### Extraction of Posts and Comments from Reddit

In this project, we focus on extracting posts and comments from Reddit using web scraping techniques and interaction with the API. Reddit is a platform rich in user-generated content on various topics, making it a valuable resource for applications such as sentiment analysis, trend tracking, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and comments from specific subreddits based on thematic criteria (e.g., r/mentalhealth, r/fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the collected data in a structured format, such as a CSV file or database, for later analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping and interacting with the API.
- **Requests**: A library for making HTTP requests, in case additional scraping is required.
- **Pandas**: A data manipulation library for handling and analyzing the extracted data.

#### Getting Started
1. **Set Up the Environment**: Install necessary libraries using pip (`praw`, `requests`, `pandas`).
2. **Obtain API Credentials**: Create a Reddit account and register an application to get API credentials (client ID, secret, and user agent).
3. **Define the Extraction Logic**: Write functions to extract data from specific subreddits or threads based on keywords or categories.
4. **Run the Scraper**: Launch the script and monitor the data collection process.
5. **Analyze the Data**: Use Pandas to analyze the collected posts and comments for insights.

#### Conclusion
This project provides a hands-on introduction to using Reddit's API and analyzing data with Python, while also allowing manipulation of data from a dynamic online community.


<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping Reddit's  Posts And Articles </p>

In [None]:
# Installing BeautifulSoup4
!pip install bs4

# Installing Selenium
!pip install selenium





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





### Scraping Reddit's Health related Topics

In [7]:
# Importing necessary libraries
from bs4 import BeautifulSoup  # BeautifulSoup is used to parse HTML content.
from datetime import datetime  # Allows handling and manipulating dates and times (not used in this script but potentially useful for adding timestamps).
from urllib.request import urlopen, Request  # Allows sending HTTP requests and opening URLs.
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
# List to store the collected information about subreddits and posts
posts = []

# Function to collect health-related subreddits
def collectSubRedditsPosts(url):

    # Creating an HTTP header to simulate a web browser. This is necessary to prevent the server from blocking the request.
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'
    }
    # Preparing the HTTP request with the target URL and headers
    request = Request(url, headers=headers)
    print(f"Navigating to {url}")  # Display the URL for user feedback

    # Sending the request and retrieving the page content
    with urlopen(request) as response:  # Opening the URL
        page_source = response.read()  # Reading the HTML content of the page

    # Creating a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(page_source, 'html.parser')
    postsElements=soup.find_all("shreddit-post")
    for postsElement in postsElements:
        post={}
        post["authorName"]=postsElement.get("author")
        post["authorId"]=postsElement.get("author-id")
        post["commentCount"]=postsElement.get("comment-count")
        post["commentsLink"]=postsElement.get("content-href")
        post["createdAt"]=postsElement.get("created-timestamp")
        post["postId"]=postsElement.get("id")
        post["postTitle"]=postsElement.get("post-title")
        post["subredditName"]=postsElement.get("subreddit-prefixed-name")
        post["createdAt"]=postsElement.get("created-timestamp")
        post["collectedAt"] = now.strftime("%Y-%m-%dT%H:%M:%S.") + str(now.microsecond).ljust(6, '0') + "+0000"
        post["interactionCategory"]=soup.find("div",class_="flair-content").get_text().replace("\n", "").strip()
        print(post)


In [8]:
import pandas as pd 
topicsList=pd.read_csv("../data/healthRedditCommunities.csv")
for index in range(0,1):
    topic=topicsList.iloc[index]
    topicName=topic["topicName"]
    topicUrl=topic["topicUrl"]
    collectSubRedditsPosts(topicUrl)

Navigating to https://www.reddit.com/r/ADHD
{'authorName': 'kswildcatmom', 'authorId': 't2_13byht', 'commentCount': '11', 'commentsLink': 'https://www.reddit.com/r/ADHD/comments/1h3yg30/my_normal_husband/', 'createdAt': '2024-12-01T07:59:29.479000+0000', 'postId': 't3_1h3yg30', 'postTitle': 'My normal husband', 'subredditName': 'r/ADHD', 'collectedAt': '2024-12-01T09:24:15.255182+0000', 'interactionCategory': 'Tips/Suggestions'}
{'authorName': 'Paranoia_King', 'authorId': 't2_6jipcd7v', 'commentCount': '78', 'commentsLink': 'https://www.reddit.com/r/ADHD/comments/1h3ebqm/a_friendly_reminder_for_my_fellow_adhders_start/', 'createdAt': '2024-11-30T14:59:51.150000+0000', 'postId': 't3_1h3ebqm', 'postTitle': "A friendly reminder for my fellow ADHD'ers, start buying your christmas gifts NOW", 'subredditName': 'r/ADHD', 'collectedAt': '2024-12-01T09:24:15.255182+0000', 'interactionCategory': 'Tips/Suggestions'}
{'authorName': 'check1232', 'authorId': 't2_19y9p7vbrf', 'commentCount': '36', 'c