### Extraction of Posts and Comments from Reddit

In this project, we focus on extracting posts and comments from Reddit using web scraping techniques and interaction with the API. Reddit is a platform rich in user-generated content on various topics, making it a valuable resource for applications such as sentiment analysis, trend tracking, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and comments from specific subreddits based on thematic criteria (e.g., r/mentalhealth, r/fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the collected data in a structured format, such as a CSV file or database, for later analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping and interacting with the API.
- **Requests**: A library for making HTTP requests, in case additional scraping is required.
- **Pandas**: A data manipulation library for handling and analyzing the extracted data.

#### Getting Started
1. **Set Up the Environment**: Install necessary libraries using pip (`praw`, `requests`, `pandas`).
2. **Obtain API Credentials**: Create a Reddit account and register an application to get API credentials (client ID, secret, and user agent).
3. **Define the Extraction Logic**: Write functions to extract data from specific subreddits or threads based on keywords or categories.
4. **Run the Scraper**: Launch the script and monitor the data collection process.
5. **Analyze the Data**: Use Pandas to analyze the collected posts and comments for insights.

#### Conclusion
This project provides a hands-on introduction to using Reddit's API and analyzing data with Python, while also allowing manipulation of data from a dynamic online community.


<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping Reddit's  Posts And Articles </p>

In [None]:
# Installing BeautifulSoup4
!pip install bs4

# Installing Selenium
!pip install selenium





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





### Scraping Reddit's Health related Topics

In [1]:
import urllib.request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import time
from datetime import datetime, timezone
from urllib.request import urlopen, Request
import pandas as pd 
# Current timestamp
now = datetime.now(timezone.utc)
posts = []
# function to periodically save collected data 
def periodicSave(data):
    data=pd.DataFrame(data)
    data.to_csv("../data/healthRedditPosts/redditPosts.csv")
    print("Periodic Save is done , Total of saved posts is  " , len(data))
# Function to collect health-related subreddits
def collectSubRedditsPosts(url):
    time.sleep(5)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)

    # Open the URL and read the page content
    with urlopen(request) as response:
        page_source = response.read()
    print(f"Navigating to {url}")
    



    # Parse the page source with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find and process posts
    
    posts_elements = soup.find_all("div",class_="thing")
    for post_element in posts_elements:
        try:
            post = {
                "authorName": post_element.get("data-author"),
                "authorId": post_element.get("data-author-fullname"),
                "commentCount": post_element.get("data-comments-count"),
                "commentsLink": post_element.get("data-url"),
                "createdAt": post_element.get("data-timestamp"),
                "postId": post_element.get("id"),
                "postTitle": post_element.find("a",class_="title may-blank").get_text(),
                "subredditName": post_element.get("data-subreddit-prefixed"),
                "collectedAt": now.strftime("%Y-%m-%dT%H:%M:%S.") + str(now.microsecond).ljust(6, '0') + "+0000",
                "interactionCategory": post_element.find("span", class_="flairrichtext").get_text("title")
                if post_element.find("span", class_="flairrichtext") else "N/A",
            }
            print(post)
            posts.append(post)
            if(len(posts)%500==0):
                periodicSave(posts)
                
        except:
            continue
    try:
        nextButton=soup.find("span",class_="next-button").find("a").get("href")
        if(nextButton):
            collectSubRedditsPosts(nextButton)
    except:
        print("no next button found ")
        return posts


In [None]:
import pandas as pd 
import time
topicsList=pd.read_csv("../data/healthRedditCommunities.csv")
collectedPosts=[]
for index in range(0,1):
    topic=topicsList.iloc[index]
    topicName=topic["topicName"]
    baseUrl=topic["topicUrl"]
    extendedUrls = [
        baseUrl,
        baseUrl + "/new/",
        baseUrl + "/rising/",
        baseUrl + "/controversial/",
        baseUrl + "/controversial/?sort=controversial&t=all",
        baseUrl + "/controversial/?sort=controversial&t=month",
        baseUrl + "/controversial/?sort=controversial&t=year",
        baseUrl + "/controversial/?sort=controversial&t=week",
        baseUrl + "/controversial/?sort=controversial&t=hour",
        baseUrl + "/top/",
        baseUrl + "/top/?sort=controversial&t=all",
        baseUrl + "/top/?sort=controversial&t=month",
        baseUrl + "/top/?sort=controversial&t=year",
        baseUrl + "/top/?sort=controversial&t=week",
        baseUrl + "/top/?sort=controversial&t=hour"
    ]
    for topicUrl in extendedUrls:
        collectedPosts=collectSubRedditsPosts(topicUrl)


Navigating to https://old.reddit.com/r/ADHD
{'authorName': 'Rich-Wolverine8912', 'authorId': 't2_14qqqlmoto', 'commentCount': '481', 'commentsLink': '/r/ADHD/comments/1gw1cs3/what_are_your_adhd_home_hacks/', 'createdAt': '1732143952000', 'postId': 'thing_t3_1gw1cs3', 'postTitle': 'What are your ADHD home hacks?', 'subredditName': 'r/ADHD', 'collectedAt': '2024-12-01T12:24:17.173698+0000', 'interactionCategory': 'Tips/Suggestions'}
{'authorName': 'AutoModerator', 'authorId': 't2_6l4z3', 'commentCount': '1', 'commentsLink': '/r/ADHD/comments/1h3vqnv/need_to_get_something_off_your_chest_rant_vent/', 'createdAt': '1733029325000', 'postId': 'thing_t3_1h3vqnv', 'postTitle': 'Need to get something off your chest? Rant, vent, get it out here!', 'subredditName': 'r/ADHD', 'collectedAt': '2024-12-01T12:24:17.173698+0000', 'interactionCategory': 'N/A'}
{'authorName': 'kswildcatmom', 'authorId': 't2_13byht', 'commentCount': '49', 'commentsLink': '/r/ADHD/comments/1h3yg30/my_normal_husband/', 'crea