### Extraction of Posts and Comments from Reddit

In this project, we focus on extracting posts and comments from Reddit using web scraping techniques and interaction with the API. Reddit is a platform rich in user-generated content on various topics, making it a valuable resource for applications such as sentiment analysis, trend tracking, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and comments from specific subreddits based on thematic criteria (e.g., r/mentalhealth, r/fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the collected data in a structured format, such as a CSV file or database, for later analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping and interacting with the API.
- **Requests**: A library for making HTTP requests, in case additional scraping is required.
- **Pandas**: A data manipulation library for handling and analyzing the extracted data.

#### Getting Started
1. **Set Up the Environment**: Install necessary libraries using pip (`praw`, `requests`, `pandas`).
2. **Obtain API Credentials**: Create a Reddit account and register an application to get API credentials (client ID, secret, and user agent).
3. **Define the Extraction Logic**: Write functions to extract data from specific subreddits or threads based on keywords or categories.
4. **Run the Scraper**: Launch the script and monitor the data collection process.
5. **Analyze the Data**: Use Pandas to analyze the collected posts and comments for insights.

#### Conclusion
This project provides a hands-on introduction to using Reddit's API and analyzing data with Python, while also allowing manipulation of data from a dynamic online community.


<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping Reddit's  Posts And Articles </p>

In [None]:
# Installing BeautifulSoup4
!pip install bs4

# Installing Selenium
!pip install selenium





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





### Scraping Reddit's Health related Topics

In [2]:
import asyncio
import aiohttp
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'
}

async def fetch_page(session, postUrl):
    """Fetch the page content asynchronously."""
    try:
        async with session.get(postUrl, headers=headers, timeout=10) as response:
            if response.status == 200:
                print(f"Navigating to {postUrl}")
                return await response.text()
            else:
                print(f"Failed to fetch {postUrl}: HTTP {response.status}")
                return None
    except Exception as e:
        print(f"Error fetching {postUrl}: {e}")
        return None

async def collect_subreddit_post_text(postUrl):
    """Collect the text body of a subreddit post asynchronously."""
    async with aiohttp.ClientSession() as session:
        page_source = await fetch_page(session, postUrl)
        if page_source:
            soup = BeautifulSoup(page_source, 'html.parser')
            response_element = soup.find("div", slot="text-body")
            if response_element:
                paragraphs = [p.get_text(strip=True) for p in response_element.find_all("p")]
                return " ".join(paragraphs)
            else:
                print(f"No content found at {postUrl}")
                return ""
        else:
            return ""

import nest_asyncio
nest_asyncio.apply()
async def getPostText(post_url):
    return await collect_subreddit_post_text(post_url)



In [10]:
import urllib.request
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import time
from datetime import datetime, timezone
from urllib.request import urlopen, Request
import pandas as pd 
# Current timestamp
now = datetime.now(timezone.utc)

# function to periodically save collected data 
def periodicSave(data,topic):
    data=pd.DataFrame(data)
    data.to_csv(f"../data/healthRedditPosts/redditPosts{topic}.csv")
    print("Periodic Save is done , Total of saved posts is  " , len(data))
# Function to collect health-related subreddits
def collectSubRedditsPosts(url,topic,posts):
    time.sleep(5)
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'}
    request = Request(url, headers=headers)

    # Open the URL and read the page content
    with urlopen(request) as response:
        page_source = response.read()
    print(f"Navigating to {url}")

    # Parse the page source with BeautifulSoup
    soup = BeautifulSoup(page_source, 'html.parser')

    # Find and process posts
    
    posts_elements = soup.find_all("div",class_="thing")
    for post_element in posts_elements:
        try:
            post = {
                "authorName": post_element.get("data-author"),
                "authorId": post_element.get("data-author-fullname"),
                "commentCount": post_element.get("data-comments-count"),
                "commentsLink": post_element.get("data-url"),
                "createdAt": post_element.get("data-timestamp"),
                "postId": post_element.get("id"),
                "postTitle": post_element.find("a",class_="title may-blank").get_text(),
                "subredditName": post_element.get("data-subreddit-prefixed"),
                "collectedAt": now.strftime("%Y-%m-%dT%H:%M:%S.") + str(now.microsecond).ljust(6, '0') + "+0000",
                "interactionCategory": post_element.find("span", class_="flairrichtext").get_text("title")
                if post_element.find("span", class_="flairrichtext") else "N/A",
            }
            print(post)
            posts.append(post)
            if(len(posts)%100==0):
                periodicSave(posts,topic)
                
        except:
            continue
    try:
        nextButton=soup.find("a",rel="nofollow next").get("href")
        if(nextButton):
            collectSubRedditsPosts(nextButton,topic,posts)
    except :
        print("no next button found . Stoppig Scroll ")
        periodicSave(posts,topic)
        return 
    return 


In [8]:
import pandas as pd 
topicsList=pd.read_csv("../data/healthRedditPosts/healthRedditCommunities.csv")


In [None]:
import pandas as pd 
import time
collectedPosts=[]

for index in range(len(topicsList)):
    topic=topicsList.iloc[index]
    topicName=topic["topicName"]
    baseUrl=topic["topicUrl"]
    extendedUrls = [
            baseUrl,
            baseUrl + "/new/", 
            baseUrl + "/rising/",  
            baseUrl + "/controversial/",  
            baseUrl + "/controversial/?sort=controversial&t=all",  
            baseUrl + "/controversial/?sort=controversial&t=month", 
            baseUrl + "/controversial/?sort=controversial&t=year",  
            baseUrl + "/controversial/?sort=controversial&t=week",  
            baseUrl + "/controversial/?sort=controversial&t=hour",  
            baseUrl + "/top/",  
            baseUrl + "/top/?sort=controversial&t=all",  
            baseUrl + "/top/?sort=controversial&t=month",
            baseUrl + "/top/?sort=controversial&t=year",  
            baseUrl + "/top/?sort=controversial&t=week",  
            baseUrl + "/top/?sort=controversial&t=hour" 
        ]
    for topicUrl in extendedUrls:
        collectedPosts=collectSubRedditsPosts(topicUrl,topicName,[])


In [None]:
import pandas as pd
import asyncio

# Load the dataset
dataset = pd.read_csv("../data/healthRedditPosts/redditPostsBipolarReddit.csv")

async def process_posts(dataset):
    """Process posts asynchronously."""
    for i in range(0, len(dataset)):  # Ensure we don't exceed the dataset length
        post = dataset.iloc[i]
        postLink = post["commentsLink"]
        postText = await getPostText("https://reddit.com" + postLink)
        dataset.loc[i, "postText"] = postText
        print(f"Post number {i} is done.")
    return dataset

# Run the asynchronous loop
import nest_asyncio
nest_asyncio.apply()

async def main():
    updated_dataset = await process_posts(dataset)
    # Save the updated dataset after processing
    updated_dataset.to_csv("../data/healthRedditPosts/redditPostsBreast_Cancer_with_text.csv", index=False)

asyncio.run(main())


Navigating to https://reddit.com/r/breastcancer/comments/1focbhi/signing_off_best_wishes_to_all/
Post number 0 is done.
Navigating to https://reddit.com/r/breastcancer/comments/ynowi8/i_need_advice/
Post number 1 is done.
Navigating to https://reddit.com/r/breastcancer/comments/1fqrkxe/dame_maggie_smith/
Post number 2 is done.
Navigating to https://reddit.com/r/breastcancer/comments/1fgjbkm/beating_the_odds/
Post number 3 is done.
Navigating to https://reddit.com/r/breastcancer/comments/1dbb3cq/lost_my_wife/
Post number 4 is done.
Navigating to https://reddit.com/r/breastcancer/comments/1gevsrb/remind_me_to_never_post_outside_of_this_sub_again/
Post number 5 is done.
Navigating to https://reddit.com/r/breastcancer/comments/1gnwmli/bc_treatments_are_all_terrible_and_im_not/
Post number 6 is done.
Navigating to https://reddit.com/r/breastcancer/comments/12sbun0/cancer_free/
Post number 7 is done.
Navigating to https://reddit.com/r/breastcancer/comments/11o7x3h/its_over_i_did_it/
Post num