### Extraction of Posts and Comments from Reddit

In this project, we focus on extracting posts and comments from Reddit using web scraping techniques and interaction with the API. Reddit is a platform rich in user-generated content on various topics, making it a valuable resource for applications such as sentiment analysis, trend tracking, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and comments from specific subreddits based on thematic criteria (e.g., r/mentalhealth, r/fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the collected data in a structured format, such as a CSV file or database, for later analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping and interacting with the API.
- **Requests**: A library for making HTTP requests, in case additional scraping is required.
- **Pandas**: A data manipulation library for handling and analyzing the extracted data.

#### Getting Started
1. **Set Up the Environment**: Install necessary libraries using pip (`praw`, `requests`, `pandas`).
2. **Obtain API Credentials**: Create a Reddit account and register an application to get API credentials (client ID, secret, and user agent).
3. **Define the Extraction Logic**: Write functions to extract data from specific subreddits or threads based on keywords or categories.
4. **Run the Scraper**: Launch the script and monitor the data collection process.
5. **Analyze the Data**: Use Pandas to analyze the collected posts and comments for insights.

#### Conclusion
This project provides a hands-on introduction to using Reddit's API and analyzing data with Python, while also allowing manipulation of data from a dynamic online community.


<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping Reddit's  Posts And Articles </p>

In [None]:
# Installing BeautifulSoup4
!pip install bs4

# Installing Selenium
!pip install selenium





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





### Scraping Reddit's Health related Topics

In [7]:
# Importing necessary libraries
from bs4 import BeautifulSoup  # BeautifulSoup is used to parse HTML content.
from datetime import datetime  # Allows handling and manipulating dates and times (not used in this script but potentially useful for adding timestamps).
from urllib.request import urlopen, Request  # Allows sending HTTP requests and opening URLs.

# Target URL containing information about Reddit health-related communities
url = "https://www.reddit.com/r/Health/wiki/communities/"

# List to store the collected information about subreddits
topicList = []

# Function to collect health-related subreddits
def collectSubReddits():
    # Creating an HTTP header to simulate a web browser. This is necessary to prevent the server from blocking the request.
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.5615.137 Safari/537.36'
    }
    # Preparing the HTTP request with the target URL and headers
    request = Request(url, headers=headers)
    print(f"Navigating to {url}")  # Display the URL for user feedback

    # Sending the request and retrieving the page content
    with urlopen(request) as response:  # Opening the URL
        page_source = response.read()  # Reading the HTML content of the page

    # Creating a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(page_source, 'html.parser')

    # Searching for the division containing the list of subreddits. The "md wiki" class seems to identify this area in the page.
    divElement = soup.find("div", class_="md wiki")

    # Extracting the unordered list (<ul>) containing the subreddits
    listOfTopics = divElement.find("ul")

    # Iterating through each list item (<li>) element
    for liElement in listOfTopics.find_all("li"):
        # Searching for the <a> (link) tag in each list item
        aElement = liElement.find("a")

        # Creating a dictionary to store information about the subreddit
        topic = {}
        topic["topicName"] = aElement.get_text()  # Extracting the subreddit name (text of the link)
        topic["topicUrl"] = "https://old.reddit.com/" + aElement.get("href")  # Constructing the full URL of the subreddit

        # Adding the dictionary to the list of topics
        topicList.append(topic)


In [8]:
import pandas as pd  # Library for handling data in the form of DataFrames.
# Calling the function to collect subreddits
collectSubReddits()

# Converting the list of topics into a pandas DataFrame
# Each dictionary in `topicList` becomes a row in the DataFrame.
topicList = pd.DataFrame(topicList)

# Saving the DataFrame as a CSV file
# The path "../data/healthRedditCommunities.csv" can be modified based on the desired location.
topicList.to_csv("../data/healthRedditCommunities.csv", index=False)  # `index=False` excludes the pandas index in the CSV file.

# Displaying a confirmation message about the save
print("The data has been saved in the file '../data/healthRedditCommunities.csv'")


Navigating to https://www.reddit.com/r/Health/wiki/communities/
The data has been saved in the file '../data/healthRedditCommunities.csv'
