### Scraping Reddit Posts and Comments

In this project, we focus on extracting posts and comments from Reddit using web scraping and API interaction techniques. Reddit is a platform with a wealth of user-generated content on various topics, making it an excellent resource for applications such as sentiment analysis, trend monitoring, and topic modeling.

#### Objectives
- **Data Collection**: Retrieve posts and comments from specific subreddits based on topic criteria (e.g., r/mentalhealth, r/fitness).
- **Data Processing**: Clean and preprocess the extracted data to prepare it for analysis.
- **Data Storage**: Store the scraped data in a structured format, such as CSV or a database, for further analysis.

#### Tools and Technologies
- **Python**: The primary programming language for web scraping and API interaction..
- **Requests**: A library for making HTTP requests if additional scraping is needed.
- **Pandas**: A data manipulation library to handle and analyze the scraped data.

#### Getting Started
1. **Set Up the Environment**: Install the necessary libraries using pip (`praw`, `requests`, `pandas`).
2. **Obtain API Credentials**: Create a Reddit account and register an application to get API credentials (client ID, secret, and user agent).
3. **Define Data Extraction Logic**: Write functions to extract data from specific subreddits or threads based on keywords or categories.
4. **Run the Scraper**: Execute the script and monitor the data collection process.
5. **Analyze the Data**: Use Pandas to analyze the collected posts and comments for insights.

#### Conclusion
This project serves as a practical introduction to using Reddit's API and data analysis with Python, providing valuable experience in handling real-world data from a vibrant online community.

<p style="color:#FBCE60;text-align:center;font-size:30px"> Scraping Reddit's  Posts And Articles </p>

In [None]:
!pip install bs4
!pip install selenium

In [2]:
# importing packages
import requests
from bs4 import BeautifulSoup

### Scraping Health Boards

In [3]:
## importing libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

<p style="color:#FBCE60;text-align:center;font-size:20px"> Searching for Reddit's health related topics  </p>

In [1]:
import pandas as pd
postDataset=pd.read_csv("../data/redditPostsDataset.csv")

  postDataset=pd.read_csv("../data/redditPostsDataset.csv")


In [2]:
# Check if there are any NaN values
if postDataset["postText"].isna().sum() > 0:
    # Find the index of the first NaN value
    first_na_index = postDataset["postText"].isna().idxmax()
    print(f"The index of the first NaN value is: {first_na_index}")
else:
    print("There are no NaN values in the 'postText' column.")


The index of the first NaN value is: 45280


In [3]:
from selenium import webdriver # type: ignore
from selenium.webdriver.common.by import By # type: ignore
from bs4 import BeautifulSoup # type: ignore
from selenium.webdriver.common.by import By # type: ignore
from selenium.webdriver.support.ui import WebDriverWait# type: ignore
from selenium.webdriver.support import expected_conditions as EC# type: ignore
def cleanText(inputText):
    # Remove line returns and extra spaces
    cleaned_text = " ".join(inputText.split())
    return cleaned_text
def scrape_page(url):

    # Set up the Selenium WebDriver
    driver = webdriver.Chrome()  # Ensure you have the correct WebDriver
    driver.get(url)    

    # Use WebDriverWait to wait until the required element is visible
    wait = WebDriverWait(driver, 10)  # Wait up to 10 seconds
    new_content = wait.until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "div[slot='text-body'] p"))
    )
    # Get the page source
    page_source = driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')
    new_content = soup.find("div",slot="text-body").find("p")
    try:
        return new_content.get_text()
    except Exception as e:
        print(f"Error processing post: {e}")

    return new_content




In [None]:
for i in range(first_na_index,len(postDataset)):
    try:
        topic=postDataset.iloc[i]
        link=topic["commentsLink"]
        scrappingResult=scrape_page(link)
        print(cleanText(scrappingResult))
        postDataset.loc[i,"postText"]=cleanText(scrappingResult)
    except:
        postDataset.loc[i,"postText"]="an error has occured while scrapping"

Does any work at grocery store upfront doing bagging, sweeping, carts, clean up spills etc., have autism/Asperger syndrome and has had a problem with a cashier you like, you gossiped to them about other coworkers and they where saying some rude things then the next week your manager brought you into the office to have a talk to you about gossiping to the cashier you like and told you that he or she heard you gossiping and told you to stop talking to that cashier and you also got told you where doing some other things wrong and the manager told you if you keep this up you could get get fired and lastly you got disappointed and sad because you can only talk to the cashier about work and you where doing some other things that could get you fired?
Possible autism
I am not a huge fan of all the depressing posts, although i get why people make them. I totally do.


In [7]:
postDataset.to_csv("../data/redditPostsDataset.csv")