| column                           | datatype | explanation                                       |
|----------------------------------|-----------|---------------------------------------------------|
| title                            | object    | Title of the post                                 |
| post_text                        | object    | Text content of the post                          |
| id                               | object    | Unique identifier for the post                    |
| score                            | int64     | Score or upvotes of the post                      |
| total_comments                   | int64     | Total number of comments on the post              |
| post_url                         | object    | URL of the post                                    |
| subreddit                        | object    | Subreddit where the post was made                  |
| post_type                        | object    | Type or format of the post                         |
| time_uploaded                    | object    | Timestamp when the post was uploaded               |
| stopword_dropped_title_and_text  | object    | Title and text content with stopwords removed     |
| title_text_stemmed               | object    | Title and text content after stemming              |
| title_text_lemmatized            | object    | Title and text content after lemmatization         |

In [1]:
# Standard Library Imports
import pandas as pd
import re
import datetime

# Third-party Library Imports
import requests
from bs4 import BeautifulSoup
import praw
import nltk
import numpy as np
import matplotlib.pyplot as plt
import concurrent.futures
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import time
import itertools
from collections import defaultdict, Counter
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.util import bigrams
from sklearn.feature_extraction.text import CountVectorizer
import string

# Custom Functions or Classes (if applicable)

In [2]:
# Set pandas display options to show the entire content of the "Post Text" column
#pd.set_option('display.max_colwidth', None)

In [3]:
# Initialize the Reddit API client
redditscrapper = praw.Reddit(
    client_id='mTKAc7piwaoiD3fvkhY7qA',
    client_secret='GdT29i_cBYDTJwb0eExYEh6prVceGg',
    user_agent='(REDACTED NAME HERE)'
)

In [4]:
# List of subreddit names to scrape
subreddit_names = ["intermittentfasting", "AnorexiaNervosa"]

In [5]:
# Dictionary to store post data
posts_dict = {
    "Title": [],
    "Post Text": [],
    "ID": [],
    "Score": [],
    "Total Comments": [],
    "Post URL": [],
    "Subreddit": [],
    "Post Type": [],  # Add a column for post type (new, hot, top, rising)
    "Time uploaded": []
}

In [6]:
# Set to keep track of collected post IDs
collected_post_ids = set()

In [7]:
# Iterate through the subreddit names to test accessibility
for subreddit_name in subreddit_names:
    reddit_url = f"https://www.reddit.com/r/{subreddit_name}"
    response = requests.get(reddit_url)

    if response.status_code == 200:
        print(f"Success! The subreddit at {reddit_url} is accessible.")
    else:
        print(f"Error! The subreddit at {reddit_url} returned a status code of {response.status_code}.")

Success! The subreddit at https://www.reddit.com/r/intermittentfasting is accessible.
Success! The subreddit at https://www.reddit.com/r/AnorexiaNervosa is accessible.


## Webscraping

In [8]:
# Define a dictionary to map post types to fetch functions
post_type_mapping = {
    "new": "new",
    "hot": "hot",
    "top": "top",
    "rising": "rising"
}

# Iterate through the subreddit names and post types to fetch and collect posts
for subreddit_name in subreddit_names:
    subreddit = redditscrapper.subreddit(subreddit_name)
    
    for post_type, fetch_function in post_type_mapping.items():
        posts = getattr(subreddit, fetch_function)(limit=1000000)  # Fetch posts using the mapping
        
        for post in posts:
            if post.id not in collected_post_ids:
                collected_post_ids.add(post.id)
                # Check if the post is deleted or removed
                if post.selftext != "[deleted]" and post.selftext != "[removed]":
                    posts_dict["Title"].append(post.title)
                    posts_dict["Post Text"].append(post.selftext)
                    posts_dict["ID"].append(post.id)
                    posts_dict["Score"].append(post.score)
                    posts_dict["Total Comments"].append(post.num_comments)
                    posts_dict["Post URL"].append(post.url)
                    posts_dict["Subreddit"].append(subreddit_name)
                    posts_dict["Post Type"].append(post_type)
                    posts_dict["Time uploaded"].append(post.created_utc)

# Create a DataFrame from the collected data
all_posts = pd.DataFrame(posts_dict)

# Convert the "Time uploaded" column to datetime format
all_posts['Time uploaded'] = all_posts['Time uploaded'].apply(lambda x: pd.to_datetime(x, unit='s', origin='unix'))

# Print a summary of the collected data
print(f"Total number of non-deleted/non-removed posts collected: {len(all_posts)}")
all_posts.head()

Total number of non-deleted/non-removed posts collected: 3960


Unnamed: 0,Title,Post Text,ID,Score,Total Comments,Post URL,Subreddit,Post Type,Time uploaded
0,Does taking flavoured creatine break a fast?,"Taking one scoop, roughly 3g. It has sucralose...",16shh83,1,0,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 07:57:13
1,I lost 120 lbs.......she lost 80. One meal a d...,,16shbmz,6,1,https://i.redd.it/cft42u8lso151.jpg,intermittentfasting,new,2023-09-26 07:46:54
2,Does fasting out of spite work?,We’ll see in 4 weeks when I go to a wedding wh...,16sfrlc,0,2,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 06:10:27
3,Daily Fasting Check-in!,"* **Type** of fast (water, juice, smoking, etc...",16sfl07,1,0,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 06:00:31
4,90 Days of Intermittent Fasting - IT WORKS!,"Hi Everyone, \n\nToday was the 90th day of my ...",16sdl2e,17,8,https://www.reddit.com/r/intermittentfasting/c...,intermittentfasting,new,2023-09-26 04:10:24


In [9]:
# Save the data to a CSV file with datetime format
all_posts.to_csv("reddit_posts_datetime.csv", index=False)

[Click here to see the notebook used to clean the scrapped data](Project%203%20Cleaning%20%28caa%20250923%202057%29.ipynb)