# Data aquisition 

We used the **PRAW** (Python Reddit API Wrapper) library to programmatically access Reddit data through its API.


In [1]:
!pip install praw pandas numpy



Importing the required packages:

In [2]:
import praw
import pandas as pd



Setting up credentials required to authenticate with Reddit's API using PRAW:

In [3]:
# --- CONFIG ---
CLIENT_ID = '3Ptv1n3uzKL-RaqAQnrMlg'
CLIENT_SECRET = 'pa5OheU7NtiIw6jl5MaFAz8ouLrZDQ'
USER_AGENT = 'reddit-popularity-predictor'

Fetching posts from a range of different subbreddits:

In [4]:
SUBREDDITS = ['technology', 'sports', 'funny', 'science', 'politics', 'gaming', 'movies']
POSTS_PER_SUBREDDIT = 750
SAMPLE_PER_BUCKET = 300 # how many posts per popularity bucket to keep

In [5]:
# Initialize Reddit API
reddit = praw.Reddit(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    user_agent=USER_AGENT
)

In [6]:

def fetch_posts(subreddit, sort, limit):
    """Fetch posts from a subreddit with given sort and limit."""
    posts = []
    submissions = getattr(reddit.subreddit(subreddit), sort)(limit=limit)
    for submission in submissions:
        posts.append({
            'subreddit': subreddit,
            'id': submission.id,
            'title': submission.title,
            'selftext': submission.selftext,
            'score': submission.score,
            'num_comments': submission.num_comments,
            'created_utc': submission.created_utc,
            'flair': submission.link_flair_text,
            'upvote_ratio': submission.upvote_ratio,
            'is_self': submission.is_self,
            'nsfw': submission.over_18,
            'author': str(submission.author),
            'url': submission.url,
            'sort_type': sort
        })
    return posts


In [7]:
all_posts = []

# Fetch different types of posts (new posts, top posts)
for sub in SUBREDDITS:
    print(f"Fetching new posts from r/{sub}...")
    all_posts.extend(fetch_posts(sub, 'new', POSTS_PER_SUBREDDIT))

for sub in SUBREDDITS:
    print(f"Fetching top posts from r/{sub}...")
    all_posts.extend(fetch_posts(sub, 'top', POSTS_PER_SUBREDDIT))

# Create DataFrame
df = pd.DataFrame(all_posts)

# Remove duplicates (some posts may appear in both new and top)
df = df.drop_duplicates(subset='id')

print(f"Total posts before bucketing: {len(df)}")

Fetching new posts from r/technology...
Fetching new posts from r/sports...
Fetching new posts from r/funny...
Fetching new posts from r/science...
Fetching new posts from r/politics...
Fetching new posts from r/gaming...
Fetching new posts from r/movies...
Fetching top posts from r/technology...
Fetching top posts from r/sports...
Fetching top posts from r/funny...
Fetching top posts from r/science...
Fetching top posts from r/politics...
Fetching top posts from r/gaming...
Fetching top posts from r/movies...
Total posts before bucketing: 10108


In [8]:
df.head()

Unnamed: 0,subreddit,id,title,selftext,score,num_comments,created_utc,flair,upvote_ratio,is_self,nsfw,author,url,sort_type
0,technology,1lzgoop,Disaster Looms As President Trump Plans To Def...,,80,6,1752480000.0,Space,0.93,False,False,upyoars,https://autos.yahoo.com/articles/disaster-loom...,new
1,technology,1lzfv56,"You can still enable uBlock Origin in Chrome, ...",,6,16,1752477000.0,Software,0.58,False,False,moeka_8962,https://www.neowin.net/guides/you-can-still-en...,new
2,technology,1lzfoze,Japan using generative AI less than other coun...,,409,43,1752477000.0,Artificial Intelligence,0.96,False,False,moeka_8962,https://www3.nhk.or.jp/nhkworld/en/news/202507...,new
3,technology,1lze324,‘Fossil fuel flunkies’: US senator warns of Bi...,,100,3,1752471000.0,Energy,0.96,False,False,upyoars,https://www.straitstimes.com/world/united-stat...,new
4,technology,1lzdxu7,Security vulnerability on U.S. trains that let...,,152,17,1752470000.0,Security,0.97,False,False,SelflessMirror,https://www.tomshardware.com/tech-industry/cyb...,new


Instead of predicting Reddit post scores (a regression task), we simplify the problem into a classificatiion task by categorizing the scores into buckets (low, medium, high popularity)

The post scores are divided into the three categories based on quantiles. This helps to transform the continuous `score` into a new categorical variable, `popularity_bucket` which can be useful for classification models.


In [9]:
# --- Bucket scores into low/medium/high popularity ---

# Define buckets by score quantiles or fixed thresholds
# Here: Use quantiles to split into 3 equal groups

quantiles = df['score'].quantile([0.33, 0.66]).values
low_threshold, high_threshold = quantiles[0], quantiles[1]

def bucket_score(score):
    if score <= low_threshold:
        return 'low'
    elif score <= high_threshold:
        return 'medium'
    else:
        return 'high'

df['popularity_bucket'] = df['score'].apply(bucket_score)

print(df['popularity_bucket'].value_counts())

popularity_bucket
high      3437
low       3336
medium    3335
Name: count, dtype: int64


Since the dataset is already balanced across the `popularity_bucket` categories, we don#t need to apply additional sampling techniques to balance the data.

In [10]:
print("Length of the dataset:", len(df))

Length of the dataset: 10108


Finally, we save the data to a csv file:

In [11]:
# --- Save dataset ---
df.to_csv('../data/reddit_dataset.csv', index=False)
print("Saved dataset to reddit_dataset.csv")

Saved dataset to reddit_dataset.csv
