# Reddit API Project (Gym/Fitness)

**Class:** MSBA 212  
**Student:** Mancy Khadka  
**Topic:** Gym / Fitness discussions on Reddit

This project pulls posts from **r/Gymshark**, **r/bodybuilding**, and **r/Fitness**.  
I have collected “hot” posts and also ran a keyword search (I used **"preworkout"**).  
Then I cleaned the data, removed duplicates, and saved everything to **`reddit_data.csv`**.

## 1. Packages
Installing &  importing required libraries (praw, pandas, python-dotenv).
> Add blockquote


In [73]:
%pip install -q praw python-dotenv pandas

import os
import pandas as pd
import praw
from dotenv import dotenv_values

## 2. Credentials
Loading the client_id, client_secret, and user_agent from reddit.env.
> Add blockquote

In [74]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [75]:
from dotenv import dotenv_values
import os

ENV_FILE = "/content/reddit"

if not os.path.exists(ENV_FILE):
    raise FileNotFoundError(f"Not found: {ENV_FILE}")

cfg = dotenv_values(ENV_FILE)

required = ["REDDIT_CLIENT_ID","REDDIT_CLIENT_SECRET","REDDIT_USER_AGENT"]
missing = [k for k in required if not cfg.get(k)]
if missing:
    raise ValueError(f"Missing keys in {ENV_FILE}: {missing}")

print(f"Environment variables loaded from {ENV_FILE}")
print({k: ("set" if bool(v) else None) for k, v in cfg.items()})

Environment variables loaded from /content/reddit
{'REDDIT_CLIENT_ID': 'set', 'REDDIT_CLIENT_SECRET': 'set', 'REDDIT_USER_AGENT': 'set'}


## 3. PRAW Authentication
Setting read_only = True and doing a quick smoke test later.
> Add blockquote



In [76]:

reddit = praw.Reddit(
    client_id=cfg["REDDIT_CLIENT_ID"],
    client_secret=cfg["REDDIT_CLIENT_SECRET"],
    user_agent=cfg["REDDIT_USER_AGENT"],
)

reddit.read_only = True



## 4. Parameters Configuration
>Add blockquote

In [77]:
SUBS = ["Gymshark", "bodybuilding", "Fitness"]
HOT_LIMIT = 50
SEARCH_TERM = "preworkout"
SEARCH_LIMIT = 30
OUT_PATH = "/content/reddit_data.csv"


## 5. Helper function
Turning submissions into rows and keeping columns consistent.
>add blockquote


In [78]:
from typing import Optional, List, Dict, Any
import pandas as pd

REQUIRED_COLUMNS = [
    "title", "score", "upvote_ratio", "num_comments", "author", "subreddit",
    "url", "permalink", "created_utc", "is_self", "selftext", "flair",
    "domain", "search_query"
]

def safe_author(author_obj) -> Optional[str]:
    """Return author name or None; robust to deleted/suspended authors."""
    try:
        return None if author_obj is None else getattr(author_obj, "name", None)
    except Exception:
        return None

def trunc(text: Optional[str], n: int = 500) -> Optional[str]:
    """Truncate selftext to n chars per rubric (None-safe)."""
    if text is None:
        return None
    return text[:n] if len(text) > n else text

def submission_to_row(s, sub_name: str, search_query: Optional[str] = None) -> Dict[str, Any]:
    """
    Map a PRAW Submission to the assignment schema with defensive gets.
    Adds `search_query` for provenance when the row came from a keyword search.
    """
    permalink = (
        f"https://www.reddit.com{getattr(s, 'permalink', '')}"
        if getattr(s, "permalink", None) else None
    )

    created_val = getattr(s, "created_utc", None)
    created_utc = int(created_val) if created_val is not None else None

    return {
        "title": getattr(s, "title", None),
        "score": getattr(s, "score", None),
        "upvote_ratio": getattr(s, "upvote_ratio", None),
        "num_comments": getattr(s, "num_comments", None),
        "author": safe_author(getattr(s, "author", None)),
        "subreddit": sub_name,
        "url": getattr(s, "url", None),
        "permalink": permalink,
        "created_utc": created_utc,
        "is_self": getattr(s, "is_self", None),
        "selftext": trunc(getattr(s, "selftext", None), 500),
        "flair": getattr(s, "link_flair_text", None),
        "domain": getattr(s, "domain", None),
        "search_query": search_query,
    }

def sub_to_rows(sub_name: str, items, query: Optional[str] = None) -> List[Dict[str, Any]]:
    """Convert an iterable of submissions to a list of dict rows for one subreddit."""
    return [submission_to_row(s, sub_name=sub_name, search_query=query) for s in items]

def make_dataframe(rows: List[Dict[str, Any]]) -> pd.DataFrame:
    """
    Convert rows → DataFrame, drop duplicates (by id/permalink if present),
    and enforce assignment column order.
    """
    df = pd.DataFrame(rows)

    for key in ("id", "permalink"):
        if key in df.columns:
            df = df.drop_duplicates(subset=[key], keep="first")

    df = df.reindex(columns=REQUIRED_COLUMNS)

    return df


## 6. Data (hot + search) with console summaries
>add blockquote


In [79]:
all_rows = []

# Hot posts
for sub in SUBS:
    posts = list(reddit.subreddit(sub).hot(limit=HOT_LIMIT))
    all_rows += sub_to_rows(sub, posts, query=None)
    print(f"Collected {len(posts)} hot posts from r/{sub}.")

# Keyword search
if SEARCH_TERM:
    for sub in SUBS:
        hits = list(reddit.subreddit(sub).search(query=SEARCH_TERM, sort="relevance", limit=SEARCH_LIMIT))
        all_rows += sub_to_rows(sub, hits, query=SEARCH_TERM)
        print(f"Collected {len(hits)} search hits for '{SEARCH_TERM}' in r/{sub}.")

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 50 hot posts from r/Gymshark.


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 50 hot posts from r/bodybuilding.


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 50 hot posts from r/Fitness.


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 1 search hits for 'preworkout' in r/Gymshark.


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Collected 30 search hits for 'preworkout' in r/bodybuilding.
Collected 30 search hits for 'preworkout' in r/Fitness.


## 7. Data Cleaning, De-duplicate, and creating CSV
>add blockquote

In [80]:
df = pd.DataFrame(all_rows)

if "id" in df.columns:
    df = df.drop_duplicates(subset=["id"], keep="first")
if "permalink" in df.columns:
    df = df.drop_duplicates(subset=["permalink"], keep="first")

required_cols = [
    "title","score","upvote_ratio","num_comments","author","subreddit",
    "url","permalink","created_utc","is_self","selftext","flair","domain","search_query"
]
df = df.reindex(columns=required_cols)

df.to_csv(OUT_PATH, index=False)
print(f"Saved {len(df)} rows to {OUT_PATH}")
df.head(3)


Saved 211 rows to /content/reddit_data.csv


Unnamed: 0,title,score,upvote_ratio,num_comments,author,subreddit,url,permalink,created_utc,is_self,selftext,flair,domain,search_query
0,BUY | TRADE | SELL Megathread,11,0.93,41,V0dka-Coke,Gymshark,https://www.reddit.com/r/Gymshark/comments/1ok...,https://www.reddit.com/r/Gymshark/comments/1ok...,1761847172,True,Going forwards we will be trialling a megathre...,BUY / SELL / TRADE,self.Gymshark,
1,Onyx quality,10,0.92,8,Immediate-Step4130,Gymshark,https://i.redd.it/2s6ch09mlwyf1.jpeg,https://www.reddit.com/r/Gymshark/comments/1om...,1762115345,False,Anyone else received the onyx with this square...,Question,i.redd.it,
2,size rec,2,1.0,0,Any-Macaron-2212,Gymshark,https://www.reddit.com/r/Gymshark/comments/1on...,https://www.reddit.com/r/Gymshark/comments/1on...,1762145796,True,"hi, I wanna get my friend a gymshark compressi...",Question,self.Gymshark,


## 8. GitHub
>add blockquote

In [81]:
%cd /content/drive/MyDrive/assignment_folder

/content/drive/MyDrive/assignment_folder


In [82]:
from getpass import getpass

GITHUB_USERNAME = "mancykhadka"
REPO_NAME = "reddit-api-assignment"

token = getpass("GitHub PAT: ").strip()

remote_url = f"https://{GITHUB_USERNAME}:{token}@github.com/{GITHUB_USERNAME}/{REPO_NAME}.git"

!git remote remove origin 2>/dev/null || true
!git remote add origin "$remote_url"
!git branch -M main
!git remote -v


GitHub PAT: ··········
origin	https://mancykhadka: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@github.com/mancykhadka/reddit-api-assignment.git (fetch)
origin	https://mancykhadka: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@github.com/mancykhadka/reddit-api-assignment.git (push)


In [83]:
!git push -u origin main


Branch 'main' set up to track remote branch 'main' from 'origin'.
Everything up-to-date
