# Computational Methods - HT26 
## Oxford Internet Institute
## Week 3 - Parsing online data

Note: This formative will be marked by the TA and/or the instructor. It is due Monday of Week 5 at 11am.

**Data**: Reddit - collect yourself; only SFW subs that do not require registration.

**Claim**: To be determined by you. 

**Representation**: Comparison between two groups. This can be statistical or narrative. You are welcome to use graphics. 

**AI tooling**: I believe that AI code is likely to be _more_ robust and accurate than self-written code. You are welcome to use AI for this work. Much of the key conceptual work will be involved in appropriate case selection for a legitimate comparison between two subreddits. 

- You will see that there is a lot of code below. It is mostly AI written but it does follow my logic very closely. For this assignment, I will assume that you will be using AI but I will also assume that you will be providing: 
  - **Skepticism**. How can you establish when your code wrong, when the data is off? 
  - **Keen eye for defaults**. There are a couple places where I created a little break. Notice below where I wrote `if i > 5: break`. That was a way to break after the first 5 stories. It should only be used in testing. You'll want all 25 stories for your work. 
  - **Focus on your core tasks and goals**. AI can sometimes be a bit enthusiastic. It might recommend tests or approaches that you do not fully understand. You must ensure that all claims are defensible within your write-up. AI should NOT be used for any writing, only for the code that follows your prompting. 


# Exercise 1. Getting the data from online

We will be downloading Reddit data "the hard way". This is also the slow way, but it is useful for this exercise. You will need the `requests` library, but that should be available in Python already.

First, we will want to create a means of collecting a single Reddit post. Navigate online to find a post of your choice and then notice the first part of the URL. For example, if I go to: 
https://www.reddit.com/r/LabourUK/comments/1qtp6ee/in_gorton_and_denton_i_found_a_longfestering/

Then replace www.reddit.com with api.reddit.com, the data will download in JSON. We will collect this data. 

Below, simply replace the <...> with a URL to collect the JSON. However, you will also need to edit the 'header' of the file so that you are uniquely identified. Then I want you to navigate this file. What we are looking to do is find the part where there is a list of comments. Report how many comments came down and compare this to the number of comments reported on the site. 

We want to know how many different users made a comment in this specific data. This is not the 'full comment tree' but instead is a truncated tree where you can request more. We will not be doing that in this lab. 


In [8]:
import requests
import pandas as pd
from pandas import json_normalize
import time


In [None]:

def get_reddit_post(url, user_agent):
    """
    Download Reddit post data as JSON.
    
    Args:
        url: A reddit post URL (www.reddit.com or api.reddit.com)
        user_agent: Unique identifier for your requests (edit this!)
    
    Returns:
        Parsed JSON data from the Reddit API
    """
    # Convert www.reddit.com to api.reddit.com if needed
    api_url = url.replace("www.reddit.com", "api.reddit.com")
    
    # Ensure URL ends with .json (alternative to using api subdomain)
    if not api_url.endswith(".json"):
        api_url = api_url.rstrip("/") + ".json"
    
    headers = {
        "User-Agent": user_agent
    }
    
    response = requests.get(api_url, headers=headers)
    response.raise_for_status()  # Raise exception for HTTP errors
    
    data = response.json()  # Equivalent to json.loads(response.text)
    
    return data


def parse_reddit_data(data):
    """
    Parse Reddit JSON data into DataFrames.
    
    Args:
        data: JSON data returned from get_reddit_post()
    
    Returns:
        Tuple of (post_df, comments_df)
    """
    # The data is a list with 2 elements:
    # data[0] contains the post
    # data[1] contains the comments
    
    # Parse the post (it's nested under data -> children -> [0] -> data)
    post_df = json_normalize(data[0]["data"]["children"])
    
    # Parse the comments (also under data -> children)
    comments_df = json_normalize(data[1]["data"]["children"])
    
    return post_df, comments_df

def clean_column_names(df):
    """
    Remove 'data.' prefix from DataFrame column names.
    
    Args:
        df: DataFrame with 'data.' prefixed column names
    
    Returns:
        DataFrame with cleaned column names
    """
    df.columns = df.columns.str.replace("data.", "", regex=False)
    return df

# Example usage
if __name__ == "__main__":
    post_url = "https://www.reddit.com/r/LabourUK/comments/1qtp6ee/in_gorton_and_denton_i_found_a_longfestering/"
    my_user_agent = "CompMethods Lab Exercise (u/<...>)
    
    data = get_reddit_post(post_url, user_agent=my_user_agent)
    
    post_df, comments_df = parse_reddit_data(data)

    posts_df = clean_column_names(post_df)
    
    # Check if the post is stickied
    is_stickied = post_df["stickied"].iloc[0]
    print(f"Is post stickied? {is_stickied}")
    
    comments_df = clean_column_names(comments_df)
    
    print("Post DataFrame:")
    print(f"  Shape: {post_df.shape}")
    print(f"  Columns: {list(post_df.columns)[:10]}...")  # First 10 columns
    
    print("\nComments DataFrame:")
    print(f"  Shape: {comments_df.shape}")
    print(f"  Columns: {list(comments_df.columns)[:10]}...")
    
    # Preview the comments
    if "data.author" in comments_df.columns:
        print(f"\nNumber of comments downloaded: {len(comments_df)}")
        print(f"Unique authors: {comments_df['data.author'].nunique()}")


## Part 2

Building this out: Now let's navigate to a subreddit and do the same thing. So for example, make a request to http://api.reddit.com/r/aww . However, I want you to request the top stories this year. This involves using an "argument". For "aww" the example would be: 

> https://api.reddit.com/r/aww/top/?t=year

This time we are looking for the posts. Each story is a single item in the json. 

Report how many stories were collected. The default is 25 but there may be more if there is a post which is "stickied". Identify the stickied post and remove it. For the remaining posts, collect the comments under the story.

We will want to create two DataFrames - one for posts and one for comments on the stories for each post. Notice that we do not store all of the data here. 

In [None]:
def get_subreddit(subreddit, user_agent, time_filter="year", rating_filter="top"):
    """
    Download top posts from a subreddit (stage 1: get post list).
    """
    url = f"https://api.reddit.com/r/{subreddit}/{rating_filter}/?t={time_filter}"
    
    headers = {
        "User-Agent": user_agent
    }
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    return response.json()


def get_post_details(permalink, user_agent):
    """
    Download full details for a single post including comments.
    """
    url = f"https://api.reddit.com{permalink}"
    
    headers = {
        "User-Agent": user_agent
    }
    
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    
    return response.json()


def build_dataframes(subreddit, user_agent, time_filter="year", rating_filter="top"):
    """
    Two-stage download: get post list, then fetch each post's comments.
    
    Returns:
        Tuple of (posts_df, comments_df)
    """
    import time
    
    # Stage 1: Get list of top posts
    print(f"Fetching top posts from r/{subreddit}...")
    listing_data = get_subreddit(subreddit, time_filter, user_agent, rating_filter)
    
    posts = []
    comments = []
    children = listing_data["data"]["children"]
    
    # Stage 2: Fetch each post's full details and comments
    for i, child in enumerate(children):
        time.sleep(0.5)
        
        if i > 5: break
        
        post_summary = child["data"]
        permalink = post_summary["permalink"]
        
        print(f"  Fetching post {i+1}/{len(children)}: {post_summary['title'][:40]}...")
        
        # Get full post data (index 0 = post, index 1 = comments)
        post_data = get_post_details(permalink, user_agent)
        post_full = post_data[0]["data"]["children"][0]["data"]
        post_id = post_full["id"]
        
        # Extract post info
        posts.append({
            "id": post_id,
            "subreddit": subreddit,
            "author": post_full["author"],
            "created_utc": post_full["created_utc"],
            "title": post_full["title"],
            "url": post_full["url"],
            "ups": post_full["ups"],
            "num_comments": post_full["num_comments"],
            "stickied": post_full["stickied"],
        })
        
        # Extract comments
        comment_children = post_data[1]["data"]["children"]
        for comment_child in comment_children:
            # Skip "more" objects (truncated comment lists)
            if comment_child["kind"] != "t1":
                continue
            
            comment_data = comment_child["data"]
            comments.append({
                "comment_id": comment_data["id"],
                "post_id": post_id,
                "subreddit": subreddit,
                "author": comment_data["author"],
                "body": comment_data["body"],
                "created_utc": comment_data["created_utc"],
                "ups": comment_data["ups"],
                "score": comment_data["score"],
            })
        
        # Be polite to Reddit's servers
        time.sleep(0.5)
    
    posts_df = pd.DataFrame(posts)
    comments_df = pd.DataFrame(comments)
    
    return posts_df, comments_df


# Example usage
if __name__ == "__main__":
    import pandas as pd
    
    # my_user_agent = "CompMethods Lab Exercise (u/berniehogan)"
    my_user_agent = "CompMethods Lab Exercise (u/<...>) # edit
    
    posts_df, comments_df = build_dataframes("news", time_filter="year", user_agent=my_user_agent)
    
    # Remove stickied posts
    stickied_ids = posts_df[posts_df["stickied"] == True]["id"].tolist()
    posts_df = posts_df[posts_df["stickied"] == False]
    comments_df = comments_df[~comments_df["post_id"].isin(stickied_ids)]
    
    print(f"Posts collected: {len(posts_df)}")
    print(f"Comments collected: {len(comments_df)}")
    print(f"Unique commenters: {comments_df['author'].nunique()}")
    
    print("\nComments per post:")
    print(comments_df.groupby("post_id").size().describe())
    
    print("\nSample comments:")
    print(comments_df[["author", "body", "ups"]].head())

Fetching top posts from r/news...
  Fetching post 1/25: Luigi Mangione accepts nearly $300K in d...
  Fetching post 2/25: Elon Musk and Prince Andrew named in lat...
  Fetching post 3/25: Personal information of 4,500 ICE and Bo...
  Fetching post 4/25: ABC suspends Jimmy Kimmel’s late-night s...
  Fetching post 5/25: Tesla investor calls for Elon Musk to st...
  Fetching post 6/25: As top Trump aides sent texts on Signal,...
Posts collected: 6
Comments collected: 219
Unique commenters: 192

Comments per post:
count     6.000000
mean     36.500000
std      13.560973
min      18.000000
25%      27.500000
50%      38.000000
75%      44.750000
max      54.000000
dtype: float64

Sample comments:
                 author                                               body  \
0      TalmadgeReyn0lds  My MS medication costs $163k every 6 months, w...   
1        brickyardjimmy    Please, oh please, let this trial be televised.   
2          Insciuspetra         So..\n\nOne week in a hospital be

# Part 3. Comparing two subs

Now you have the means to create a DataFrame for a sub, and a DataFrame of comments. Now here's where your work really begins: 

- Compare two subreddits using a research question that can be answered with the columns we have selected. 
- This is wide open for you. Some ideas:
  - Compare two subs with a regional focus. How often does a keyword appear in the title of the two subs.
  - Compare two political subs and report on the different news sites that they used. 
  - If you want to use time data: 
    -  Consider the top posts of the last year: Were they in different times of the year? 
    - Do the top posts get posted a specific time of day?
  - Compare the posters in the comments: 
    - How many people are featured in multiple posts? Do people who post in multiple posts tend to get more upvotes on average? 

What we will be looking for: 
- **Code**: How clean is it? Do you use methods / functions? 
- **Comparison**: Is it a fair comparison or a trivial one? 
- **Write up**: How do you interpret your results? Have you used a table or an image to display some work? 


# Your answer: 

In your answer first produce all the code cells then produce a short write up of less than 500 words. You will want to consider:
- Am I going to create two DFs, one for each subreddit or am I going to merge them into a single DF? 
- What statistics will I use to compare if any? 
- What should we interpret from these results? 

Below your answer write your own AI declaration. Within this, be sure to remark on any times where the AI led you astray, did things you couldn't explain or otherwise did not work as predicted. 

# AI Declaration 



The code was written primarily by Claude. I wrote the summaries and fed it, read the code, ran it and tweaked it. I ensured that we have a means for user-agent strings and `time.sleep()`. 

Neither Claude nor myself had luck in downloading selftext (the body of the post).

For students, you are welcome to use an AI to assist you in this task. The important aspects of this task will be down to:
- Checking that the results match what you expect. 
- That your research question is grounded and coherent. Ideally with some reference to prior literature where possible.
- Checking that you will use a statistical test that is relevant for the data.