# Sample Project

## Structure overview

I'm going to show you how I would organize an example project. There are three directories - `code`, `data`, and `paper`. The `code` directory has three programs - one to get the data, one to clean the data and create measures, and one to analyze the data and create visualizations. In more complex projects, this could include lots more programs. Because this class has only used Notebooks, I use Notebooks, but in my work most of these would be .py files, to make it easier to run them from the command line or make them part of pipelines.

The `data` directory holds the data. Often, I will have a `raw_data` directory that holds the initial data, and a `cleaned_data` or `summarized_data` directory that represents the measures. Sometimes it's wise to keep even more intermediate datasets.

Finally, I'll show how these might fit into a `LaTeX` publishing pipeline. The figures can go straight into a paper, and everything can be updated if the data changes.

This project includes a `Makefile`; in this case it just makes the `LaTeX` file into a PDF, but you can make more complicated Makefiles which will automatically update run code as things change. I've also used `snakemake` in the past, which is based on python and is designed for scientific workflows.


## Project overview

For this simple project, we will try to answer the question, "In conversations about Purdue, do shorter or longer comments get more upvotes?"


First, we load libraries and authenticate to reddit. Note that you will need to 1) Copy your `reddit_authentication.py` file to the code directory and 2) create a `data` directory (for the raw data to be stored in).


In [1]:
import praw
import reddit_authentication
from prawcore.exceptions import NotFound
import csv

# Create an instance called reddit. We'll use this to call the API.
reddit = praw.Reddit(client_id=reddit_authentication.client_id,
                     client_secret=reddit_authentication.client_secret,
                    user_agent = reddit_authentication.user_agent,
                    username = reddit_authentication.username,
                    password = reddit_authentication.password)

fn = '../data/raw_data.csv'


Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


Because we can't always trust network connections to remain stable, we're going to write out to our file for every post that we look at.

This also means that we need to add logic that figures out where we left off and starts at the next post.

In [12]:
num_posts = 500


# Get the list of completed submissions
completed = []
try:
    with open(fn, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            if row['submission_id'] not in completed:
                completed.append(row['submission_id'])
        no_file = False
except FileNotFoundError:
    completed = []
    no_file = True


In [13]:
i = 0
for submission in reddit.subreddit('all').search('Purdue', sort = 'new', limit=num_posts):
    comments = []
    i += 1
    if i % 10 == 0:
        print(f"Submission {i} of {num_posts}")
    if submission.id in completed:
        continue
    # Get comments
    if submission.num_comments == 0:
        continue
    submission.comments.replace_more(limit=None)
    for comment in submission.comments.list():
        # I'm skipping deleted or removed comments. You might want to include them.
        if comment.author is None:
            continue
    # At this point, make sure you include every attribute you might possibly need.
        comments.append({
            'author': comment.author.name,
            'body': comment.body,
            'created_utc': comment.created_utc,
            'id': comment.id,
            'link_id': comment.link_id,
            'parent_id': comment.parent_id,
            'depth': comment.depth,
            'score': comment.score,
            'score_hidden': comment.score_hidden,
            'upvotes': comment.ups,
            'downvotes': comment.downs,
            'subreddit': comment.subreddit.display_name,
            'submission_id': comment.submission.id,
            'submission_title': comment.submission.title,
            'submission_created_utc': comment.submission.created_utc,
            'submission_author': comment.submission.author.name,
            'submission_num_comments': comment.submission.num_comments,
            'submission_score': comment.submission.score,
            'submission_body': comment.submission.selftext,
            'submission_url': comment.submission.url
        })
        completed.append(submission.id)
    with open(fn, 'a') as f: # 'a' means append to the file
        if len(comments) == 0:
            continue
        out = csv.DictWriter(f, fieldnames=comments[0].keys())
        # If the file is empty, write the header
        if no_file:
            out.writeheader()
            no_file = False
        out.writerows(comments)


Submission 10 of 500
Submission 20 of 500
Submission 30 of 500
Submission 40 of 500
Submission 50 of 500
Submission 60 of 500
Submission 70 of 500
Submission 80 of 500
Submission 90 of 500
Submission 100 of 500
Submission 110 of 500
Submission 120 of 500
Submission 130 of 500
Submission 140 of 500
Submission 150 of 500
Submission 160 of 500
Submission 170 of 500
Submission 180 of 500
Submission 190 of 500
Submission 200 of 500
Submission 210 of 500
Submission 220 of 500
Submission 230 of 500
Submission 240 of 500
Submission 250 of 500
