# Introduction to Reddit Data for Social Scientists

[Reddit](https://www.reddit.com) is one of the most popular websites in the world. It is composed of hundreds of thousands of "subreddits" - topical communities which users subscribe to. These communities are then aggregated into a personalized home page for each user.

The size and influence of Reddit have made it the focus of an increasing amount of social science scholarship. In addition, it is by far the most open of the major social networking sites, with full text comments and information about users made available via API.

There are currently two main approaches to getting data from Reddit. The first is through the reddit API. The API itself is powerful but really confusing to get into. I highly recommend the use of [praw](https://praw.readthedocs.io/en/latest/), a really great "wrapper" for the API (similar to tweepy for Twitter). As with tweepy, you will need to install it through conda or pip.

In order to use the Reddit API, you need to create a reddit account, and then [create an application](https://www.reddit.com/prefs/apps/). This is basically the same as what we did with tweepy - you will be given a secret, which must be used in connection with the username and password for the account.

These should be stored in their own file, which you are careful not to accidentally upload to Github! :)

In the same directory as the notebook you use to run this code, create a file called `reddit_auth.py` with the following information (replace with actual strings from the application you create):

```
client_id = 'aaaaaaaaa'
secret = 'bbbbbbbbb'
user = 'ccccccccc'
pw = 'ddddddddd'
```

Then, we authenticate.

You can use praw to do all sorts of interaction with the site - writing comments or posts, moderating things, etc. I am just going to show data collection, which only requires read-only rights. For these actions, you actually don't need to pass your user name and password. So it looks like this:

In [1]:
import praw
import reddit_auth

# Create an instance called reddit. We'll use this to call the API.
reddit = praw.Reddit(client_id=reddit_auth.client_id,
                     client_secret=reddit_auth.secret,
                     user_agent="Example for COM 674 class") # Call this something descriptive

Version 7.7.1 of praw is outdated. Version 7.8.0 was released 22 hours ago.


And here are a few examples of basic things you might want to do.

First, getting recent posts from a set of subreddits.

In [3]:
# First, let's find the top 10 COVID-related subreddits, according to reddit's search
top_covid_subs = [x for x in reddit.subreddits.search('Coronavirus')][:10]

# Each of these is a subreddit object, and we can get information about the subreddit from it

In [4]:
for s in top_covid_subs:
    print(f"Name: {s.display_name}\tSubscribers: {s.subscribers}")

Name: Coronavirus	Subscribers: 2284728
Name: CoronavirusCirclejerk	Subscribers: 55097
Name: LockdownSkepticism	Subscribers: 55875
Name: CoronavirusDownunder	Subscribers: 183217
Name: CoronavirusUS	Subscribers: 141023
Name: CoronavirusUK	Subscribers: 90265
Name: Coronavirus_NZ	Subscribers: 33456
Name: CoronavirusColorado	Subscribers: 65770
Name: CoronavirusCA	Subscribers: 122395
Name: CoronavirusIllinois	Subscribers: 147543


In [5]:
# Then, we can also iterate through each subreddit and get comments

comment_dict = {}

for sr in top_covid_subs:
    curr_comments_text = []
    curr_sr_comments = sr.comments(limit=500)
    for comment in curr_sr_comments:
        curr_comments_text.append(comment.body)
    comment_dict[sr.display_name] = curr_comments_text
        

KeyboardInterrupt: 

In [None]:
comment_dict

The other thing we might want is information about users. Here's a simple example of how to do that for the users who made the last 10 comments in the "Purdue" subreddit.

In [None]:
authors = []
for c in reddit.subreddit('Purdue').comments(limit=10):
    authors.append(c.author)
    
for author in authors:
    print(f"Name: {author.name}\t Comment karma: {author.comment_karma}")

# Pushshift.io

The other option for reddit data is [Pushshift](https://github.com/pushshift/api). Jason Baumgartner has been archiving every comment and post on reddit, and allows researchers access to this data in a way that is often much more convenient and faster than the reddit API. Also, because it's only public data, there is no need for authentication.

It also includes some cool tools for aggregating data but for this example, I'm just showing grabbing data.

Pushshift is really organized just around submissions and comments - it's designed for getting reddit-level or subreddit-level data about how much content is being produced around topics. It does not, for example, let you get direct informationa about users. Nor can you directly navigate a comment tree, like the Reddit API lets you do. However, it's really great for many common use cases.

_Note:_ There is a wrapper for Pushshift called `psaw`. It's not much easier than just using Pushshift directly, IMO, so i'm just showing how to do that.

In [11]:
# This is how you do the same thing - get the last 500 comments from each of the subreddits in our list

import requests
import time

# Initial URL
endpt = "https://api.pushshift.io/reddit/search/comment"


comment_dict = {}
for sr in top_covid_subs:
    params = {'subreddit': sr.display_name,
              'size': 500} # Get 500 comments for each
    print(sr)
    r = requests.get(endpt, params = params)
    # Take the reqeusts object, and extract just the text
    comment_dict[sr.display_name] = [x['body'] for x in r.json()['data']]
    # Add a sleep, to be kind to the API (really only needed if you are getting a lot of data)
    time.sleep(20)

Coronavirus


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [9]:
r.json()['data'][0]

{'subreddit_id': 't5_2gx1h6',
 'author_is_blocked': False,
 'comment_type': None,
 'edited': False,
 'author_flair_type': 'richtext',
 'total_awards_received': 0,
 'subreddit': 'CoronavirusIllinois',
 'author_flair_template_id': '8630883c-54d8-11ec-b865-c650f3516720',
 'id': 'jd2eqkx',
 'gilded': 0,
 'archived': False,
 'collapsed_reason_code': None,
 'no_follow': True,
 'author': 'theoryofdoom',
 'send_replies': True,
 'parent_id': None,
 'score': 1,
 'author_fullname': 't2_oop22',
 'all_awardings': [],
 'body': 'This sort of free-floating speculation is not permitted.',
 'top_awarded_type': None,
 'author_flair_css_class': None,
 'author_patreon_flair': False,
 'collapsed': False,
 'author_flair_richtext': [{'e': 'text', 't': 'Vaccinated &amp; Recovered'}],
 'is_submitter': False,
 'gildings': {},
 'collapsed_reason': None,
 'associated_award': None,
 'stickied': True,
 'author_premium': True,
 'can_gild': True,
 'link_id': 't3_11x3blz',
 'unrepliable_reason': None,
 'author_flair_te

In [None]:
# If you need more than 500 comments, you will need a while loop.

endpt = "https://api.pushshift.io/reddit/search/comment"

comment_dict = {}
before = ''
for sr in top_covid_subs[:2]:
    curr_comments = []
    while len(curr_comments) < 2000:
        params = {'subreddit': sr.display_name,
                  'before': before,
                  'size': 500} # Get 500 comments for each
        r = requests.get(endpt, params = params)
        # Take the reqeusts object, and extract just the text
        curr_comments += [x['body'] for x in r.json()['data']]
        # Get the created time of the last comment; for the next run, only get 
        # the comments older than that
        before = r.json()['data'][-1]['created_utc'] 
    # Add a sleep, to be kind to the API (really only needed if you are getting a lot of data)
    comment_dict[sr.display_name] = curr_comments
    time.sleep(20)

In [28]:

df = pd.read_csv('../resources/data/purdue_iu_comments.csv')

comments = []

for sr in ['Purdue', 'IndianaUniversity']:
    i = 0
    for submission in reddit.subreddit(sr).new(limit=1000):
        i += 1
        if i % 100 == 0:
            print(i)
        if submission.id in df.post_id:
            continue
        try:
            submission.comments.replace_more(limit=0)
        except TooManyRequests:
            time.sleep(30)
            submission.comments.replace_more(limit=0)
        for comment in submission.comments.list():
            try:
                comments.append({'subreddit': sr, 
                                'body': comment.body, 
                                'author': comment.author.name,
                                'score': comment.score,
                                'created_utc': comment.created_utc,
                                'post_id': submission.id
                                })
            except AttributeError:
                pass

100
200
300
400
500
600
700
800
900
100
200
300
400
500
600
700
800
900


In [29]:
import pandas as pd

df = pd.concat([df, pd.DataFrame(comments)])

df.to_csv('../resources/data/purdue_iu_comments.csv', index=False)

In [32]:
df.drop_duplicates().to_csv('../resources/data/purdue_iu_comments.csv', index=False)