## Reddit API


This notebook displays the code that was used to extract Reddit posts data from the Reddit API.

In [1]:
# Import necessary packages
import json
import requests
import pandas as pd
from datetime import datetime

In [2]:
# Accessing private API keys and Reddit login information
with open('reddit_keys.json') as file:
    data = json.load(file)

# Definining credentials
client = data['reddit_api']['CLIENT_ID']
secret = data['reddit_api']['SECRET_ID']
reddit_data = data['reddit_data']
auth = requests.auth.HTTPBasicAuth(client, secret)
headers = {'User-Agent': 'MyAPI/0.0.1'}

# Gaining access to API
res = requests.post('https://www.reddit.com/api/v1/access_token',
                   auth=auth, data=reddit_data, headers=headers)
TOKEN = res.json()['access_token']
headers = {**headers, **{'Authorization': f'bearer {TOKEN}'}}

In [3]:
# Creating reddit posts data file
new_posts = pd.DataFrame()
all_posts = pd.DataFrame()
hot_posts = pd.DataFrame()

In [4]:
subreddits = ['birthcontrol', 'IUD', 'TwoXChromosomes']
categories = ['top', 'new', 'hot']
prefix = 't3_'

As we loop through posts from new, hot and top pages of the r/BirthControl, r/IUD and r/TwoXChromosomes, we pull only the columns that we know we will need. This reduces unnecessary cleaning and data storing.

As we pull the data from the three subreddits, there is a strong possibility that we contravene the Reddit API thresholds. To ensure that we do not overwhelm the API limitations we use a time delay to respect these rules. When we meet an error with the pull we move onto the next loop by placing the code in a try/except statement.

Whilst pulling relevant columns we complete some data transformations so that we do not unnecessarily iterate through all the data at another stage.
* The UTC date stamp is transformed into a datetime object using the datetime package as we pull it in.
* We remove the '/n' objects in the text box.


In [5]:
# Pull data from Reddit API and store

for s in subreddits:
    for c in categories:
        counter = 0
        for i in range(20):
            try:
                # Calling API for relevant url without after parameter only for first run
                if counter == 0:
                    request = requests.get(f'https://oauth.reddit.com/r/{s}/{c}', headers=headers,
                                           params={'limit': '100'})

                # Reading in data from API
                ls = []
                for post in request.json()['data']['children']:
                    # Taking only the columns that are relevant
                    new_dict = {
                        'subreddit': post['data']['subreddit'],
                        'title': post['data']['title'],
                        'text': post['data']['selftext'].strip(),
                        'upvote_ratio' : post['data']['upvote_ratio'],
                        'score' : post['data']['score'],
                        'flair' : post['data']['link_flair_text'],
                        'datetime' : pd.to_datetime(datetime.fromtimestamp(post['data']['created_utc'])),
                        'num_comments' : post['data']['num_comments'], 
                        'id': post['data']['id']}
                    ls.append(new_dict)

                # Adding new posts to all posts directory
                new_posts = pd.DataFrame(ls)
                all_posts = pd.concat([all_posts, new_posts], axis=0)

                # Updating fullname id to pull next 100 
                idx = new_posts['id'].index.stop
                last_id = new_posts['id'][idx - 1]
                fullname = prefix + last_id

                # Reset API GET call to include new fullname id
                request = requests.get(f'https://oauth.reddit.com/r/{s}/{c}', headers=headers,
                                       params={'limit': '100', 'after': fullname})

                # Using time delay command to respect API request limits
                time.sleep(2)
                counter += 1

            except:
                continue

Save raw dataset for further data cleaning and exploratory analysis in the main project file.

In [6]:
#all_posts.to_csv('reddit_posts.csv', index=False)