# Part 2: Data Collection

---

## Notebook Summary

This notebook shows the reader how an API webscraper was developed to collect text posts from two different subreddits. Although the code is written and commented here to show the overall development of the webscraper and the API request, the executable Python code is saved in a separate .py file for the user to implement in their own Terminal. Included in this notebook, the reader will find:

* Library Imports
* User Credentials and Authorization Functions
* Webscraper and Transaction Log Functions
* Main Function
* Notebook Summary

---

## Library Imports

First, we will import the requisite libraries.

In [1]:
# import libaries
import pandas as pd
import requests
import os
import datetime

---

## User Credentials and Authorization Functions

Next, we will gather user input regarding their authorization credentials for accessing the Reddit API as well as which two subreddits the user wishes to scrape for data. The first function will ask the user for their credentials, and store them for subsequent access later.

In [2]:
# client_id = 'w0zou506jc-kPrtfSvQBSA'
# client_secret =  'Yv54-psOKLENM4nRWHNOmiU1w5y2aQ'
# user_agent =  'project_3'
# username =  'the_nerdist_1'
# password =  'rhd@afw7WRZ@cmv2rzq'

In [3]:
def gather_user_input():
    # asks the user to provide two subreddits of interest and authorization credentials
    subred_1 = input('What is the first subreddit from which you would like to collect data? ')
    subred_2 = input('What is the second subreddit from which you would like to collect? ')
    client_id = input('What is your API client ID? ')
    client_secret =  input('What is your API client secret? ')
    user_agent =  input('What is the user agent? ')
    username =  input('What is the user name? ')
    password =  input('What is the password? ')
    
    # stores all user data solicited above into a dictionary
    user_info = [{
    'subred_1': subred_1,
    'subred_2': subred_2,
    'client_id': client_id,
    'client_secret': client_secret,
    'user_agent': user_agent,
    'username': username,
    'password': password
    }]
    
    # stores the user data dictionary into a a json file to be read later
    pd.DataFrame(user_info).to_json('./data_files/user_info.json')
    
    # prints a message to the user indicating credentials stored or throws error if failed
    try:
        if os.path.exists('./data_files/user_info.json'):
            print("User credentials successfully stored.")
    except Exception as E:
        print(f'Error is {E}.')

In [4]:
def authorize_creds():
    # reads in json file with stored user credential data
    user_info = pd.read_json('./data_files/user_info.json')
    
    # uses user credentials to define authorization and data access
    auth = requests.auth.HTTPBasicAuth(user_info.loc[0, 'client_id'], user_info.loc[0, 'client_secret'])
    data = {
        'grant_type': 'password',
        'username': user_info.loc[0, 'username'],
        'password': user_info.loc[0, 'password']
    }
    #create an informative header for your application
    headers = {'User-Agent': 'joey/0.0.1'}
    
    # submits request to API
    res = requests.post(
        'https://www.reddit.com/api/v1/access_token',
        auth=auth,
        data=data,
        headers=headers)
    
    # prints response code to user, 200 if status ok
    print(res)
    
    #retrieve access token
    token = res.json()['access_token']

    # creates access token in headers for scraping data
    headers['Authorization'] = f'bearer {token}'
    
    # submits request to API
    requests.get('https://oauth.reddit.com/api/v1/me', headers=headers).status_code == 200
    
    # returns new header with access token for use in webscraper
    return headers
    

---

## Webscraper and Transaction Log Functions

Now that we had set up functions to solicit user input on the subreddits of interest and authorized user credentials, we can use the requests to the Reddit API to begin collecting data. While we collect our data on a daily basis, we will also build a separate transation log to track the date and time of each API get request, the amount of posts collected in each individual script run, and the total posts collected to date.

In [5]:
def api_scrape(headers):
    # reads in the user credentials to pull the two subreddits of interest
    user_info = pd.read_json('./data_files/user_info.json')
    base_url = 'https://oauth.reddit.com/r/'
    subreddits = [user_info.loc[0, 'subred_1'], user_info.loc[0, 'subred_2']]
    
    # create an empty list to append a text dictionary of one post at a time
    posts = []
    
    # creates a for loop to iterate through both of the subreddits of interest
    for subreddit in subreddits:
        
        # creates a for loop to collect up to 100 posts at a time 10 separate times
        for i in range(10):
            # sets post collection limit to 100 per API restrictions
            if i == 0:
                params = {
                    'limit': 100
                }
            # pagination set after first iteration to gather next 100 posts
            else:
                params = {
                    'after': after_param,
                    'limit': 100
                }
            
            # submits request to API and stores in res variable
            res = requests.get(base_url+subreddit, 
                               headers=headers, 
                               params = params)
            
            # loops through all the subreddit posts in the get request, up to 100
            for j in range(len(res.json()['data']['children'])):
                # creates an empty dictionary each post and stores title, selftext, and subreddit
                text = {}
                text['title'] = res.json()['data']['children'][j]['data']['title']
                text['selftext'] = res.json()['data']['children'][j]['data']['selftext']
                text['subreddit'] = res.json()['data']['children'][j]['data']['subreddit']
                # appends the whole text dictionary to the post list if it is not a duplicate title
                if text['title'] not in posts:
                    posts.append(text)
            
            # sets the after parameter for next 100 posts if it exists; otherwise ends for loop
            try:
                after_param = res.json()['data']['children'][-1]['data']['name']
            except:
                pass
    
    # checks if csv file storing all posts already exists
    # if csv file does exist, reads in the file, creates a df of the posts just created,
    # concatenates the two dfs, stores as csv file, and then accesses transaction log
    if os.path.exists('./data_files/subreddit_posts.csv'):
        subreddit_posts = pd.read_csv('./data_files/subreddit_posts.csv')
        new_posts = pd.DataFrame(posts)
        subreddit_posts = pd.concat([subreddit_posts, new_posts], ignore_index = True)
        subreddit_posts.to_csv('./data_files/subreddit_posts.csv', index = False)
        transaction_log(posts, subreddit_posts)
    # if csv file does not exist, creates a df of the posts just created, stores as csv file,
    # then accesses transaction log
    else:
        subreddit_posts = pd.DataFrame(posts)
        subreddit_posts.to_csv('./data_files/subreddit_posts.csv', index = False)
        transaction_log(posts, subreddit_posts)
    
    # returns the df for user to see
    return subreddit_posts

Within the api_scrape function, we call the transaction log function. The purpose of the transaction log is to create a separate csv file in which the user can see when the webscraper was run to collect subreddit posts, how many posts were collected when the script was run, and how many total posts have been collected to date.

In [6]:
def transaction_log(posts, subreddit_posts):
    # creates an empty list to store new log entry as a dictionary
    new_log = []
    
    # if transaction log already exists, reads in the log,
    # creates a new log entry as a list with a dictionary, 
    # saves new log entry as a df, concatenates both logs,
    # then saves transaction log as csv
    if os.path.exists('./data_files/transaction_log.csv'):
        trans_log = pd.read_csv('./data_files/transaction_log.csv')
        current_log = [{'datetime_retrieved': datetime.datetime.now(), 
                       'posts_retrieved': len(posts), 
                       'total_posts': subreddit_posts.shape[0]}]
        new_log = pd.DataFrame(current_log)
        trans_log = pd.concat([trans_log, new_log], ignore_index = True)
        trans_log.to_csv('./data_files/transaction_log.csv', index = False)
    # If not transaction log exists yet, 
    # creates a new log entry as a list with a dictionary,
    #saves new log entry as a df, then saves transaction log as csv
    else:
        current_log = [{'datetime_retrieved': datetime.datetime.now(), 
                       'posts_retrieved': len(posts), 
                       'total_posts': subreddit_posts.shape[0]}]
        trans_log = pd.DataFrame(current_log)
        trans_log.to_csv('./data_files/transaction_log.csv', index = False)
    
    # prints out transaction log status to user
    return print(f'There were {len(posts)} retrieved at {datetime.datetime.now()} for a total of {subreddit_posts.shape[0]} to date.')

---

## Main Function

The main function will exexcute after all other prior functions have been defined. The main function check is the user_info json file exists and determine whether to gather user input first or simply execute credential authorization and the API webscrape script.

In [7]:
# if the user info json file exists, authorize credentials
if os.path.exists('./data_files/user_info.json'):
    headers = authorize_creds()
# if the user info json file does not exists, gather user input,
# then authorize credentials
else:
    gather_user_input()
    headers = authorize_creds()

# call webscrape function to run script and return response status and log status
subreddit_posts = api_scrape(headers)

# print new df shape
subreddit_posts.shape

<Response [200]>
There were 1759 retrieved at 2023-08-09 14:13:04.464410 for a total of 18115 to date.


(18115, 3)

In [8]:
# print new df
subreddit_posts

Unnamed: 0,title,selftext,subreddit
0,Megathread: US Medication Shortage,"As many of you are aware by now, the current U...",ADHD
1,Did you do something you're proud of? Somethin...,What success have you had this week?\n\nDid yo...,ADHD
2,The Vyvanse poops have taken over my mornings..,I now wake up at least 1.5 hours early to ensu...,ADHD
3,Why does someone forcing you to push through e...,I can’t even explain how it hurts but it’s so ...,ADHD
4,Just had an epiphany- isn’t it crazy how relig...,"So my mom can believe in all her saints, God, ...",ADHD
...,...,...,...
18110,Me when I friend zone someone by accident,I friend zoned a girl for 5 years once by acci...,autism
18111,Trying to come off as less rude at work,"Hello, this is my first post here. For clarifi...",autism
18112,Do you feel hyper sensitive to negative affect?,"I don't always place specifics, but I seem to ...",autism
18113,Limiting food intake,"Hello, do any of you purposely not eat all you...",autism


---

## Notebook Conclusion

In this notebook, we built a series of functions to gather user input, authorize user credentials, webscrape the Reddit API, and gather up to 1000 unique user posts from each of two different subreddits at a time. In the process, this script creates a transaction log to keep track of our prior script executions. This script is intended to be run in the Terminal, so again, a separate Python file has been created in this report while this notebook is preserved to provide more detail and explanation to the reader of how data were collected.

In Part 3, we will begin the process of data cleaning and exploratory data analysis based on the posts that we have collected from this script.