# Data Collection 

This notebook 1 (out of 5) for <b>Project 3</b> of the GA Data Science Immersive<br>
Notebook by: <b>Martijn de Vries</b><br>
martijndevries91@gmail.com

## Problem Statement

A US political consultancy company is researching how news sources and discussed topics differ between the US political mainstream and the conservative right-wing media. In the last decade or so, the US political right-wing has been increasingly described as living in an entirely separate information ecosystem from the political mainstream. In order to gauge how intense this effect is, we will collect, process, and classify the Reddit content of two politically-themed subreddit that reflect the mainstream and conservative voters respectively: <b>r/politics</b> and <b>r/conservative</b>. 

For this project, we will build two separate branches of models: one for post submissions (largely consisting of links to news sites), and another for comments (consisting of actual Reddit users discussing political news). As this is a binary classification problem where the two classes are of equal interest and will be approximately balanced, we will use the accuracy score as the main metric to gauge the success of the classification model. 

Because political news is always evolving, we have chosen a specific moment in time: the month leading up to the 2022 midterms, October 6th to November 6th 2022. This ensures that 1) the same news cycle is covered for both subreddits, 2) both subreddits were at peak activity, and 3) maximum potential for interesting insights in the way that news is discussed within these two subreddits.

## In this Notebook

I will use the Pushshift API to obtain data from the two subreddits (r/politics and r/conservative) for the specified data. I will then extract the information I want from the obtained data, convert to a pandas dataframe and save the dataframes to two csv files: all_submissions.csv and all_comments.csv.

In [1]:
import requests
import datetime as dt
import time
import pandas as pd

## Posts

The first thing I would like to is to collect all posts made in the specified month (October 6th, 2022 until November 6th, 2022). First let's define the two subreddits:

In [110]:
subreddit1 = 'politics'
subreddit2 = 'conservative'

Now I'll write a function to actually collect data with Pushshift API. From previous experiments, it seems that I'm not always able to succesfully connect to the API. So I'll build in a couple of failsafes - I'll loop over each day of the month, and collect all the data. At the end of the loop, repeat for the days where it didn't manage to connect succesfully, until all data has been collected:

In [241]:
def collect_subreddit_data(subreddit, start_date='10-06-2022', n_days=30, rtype='submission'):
    """
    This function collects data (submissions, or comments) from a given subreddit, from a given range of days, using the pushshift API
    Because the pushshift API is kinda finnicky and API requests aren't always succesful, I'm wrapping everything in a while loop
    so that it can keep trying until it's made a succesful request for each of my specified dates
    Returns: a list of json objects for each of the days
    """
    
    pushshift_url= 'https://api.pushshift.io/reddit/' + rtype + '/search'
    print(pushshift_url)
    subr_data = []
    success_list = []
    mm, dd, yy = start_date.split('-')
    tot_posts = 0
    #only stop when data was succesfully retrieved from all the days
    q = 0 
    while len(success_list) < n_days:
        s_epoch_start = int(dt.datetime(int(yy),int(mm),int(dd),0,0).timestamp())

        for i in range(1, n_days+1):
            print('-' * 30)

            s_epoch_start += 24*3600
            if i in success_list: continue
            print(f'Day of the month {i}')
            s_epoch_end = s_epoch_start + 24*3600
            
            params = {'subreddit':subreddit, 'since':s_epoch_start, 'until':s_epoch_end , 'size':1000}
            res = requests.get(pushshift_url, params)
            print(f'Status code: {res.status_code}')
            
            if res.status_code != 200:
                print('Connection unsuccessful')
                continue
            else:
                data = res.json()
                if len(data['data']) == 0:
                    print('Connection succesful but no data retrieved.') 
                    continue
                subr_data.append(data)
                success_list.append(i) #don't need try this day again on subsequent loops
                tot_posts += len(data['data'])
                
            time.sleep(15) #polite
        print(f'Total length of the list: {len(success_list)}')
        q +=1
        if q == 10: break # Failsafe to make sure that if the API is down, the while loop doesn't keep going forever
    print(f'Great Success! Total number of posts/comments retrieved: {tot_posts}')
    return subr_data

In [50]:
subr1_data = collect_subreddit_data(subreddit1, start_date='10-06-2022', n_days=31)

------------------------------
Day of the month 1
Status code: 200
------------------------------
Day of the month 2
Status code: 200
------------------------------
Day of the month 3
Status code: 200
------------------------------
Day of the month 4
Status code: 200
------------------------------
Day of the month 5
Status code: 200
------------------------------
Day of the month 6
Status code: 200
------------------------------
Day of the month 7
Status code: 200
------------------------------
Day of the month 8
Status code: 200
------------------------------
Day of the month 9
Status code: 200
------------------------------
Day of the month 10
Status code: 200
------------------------------
Day of the month 11
Status code: 200
------------------------------
Day of the month 12
Status code: 200
------------------------------
Day of the month 13
Status code: 200
------------------------------
Day of the month 14
Status code: 200
------------------------------
Day of the month 15
Status

In [52]:
subr2_data = collect_subreddit_data(subreddit2, start_date='10-06-2022', n_days=31)

------------------------------
Day of the month 1
Status code: 524
Connection unsuccessful
------------------------------
Day of the month 2
Status code: 200
------------------------------
Day of the month 3
Status code: 200
------------------------------
Day of the month 4
Status code: 200
------------------------------
Day of the month 5
Status code: 524
Connection unsuccessful
------------------------------
Day of the month 6
Status code: 200
------------------------------
Day of the month 7
Status code: 200
------------------------------
Day of the month 8
Status code: 200
------------------------------
Day of the month 9
Status code: 200
------------------------------
Day of the month 10
Status code: 200
------------------------------
Day of the month 11
Status code: 200
------------------------------
Day of the month 12
Status code: 200
------------------------------
Day of the month 13
Status code: 200
------------------------------
Day of the month 14
Status code: 524
Connectio

In [244]:
#display the keys of a single post dictionary
subr1_data[0]['data'][0].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'archived', 'author', 'author_created_utc', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_gild', 'category', 'content_categories', 'contest_mode', 'created_utc', 'discussion_type', 'distinguished', 'domain', 'edited', 'gilded', 'gildings', 'hidden', 'hide_score', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'post_hint', 'prev

subr1_data and subr2_data are now lists with the following structure:
1) each entry is all the posts for a given day
2) indexing into the lists, selecting ['data'], gives me a list with all the data for that day
3) indexing into this gives me a dictionary with the data for a single post

Let's write a function to take these lists, and return a pandas dataframe with
1) the unique ID of the post
2) the time it was posted
3) the title
4) the selftext (if present)
5) the URL the post links to
5) the number of comments
6) the upvote ratio

In [327]:
def create_submissions_df(subr_data, subr_name):
    
    #create a new list of dictionaries that has the format we want
    df_l = []
    features = ['id', 'created_utc', 'title', 'selftext', 'url', 'num_comments', 'upvote_ratio']
    for subs_day in subr_data:
        
        for post in subs_day['data']:
            post_dict = {features[i]:post[x] for i,x in enumerate(features)}
            df_l.append(post_dict)
            
    sub_df = pd.DataFrame(df_l)
    sub_df['subreddit'] = subr_name
    sub_df.set_index('id', inplace=True)
    return sub_df

Let's try this out for our first subreddit, r/politics

In [328]:
subr1_df = create_submissions_df(subr1_data, subreddit1)
subr1_df.shape

(10010, 7)

In [329]:
subr1_df.head()

Unnamed: 0_level_0,created_utc,title,selftext,url,num_comments,upvote_ratio,subreddit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
xylvpq,1665212044,"Editorial: Hey, QAnon — Texas had an actual ch...",,https://www.houstonchronicle.com/opinion/edito...,688,0.97,politics
xylh3y,1665210574,Sanders: Biden’s Marijuana Pardons Are Good — ...,,https://truthout.org/articles/sanders-bidens-m...,269,0.97,politics
xyla6d,1665209886,Elon Musk suggests making Taiwan a ‘special ad...,[removed],,24,0.26,politics
xykwzh,1665208590,Anyone else in Chicago noticing how Fox News k...,[removed],,1,1.0,politics
xykox1,1665207791,Urfi wore a bold saree! Spread the flames of h...,,https://countryconnect.in/entertainment-news/u...,1,1.0,politics


Seems like that worked!
Let's double check if there are any duplicate - there shouldn't be, since I iterated over subsequent time intervals

In [330]:
subr1_df.groupby(level=0).first().shape

(10010, 7)

Now let's repeat the procedure for r/conservative

In [331]:
subr2_df = create_submissions_df(subr2_data, subreddit2)
subr2_df.shape

(10411, 7)

In [332]:
subr2_df.groupby(level=0).first().shape

(10411, 7)

In [333]:
subr2_df.head()

Unnamed: 0_level_0,created_utc,title,selftext,url,num_comments,upvote_ratio,subreddit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
xzewes,1665298432,"Kanye is getting ""cancelled"" on twitter by the...",[removed],,0,1.0,conservative
xzek87,1665297163,"Attack On Free Speech' Paypals ""Misinformation...",[removed],https://www.reddit.com/r/Conservative/comments...,0,1.0,conservative
xzeit9,1665297021,Is it really any different if I move?,[removed],https://www.reddit.com/r/Conservative/comments...,0,1.0,conservative
xzefy5,1665296708,Will some people always assume the worst about...,[removed],https://www.reddit.com/r/Conservative/comments...,1,1.0,conservative
xzebti,1665296283,Any Help against Far-Left Liberals on Twitter?,[removed],https://www.reddit.com/r/Conservative/comments...,0,1.0,conservative


Now let's merge this into one CSV and save

In [334]:
subr_tot = pd.concat([subr1_df, subr2_df])


In [335]:
subr_tot.to_csv('../data/all_submissions.csv')

Looking at the 'selftext' columns - it seem that the vast majority of selftexts are empty, or removed/deleted:

In [336]:
subr_tot['selftext'].value_counts(normalize=True)[:3]

             0.775133
[removed]    0.179962
[deleted]    0.036482
Name: selftext, dtype: float64

### Comments

I would also like to look at comments to obtain more information on how people actually speak on these subreddits. Comments are not included in the submissions data, I will need to scrape them individually:

In [337]:
subr1_comments = collect_subreddit_data(subreddit1, start_date='10-06-2022', n_days=31, rtype='comment')

https://api.pushshift.io/reddit/comment/search
------------------------------
Day of the month 1
Status code: 200
------------------------------
Day of the month 2
Status code: 200
------------------------------
Day of the month 3
Status code: 200
------------------------------
Day of the month 4
Status code: 200
------------------------------
Day of the month 5
Status code: 200
------------------------------
Day of the month 6
Status code: 200
------------------------------
Day of the month 7
Status code: 200
------------------------------
Day of the month 8
Status code: 200
------------------------------
Day of the month 9
Status code: 200
------------------------------
Day of the month 10
Status code: 200
------------------------------
Day of the month 11
Status code: 200
------------------------------
Day of the month 12
Status code: 200
------------------------------
Day of the month 13
Status code: 200
------------------------------
Day of the month 14
Status code: 200
----------

In [251]:
subr2_comments = collect_subreddit_data(subreddit2, start_date='10-06-2022', n_days=31, rtype='comment')

https://api.pushshift.io/reddit/comment/search
------------------------------
Day of the month 1
Status code: 524
Connection unsuccessful
------------------------------
Day of the month 2
Status code: 200
------------------------------
Day of the month 3
Status code: 200
------------------------------
Day of the month 4
Status code: 200
------------------------------
Day of the month 5
Status code: 200
------------------------------
Day of the month 6
Status code: 200
------------------------------
Day of the month 7
Status code: 200
------------------------------
Day of the month 8
Status code: 200
------------------------------
Day of the month 9
Status code: 200
------------------------------
Day of the month 10
Status code: 200
------------------------------
Day of the month 11
Status code: 200
------------------------------
Day of the month 12
Status code: 200
------------------------------
Day of the month 13
Status code: 200
------------------------------
Day of the month 14
Sta

What do the keys of 'comment' dictionaries look like?

In [253]:
subr1_comments[0]['data'][0].keys()

dict_keys(['all_awardings', 'archived', 'associated_award', 'author', 'author_created_utc', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'body', 'can_gild', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'collapsed_reason_code', 'comment_type', 'controversiality', 'created_utc', 'distinguished', 'edited', 'gilded', 'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_utc', 'score', 'score_hidden', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_type', 'top_awarded_type', 'total_awards_received', 'treatment_tags', 'unrepliable_reason', 'updated_utc', 'body_sha1', 'utc_datetime_str'])

It seems like the actual comments are saved in 'body':

In [264]:
subr1_comments[0]['data'][3]['body']

'If he was making calls to Georgia for someone else, imagine what he was doing for himself in SC.'

We'll want to save these comments in another csv file. Let's write a function that saves
1) id
2) parent id 
3) the author
4) date created
5) actual comment ('body')
6) score

In [313]:
def create_comments_df(subr_data, subr_name):
    
    #create a new list of dictionaries that has the format we want
    df_l = []
    features = ['id', 'parent_id', 'author', 'created_utc', 'body', 'score']
    for subs_day in subr_data:
        for post in subs_day['data']:
            post_dict = {features[i]:post[x] for i,x in enumerate(features)}
            df_l.append(post_dict)
    com_df = pd.DataFrame(df_l)
    com_df['subreddit'] = subr_name
    com_df.set_index('id', inplace=True)
    return com_df

In [314]:
subr1_com_df = create_comments_df(subr1_comments, subreddit1)
subr1_com_df.shape

(30956, 6)

In [315]:
subr1_com_df.head()

Unnamed: 0_level_0,parent_id,author,created_utc,body,score,subreddit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
irhr7g7,40844090000.0,stickznstonez_,1665212396,https://youtu.be/i1oCQ6bZ_Ws\n\nThis guy might...,1,politics
irhr7fd,40840990000.0,PoliticsModeratorBot,1665212395,Hi `PhilipLiptonSchrute`. [Your comment](/r/po...,1,politics
irhr7bp,,valcatrina,1665212393,I am surprised it takes the FBI to draw this l...,1,politics
irhr79r,40843470000.0,StrillyBings,1665212392,If he was making calls to Georgia for someone ...,1,politics
irhr79a,,After_Ad_9636,1665212391,Duh?\n\nWhy wouldn’t he?,1,politics


Not sure about those NaN values for parent Id, but let's keep this as is.

In [316]:
subr2_com_df = create_comments_df(subr2_comments, subreddit2)
subr2_com_df.shape

(30953, 6)

In [317]:
subr2_com_df.head()

Unnamed: 0_level_0,parent_id,author,created_utc,body,score,subreddit
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
irlvo5v,40851400000.0,Domination11,1665298765,this is just such a shittily written article a...,1,conservative
irlvmx0,40848250000.0,[deleted],1665298736,[removed],1,conservative
irlvm7f,,SadNYSportsFan-11209,1665298721,But but ring wing white supremacist terrorists...,1,conservative
irlvj5a,,[deleted],1665298654,[removed],1,conservative
irlvfqo,,EpicToshima,1665298577,Why should she visit the border when the borde...,1,conservative


Now we can merge the two, and save:

In [318]:
subr_com_tot = pd.concat([subr1_com_df, subr2_com_df])
subr_com_tot.to_csv('../data/all_comments.csv')