<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Webscrapping: Data Gathering Notebook

_Authors: Patrick Wales-Dinan_

---

This experience started as a daunting project but quickly became fun and interesting experience. I webscraped the Reddit.com API to gather unique posts from two different subreddits. The first was the California Politics subreddit and the second was the Texas Politics subreddit. At first glance these seemed to provide a very nice contrast with each other as the topics discussed in each would have some overlap when the content centered around national politics and some divergence when the content centered around state and local politics. The hope was that they would provide enough features for the Natural Language Processor to correctly classify a posts origin. 

I decided to set up the classification problem to attempt to predict if a post came from the California subreddit. I had sample sizes that were relatively large (~980 vs. ~930) and almost equal. After running multiple models through the GridSearch I concluded that they were all running well giving a 98-99% accuracy score for the training set and a 92-93% accuracy score for the testing set. The told me that my model did not suffer from a high degree of overfitting.

I decided to re-run my models without including important features such as 'california' & 'texas' to see how this changed the results. The accuracy score of the training set dropped to ~92% and the testing set accuracy dropped to ~82%. 

This told me that these features where playing a very strong role in helping to classify a post correctly. I then explored the Beta values of the features to see how they looked. 

I was able to conclude that the model ran very consistently and that the misclassified posts were generally a result of a particular subreddit post explicitly talking about what was happening in the opposing state.

Overall this was a fantastic learning experience and I thoroughly enjoyed the process.

## Contents:
- [Import Libraries](#Import-our-Libraries)
- [Creating our URLs](#Instantiate-our-URL)
- [Accessing the API](#Access-Reddit-API-and-Scrape-Posts)
- [Keep only uniqie Posts](#Check-to-be-Sure-Posts-are-Unique)
- [Cleaning and Creating Master DataFrame](#Clean-up-the-DataFrame)
- [Exporting our DataFrame for Modeling](#Export-as-CSV-File)

Please visit the Data Modeling notebook for an in-depth look at my data modeling and data visualization process: [Data Gathering Notebook](/Project_3_Data_Modeling.ipynb)


## Import our Libraries

In [71]:
import requests
import time
import pandas as pd
import numpy as np
import copy

from sklearn.feature_extraction import stop_words 
from sklearn.preprocessing import Imputer

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Instantiate our URL

In [2]:
tx_url = 'https://www.reddit.com/r/TexasPolitics.json'
ca_url = 'https://www.reddit.com/r/California_Politics.json'

## Access Reddit API and Scrape Posts

In [7]:
def get_posts(url):
    # Setting up my unique user agent so that I can pull posts from reddit
    user_agent = {'User-agent' : 'pat bot 0.1'}
    
    # Empty posts list
    posts = []
    
    # Setting after to NONE to start as this needs to be there in order to begin each pull
    after = None
    
    for i in range(0,60):
        print(i)
        url = url
        if after == None:
            params = {}
        else:
            params = {'after' : after}
        res = requests.get(url, params=params, headers=user_agent)
        if res.status_code == 200:
            json = res.json()
            posts.extend(json['data']['children'])
            after = json['data']['after']
        else: 
            print(tx_res.status_code)
            break
        time.sleep(2)
    return posts

In [5]:
# Waiting time to get our posts for Texas Subreddit
tx_posts = get_posts(tx_url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [6]:
# Waiting time to get our posts for California Subreddit
ca_posts = get_posts(ca_url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [8]:
len(ca_posts)
len(set([p['data']['name'] for p in ca_posts]))

938

In [9]:
len(tx_posts)
len(set([p['data']['name'] for p in tx_posts]))

982

## Check to be Sure Posts are Unique
### If they aren't removed them. If they are, put them in a DataFrame

In [10]:
# Checking to ensure that the posts are unique
ca_post_new = []
ca_post_names = set() # Making the Califormia posts a set to retain uniqueness
for post_dict in ca_posts:
    keep_data = post_dict['data']
    if keep_data['name'] not in ca_post_names:
        ca_post_new.append(keep_data)
        ca_post_names.add(keep_data['name'])
df_ca = pd.DataFrame(ca_post_new) # Adding unique posts to DataFrame
df_ca.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,theProgressiveGOP,,,,[],,,,text,t2_3x0d2uzk,False,,,False,False,,False,,False,1562712000.0,1562683000.0,,,,,calmatters.org,0,False,0,{},False,True,cb1s4p,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_cb1s4p,False,0,0,,False,,/r/California_Politics/comments/cb1s4p/state_m...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,13,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/GcoPg0hQ78iGU...,93.0,140.0,"State May Push Cities and Counties to Draw ""fa...",0,13,https://calmatters.org/articles/redistricting-...,[],,False,,
1,[],False,,,False,CALmatters,,,,[],,,,text,t2_kwolsnv,False,,,False,False,,False,,False,1562668000.0,1562639000.0,,,,,calmatters.org,0,False,0,{},False,False,caupt1,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_caupt1,False,3,1,,False,,/r/California_Politics/comments/caupt1/new_cal...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,45,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/2FFGh5hCRRYIk...,93.0,140.0,New California rules for deadly police force g...,0,45,https://calmatters.org/articles/ca-passes-dead...,[],,False,,
2,[],False,,,False,BlankVerse,,,,[],,,,text,t2_97a3,False,,,False,False,,False,,False,1562634000.0,1562606000.0,,,,,thetrace.org,0,False,0,{},False,False,canr0y,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_canr0y,False,21,0,,False,,/r/California_Politics/comments/canr0y/the_nra...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,53,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/5y-0hwerp_6jF...,93.0,140.0,The NRA Opposes A California Gun Regulation It...,0,53,https://www.thetrace.org/rounds/california-rea...,[],,False,,
3,[],False,,,False,travadera,,,,[],,,,text,t2_10ukzyn2,False,,,False,False,,False,,False,1562636000.0,1562607000.0,,,,,latimes.com,0,False,0,{},False,False,cao5lo,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_cao5lo,False,5,1,,False,,/r/California_Politics/comments/cao5lo/ca15_er...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,20,,{},,,False,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/EXvjRI9EHJh4w...,78.0,140.0,[CA-15] Eric Swalwell is expected to withdraw ...,0,20,https://www.latimes.com/politics/la-na-pol-202...,[],,False,,
4,[],False,,,False,BlankVerse,,,,[],,,,text,t2_97a3,False,,,False,False,,False,,False,1562642000.0,1562613000.0,,,,,cnn.com,0,False,0,{},False,False,capegr,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_capegr,False,4,0,,False,,/r/California_Politics/comments/capegr/eric_sw...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,12,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/pKle1lzNaOWeA...,78.0,140.0,Eric Swalwell expected to end presidential bid...,0,12,https://www.cnn.com/2019/07/08/politics/eric-s...,[],,False,,


In [11]:
# Checking to ensure that the posts are unique
tx_post_new = []
tx_post_names = set() # Making the Califormia posts a set to retain uniqueness
for post_dict in tx_posts: 
    keep_data = post_dict['data']
    if keep_data['name'] not in tx_post_names:
        tx_post_new.append(keep_data)
        tx_post_names.add(keep_data['name'])
df_tx = pd.DataFrame(tx_post_new) # Adding unique posts to DataFrame
df_tx.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_metadata,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,arcanition,,3,[],17553cd2-9c63-11e7-b44c-0e30f0006cb4,3rd District (Northern Dallas Suburbs),dark,text,t2_5d5mc,False,,,False,False,,False,,False,1559800000.0,1559771000.0,,,,moderator,self.TexasPolitics,0,1.55984e+09,0,{},False,False,bx8cik,False,False,False,False,True,True,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_bx8cik,False,22,0,,False,,/r/TexasPolitics/comments/bx8cik/welcome_new_r...,False,,False,,,False,13,,{},"Hey all,\n\nAfter much time reading applicatio...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,True,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,Welcome New /r/TexasPolitics Moderators - Q&amp;A,0,13,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
1,[],True,,,False,Texas_Monthly,,verified,[],,Verified - Texas Monthly,dark,text,t2_3x7xx9qc,False,,,False,False,,False,,False,1561003000.0,1560974000.0,,,,,self.TexasPolitics,0,1.56107e+09,0,{},False,False,c2lven,False,False,False,False,True,True,False,,,ama,[],b8855642-9c62-11e7-ae9f-0e71ceb054c0,AMA,dark,text,False,,{},,False,,,,[],t3_c2lven,False,243,0,,False,,/r/TexasPolitics/comments/c2lven/im_chris_hook...,False,,False,,,False,85,,{},"Hey, r/TexasPolitics! I’m Chris Hooks, a write...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,True,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,qa,,"I’m Chris Hooks, a Texas Monthly writer who wo...",0,85,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
2,[],False,,,False,beanzamillion21,,12,[],2584f856-9c63-11e7-93b7-0e2bf15991f0,12th Congressional District (Western Fort Worth),dark,text,t2_50w01,False,,,False,False,,False,,False,1562712000.0,1562683000.0,,,,,dallasnews.com,0,False,0,{},False,True,cb1r8d,False,False,False,False,True,False,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_cb1r8d,False,8,0,,False,,/r/TexasPolitics/comments/cb1r8d/ross_perot_se...,False,,False,,,False,25,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,"Ross Perot, self-made billionaire, patriot and...",0,25,https://www.dallasnews.com/business/business/2...,[],,False,,
3,[],False,,,False,irony_glazed,,,[],,,,text,t2_442pim8f,False,,,False,False,,False,,False,1562678000.0,1562649000.0,,,,,self.TexasPolitics,0,1.56268e+09,0,{},False,False,cawekm,False,False,False,False,True,True,False,,,discussion,[],ac3a0f90-9c62-11e7-9e00-0e65ddf91c6e,Discussion,dark,text,False,,{},,False,,,,[],t3_cawekm,False,14,1,,False,,/r/TexasPolitics/comments/cawekm/lets_talk_abo...,False,,False,,,False,37,,{},"Disclaimer: I am not a lawyer, this is not leg...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,Let's Talk About Why Texas's Hemp Law is Stupid,0,37,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
4,[],False,,,False,beanzamillion21,,12,[],2584f856-9c63-11e7-93b7-0e2bf15991f0,12th Congressional District (Western Fort Worth),dark,text,t2_50w01,False,,,False,False,,False,,False,1562712000.0,1562683000.0,,,,,housingwire.com,0,False,0,{},False,True,cb1sin,False,False,False,False,True,False,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_cb1sin,False,3,0,,False,,/r/TexasPolitics/comments/cb1sin/this_texas_to...,False,,False,,,False,2,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,This Texas town is the most affordable housing...,0,2,https://www.housingwire.com/articles/49504-thi...,[],,False,,


## Clean up the DataFrame

In [12]:
# Dropping all columns except for the subreddit identifier, the title and the number of comments. 
# (Considered that it might be fun to do a number of comments analysis in the future)
df_tx = df_tx[['subreddit', 'title', 'num_comments']]

In [13]:
# Checking the length
df_tx.shape

(982, 3)

In [14]:
# Dropping all columns except for the subreddit identifier, the title and the number of comments. 
# (Considered that it might be fun to do a number of comments analysis in the future)
df_ca = df_ca[['subreddit', 'title', 'num_comments']]

In [15]:
# Checking the length
df_ca.shape

(938, 3)

In [16]:
# Putting the DataFrames together
df_reddit = df_ca.append(df_tx)
df_reddit.head(5)

In [18]:
# Making sure to catagorize California Subreddit as 1 and Texas Subreddit as 0. 
#I will try to predict California Subreddit
df_reddit['ca'] = df_reddit['subreddit'].map({'California_Politics':1,
                                                 'TexasPolitics':0})
df_reddit.drop(labels='subreddit', axis=1, inplace=True)

In [20]:
df_reddit

Unnamed: 0,title,num_comments,ca
0,"State May Push Cities and Counties to Draw ""fa...",0,1
1,New California rules for deadly police force g...,3,1
2,The NRA Opposes A California Gun Regulation It...,21,1
3,[CA-15] Eric Swalwell is expected to withdraw ...,5,1
4,Eric Swalwell expected to end presidential bid...,4,1
5,Tom Steyer Is Telling Allies He’s Running for ...,9,1
6,California's Governor is Asking Trump for Emer...,29,1
7,State Promises to Rebuild: Ridgecrest Will Not...,4,1
8,How California made a 'dramatic' impact on kin...,0,1
9,California's Politically Powerful Unions Aim T...,12,1


## Export as CSV File

In [74]:
# Save the DataFrame to a CSV to use in the modeling notebook
pd.DataFrame(df_reddit).to_csv('reddit.csv', index=True)