<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Webscrapping: Data Gathering Notebook

_Authors: Patrick Wales-Dinan_

---

This lab was incredibly challenging. We had to extensively clean a date set that was missing a lot of values and had TONS of categorical data. Then we had to decide what features to use to model that data. After that we had to build and fit the models making decisions like whether to use polynomial features, dummy variables etc, log scaling features or log scaling the depended variable.

After that we had to re run our model over and over again, looking at the different values of $\beta$ and seeing if they were contributing to the predictive power of the model. We had to decide if we should throw those values out or if we should leave them. We also had to make judgement calls to see if our model appeared to be over fitting or suffering from bias. 

## Contents:
- [Creating our URLs](#Instantiate-our-URL)
- [Data Import](#Data-Import)
- [Feature Creation](#Feature-Creation)
- [Choosing the Features](#Feature-Choice)
- [Log Scaling](#Log-Scaling-Independent-Variables)
- [Cleaning the Data and Modifying the Data](#Cleaning-&-Creating-the-Data-Set)
- [Modeling the Data](#Modeling-the-Data)
- [Model Analysis](#Analyzing-the-model)

Please visit the Graphs & Relationships notebook for additional visuals: Notebook - [Here](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)


In [71]:
import requests
import time
import pandas as pd
import numpy as np
import seaborn as sns
import copy

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.feature_extraction import stop_words 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.preprocessing import Imputer

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Instantiate our URL

In [2]:
tx_url = 'https://www.reddit.com/r/TexasPolitics.json'
ca_url = 'https://www.reddit.com/r/California_Politics.json'

In [7]:
def get_posts(url):
    # Setting up my unique user agent so that I can pull posts from reddit
    user_agent = {'User-agent' : 'pat bot 0.1'}
    
    # Empty posts list
    posts = []
    
    # Setting after to NONE to start as this needs to be there in order to begin each pull
    after = None
    
    for i in range(0,60):
        print(i)
        url = url
        if after == None:
            params = {}
        else:
            params = {'after' : after}
        res = requests.get(url, params=params, headers=user_agent)
        if res.status_code == 200:
            json = res.json()
            posts.extend(json['data']['children'])
            after = json['data']['after']
        else: 
            print(tx_res.status_code)
            break
        time.sleep(2)
    return posts

In [5]:
tx_posts = get_posts(tx_url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [6]:
ca_posts = get_posts(ca_url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [8]:
len(ca_posts)
len(set([p['data']['name'] for p in ca_posts]))

938

In [9]:
len(tx_posts)
len(set([p['data']['name'] for p in tx_posts]))

982

In [10]:
ca_post_new = []
ca_post_names = set()
for post_dict in ca_posts:
    keep_data = post_dict['data']
    if keep_data['name'] not in ca_post_names:
        ca_post_new.append(keep_data)
        ca_post_names.add(keep_data['name'])
df_ca = pd.DataFrame(ca_post_new)
df_ca.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,theProgressiveGOP,,,,[],,,,text,t2_3x0d2uzk,False,,,False,False,,False,,False,1562712000.0,1562683000.0,,,,,calmatters.org,0,False,0,{},False,True,cb1s4p,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_cb1s4p,False,0,0,,False,,/r/California_Politics/comments/cb1s4p/state_m...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,13,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/GcoPg0hQ78iGU...,93.0,140.0,"State May Push Cities and Counties to Draw ""fa...",0,13,https://calmatters.org/articles/redistricting-...,[],,False,,
1,[],False,,,False,CALmatters,,,,[],,,,text,t2_kwolsnv,False,,,False,False,,False,,False,1562668000.0,1562639000.0,,,,,calmatters.org,0,False,0,{},False,False,caupt1,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_caupt1,False,3,1,,False,,/r/California_Politics/comments/caupt1/new_cal...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,45,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/2FFGh5hCRRYIk...,93.0,140.0,New California rules for deadly police force g...,0,45,https://calmatters.org/articles/ca-passes-dead...,[],,False,,
2,[],False,,,False,BlankVerse,,,,[],,,,text,t2_97a3,False,,,False,False,,False,,False,1562634000.0,1562606000.0,,,,,thetrace.org,0,False,0,{},False,False,canr0y,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_canr0y,False,21,0,,False,,/r/California_Politics/comments/canr0y/the_nra...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,53,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/5y-0hwerp_6jF...,93.0,140.0,The NRA Opposes A California Gun Regulation It...,0,53,https://www.thetrace.org/rounds/california-rea...,[],,False,,
3,[],False,,,False,travadera,,,,[],,,,text,t2_10ukzyn2,False,,,False,False,,False,,False,1562636000.0,1562607000.0,,,,,latimes.com,0,False,0,{},False,False,cao5lo,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_cao5lo,False,5,1,,False,,/r/California_Politics/comments/cao5lo/ca15_er...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,20,,{},,,False,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/EXvjRI9EHJh4w...,78.0,140.0,[CA-15] Eric Swalwell is expected to withdraw ...,0,20,https://www.latimes.com/politics/la-na-pol-202...,[],,False,,
4,[],False,,,False,BlankVerse,,,,[],,,,text,t2_97a3,False,,,False,False,,False,,False,1562642000.0,1562613000.0,,,,,cnn.com,0,False,0,{},False,False,capegr,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_capegr,False,4,0,,False,,/r/California_Politics/comments/capegr/eric_sw...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,12,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7746,public,,https://b.thumbs.redditmedia.com/pKle1lzNaOWeA...,78.0,140.0,Eric Swalwell expected to end presidential bid...,0,12,https://www.cnn.com/2019/07/08/politics/eric-s...,[],,False,,


In [11]:
tx_post_new = []
tx_post_names = set()
for post_dict in tx_posts:
    keep_data = post_dict['data']
    if keep_data['name'] not in tx_post_names:
        tx_post_new.append(keep_data)
        tx_post_names.add(keep_data['name'])
df_tx = pd.DataFrame(tx_post_new)
df_tx.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_metadata,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,arcanition,,3,[],17553cd2-9c63-11e7-b44c-0e30f0006cb4,3rd District (Northern Dallas Suburbs),dark,text,t2_5d5mc,False,,,False,False,,False,,False,1559800000.0,1559771000.0,,,,moderator,self.TexasPolitics,0,1.55984e+09,0,{},False,False,bx8cik,False,False,False,False,True,True,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_bx8cik,False,22,0,,False,,/r/TexasPolitics/comments/bx8cik/welcome_new_r...,False,,False,,,False,13,,{},"Hey all,\n\nAfter much time reading applicatio...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,True,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,Welcome New /r/TexasPolitics Moderators - Q&amp;A,0,13,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
1,[],True,,,False,Texas_Monthly,,verified,[],,Verified - Texas Monthly,dark,text,t2_3x7xx9qc,False,,,False,False,,False,,False,1561003000.0,1560974000.0,,,,,self.TexasPolitics,0,1.56107e+09,0,{},False,False,c2lven,False,False,False,False,True,True,False,,,ama,[],b8855642-9c62-11e7-ae9f-0e71ceb054c0,AMA,dark,text,False,,{},,False,,,,[],t3_c2lven,False,243,0,,False,,/r/TexasPolitics/comments/c2lven/im_chris_hook...,False,,False,,,False,85,,{},"Hey, r/TexasPolitics! I’m Chris Hooks, a write...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,True,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,qa,,"I’m Chris Hooks, a Texas Monthly writer who wo...",0,85,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
2,[],False,,,False,beanzamillion21,,12,[],2584f856-9c63-11e7-93b7-0e2bf15991f0,12th Congressional District (Western Fort Worth),dark,text,t2_50w01,False,,,False,False,,False,,False,1562712000.0,1562683000.0,,,,,dallasnews.com,0,False,0,{},False,True,cb1r8d,False,False,False,False,True,False,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_cb1r8d,False,8,0,,False,,/r/TexasPolitics/comments/cb1r8d/ross_perot_se...,False,,False,,,False,25,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,"Ross Perot, self-made billionaire, patriot and...",0,25,https://www.dallasnews.com/business/business/2...,[],,False,,
3,[],False,,,False,irony_glazed,,,[],,,,text,t2_442pim8f,False,,,False,False,,False,,False,1562678000.0,1562649000.0,,,,,self.TexasPolitics,0,1.56268e+09,0,{},False,False,cawekm,False,False,False,False,True,True,False,,,discussion,[],ac3a0f90-9c62-11e7-9e00-0e65ddf91c6e,Discussion,dark,text,False,,{},,False,,,,[],t3_cawekm,False,14,1,,False,,/r/TexasPolitics/comments/cawekm/lets_talk_abo...,False,,False,,,False,37,,{},"Disclaimer: I am not a lawyer, this is not leg...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,Let's Talk About Why Texas's Hemp Law is Stupid,0,37,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
4,[],False,,,False,beanzamillion21,,12,[],2584f856-9c63-11e7-93b7-0e2bf15991f0,12th Congressional District (Western Fort Worth),dark,text,t2_50w01,False,,,False,False,,False,,False,1562712000.0,1562683000.0,,,,,housingwire.com,0,False,0,{},False,True,cb1sin,False,False,False,False,True,False,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_cb1sin,False,3,0,,False,,/r/TexasPolitics/comments/cb1sin/this_texas_to...,False,,False,,,False,2,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5415,public,,,This Texas town is the most affordable housing...,0,2,https://www.housingwire.com/articles/49504-thi...,[],,False,,


In [12]:
df_tx = df_tx[['subreddit', 'title', 'num_comments']]


In [13]:
df_tx.shape

(982, 3)

In [14]:
df_ca = df_ca[['subreddit', 'title', 'num_comments']]


In [15]:
df_ca.shape

(938, 3)

In [16]:
df_reddit = df_ca.append(df_tx)

In [17]:
df_reddit.head(5)

Unnamed: 0,subreddit,title,num_comments
0,California_Politics,"State May Push Cities and Counties to Draw ""fa...",0
1,California_Politics,New California rules for deadly police force g...,3
2,California_Politics,The NRA Opposes A California Gun Regulation It...,21
3,California_Politics,[CA-15] Eric Swalwell is expected to withdraw ...,5
4,California_Politics,Eric Swalwell expected to end presidential bid...,4


In [18]:
df_reddit['ca'] = df_reddit['subreddit'].map({'California_Politics':1,
                                                 'TexasPolitics':0})


In [19]:
df_reddit.drop(labels='subreddit', axis=1, inplace=True)

In [20]:
df_reddit

Unnamed: 0,title,num_comments,ca
0,"State May Push Cities and Counties to Draw ""fa...",0,1
1,New California rules for deadly police force g...,3,1
2,The NRA Opposes A California Gun Regulation It...,21,1
3,[CA-15] Eric Swalwell is expected to withdraw ...,5,1
4,Eric Swalwell expected to end presidential bid...,4,1
5,Tom Steyer Is Telling Allies He’s Running for ...,9,1
6,California's Governor is Asking Trump for Emer...,29,1
7,State Promises to Rebuild: Ridgecrest Will Not...,4,1
8,How California made a 'dramatic' impact on kin...,0,1
9,California's Politically Powerful Unions Aim T...,12,1


In [74]:
pd.DataFrame(df_reddit).to_csv('reddit.csv', index=True)

In [21]:
# Getting our baseline accuracy
df_reddit['ca'].value_counts(normalize=True)

0    0.511458
1    0.488542
Name: ca, dtype: float64

In [60]:
X = df_reddit['title']
y = df_reddit['ca']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=79)
pipe = Pipeline([
            ('vec', CountVectorizer()),
            ('model', LogisticRegression())
])



In [72]:
pipe_params = {
    'vec' : [CountVectorizer(), TfidfVectorizer()],
    'vec__max_features': [1500, 2000, 2500, 2700],
    'vec__min_df': [2, 3, 4],
#     'vec__max_df': [0.5, .60, .70],
    'vec__ngram_range': [(1,2), (1,1)],
    'model' : [LogisticRegression(), LogisticRegression(penalty='l1', solver='liblinear'), LogisticRegression(penalty='l2', solver='liblinear'), MultinomialNB]
#     'vec__stop_words': ['english']
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5, verbose=1, n_jobs=2)
gs.fit(X_train, y_train)

print(f' Best Parameters: {gs.best_params_}')
print('')
print(f' Cross Validation Accuracy Score: {gs.best_score_}')
print(f' Training Data Accuracy Score: {gs.score(X_train, y_train)}')
print(f' Testing Data Accuracy Score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  61 tasks      | elapsed:    6.0s
[Parallel(n_jobs=2)]: Done 361 tasks      | elapsed:   19.5s
[Parallel(n_jobs=2)]: Done 717 out of 720 | elapsed:   35.9s remaining:    0.2s
[Parallel(n_jobs=2)]: Done 720 out of 720 | elapsed:   35.9s finished


 Best Parameters: {'model': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False), 'vec': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1500, min_df=4,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None), 'vec__max_features': 1500, 'vec__min_df': 4, 'vec__ngram_range': (1, 2)}

 Cross Validation Accuracy Score: 0.91875
 Training Data Accuracy Score: 0.98125
 Testing Data Accuracy Score: 0.9208333333333333


In [70]:
gs.cv_results_['param_model']



{'mean_fit_time': array([0.18865619, 0.03053555, 0.07005663, 0.03046961, 0.07289577,
        0.02966752, 0.07243733, 0.03045788, 0.0748312 , 0.03077445,
        0.06893911, 0.03069229, 0.07196903, 0.03061595, 0.07140865,
        0.03032808, 0.07067137, 0.03006439, 0.07305241, 0.03057675,
        0.07306881, 0.02957802, 0.07030916, 0.03063912, 0.06786838,
        0.02912092, 0.06871009, 0.0287528 , 0.06891675, 0.02798734,
        0.06915956, 0.02865872, 0.06998906, 0.03007245, 0.06853371,
        0.03061614, 0.07114568, 0.03089533, 0.07210298, 0.0287921 ,
        0.07042923, 0.02851915, 0.07878366, 0.02995596, 0.06798496,
        0.02878227, 0.07238388, 0.02834811, 0.07018409, 0.02945461,
        0.06937637, 0.02930965, 0.06950932, 0.02811322, 0.06867752,
        0.02877254, 0.07135801, 0.02861009, 0.07596879, 0.02905493,
        0.07254014, 0.02970834, 0.06847043, 0.02909288, 0.07176232,
        0.02889848, 0.07128401, 0.03021598, 0.07185802, 0.03041396,
        0.07037158, 0.02943029,

In [None]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('grad', GradientBoostingClassifier()),
    ('logreg', LogisticRegression())
])

pipe = Pipeline([
    ('vote', vote)
])

pipe_params = {
    'vote__tree__max_depth' : [None, 1, 2],
    'vote__ada__n_estimators' : [40, 50, 60],
    'vote__grad__n_estimators' : [90, 100],
    'vote__logreg__penalty' : ['l1', 'l2'],
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
gs.fit(X_train, y_train)
print(gs.best_score_) # cross val accuracy score
gs.best_params_