<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Webscrapping

_Authors: Patrick Wales-Dinan_

---

This lab was incredibly challenging. We had to extensively clean a date set that was missing a lot of values and had TONS of categorical data. Then we had to decide what features to use to model that data. After that we had to build and fit the models making decisions like whether to use polynomial features, dummy variables etc, log scaling features or log scaling the depended variable.

After that we had to re run our model over and over again, looking at the different values of $\beta$ and seeing if they were contributing to the predictive power of the model. We had to decide if we should throw those values out or if we should leave them. We also had to make judgement calls to see if our model appeared to be over fitting or suffering from bias. 

## Contents:
- [Data Import](#Data-Import)
- [Feature Creation](#Feature-Creation)
- [Choosing the Features](#Feature-Choice)
- [Log Scaling](#Log-Scaling-Independent-Variables)
- [Cleaning the Data and Modifying the Data](#Cleaning-&-Creating-the-Data-Set)
- [Modeling the Data](#Modeling-the-Data)
- [Model Analysis](#Analyzing-the-model)

Please visit the Graphs & Relationships notebook for additional visuals: Notebook - [Here](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)


In [53]:
import requests
import time
import pandas as pd
import numpy as np
import seaborn as sns
import copy

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [47]:
tx_url = 'https://www.reddit.com/r/TexasPolitics.json'
ca_url = 'https://www.reddit.com/r/California_Politics.json'

In [3]:
# headers = {'User-agent' : 'pat bot 0.1'}
# tx_res = requests.get(tx_url, headers=headers)
# wa_res = requests.get(wa_url, headers=headers)
# print(f' TX Status: {tx_res.status_code}')
# print(f' WA Status: {wa_res.status_code}')
# tx_json = tx_res.json()
# wa_json = wa_res.json()

 TX Status: 200
 WA Status: 200


In [34]:
# tx_res.text

In [35]:
# tx_json['data'].keys()

In [36]:
# tx_json['data']['children'][0]['data']

In [37]:
# tx_json['data']['after']

In [9]:
# params = {'after' : 't3_c6m277'}

In [11]:
# requests.get(tx_url, params=params, headers=headers)

<Response [200]>

In [27]:
def get_posts(url):
    # Setting up my unique user agent so that I can pull posts from reddit
    user_agent = {'User-agent' : 'pat bot 0.1'}
    
    # Empty posts list
    posts = []
    
    # Setting after to NONE to start as this needs to be there in order to begin each pull
    after = None
    
    for i in range(0,60):
        print(i)
        url = url
        if after == None:
            params = {}
        else:
            params = {'after' : after}
        res = requests.get(url, params=params, headers=user_agent)
        if res.status_code == 200:
            json = res.json()
            posts.extend(json['data']['children'])
            after = json['data']['after']
        else: 
            print(tx_res.status_code)
            break
        time.sleep(4)
    return posts

In [29]:
wa_posts = get_posts(wa_url)
tx_posts = get_posts(tx_url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [48]:
ca_posts = get_posts(ca_url)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [112]:
len(ca_posts)
len(set([p['data']['name'] for p in ca_posts]))

937

In [117]:
len(tx_posts)
len(set([p['data']['name'] for p in tx_posts]))

983

In [118]:
ca_post_new = []
ca_post_names = set()
for post_dict in ca_posts:
    keep_data = post_dict['data']
    if keep_data['name'] not in ca_post_names:
        ca_post_new.append(keep_data)
        ca_post_names.add(keep_data['name'])
df_ca = pd.DataFrame(ca_post_new)
df_ca.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,BlankVerse,,,,[],,,,text,t2_97a3,False,,,False,False,,False,,False,1562634000.0,1562606000.0,,,,,thetrace.org,0,False,0,{},False,False,canr0y,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_canr0y,False,16,0,,False,,/r/California_Politics/comments/canr0y/the_nra...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,41,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7729,public,,https://b.thumbs.redditmedia.com/5y-0hwerp_6jF...,93.0,140.0,The NRA Opposes A California Gun Regulation It...,0,41,https://www.thetrace.org/rounds/california-rea...,[],,False,,
1,[],False,,,False,travadera,,,,[],,,,text,t2_10ukzyn2,False,,,False,False,,False,,False,1562636000.0,1562607000.0,,,,,latimes.com,0,False,0,{},False,False,cao5lo,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_cao5lo,False,3,0,,False,,/r/California_Politics/comments/cao5lo/ca15_er...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,15,,{},,,False,False,False,California_Politics,t5_357go,r/California_Politics,7729,public,,https://b.thumbs.redditmedia.com/EXvjRI9EHJh4w...,78.0,140.0,[CA-15] Eric Swalwell is expected to withdraw ...,0,15,https://www.latimes.com/politics/la-na-pol-202...,[],,False,,
2,[],False,,,False,BlankVerse,,,,[],,,,text,t2_97a3,False,,,False,False,,False,,False,1562642000.0,1562613000.0,,,,,cnn.com,0,False,0,{},False,False,capegr,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_capegr,False,2,0,,False,,/r/California_Politics/comments/capegr/eric_sw...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,6,,{},,,True,False,False,California_Politics,t5_357go,r/California_Politics,7729,public,,https://b.thumbs.redditmedia.com/pKle1lzNaOWeA...,78.0,140.0,Eric Swalwell expected to end presidential bid...,0,6,https://www.cnn.com/2019/07/08/politics/eric-s...,[],,False,,
3,[],False,,,False,Admiral_Red_Wings,,,,[],,,,text,t2_29oepqqb,False,,,False,False,,False,,False,1562648000.0,1562619000.0,,,,,politico.com,0,False,0,{},False,False,caqq3c,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_caqq3c,True,0,0,,False,,/r/California_Politics/comments/caqq3c/eric_sw...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,1,,{},,,False,False,False,California_Politics,t5_357go,r/California_Politics,7729,public,,https://b.thumbs.redditmedia.com/HUYb9u5408zqO...,93.0,140.0,"Eric Swalwell ends White House bid, citing low...",0,1,https://www.politico.com/story/2019/07/08/swal...,[],,False,,
4,[],False,,,False,travadera,,,,[],,,,text,t2_10ukzyn2,False,,,False,False,,False,,False,1562585000.0,1562556000.0,,,,,theatlantic.com,0,False,0,{},False,False,cafu29,False,False,False,False,True,False,False,,,,[],,dark,text,False,,{},False,,,,[],t3_cafu29,False,9,0,,False,,/r/California_Politics/comments/cafu29/tom_ste...,False,link,{'images': [{'source': {'url': 'https://extern...,,False,,,False,15,,{},,,False,False,False,California_Politics,t5_357go,r/California_Politics,7729,public,,https://b.thumbs.redditmedia.com/-zW0yrIqKre44...,72.0,140.0,Tom Steyer Is Telling Allies He’s Running for ...,0,15,https://www.theatlantic.com/politics/archive/2...,[],,False,,


In [121]:
tx_post_new = []
tx_post_names = set()
for post_dict in tx_posts:
    keep_data = post_dict['data']
    if keep_data['name'] not in tx_post_names:
        tx_post_new.append(keep_data)
        tx_post_names.add(keep_data['name'])
df_tx = pd.DataFrame(tx_post_new)
df_tx.head()

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,banned_at_utc,banned_by,can_gild,can_mod_post,category,clicked,content_categories,contest_mode,created,created_utc,crosspost_parent,crosspost_parent_list,discussion_type,distinguished,domain,downs,edited,gilded,gildings,hidden,hide_score,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,likes,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media,media_embed,media_metadata,media_only,mod_note,mod_reason_by,mod_reason_title,mod_reports,name,no_follow,num_comments,num_crossposts,num_reports,over_18,parent_whitelist_status,permalink,pinned,pwls,quarantine,removal_reason,report_reasons,saved,score,secure_media,secure_media_embed,selftext,selftext_html,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,arcanition,,3,[],17553cd2-9c63-11e7-b44c-0e30f0006cb4,3rd District (Northern Dallas Suburbs),dark,text,t2_5d5mc,False,,,False,False,,False,,False,1559800000.0,1559771000.0,,,,moderator,self.TexasPolitics,0,1.55984e+09,0,{},False,False,bx8cik,False,False,False,False,True,True,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_bx8cik,False,22,0,,False,,/r/TexasPolitics/comments/bx8cik/welcome_new_r...,False,,False,,,False,12,,{},"Hey all,\n\nAfter much time reading applicatio...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,True,TexasPolitics,t5_2t47s,r/TexasPolitics,5412,public,,,Welcome New /r/TexasPolitics Moderators - Q&amp;A,0,12,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
1,[],True,,,False,Texas_Monthly,,verified,[],,Verified - Texas Monthly,dark,text,t2_3x7xx9qc,False,,,False,False,,False,,False,1561003000.0,1560974000.0,,,,,self.TexasPolitics,0,1.56107e+09,0,{},False,False,c2lven,False,False,False,False,True,True,False,,,ama,[],b8855642-9c62-11e7-ae9f-0e71ceb054c0,AMA,dark,text,False,,{},,False,,,,[],t3_c2lven,False,243,0,,False,,/r/TexasPolitics/comments/c2lven/im_chris_hook...,False,,False,,,False,82,,{},"Hey, r/TexasPolitics! I’m Chris Hooks, a write...","&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",True,False,True,TexasPolitics,t5_2t47s,r/TexasPolitics,5412,public,qa,,"I’m Chris Hooks, a Texas Monthly writer who wo...",0,82,https://www.reddit.com/r/TexasPolitics/comment...,[],,False,,
2,[],False,,,False,SoggyFlakes4US,,,[],,,,text,t2_3762m0pi,False,,,False,False,,False,,False,1562619000.0,1562590000.0,t3_cak0yu,"[{'approved_at_utc': None, 'subreddit': 'polit...",,,texastribune.org,0,False,0,{},False,False,cakl8x,False,False,False,False,True,False,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_cakl8x,False,90,0,,False,,/r/TexasPolitics/comments/cakl8x/texas_is_goin...,False,,False,,,False,59,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5412,public,,,Texas is going to court to end Obamacare. It h...,0,59,https://www.texastribune.org/2019/07/08/texas-...,[],,False,,
3,[],False,,,False,ComfortAarakocra,,33,[],4694488a-9c63-11e7-989b-0e888b735072,33rd District (West Dallas),dark,text,t2_npxl0zp,False,,,False,False,,False,,False,1562616000.0,1562587000.0,t3_cajeq9,"[{'approved_at_utc': None, 'subreddit': 'polit...",,,theguardian.com,0,False,0,{},False,False,cak4u1,False,False,False,False,True,False,False,,,analysis,[],ae3e8d52-9c62-11e7-9ea1-0ec0fd94671e,Analysis,dark,text,False,,{},,False,,,,[],t3_cak4u1,False,69,0,,False,,/r/TexasPolitics/comments/cak4u1/protesters_as...,False,,False,,,False,45,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5412,public,,,'Protesters as terrorists': growing number of ...,0,45,https://www.theguardian.com/environment/2019/j...,[],,False,,
4,[],False,,,False,Philo1927,,,[],,,,text,t2_iwi67,False,,,False,False,,False,,False,1562634000.0,1562606000.0,,,,,latimes.com,0,False,0,{},False,False,canq67,False,False,False,False,True,False,False,,,,[],,,dark,text,False,,{},,False,,,,[],t3_canq67,False,3,0,,False,,/r/TexasPolitics/comments/canq67/death_at_the_...,False,,False,,,False,16,,{},,,True,False,False,TexasPolitics,t5_2t47s,r/TexasPolitics,5412,public,,,"Death at the border: 4 from Guatemala, 3 of th...",0,16,https://www.latimes.com/world/la-fg-guatemala-...,[],,False,,


In [122]:
df_tx = df_tx[['subreddit', 'title', 'num_comments']]


In [123]:
df_tx.shape

(983, 3)

In [119]:
df_ca = df_ca[['subreddit', 'title', 'num_comments']]


In [120]:
df_ca.shape

(937, 3)

In [124]:
df_reddit = df_ca.append(df_tx)

In [131]:
df_reddit.head(5)

Unnamed: 0,subreddit,title,num_comments
0,California_Politics,The NRA Opposes A California Gun Regulation It...,16
1,California_Politics,[CA-15] Eric Swalwell is expected to withdraw ...,3
2,California_Politics,Eric Swalwell expected to end presidential bid...,2
3,California_Politics,"Eric Swalwell ends White House bid, citing low...",0
4,California_Politics,Tom Steyer Is Telling Allies He’s Running for ...,9


In [132]:
df_reddit['ca'] = df_reddit['subreddit'].map({'California_Politics':1,
                                                 'TexasPolitics':0})


In [134]:
df_reddit.drop(labels='subreddit', axis=1, inplace=True)

In [135]:
df_reddit

Unnamed: 0,title,num_comments,ca
0,The NRA Opposes A California Gun Regulation It...,16,1
1,[CA-15] Eric Swalwell is expected to withdraw ...,3,1
2,Eric Swalwell expected to end presidential bid...,2,1
3,"Eric Swalwell ends White House bid, citing low...",0,1
4,Tom Steyer Is Telling Allies He’s Running for ...,9,1
5,California's Governor is Asking Trump for Emer...,27,1
6,State Promises to Rebuild: Ridgecrest Will Not...,4,1
7,California's Politically Powerful Unions Aim T...,10,1
8,How California made a 'dramatic' impact on kin...,0,1
9,New state budget a windfall for unions,15,1


In [136]:
df_reddit['ca'].value_counts(normalize=True)

0    0.511979
1    0.488021
Name: ca, dtype: float64

In [149]:
X = df_reddit['title']
y = df_reddit['ca']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=79)
pipe = Pipeline([
            ('cvec', CountVectorizer()),
            ('logreg', LogisticRegression())
])



In [181]:
pipe_params = {
    'cvec__max_features': [2500, 2550, 2600],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.70, .80],
    'cvec__ngram_range': [(1,2), (1,1)],
#     'logreg__C' : [0.1, 1, 10, 10_000],
#     'cvec__stop_words': ['english']
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5)

gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_



0.9083333333333333


{'cvec__max_df': 0.7,
 'cvec__max_features': 2500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2)}

In [182]:
gs.score(X_train, y_train)

0.9986111111111111

In [183]:
gs.score(X_test, y_test)

0.93125