<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Webscrapping: Modeling Notebook

_Authors: Patrick Wales-Dinan_

---

This lab was incredibly challenging. We had to extensively clean a date set that was missing a lot of values and had TONS of categorical data. Then we had to decide what features to use to model that data. After that we had to build and fit the models making decisions like whether to use polynomial features, dummy variables etc, log scaling features or log scaling the depended variable.

After that we had to re run our model over and over again, looking at the different values of $\beta$ and seeing if they were contributing to the predictive power of the model. We had to decide if we should throw those values out or if we should leave them. We also had to make judgement calls to see if our model appeared to be over fitting or suffering from bias. 

## Contents:
- [Data Import](#Data-Import)
- [Baseline Accuracy](#Calculate-the-Baseline-Accuracy)
- [Train Test Split](#Train-Test-Split-Our-Data)
- [Log Scaling](#Log-Scaling-Independent-Variables)
- [Cleaning the Data and Modifying the Data](#Cleaning-&-Creating-the-Data-Set)
- [Modeling the Data](#Modeling-the-Data)
- [Model Analysis](#Analyzing-the-model)

Please visit the Graphs & Relationships notebook for additional visuals: Notebook - [Here](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)


In [1]:
import requests
import time
import pandas as pd
import numpy as np
import seaborn as sns
import copy

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction import stop_words 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.preprocessing import Imputer

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Data Import

In [2]:
df_reddit = pd.read_csv('./reddit.csv')

## Calculate the Baseline Accuracy

In [3]:
# Getting our baseline accuracy :: So 0.51 is the baseline accuracy for 0.
df_reddit['is_ca'] = df_reddit['ca']
df_reddit['is_ca'].value_counts(normalize=True)

0    0.511458
1    0.488542
Name: is_ca, dtype: float64

## Train Test Split Our Data

In [None]:
df_reddit.loc[(df_reddit['ca'] == 1)]['title']

In [4]:
X = df_reddit['title']
y = df_reddit['is_ca']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=79)

In [19]:
pipe_params = {
    'vec' : [CountVectorizer(), TfidfVectorizer()],
    'vec__max_features': [1300, 1500],
    'vec__min_df': [2, 3, 4],
#     'vec__max_df': [0.5, .60, .70],
    'vec__ngram_range': [(1,2), (1,1)],
    'model' : [LogisticRegression(), LogisticRegression(penalty='l1', solver='liblinear'), LogisticRegression(penalty='l2', solver='liblinear'), MultinomialNB()],
    'vec__stop_words': [frozenset(stop_words)]
}

def model_analysis(X, y, **pipe_params):
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=79)
    pipe = Pipeline([
            ('vec', CountVectorizer()),
            ('model', LogisticRegression())])

    gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5, verbose=1, n_jobs=2)
    gs.fit(X_train, y_train)

    print(f' Best Parameters: {gs.best_params_}')
    print('')
    print(f' Cross Validation Accuracy Score: {gs.best_score_}')
    print(f' Training Data Accuracy Score: {gs.score(X_train, y_train)}')
    print(f' Testing Data Accuracy Score: {gs.score(X_test, y_test)}')

In [20]:
model_analysis(X, y, **pipe_params)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    3.2s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:   14.0s
[Parallel(n_jobs=2)]: Done 446 tasks      | elapsed:   32.3s
[Parallel(n_jobs=2)]: Done 480 out of 480 | elapsed:   35.5s finished


 Best Parameters: {'model': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'vec': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1300, min_df=2,
        ngram_range=(1, 2), preprocessor=None,
        stop_words=frozenset({'in', 'from', 'be', 'is', 'on', 'will', 'california', 'this', 'as', 'to', 'are', 'of', 'says', 'for', 'about', 'by', 'and', 'that', 'can', 'what', 'it', 'has', 'texas', 'with', 'the', 'at', 'how'}),
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), 'vec__max_features': 1300, 'vec__min_df': 2, 'vec__ngram_range': (1, 2), 'vec__stop_words': frozenset({'in', 'from', 'be', 'is', 'on', 'will', 'california', 'this', 'as', 'to', 'are', 'of', 'says', 'for', 'about', 'by', 'and', 'that', 'can', 'what', 'it', 'has', 'texas', 'with', 'the', 'at', 'how'})}

 Cross Validat

In [16]:
stop_words = ['to', 'the', 'in', 'of', 'for', 'and', 'on', 'is', 'it', 'with', 'what', 'about', 'are', 'as', 'from', 'at', 'will', 'that', 'says', 'by', 'be', 'this', 'can', 'has', 'how', 'california', 'texas']
vectorizer = CountVectorizer(tokenizer = None,
                            preprocessor = None,
                            stop_words = frozenset(stop_words),
                            max_features = 1500,
                            ngram_range= (1,1),
                            analyzer = 'word',
                            min_df=3) 
vectorizer.fit(X_train)
X_train = vectorizer.transform(X_train)
X_test = vectorizer.transform(X_test)
X_train_df = pd.DataFrame(X_train.toarray(), columns=vectorizer.get_feature_names())
X_test_df = pd.DataFrame(X_test.toarray(), columns=vectorizer.get_feature_names())
y_train_df = pd.DataFrame(y_train)


In [None]:
X_train_df.sum().sort_values(ascending=False)

In [None]:
X_test_df.sum().sort_values(ascending=False)

In [None]:
y_train_df = y_train_df.reset_index()
y_train_df

In [None]:
# print(X_train_df.index)
# print(y_train_df.reset_index().index)

corr = pd.concat([X_train_df, y_train_df], axis=1)

In [None]:
corr.corr()[['is_ca']].sort_values('is_ca', ascending=False).head(100)

In [None]:
lr = LogisticRegression()
lr.fit(X_train_df, y_train)
print(lr.score(X_train_df, y_train))
print(lr.score(X_test_df, y_test))
print(f'Intercept: {lr.intercept_}')
print('')
print(f'Coefficient: {lr.coef_}')
print('')
print(f'Exponentiated Coefficient: {np.exp(lr.coef_)}')


In [None]:
print(f'Logreg predicted values: {lr.predict(X_train_df.head())}')
print(f'Logreg predicted probabilities: {lr.predict_proba(X_train_df.head())}')


In [None]:
preds = lr.predict(X_test_df)
confusion_matrix(y_test, # True values.
                 preds)  # Predicted values.


In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

In [None]:
spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

In [None]:
sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

In [None]:
coef_df = pd.DataFrame({'variables':X_train_df.columns})
print(lr.coef_.shape)
coe = pd.DataFrame({'ß - Beta':np.squeeze(np.exp(lr.coef_))})
coef_df = pd.concat([coef_df, coe], axis=1)
values = pd.DataFrame(X_train_df, index=list(range(0,2049)), columns=coef_df['variables'])
values['CA_Post'] = y
values.head()

In [None]:
coef_df.sort_values('ß - Beta', ascending=False)

In [None]:
from os import path
import scipy
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from scipy.ndimage.interpolation import zoom
cali_mask = np.array(Image.open("./download.png"))

In [None]:
def transform_format(val):
    if val == 0:
        return 255
    else:
        return val
transformed_cali_mask = np.ndarray((cali_mask.shape[0],cali_mask.shape[1]), np.int32)

for i in range(len(cali_mask)):
    transformed_cali_mask[i] = list(map(transform_format, cali_mask[i]))


In [None]:
im_small = zoom(transformed_cali_mask, (2.25))

In [None]:
text = " ".join(post for post in (df_reddit.loc[(df_reddit['ca'] == 1)]['title'])) # This is getting me just the words for posts on the CA Subreddit
print ("There are {} words in all posts.".format(len(text)))

In [None]:
wordcloud = WordCloud(stopwords=stop_words, max_font_size=30, max_words=500, background_color="white", mask=im_small, contour_color='grey', contour_width=0.5).generate(text)
plt.figure(figsize= [8,23])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
results = pd.DataFrame(lr.predict(X_test_df), columns=['predicted'])

# Create column for observed values.
y_test = y_test.reset_index()
y_test.head()
results['actual'] = y_test['is_ca']

In [None]:
results.head()

In [None]:
row_ids = results[results['predicted'] != results['actual']].index

In [None]:
row_ids

In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier


In [23]:
vote = VotingClassifier([
    ('tree', DecisionTreeClassifier()),
    ('ada', AdaBoostClassifier()),
    ('grad', GradientBoostingClassifier()),
    ('logreg', LogisticRegression())
])

pipe = Pipeline([
    ('vote', vote)
])

pipe_params = {
    'vote__tree__max_depth' : [None, 1, 2],
    'vote__ada__n_estimators' : [40, 50, 60],
    'vote__grad__n_estimators' : [90, 100],
    'vote__logreg__penalty' : ['l1', 'l2'],
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
gs.fit(X_train, y_train)
print(gs.best_score_) # cross val accuracy score
gs.best_params_







0.7756944444444445




{'vote__ada__n_estimators': 40,
 'vote__grad__n_estimators': 90,
 'vote__logreg__penalty': 'l2',
 'vote__tree__max_depth': 2}