<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Webscrapping: Modeling Notebook

_Authors: Patrick Wales-Dinan_

---

This lab was incredibly challenging. We had to extensively clean a date set that was missing a lot of values and had TONS of categorical data. Then we had to decide what features to use to model that data. After that we had to build and fit the models making decisions like whether to use polynomial features, dummy variables etc, log scaling features or log scaling the depended variable.

After that we had to re run our model over and over again, looking at the different values of $\beta$ and seeing if they were contributing to the predictive power of the model. We had to decide if we should throw those values out or if we should leave them. We also had to make judgement calls to see if our model appeared to be over fitting or suffering from bias. 

## Contents:
- [Data Import](#Data-Import)
- [Baseline Accuracy](#Calculate-the-Baseline-Accuracy)
- [Train Test Split](#Train-Test-Split-Our-Data)
- [Log Scaling](#Log-Scaling-Independent-Variables)
- [Cleaning the Data and Modifying the Data](#Cleaning-&-Creating-the-Data-Set)
- [Modeling the Data](#Modeling-the-Data)
- [Model Analysis](#Analyzing-the-model)

Please visit the Graphs & Relationships notebook for additional visuals: Notebook - [Here](/Users/pwalesdi/Desktop/GA/GA_Project_2/Project_2_Graphs_&_Relationships.ipynb)


In [6]:
import requests
import time
import pandas as pd
import numpy as np
import seaborn as sns
import copy

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction import stop_words 
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.preprocessing import Imputer

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## Data Import

In [2]:
df_reddit = pd.read_csv('./reddit.csv')

## Calculate the Baseline Accuracy

In [3]:
# Getting our baseline accuracy :: So 0.51 is the baseline accuracy for 0.
df_reddit['ca'].value_counts(normalize=True)

0    0.511458
1    0.488542
Name: ca, dtype: float64

## Train Test Split Our Data

In [8]:
X = df_reddit['title']
y = df_reddit['ca']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=79)
pipe = Pipeline([
            ('vec', CountVectorizer()),
            ('model', LogisticRegression())
])



In [5]:
pipe_params = {
    'vec' : [CountVectorizer(), TfidfVectorizer()],
    'vec__max_features': [1500, 2000, 2500, 2700],
    'vec__min_df': [2, 3, 4],
#     'vec__max_df': [0.5, .60, .70],
    'vec__ngram_range': [(1,2), (1,1)],
    'model' : [LogisticRegression(), LogisticRegression(penalty='l1', solver='liblinear'), LogisticRegression(penalty='l2', solver='liblinear'), MultinomialNB()]
#     'vec__stop_words': ['english']
}

gs = GridSearchCV(pipe, param_grid=pipe_params, cv=5, verbose=1, n_jobs=2)
gs.fit(X_train, y_train)

print(f' Best Parameters: {gs.best_params_}')
print('')
print(f' Cross Validation Accuracy Score: {gs.best_score_}')
print(f' Training Data Accuracy Score: {gs.score(X_train, y_train)}')
print(f' Testing Data Accuracy Score: {gs.score(X_test, y_test)}')

Fitting 5 folds for each of 192 candidates, totalling 960 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  60 tasks      | elapsed:    5.0s
[Parallel(n_jobs=2)]: Done 360 tasks      | elapsed:   18.8s
[Parallel(n_jobs=2)]: Done 860 tasks      | elapsed:   44.5s


 Best Parameters: {'model': MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True), 'vec': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1500, min_df=3,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), 'vec__max_features': 1500, 'vec__min_df': 3, 'vec__ngram_range': (1, 1)}

 Cross Validation Accuracy Score: 0.9243055555555556
 Training Data Accuracy Score: 0.9729166666666667
 Testing Data Accuracy Score: 0.9229166666666667


[Parallel(n_jobs=2)]: Done 960 out of 960 | elapsed:   49.3s finished


In [None]:
gs.cv_results_['param_model']

In [None]:
vectorizer = CountVectorizer(analyzer = "word",
                            tokenizer = None,
                            preprocessor = None,
                            stop_words = None,
                            max_features = 1500,
                            ngram_range= (1,1),
                            min_df=3) 

In [9]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(f'Intercept: {lr.intercept_}')
print('')
print(f'Coefficient: {lr.coef_}')
print('')
print(f'Exponentiated Coefficient: {np.exp(lr.coef_)}')




ValueError: could not convert string to float: "Freshman Rep. Katie Hill, who was part of the Democratic 'blue wave,' faces angry voters at a town hall in California [Santa Clarita, CA] [CA-25, N Los Angeles County]"

In [None]:
print(f'Logreg predicted values: {lr.predict(X_train.head())}')
print(f'Logreg predicted probabilities: {lr.predict_proba(X_train.head())}')


In [None]:
preds = lr.predict(X_test)
confusion_matrix(y_test, # True values.
                 preds)  # Predicted values.


In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

In [None]:
spec = tn / (tn + fp)

print(f'Specificity: {round(spec,4)}')

In [None]:
sens = tp / (tp + fn)

print(f'Sensitivity: {round(sens,4)}')

In [None]:
# vote = VotingClassifier([
#     ('tree', DecisionTreeClassifier()),
#     ('ada', AdaBoostClassifier()),
#     ('grad', GradientBoostingClassifier()),
#     ('logreg', LogisticRegression())
# ])

# pipe = Pipeline([
#     ('vote', vote)
# ])

# pipe_params = {
#     'vote__tree__max_depth' : [None, 1, 2],
#     'vote__ada__n_estimators' : [40, 50, 60],
#     'vote__grad__n_estimators' : [90, 100],
#     'vote__logreg__penalty' : ['l1', 'l2'],
# }

# gs = GridSearchCV(pipe, param_grid=pipe_params, cv=3)
# gs.fit(X_train, y_train)
# print(gs.best_score_) # cross val accuracy score
# gs.best_params_