## 04_model_test

**Description:** Using the production logarithmic model to predict on unseen headlines.

In [2]:
import numpy as np
import pandas as pd
import pickle
import re
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, AdaBoostClassifier

**Importing Vectorizer, Model, and New Data**

In [9]:
with open('../data/new_content.csv', 'rb') as f:
    df = pd.read_csv(f)
with open('../pickle/tfidf_2.pkl', 'rb') as f:
    tfidif_2 = pickle.load(f)
with open('../pickle/log_model.pkl', 'rb') as f:
    log_model = pickle.load(f)

**Examining New Data**

In [17]:
pd.set_option("display.max_colwidth", 200)
df

Unnamed: 0,title,target
0,MTA Official Too Nervous To Tell Commuters Waiting For Train That Service Shut Down Permanently An Hour Ago,0
1,"‘Rock The Caliphate’ Charity Concert Features U2, Ed Sheeran, Dua Lipa Coming Together To Raise Money For Struggling Islamic State",0
2,"Deformed, Half-Feathered Audubon Society President Flees Into Forest After Injecting Self With Bird DNA",0
3,New Study Confirms This Didn’t Even Feel Like A 4-Day Work Week,0
4,"Cory Booker Expelled From Senate, Stripped Naked, Forced To Wander Maryland Bog In Woe For All Eternity",0
5,Fox Business Host Calls Former President George W. Bush a 'Radical' Liberal,1
6,Russian State Media Accuses Anime of Promoting Child Suicide,1
7,Dangerous drug trend called 'wasping' combines insecticide with meth,1
8,Nokia 9 Smartphone Packing Five Rear Camera Lens Image Leaked,1
9,'Remodelling the lizard people's lair': Denver airport trolls conspiracy theorists,1


In [19]:
term_mat = tfidif_2.transform(df['title'])

**Predicting on New Data**

In [21]:
log_model.predict_proba(term_mat)

array([[0.5455661 , 0.4544339 ],
       [0.73684455, 0.26315545],
       [0.77122721, 0.22877279],
       [0.95058404, 0.04941596],
       [0.84775819, 0.15224181],
       [0.65895784, 0.34104216],
       [0.29002262, 0.70997738],
       [0.27191088, 0.72808912],
       [0.4416558 , 0.5583442 ],
       [0.44680134, 0.55319866]])

In [22]:
log_model.predict(term_mat)

array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

**Analysis**: I'm impressed with 90% accuracy on this new sample, though I suppose it's not out of the realm of possibility with a test accuracy of ~81.7%. I should even expect a random new sample to perform like this every so often. 

Even from a glance, it's easy to see that each one of the above new headlines is similar to the one's that I built my model with. At least in the sense that the Onion headlines are much more verbose. Part of me wonders whether the length of headline strongly influences the prediction, given the difference in average character length between Onion and non-Onion posts. Unfortunately, I don't know enough about how vectorizers work to know if that contributed to each TFIDF component measure, however, I would like to take some time in the future to study this and find out.

I would be interested to know what factors contributed to the one incorrect new prediction that my model made. Part of me thinks that a term like "Fox" would appear far more often within the Onion posts, as the news source is frequently poked fun at.

**Conclusion**: In conclusion, I'm very confident that this model can be used to accurately differentiate between Satirical and Sensational headlines (so as long as those Satirical headlines come from the Onion). That being stated, I do think it will be worthwhile to eventually try and refine this model, by increasing the scrutiny of the preprocessing step through the following:

- Removing “American Voices” headlines from The Onion (these headlines are worded more similarly to Sensational headlines, in terms of their length). 
- Consider increasing ngram range to (1, 3)
- Further tuning the Random Forest hyperparameters
- Lemmatize the dataset
- Consider tuning SVD to eliminate fewer features
- Apply SVD to the multi-ngram vector

Nonetheless, I'm happy with how the model turned out, and I'm excited to start using this in the field.