## 04_model_test

**Description:** Using the production logarithmic model to predict on unseen headlines.

In [1]:
import pickle
import re

import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, AdaBoostClassifier

**Importing Vectorizer, Model, and New Data**

In [24]:
with open('../data/new_content.csv', 'rb') as f:
    df2 = pd.read_csv(f)
with open('../pickle/rf_model.pkl', 'rb') as f:
    rf_model = pickle.load(f)
with open('../pickle/ad_model.pkl', 'rb') as f:
    ad_model = pickle.load(f)
with open('../pickle/tfidf_n3.pkl', 'rb') as f:
    tfidf_n3 = pickle.load(f)
with open('../data/nto_update.pkl', 'rb') as f:
    nto_update = pickle.load(f)
with open('../data/to_update.pkl', 'rb') as f:
    to_update = pickle.load(f)

#### Examining New Data

In [3]:
len(nto_update), len(to_update)

(100, 100)

In [4]:
def new_X(json_list):
    new_x = []
    for post in json_list:
        new_x.append(post['title'])
    return new_x

In [5]:
new_nto = new_X(nto_update)
new_to = new_X(to_update)

##### Adding The Onion + Not The Onion Posts to DataFrame

In [57]:
df = pd.DataFrame(new_to, columns=['title'])
df['target'] = 0

In [58]:
df_nto = pd.DataFrame(new_nto, columns=['title'])
df_nto['target'] = 1

In [59]:
df = pd.concat([df, df_nto], ignore_index=True, sort=True)

##### Examining New Posts - The Onion

In [60]:
pd.set_option("display.max_colwidth", 200)
df.head(20)

Unnamed: 0,target,title
0,0,George Lucas said WHAT?!
1,0,"Putin Condemns Ukrainian People’s Unprovoked 1,000-Year Occupation Of South Russia"
2,0,Tear Gas Manufacturers Worried About Association With Everything Tear Gas Used For
3,0,"GM Announces Money Saved From Layoffs To Fund Massive Investment In Lake Homes, Private Jets"
4,0,California Camp Fire Fully Contained
5,0,Horrified Nation Wakes Up On Cyber Monday To Find Amazon Echo Devices Embedded Beneath Skin
6,0,"Human Slave From Future Remembers When Cyber Monday Was About Celebrating Savings, Not Robot Uprising"
7,0,Report: More Travelers Avoiding Long Lines At Airport Thanks To Cinnabon PreCheck Memberships
8,0,Trump Unveils Plan To Address Migrants With New Open-Fire Policy
9,0,The 5 Times Dad Was Irrefutably In The Zone


##### Examining New Posts - Not The Onion

In [61]:
pd.set_option("display.max_colwidth", 200)
df.tail(20)

Unnamed: 0,target,title
180,1,Gus Johnson calls out Gus Johnson
181,1,British urgently want the blood of Irish people
182,1,Band fakes a fan base to book UK tour.
183,1,Top Democrat Says Caravan ‘Should Be Allowed To Come In’ Hours Before Mi...
184,1,"Some, including an exorcist, are convinced Celine Dion's new children's clothing line is 'demonic'"
185,1,Hawaii burger eatery closes after video seems to show rat cooking
186,1,Woman claims GPS directed her to drive on railroad tracks
187,1,Elon Musk considers move to Mars despite 'good chance of death'
188,1,Migrant Caravan STORM US Border Trying To Cross Illegally (Full Compilat...
189,1,"Otter eats 3 more koi, evades capture at Chinatown park"


##### Removing Duplicates + Posts with 'Onion' in Them

In [62]:
df.drop_duplicates(inplace=True)

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 0 to 199
Data columns (total 2 columns):
target    182 non-null int64
title     182 non-null object
dtypes: int64(1), object(1)
memory usage: 4.3+ KB


In [64]:
o_index = df[df['title'].str.contains('onion|Onion')].index
df.drop(index=o_index, inplace=True)

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 177 entries, 0 to 199
Data columns (total 2 columns):
target    177 non-null int64
title     177 non-null object
dtypes: int64(1), object(1)
memory usage: 4.1+ KB


##### How Balanced Are the Targets?

In [66]:
df['target'].value_counts()

1    91
0    86
Name: target, dtype: int64

Roughly 51% of the dataset is from "Not the Onion", & the remaining 49% is from "The Onion"

##### Vectorizing with TFIDF

In [67]:
term_mat = tfidf_n3.transform(df['title'])

##### Making Predictions - Adaboost

In [68]:
ad_model.score(term_mat, df['target'])

0.7062146892655368

In [69]:
ad_model.predict(term_mat)

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1])

#### Evaluating Predictions

In [70]:
df['predicted'] = ad_model.predict(term_mat)

In [71]:
order = ['title', 'target', 'predicted']
df = df[order]
df.reset_index(inplace=True)

In [75]:
df.drop('index', 1, inplace=True)

In [88]:
df.iloc[65]

title        Lucky Old Woman Getting Wheeled Around Airport
target                                                    0
predicted                                                 0
Name: 65, dtype: object

**Analysis**: 

**Conclusion**: 

- Removing “American Voices” headlines from The Onion. 
- Lemmatize the dataset

Nonetheless, I'm happy with how the model turned out, and I'm excited to start using this in the field.