## Article Collection and Volume Reduction Pipeline 
1. Find efficient keywords with word embeddings (gensim)
2. Remove duplicitous articles with cosine similarity on TFIDF vectors (scikit-learn)
3. Remove duplicitous articles with entity extraction and jaccard similarity (spacy)
4. **Classify relevant articles** (scikit-learn)

## Summary
Using the article title and summary, we use a TFIDF matrix as features in a logistic regression to predict whether an article is about the war on terror. In practice, we fine tune a cutoff threshold to improve recall on a validation dataset.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np

In [2]:
usecols = ['Article_ID', 'Title', 'Summary', 'War on Terror']
data = (pd
        .read_csv('./nyt_ftpg_1996_2006_no_text.csv', engine='python', usecols=usecols)
        .assign(text=lambda x: x['Title'] + ' ' + x['Summary'])
    )

In [3]:
data.head()

Unnamed: 0,Article_ID,Title,Summary,War on Terror,text
0,1,Nation's Smaller Jails Struggle To Cope With S...,Jails overwhelmed with hardened criminals,0,Nation's Smaller Jails Struggle To Cope With S...
1,2,Dancing (and Kissing) In the New Year,new years activities,0,Dancing (and Kissing) In the New Year new yea...
2,3,Forbes's Silver Bullet for the Nation's Malaise,Steve Forbes running for President,0,Forbes's Silver Bullet for the Nation's Malais...
3,4,"Up at Last, Bridge to Bosnia Is Swaying Gatewa...",U.S. military constructs bridge to help their ...,0,"Up at Last, Bridge to Bosnia Is Swaying Gatewa..."
4,5,2 SIDES IN SENATE DISAGREE ON PLAN TO END FURL...,Democrats and Republicans can't agree on plan ...,0,2 SIDES IN SENATE DISAGREE ON PLAN TO END FURL...


In [4]:
data['War on Terror'].value_counts(normalize=True)

0    0.863795
1    0.136205
Name: War on Terror, dtype: float64

In [5]:
tfidf = TfidfVectorizer(stop_words='english', min_df=2)
tfidf_array = tfidf.fit_transform(data['text'])


In [6]:
X = tfidf_array
y = data['War on Terror'].values

tts = train_test_split(X, y, test_size=0.33, random_state=666)
X_train, X_test, y_train, y_test = tts

In [7]:
lr = LogisticRegressionCV()
lr.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [8]:
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.97      0.98      0.98      8845
          1       0.89      0.80      0.84      1397

avg / total       0.96      0.96      0.96     10242



In [9]:
threshold = 0.05
y_pred_threshold = lr.predict_proba(X_test)[:,1] > threshold
print(classification_report(y_test, y_pred_threshold))

             precision    recall  f1-score   support

          0       0.99      0.93      0.96      8845
          1       0.68      0.96      0.79      1397

avg / total       0.95      0.93      0.94     10242



In [10]:
print("confusion matrix")
pd.crosstab(pd.Series(y_test, name='actual'),
            pd.Series(y_pred_threshold, name='predicted'))

confusion matrix


predicted,False,True
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8208,637
1,56,1341


In [11]:
sum(y_pred_threshold) / len(y_test)

0.1931263425112283