## Article Collection and Volume Reduction Pipeline 
1. **Find efficient keywords with word embeddings** (gensim)
2. Remove duplicitous articles with cosine similarity on TFIDF vectors (scikit-learn)
3. Remove duplicitous articles with entity extraction and jaccard similarity (spacy)
4. Classify relevant articles (scikit-learn)

## Summary
Using both the title and summary as input text, we create a word embedding model and query it for similar words to 'election'

In [1]:
import pandas as pd
from gensim.utils import simple_tokenize
from gensim.models.word2vec import Word2Vec

In [2]:
usecols = ['Article_ID', 'Title', 'Summary']
data = pd.read_csv('../data/nyt_ftpg_1996_2006_no_text.csv', engine='python', usecols=usecols)


In [3]:
data.head()

Unnamed: 0,Article_ID,Title,Summary
0,1,Nation's Smaller Jails Struggle To Cope With S...,Jails overwhelmed with hardened criminals
1,2,Dancing (and Kissing) In the New Year,new years activities
2,3,Forbes's Silver Bullet for the Nation's Malaise,Steve Forbes running for President
3,4,"Up at Last, Bridge to Bosnia Is Swaying Gatewa...",U.S. military constructs bridge to help their ...
4,5,2 SIDES IN SENATE DISAGREE ON PLAN TO END FURL...,Democrats and Republicans can't agree on plan ...


In [4]:
tokenized_title = data['Title'].apply(lambda x: list(simple_tokenize(x.lower()))).tolist()
tokenized_summary = data['Title'].apply(lambda x: list(simple_tokenize(x.lower()))).tolist()

tokenized_text = tokenized_title + tokenized_summary

In [5]:
w2v_model = Word2Vec(sentences=tokenized_text)

In [6]:
w2v_model.wv.most_similar('election')

[('mccain', 0.9658215045928955),
 ('republican', 0.956897497177124),
 ('recall', 0.9452403783798218),
 ('memo', 0.9370315670967102),
 ('texas', 0.9364608526229858),
 ('victory', 0.9356061816215515),
 ('republicans', 0.9334008097648621),
 ('showdown', 0.9322413206100464),
 ('battle', 0.9317525029182434),
 ('homestretch', 0.9315381646156311)]