## Article Collection and Volume Reduction Pipeline 
1. Find efficient keywords with word embeddings (gensim)
2. **Remove duplicitous articles with cosine similarity on TFIDF vectors** (scikit-learn)
3. Remove duplicitous articles with entity extraction and jaccard similarity (spacy)
4. Classify relevant articles (scikit-learn)

## Summary
Using the article title, we calculate the cosine similarity between each pair of article titles. In practice, for each pair of articles, if their similarity is above a threshold, we remove one.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [2]:
data = (pd
        .read_csv('../data/nyt_ftpg_1996_2006_no_text.csv', engine='python', usecols=['Article_ID', 'Title'])
        .assign(title_length=lambda x: x['Title'].apply(len)))

In [3]:
data.head()

Unnamed: 0,Article_ID,Title,title_length
0,1,Nation's Smaller Jails Struggle To Cope With S...,62
1,2,Dancing (and Kissing) In the New Year,38
2,3,Forbes's Silver Bullet for the Nation's Malaise,48
3,4,"Up at Last, Bridge to Bosnia Is Swaying Gatewa...",59
4,5,2 SIDES IN SENATE DISAGREE ON PLAN TO END FURL...,52


In [4]:
data_titles = data[data['title_length'] > 41]

In [5]:
tfidf = TfidfVectorizer(stop_words='english', min_df=2)
tfidf_array = tfidf.fit_transform(data_titles['Title']).todense()

In [6]:
%time cosine_similarity = tfidf_array * tfidf_array.T
print(cosine_similarity.shape)

CPU times: user 6min 27s, sys: 10.6 s, total: 6min 38s
Wall time: 1min 43s
(22333, 22333)


In [7]:
# every document is an exact match with itself
cosine_similarity[np.diag_indices(len(data_titles))] = 0

In [8]:
similarities = pd.DataFrame(cosine_similarity, index=data_titles.index)

In [9]:
reference_document_index = 0
print("Ref:", data_titles.loc[reference_document_index, 'Title'], "\n", sep="\t")

most_similar_indices = similarities[reference_document_index].sort_values(ascending=False).head()
for index, value in most_similar_indices.iteritems():
    print(round(value, 2), data_titles.loc[index, 'Title'], sep="\t")

Ref:	Nation's Smaller Jails Struggle To Cope With Surge in Inmates 	

0.33	Smaller Bookstores End Court Struggle Against Two Chains
0.29	As Parents Age, Baby Boomers And Business Struggle to Cope
0.29	New York Hospitals Struggle With Flu Surge
0.28	 Issue in China: Many in Jails Without Trial
0.26	American Jails In Iraq Bursting With Detainees
