## Article Collection and Volume Reduction Pipeline 
1. Find efficient keywords with word embeddings (gensim)
2. **Remove duplicitous articles with cosine similarity on TFIDF vectors** (scikit-learn)
3. Remove duplicitous articles with entity extraction and jaccard similarity (spacy)
4. Classify relevant articles (scikit-learn)

## Summary
Using the article title, we calculate the cosine similarity between each pair of article titles. In practice, for each pair of articles, if their similarity is above a threshold, we remove one.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import spacy

In [2]:
data = (pd
        .read_csv('../data/nyt_ftpg_1996_2006_no_text.csv', engine='python', usecols=['Article_ID', 'Title'])
        .assign(title_words=lambda x: x['Title'].str.count(' ')))

In [3]:
data.head()

Unnamed: 0,Article_ID,Title,title_words
0,1,Nation's Smaller Jails Struggle To Cope With S...,10
1,2,Dancing (and Kissing) In the New Year,7
2,3,Forbes's Silver Bullet for the Nation's Malaise,7
3,4,"Up at Last, Bridge to Bosnia Is Swaying Gatewa...",11
4,5,2 SIDES IN SENATE DISAGREE ON PLAN TO END FURL...,10


In [4]:
# only keep titles with 6 or more words
data_titles = data[data['title_words'] > 6]

In [5]:
nlp = spacy.load('en')
def lemmatizer(text):
    return [token.lemma_ for token in nlp(text) if not (token.is_stop or token.is_punct)]

In [6]:
tfidf = TfidfVectorizer(min_df=3, analyzer=lemmatizer)
tfidf_array = tfidf.fit_transform(data_titles['Title']).todense()

In [7]:
pd.DataFrame(tfidf_array, columns=tfidf.get_feature_names())[['nation', 'jails', 'dancing', 'bullet']].head()

Unnamed: 0,nation,jails,dancing,bullet
0,0.218469,0.426234,0.0,0.0
1,0.0,0.0,0.807545,0.0
2,0.251103,0.0,0.0,0.518494
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0


In [8]:
%time cosine_similarity = tfidf_array * tfidf_array.T
print(cosine_similarity.shape)

CPU times: user 3min 28s, sys: 4.88 s, total: 3min 33s
Wall time: 46.4 s
(19615, 19615)


In [9]:
# every document is an exact match with itself
cosine_similarity[np.diag_indices(len(data_titles))] = 0

In [10]:
similarities = pd.DataFrame(cosine_similarity, index=data_titles.index)

In [11]:
def highlight_nonzero(val):
    color = 'yellow' if val > 0 else 'white'
    return f'background-color: {color}'

similarities.loc[:10,:10].style.applymap(highlight_nonzero)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,0.0,0.0,0.118228,0.0349719,0.0289165,0.0,0.0,0.0,0.0460681,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0387648,0.0,0.0,0.0,0.0,0.367494,0.0788027
2,0.118228,0.0,0.0,0.0803917,0.0,0.0,0.0,0.0,0.0570019,0.0,0.0
3,0.0349719,0.0,0.0803917,0.0,0.0,0.0352836,0.0233534,0.0,0.0314574,0.0420533,0.0
4,0.0289165,0.0387648,0.0,0.0,0.0,0.0,0.0,0.0,0.0260106,0.0,0.0
5,0.0,0.0,0.0,0.0352836,0.0,0.0,0.0244428,0.0617879,0.0,0.0440152,0.0
6,0.0,0.0,0.0,0.0233534,0.0,0.0244428,0.0,0.0,0.0,0.0291326,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0617879,0.0,0.0,0.0,0.0,0.0
8,0.0460681,0.0,0.0570019,0.0314574,0.0260106,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.367494,0.0,0.0420533,0.0,0.0440152,0.0291326,0.0,0.0,0.0,0.150297


In [12]:
reference_document_index = 0
print("Ref:", data_titles.loc[reference_document_index, 'Title'], "\n", sep="\t")

most_similar_indices = similarities[reference_document_index].sort_values(ascending=False).head()
for index, value in most_similar_indices.iteritems():
    print(round(value, 2), data_titles.loc[index, 'Title'], sep="\t")

Ref:	Nation's Smaller Jails Struggle To Cope With Surge in Inmates 	

0.31	H.M.O.'s Cope With a Backlash on Cost Cutting 
0.31	No Need to Stew: A Few Tips To Cope With Life's Annoyances
0.27	As Parents Age, Baby Boomers And Business Struggle to Cope
0.25	Smaller Bookstores End Court Struggle Against Two Chains
0.23	 Towns With Odd Jobs Galore Turn to Inmates
