## Article Collection and Volume Reduction Pipeline 
1. Find efficient keywords with word embeddings (gensim)
2. Remove duplicitous articles with cosine similarity on TFIDF vectors (scikit-learn)
3. **Remove duplicitous articles with entity extraction and jaccard similarity** (spacy)
4. Classify relevant articles (scikit-learn)

## Summary
Using the article summary, we calculate the jaccard similarity between the set of entities in two articles. In practice, for each pair of articles, if their similarity is above a threshold, we remove one. We also change the threshold depending on the size of the entity set.

In [1]:
import pandas as pd
import numpy as np
import spacy

In [2]:
usecols=['Article_ID', 'Title', 'Summary']
data = pd.read_csv('../data/nyt_ftpg_1996_2006_no_text.csv', engine='python', usecols=usecols)


In [3]:
data.head()

Unnamed: 0,Article_ID,Title,Summary
0,1,Nation's Smaller Jails Struggle To Cope With S...,Jails overwhelmed with hardened criminals
1,2,Dancing (and Kissing) In the New Year,new years activities
2,3,Forbes's Silver Bullet for the Nation's Malaise,Steve Forbes running for President
3,4,"Up at Last, Bridge to Bosnia Is Swaying Gatewa...",U.S. military constructs bridge to help their ...
4,5,2 SIDES IN SENATE DISAGREE ON PLAN TO END FURL...,Democrats and Republicans can't agree on plan ...


In [4]:
nlp = spacy.load('en')

In [5]:
%%time
# entity extraction (with multithreading) in spacy

entities = []
for index, doc in enumerate(nlp.pipe(data['Summary'].tolist(), n_threads=-1)):
    doc_ents = set(ent.text for ent in doc.ents)
    if len(doc_ents) > 0:
        entities.append((index, doc_ents))

CPU times: user 1min 45s, sys: 9.5 s, total: 1min 54s
Wall time: 1min 33s


In [6]:
len(entities)

19586

In [7]:
entities[:5]

[(1, {'new years'}),
 (2, {'Steve Forbes'}),
 (3, {'Bosnia', 'U.S.'}),
 (4, {'Democrats', 'Republicans'}),
 (5, {'Interstate Commerce Commission'})]

In [8]:
# jaccard takes a long time. we'll run on a limited set
limited_entities = entities[:5000]

## Jaccard Index / Similarity
![jaccard](https://wikimedia.org/api/rest_v1/media/math/render/svg/eaef5aa86949f49e7dc6b9c8c3dd8b233332c9e7)
![jaccardimg](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Intersection_of_sets_A_and_B.svg/400px-Intersection_of_sets_A_and_B.svg.png)

In [None]:
def jaccard(a, b):
    if len(a) == 0 or len(b) == 0:
        return 0
    else:
        c = a.intersection(b)
        return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
%%time

# there are better ways to do this.

n_statements = len(limited_entities)
jc_similarities = []

for i, (ent_ix, ents) in enumerate(limited_entities):
    entities_subset = limited_entities[i:]
    for j, (ent_c_ix, ents_comparison) in enumerate(entities_subset):
        scs = 1 if j == 0 else jaccard(ents, ents_comparison)
        jc_similarities.append(scs)

jc_empty = np.zeros((n_statements, n_statements))
upper_indices = np.triu_indices(n_statements)
jc_empty[upper_indices] = jc_similarities
jaccard_sim_matrix = np.triu(jc_empty, -1).T + jc_empty


In [None]:
jaccard_sim_matrix[np.diag_indices((n_statements))] = 0

In [None]:
kept_indices = [i[0] for i in limited_entities]
similarities = pd.DataFrame(jaccard_sim_matrix, index=kept_indices, columns=kept_indices)

In [None]:
ix = 3
reference_document_index = limited_entities[ix][0]
print("Ref:", data.loc[reference_document_index, 'Summary'], "\n", sep="\t")

most_similar_indices = similarities[reference_document_index].sort_values(ascending=False).head()
for index, value in most_similar_indices.iteritems():
    print(round(value, 2), data.loc[index, 'Summary'], sep="\t")