# Cosine Similarity

This was one of my minimal viable project investigations - which documents have the least cosine similarity
with the rest of the budget? 

In [45]:
import pandas as pd
import numpy as np 

from pathlib import Path

from gensim import corpora

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [46]:
from budget_corpus import read_raw_corpus, read_documents

raw_corpus = read_raw_corpus()
corpus = read_documents()

In [47]:
dictionary = corpora.Dictionary(tokens for tokens in corpus)

2019-02-28 15:35:12,349 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-02-28 15:35:12,429 : INFO : built Dictionary(4392 unique tokens: ['acquisition', 'aircraft', 'authorize', 'capital', 'derive']...) from 1248 documents (total 69715 corpus positions)


In [48]:
tokened_corpus = [dictionary.doc2bow(tokens) for tokens in corpus ]

# Data prep

Use TF-DIF to normalize and LSI to reduce dimensionality

In [60]:
# dimensionality reduction
from gensim.models import TfidfModel, LsiModel

# first convert words to tfidf values
tfidf = TfidfModel(dictionary=dictionary, normalize=True)
vectored_corpus = [ tfidf[doc] for doc in tokened_corpus]

In [65]:
# next do dimensionalty reduction
lsi = LsiModel(corpus=vectored_corpus, num_topics=200, id2word=dictionary)

2019-02-28 15:37:19,975 : INFO : using serial LSI version on this node
2019-02-28 15:37:19,976 : INFO : updating model with new documents
2019-02-28 15:37:19,976 : INFO : preparing a new chunk of documents
2019-02-28 15:37:19,994 : INFO : using 100 extra samples and 2 power iterations
2019-02-28 15:37:19,996 : INFO : 1st phase: constructing (4392, 300) action matrix
2019-02-28 15:37:20,020 : INFO : orthonormalizing (4392, 300) action matrix
2019-02-28 15:37:20,217 : INFO : 2nd phase: running dense svd on (300, 1248) matrix
2019-02-28 15:37:20,269 : INFO : computing the final decomposition
2019-02-28 15:37:20,270 : INFO : keeping 200 factors (discarding 15.371% of energy spectrum)
2019-02-28 15:37:20,280 : INFO : processed documents up to #1248
2019-02-28 15:37:20,286 : INFO : topic #0(5.975): 0.177*"assistance" + 0.146*"necessary" + 0.146*"remain" + 0.144*"year" + 0.142*"transfer" + 0.142*"exceed" + 0.141*"head" + 0.140*"secretary" + 0.139*"law" + 0.138*"public"
2019-02-28 15:37:20,287

In [66]:
lsi_corpus = [ lsi[doc] for doc in vectored_corpus ]

lsi_array = np.array(lsi_corpus)
lsi_array = lsi_array[:,:,1]
lsi_array.shape

(1248, 200)

# which documents are least similar to the rest of the corpus? 

In [71]:
from gensim import similarities

index = similarities.MatrixSimilarity(lsi_corpus)

2019-02-28 15:38:04,313 : INFO : creating matrix with 1248 documents and 200 features


In [72]:
sims = index[lsi_corpus]

In [73]:
sims.shape

(1248, 1248)

In [83]:
df_sims = pd.DataFrame(sims)
avg_sim = df_sims.median()

In [84]:
avg_sim.sort_values().head(10)

806    0.000035
763    0.000035
55     0.000043
670    0.000367
302    0.000701
805    0.000707
657    0.000716
874    0.000891
575    0.000902
919    0.001006
dtype: float32

In [85]:
least_sim = avg_sim.sort_values().head(10).index
for i in least_sim:
    print('-----')
    print(raw_corpus[i][:500])

-----
401.None of the funds in this Act shall be used for the planning or execution of any program to pay the expenses of, or otherwise compensate, non-Federal parties intervening in regulatory or adjudicatory proceedings funded in this Act.
-----
601.None of the funds in this Act shall be used for the planning or execution of any program to pay the expenses of, or otherwise compensate, non-Federal parties intervening in regulatory or adjudicatory proceedings funded in this Act.
-----
624.None of the funds made available in this Act may be used in contravention of chapter 29, 31, or 33 of title 44, United States Code.
-----
For salaries and expenses, not otherwise provided for, $48,134,000.
-----
This title may be cited as the Judiciary Appropriations Act, 2019 .
-----
520.Funds available to the General Services Administration shall be available for the hire of passenger motor vehicles.
-----
517.None of the funds made available in this Act may be used for first-class travel by the emp

# Simple BOW

Not using the TF-IDF normalization for this one 

In [86]:
# next do dimensionalty reduction
lsi2 = LsiModel(corpus=tokened_corpus, num_topics=200, id2word=dictionary)

2019-02-28 15:42:09,463 : INFO : using serial LSI version on this node
2019-02-28 15:42:09,464 : INFO : updating model with new documents
2019-02-28 15:42:09,465 : INFO : preparing a new chunk of documents
2019-02-28 15:42:09,481 : INFO : using 100 extra samples and 2 power iterations
2019-02-28 15:42:09,481 : INFO : 1st phase: constructing (4392, 300) action matrix
2019-02-28 15:42:09,517 : INFO : orthonormalizing (4392, 300) action matrix
2019-02-28 15:42:09,714 : INFO : 2nd phase: running dense svd on (300, 1248) matrix
2019-02-28 15:42:09,757 : INFO : computing the final decomposition
2019-02-28 15:42:09,758 : INFO : keeping 200 factors (discarding 3.810% of energy spectrum)
2019-02-28 15:42:09,772 : INFO : processed documents up to #1248
2019-02-28 15:42:09,773 : INFO : topic #0(236.398): 0.437*"assistance" + 0.305*"secretary" + 0.290*"appropriate" + 0.217*"law" + 0.200*"include" + 0.197*"committee" + 0.190*"public" + 0.182*"foreign" + 0.147*"head" + 0.141*"report"
2019-02-28 15:4

In [88]:
lsi2_corpus = [ lsi2[doc] for doc in tokened_corpus ]

lsi2_array = np.array(lsi2_corpus)
lsi2_array = lsi2_array[:,:,1]

In [89]:
avg_sim = pd.DataFrame(lsi2_array).median()

In [90]:
least_sim = avg_sim.sort_values().head(10).index
for i in least_sim:
    print('-----')
    print(raw_corpus[i][:500])

-----
235.(a) Authority The Secretary of Housing and Urban Development (in this section referred to as the Secretary ) may carry out a mobility demonstration program to enable public housing agencies to administer housing choice voucher assistance under section 8(o) of the United States Housing Act of 1937 ( 42 U.S.C. 1437f(o) ) in a manner designed to encourage families receiving such voucher assistance to move to lower-poverty areas and expand access to opportunity areas.(b) Selection of PHAs (1) Re
-----
For necessary expenses of the Center for Middle Eastern-Western Dialogue Trust Fund, as authorized by section 633 of the Departments of Commerce, Justice, and State, the Judiciary, and Related Agencies Appropriations Act, 2004 ( 22 U.S.C. 2078 ), the total amount of the interest and earnings accruing to such Fund on or before September 30, 2019, to remain available until expended.
-----
406.Except as otherwise specifically provided by law, not to exceed 50 percent of unobligated bal

> Just looking at cosine similarity does not point unusual parts of the budget.