# An attempt at clustering

So, if you read the unique_words notebook, it's clear that
clustering this in a traditional way is hopeless.

The algorithm that seemed to work somewhat was hdbscan which let me 
have some very small clusters (by setting the method=leaf). This at 
least clustered a bunch of cut-n-pasted sections.  But even so, most
of the corpus was unclassifiable.

In [58]:
import pandas as pd
import numpy as np 

from gensim import corpora
from gensim.models import TfidfModel, LsiModel

# Read the data

In [59]:
from budget_corpus import read_documents, read_raw_corpus
corpus = read_documents()
raw_corpus = read_raw_corpus() # so I can see the original doc for debugging

In [60]:
dictionary = corpora.Dictionary(tokens for tokens in corpus)

In [61]:
tokened_corpus = [dictionary.doc2bow(tokens) for tokens in corpus ]

# TFIDF and LSI

TFIDF compensates for variation in document length (at least if this 
were a normal corpus)

LSI (called LSA in sklearn) does dimensionality reduction. Went from 4000
vocabulary words down to 100 dimensions so that thing would cluster for HDBSCAN.

In [62]:
# first convert words to tfidf values (this helps compensate for the
# uneven document sizes)

tfidf = TfidfModel(dictionary=dictionary)
vectored_corpus = [ tfidf[doc] for doc in tokened_corpus]

# next do dimensionalty reduction - 
lsi = LsiModel(corpus=vectored_corpus, num_topics=100, id2word=dictionary, onepass=False, power_iters=3)

In [63]:
lsi.print_topics()

[(0,
  '0.177*"assistance" + 0.146*"necessary" + 0.146*"remain" + 0.144*"year" + 0.142*"transfer" + 0.142*"exceed" + 0.141*"head" + 0.140*"secretary" + 0.139*"law" + 0.138*"authorize"'),
 (1,
  '0.405*"inspector" + -0.253*"loan" + 0.251*"reception" + 0.248*"representation" + 0.209*"official" + 0.206*"necessary" + 0.195*"exceed" + -0.186*"housing" + 0.184*"vehicle" + 0.166*"motor"'),
 (2,
  '0.581*"loan" + 0.333*"inspector" + 0.254*"rural" + 0.247*"housing" + 0.210*"guarantee" + 0.138*"direct" + 0.122*"guaranteed" + -0.118*"committee" + 0.097*"principal" + 0.094*"farm"'),
 (3,
  '-0.718*"inspector" + 0.198*"reception" + 0.186*"representation" + 0.167*"official" + 0.163*"exceed" + -0.160*"foreign" + -0.145*"assistance" + 0.133*"loan" + -0.125*"carry" + 0.115*"authorize"'),
 (4,
  '-0.312*"assistance" + 0.257*"transfer" + -0.254*"foreign" + 0.209*"inspector" + 0.185*"obligation" + 0.172*"year" + 0.172*"house" + 0.170*"current" + 0.167*"expressly" + 0.160*"senate"'),
 (5,
  '0.337*"housing

In [64]:
lsi_corpus = [ lsi[doc] for doc in vectored_corpus ]

# each element of lsi corpus is a topic number and a percentage,
# to turn this into an array of vectors for clustering, need to
# drop the topic numbers
lsi_corpus[0][ :10]

[(0, 0.08837494518307108),
 (1, 0.026074224555865326),
 (2, -0.016551861764710026),
 (3, 0.047667914966584166),
 (4, 0.015880671024923635),
 (5, 0.04245770095562856),
 (6, -0.02706591516144045),
 (7, -0.018622189133790976),
 (8, -0.0010137140280048898),
 (9, -0.04471173692014821)]

In [65]:
lsi_array = np.array(lsi_corpus)
lsi_array.shape

(1248, 100, 2)

In [66]:
lsi_array = lsi_array[:,:,1]
lsi_array.shape

(1248, 100)

In [67]:
# and you can see we have the same numbers as the sample above,
# just in a different shape
lsi_array[0, :10]

array([ 0.08837495,  0.02607422, -0.01655186,  0.04766791,  0.01588067,
        0.0424577 , -0.02706592, -0.01862219, -0.00101371, -0.04471174])

In [68]:
# cluster with hdbscan, telling it to go with fine grained clusters
# because the broad ones created with other models make even less sense...

import hdbscan
hdb = hdbscan.HDBSCAN(cluster_selection_method='leaf')
hdb.fit(lsi_array)

HDBSCAN(algorithm='best', allow_single_cluster=False, alpha=1.0,
    approx_min_span_tree=True, cluster_selection_method='leaf',
    core_dist_n_jobs=4, gen_min_span_tree=False, leaf_size=40,
    match_reference_implementation=False, memory=Memory(location=None),
    metric='euclidean', min_cluster_size=5, min_samples=None, p=None,
    prediction_data=False)

In [69]:
hdb.labels_

array([-1, -1, -1, ...,  4, -1, 12])

In [70]:
labels = pd.Series(hdb.labels_)

In [71]:
labels.value_counts()

-1     980
 12    177
 4      12
 1       9
 13      8
 10      7
 8       7
 6       7
 5       7
 9       6
 3       6
 2       6
 0       6
 11      5
 7       5
dtype: int64

In [72]:
# The clustering is good at finding repeated texts at least

In [73]:
for idx, lab in enumerate(hdb.labels_):
    if lab == 8:
        print(" " )
        print(raw_corpus[idx])

 
For necessary expenses, not otherwise provided for, in the conduct and support of science research and development activities, including research, development, operations, support, and services; maintenance and repair, facility planning and design; space flight, spacecraft control, and communications activities; program management; personnel and related costs, including uniforms or allowances therefor, as authorized by sections 5901 and 5902 of title 5, United States Code; travel expenses; purchase and hire of passenger motor vehicles; and purchase, lease, charter, maintenance, and operation of mission and administrative aircraft, $6,905,700,000, to remain available until September 30, 2020: Provided , That, of the amounts provided, $545,000,000 is for an orbiter and $195,000,000 is for a lander to meet the science goals for the Jupiter Europa mission as recommended in previous Planetary Science Decadal surveys: Provided further , That the National Aeronautics and Space Administratio

In [74]:
# Let's take a peek at the bigger category : You can see
# that it's not particularly obvious why they're combined

In [75]:
import random
for idx, lab in enumerate(hdb.labels_):
    if lab == 12 and random.random() > .8:
        print(" " )
        print(raw_corpus[idx][:500])

 
516.(a) Notwithstanding any other provision of law or treaty, none of the funds appropriated or otherwise made available under this Act or any other Act may be expended or obligated by a department, agency, or instrumentality of the United States to pay administrative expenses or to compensate an officer or employee of the United States in connection with requiring an export license for the export to Canada of components, parts, accessories or attachments for firearms listed in Category I, secti
 
7042.(a) African great lakes region assistance restriction Funds appropriated by this Act under the heading International Military Education and Training for the central government of a country in the African Great Lakes region may be made available only for Expanded International Military Education and Training and professional military education until the Secretary of State determines and reports to the Committees on Appropriations that such government is not facilitating or otherwise par

In [47]:
# Finally, a peek at what the clusterer could not cluster at all

In [48]:
for idx, lab in enumerate(hdb.labels_):
    if lab == -1 and random.random() > .99:
        print(" " )
        print(raw_corpus[idx][:500])

 
For expenses necessary for the enforcement of antitrust and kindred laws, $164,977,000, to remain available until expended: Provided , That notwithstanding any other provision of law, fees collected for premerger notification filings under the Hart-Scott-Rodino Antitrust Improvements Act of 1976 ( 15 U.S.C. 18a ), regardless of the year of collection (and estimated to be $136,000,000 in fiscal year 2019), shall be retained and used for necessary expenses in this appropriation, and shall remain a
 
415.None of the funds appropriated or otherwise made available under this Act may be used by the Surface Transportation Board to charge or collect any filing fee for rate or practice complaints filed with the Board in an amount in excess of the amount authorized for district court civil suit filing fees under section 1914 of title 28, United States Code.
 
183.None of the funds in this Act shall be available for salaries and expenses of more than 125 political and Presidential appointees in

In [49]:
# and where's my butterfly? See above...

In [50]:
raw_corpus[234]

'231.None of the funds made available by this Act or prior Acts are available for the construction of pedestrian fencing—(1) within the Santa Ana Wildlife Refuge;(2) within the Bentsen-Rio Grande Valley State Park;(3) within La Lomita Historical park;(4) within the National Butterfly Center; or(5) within or east of the Vista del Mar Ranch tract of the Lower Rio Grande Valley National Wildlife Refuge.'

In [51]:
hdb.labels_[234]

11