# Unique words in the budget

This notebook documents the prevalence of unique words in the 
budget.  


In [1]:
import pandas as pd
import numpy as np 

from pathlib import Path

In [2]:
from gensim import corpora

In [3]:
from budget_corpus import read_raw_corpus, read_documents

In [4]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [5]:
raw_corpus = read_raw_corpus()
corpus = read_documents()

In [6]:
dictionary = corpora.Dictionary(tokens for tokens in corpus)

2019-02-27 21:05:26,259 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2019-02-27 21:05:26,317 : INFO : built Dictionary(4392 unique tokens: ['acquisition', 'aircraft', 'authorize', 'capital', 'derive']...) from 1248 documents (total 69715 corpus positions)


## No frequest words left to filter

My stopwords set includes a number of legalese and budget
words.  See budget_corpus.py. 

This statement keeps all the words only used once (because
butterfly was only used once. It would delete words that appear
in over 50% of the documents, but there aren't any. 

In [7]:
dictionary.filter_extremes(no_below=0, no_above=.5)

2019-02-27 21:05:26,324 : INFO : discarding 0 tokens: []...
2019-02-27 21:05:26,326 : INFO : keeping 4392 tokens which were in no less than 0 and no more than 624 (=50.0%) documents
2019-02-27 21:05:26,329 : INFO : resulting dictionary: Dictionary(4392 unique tokens: ['acquisition', 'aircraft', 'authorize', 'capital', 'derive']...)


# Convert documents to tokens 

And see how many documents contain a word that only occurs
once in the budget.

In [8]:
tokened_corpus = [dictionary.doc2bow(words) for words in corpus ]

In [9]:
odd_tokens = []
for token,freq in dictionary.dfs.items():
    if freq == 1:
        odd_tokens.append(token)
len(odd_tokens)

1685

> Too many unique words to print - let's do 20 lines.

In [10]:
words = [ dictionary[s] for s in odd_tokens ]
output = ""
count = 0
for word in words:
    output += word  + ' '
    if len(output) > 100:
        count += 1
        if count > 20:
            break
        print(output)
        output = ''

moderately focused random randomized landlord promising forbid apprentice trainee unemployed bid delay 
postage suffrage deobligation rehire wellness dump heroin willing nonagricultural nonimmigrant committed 
practical liaison accessory exporter shipper barrel cylinder frame breech postmaster canadian maintained 
escalation helsinki reinsurance turkey turkish indictment stand coalition procedural elizabeth beach 
downgrade resiliency telemarketing correspondingly manifest packing grape varietal wine microorganism 
kindred dune retitle lakeshore nonapplication douglas redesignation miller assurance supreme cotton avian 
specialty zoonotic stockpile scrapie screwworm formulate brucellosis escort philosophical aids improper 
analytic confer alignment earlier afford roma escobare grulla salineno les multiply redistribution burned 
furnished capitalized retardant equality raise setting marriage genital cutting mutilation constrain 
malawi ebola cancele darfur blue nile abyei referendum via

Definitely a design choice not to tell the lemmatizer to
remove proper names. I wanted to keep "National Butterfly Center" after all. But there are lot of other words which are 
not part of proper names.

I can classify the language of the budget as basically:
- Stop words in the usual NLP sense ('the', etc.)
- Stop words in the government budget document sense ('pay', 'agency', 'fiscal', 'section', 'subsection', 'title', 'code')
- And unique words that have to do with the specific thing being funded in one or possibly a few budget sections and not mentioned again.

There are some more typical-looking words like 'beginning' and 'moderately' 
which you might think would come up more than once in a 200,000 word sample, but the federal budget is not a 
normal sample. 

If I were to drop out all the words mentioned in fewer than 
5 percent of the documents (for demonstration purposes, not
for actual analysis), I'm sill only dropping 135 out of about 
4400 tokens.

In [11]:
dictionary.filter_extremes(no_below=0, no_above=.05)

2019-02-27 21:05:26,409 : INFO : discarding 135 tokens: [('acquisition', 68), ('authorize', 368), ('capital', 72), ('derive', 81), ('excess', 66), ('advance', 69), ('appropriate', 340), ('committee', 233), ('current', 95), ('day', 163)]...
2019-02-27 21:05:26,410 : INFO : keeping 4257 tokens which were in no less than 0 and no more than 62 (=5.0%) documents
2019-02-27 21:05:26,413 : INFO : resulting dictionary: Dictionary(4257 unique tokens: ['aircraft', 'donation', 'exist', 'obtain', 'offset']...)
