<h1> Choosing a recommender </h1>

Running through several combinations of vectorizers, models, and distance metrics to find the best performing combination for the backend of the report recommender.

In [1]:
import numpy as np
from pymongo import MongoClient
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords 
from sklearn.decomposition import NMF, TruncatedSVD
from bson.objectid import ObjectId
import textract

In [2]:
client = MongoClient('mongodb://localhost:27017/')
db = client.project_4_database

In [3]:
dc_abbreviations = [abbrev for abbrev in db.crs_abbreviations.find()][0]['DC_abbrevs']

In [4]:
stopwords = list(stopwords) + dc_abbreviations

<h1> Test Texts </h1>

Here are a few texts that are either short descriptions of legislation pieces or the full texts of a bill in Congress.

In [5]:
house_hydro = '''
H.R. 3043 modernizes the regulatory permitting process and encourages the expansion of hydropower generation by improving administrative efficiency, accountability, and transparency; promotes new hydropower infrastructure; requires balanced and timely decision making; and reduces duplicative oversight.

Specifically, the legislation establishes the Federal Energy Regulatory Commission as the lead agency for all hydropower authorizations, approvals, and requirements mandated by federal law. The bill modifies the definition of renewable energy to include hydropower, extends the timeframe for a preliminary permit from 3 to 4 years, and extends the time limit for construction “for not more than 8 additional years.”

The legislation also establishes procedures for trial-type hearings conducted by an Administrative Law Judge to resolve disputes relating to conditions and fishway prescriptions under Part I of the Federal Power Act. In addition, the legislation facilitates the timely and efficient completion of license proceedings by minimizing duplication of studies and establishing a program to compile a comprehensive collection of studies and data on a regional or basin-wide scale.
'''

In [6]:
house_chip ="\n\nWHAT IS CHIP?\nCHIP is short for the Children's Health Insurance Program - Pennsylvania's program to provide health insurance to uninsured children and teens who are not eligible for or enrolled in Medical Assistance. Nine out of 10 CHIP parents report satisfaction with their child�s health plan, and 96% received an appointment for check-ups and vaccinations as soon as they wanted. There are a lot of reasons kids might not have health insurance - maybe their parents lost a job, don't have health insurance at work or maybe it just costs too much. Whatever the reason, CHIP may be able to help. All families need to do is apply today.\n\n\nNOW THAT HEALTHCARE REFORM HAS BEGUN, IS CHIP STILL AVAILABLE FOR MY CHILDREN?\nUninsured Pennsylvania children and teens who are not eligible for Medical Assistance have access to affordable, comprehensive health-care coverage. CHIP is still available for uninsured kids and teens. CHIP is  there for your kids with quality, comprehensive health insurance coverage for routine doctor visits, prescriptions, dental, eye care, prescriptions and much more. All families need to do is apply today. For a full list of covered benefits available through CHIP, click here.\n\nI thought CHIP was only for low-income families?\n\nParents may think their kids can't get CHIP because they make too much money. Not true! CHIP covers uninsured kids and teens up to age 19 who are not eligible for Medical Assistance.\n\nWHAT'S THE COST FOR CHIP COVERAGE?\n\nFor most families, it's free. Families with incomes above the free CHIP limits will pay low monthly premiums and co-pays for some services. View comprehensive income information (PDF).\n\nHOW DOES CHIP PUT INSURANCE WITHIN REACH?\n\nCHIP is brought to you by leading health insurance companies who offer your kids quality, comprehensive coverage. You'll have a choice of major insurance companies with large networks of physicians, specialists and care facilities near you.  In fact, your kids may even be able to keep visiting the same doctors they see now. Find a CHIP health insurance company in your county.\n\nHOW LONG IS MY CHILD COVERED ONCE THEY ARE ENROLLED IN CHIP?\n\nOnce enrolled, children are guaranteed 12 months of CHIP coverage unless they no longer meet the basic eligibility requirements. Families must renew their coverage every year in order for the coverage to continue. CHIP insurance companies send renewal notices 90 days before their benefits are going to end, and families must fill out and send the renewal information back to their CHIP insurance company in order for benefits to continue.\n\nI recently gained legal custody of my grandchildren. They are uninsured and need health benefits. Can I apply for CHIP for them?\n\nYes! Any legal guardian who is exercising care and control of the children, can apply for CHIP.\n\nMY CHILD HAS A PRE-EXISTING CONDITION. WILL THAT AFFECT OUR ELIGIBILITY?\n\nPre-existing conditions are covered. There are no exclusions for pre-existing conditions in CHIP or Medical Assistance. However, if your child has a serious medical condition or disability, he or she may be considered for Medical Assistance.\n\nDOES CHIP HAVE A WAITING LIST?\n\nNo. There is no waiting list to enroll in CHIP.\n\nI HAVE MORE QUESTIONS. CAN I TALK TO SOMEONE ABOUT CHIP?\n\nCouldn't find the answer you're looking for? Maybe it's under another category in our FAQ section. Check it out. Or, find a CHIP insurance company in your county and give them a call!\n\n\n"

In [7]:
hydro_bill = textract.process('/Users/jonathanjramirez/Downloads/BILLS-115hr3043rh.pdf')

<h1> Defining classes for text handlers </h1>

In [8]:
def get_from_mongo(mongo_id, art_name):
    
    """
    Given a mongoDB ID and the name of the article, this will return the text from the database.
    """
    
    return db.cleaned_pdfs.find({'_id': mongo_id})[0][art_name]

In [9]:
class text_handler():
    
    def __init__(self):
        
        self.total_unique = len([article for article in db.cleaned_pdfs.find()])
        
       
        
    def get_vectorizer(self, vectorizer = 'count', ngram_range = (1,2), stop_words = 'english', max_df = 0.6, max_features = 5000):
        
        """
        This function allows the user to create either a count vectorizer or a tfidf vectorizer
        while setting preferences such as the ngram range any stopwords, max document frequency, and max features.
        --------
        INPUT:
            vectorizer: string (eiter 'count' or 'tfidf' to specify kind of vectorizer)
            ngram_range: tuple (min, max)
            stop_words: string or list of stop words
            max_df: number between 0 and 1 to indicate the maximum percentage of the documents that a term can appear in
            max_features: total number of features desired
            
        OUTPUT:
            count or tfidf vectorizer with the desired specifications.
        """

        if vectorizer == 'count':

            count_vectorizer = CountVectorizer(ngram_range=ngram_range,  
                                       stop_words=stop_words, 
                                       token_pattern="\\b[a-z][a-z]+\\b",
                                       lowercase=True,
                                       max_df = max_df, max_features = max_features)

            return count_vectorizer

        elif vectorizer == 'tfidf':

            tfidf_vectorizer = TfidfVectorizer(ngram_range=ngram_range,  
                                       stop_words=stop_words, 
                                       token_pattern="\\b[a-z][a-z]+\\b",
                                       lowercase=True,
                                       max_df = max_df, max_features = max_features)
            return tfidf_vectorizer

        else:

            print("Type in either 'count' or 'tfidf'")
            
            
            
    def get_all_docs(self):
        
        """
        This function gets all articles from mongoDB.
        """
        
        all_unique_articles = [article for article in db.cleaned_pdfs.find()]
        
        return all_unique_articles



    def get_txt(self, article):
        
        art_name = [key for key in article][1]

        art_text = article[art_name]

        return (art_name,art_text)


    def get_all_texts(self):
        
        """
        This function loads all document names and mongoDB _ids into the handler object.
        """

        unique_arts = self.get_all_docs()
        
        self.doc_names = []
        self.doc_ids = []
        
        for art in unique_arts:
            
            _id = art['_id']
                                
            self.doc_ids.append(_id)
            
            
        self.doc_names = []
        
        for doc in self.doc_ids:
            
            keys = [key for key in db.cleaned_pdfs.find({'_id': doc})[0]]
            
            for key in keys:
                
                if "R" in key:
                    
                    name = key
                    self.doc_names.append(key)
        
        output = []
        
        for i,key in enumerate(self.doc_names):
            
            output.append(unique_arts[i][key])
            
        return output

In [10]:
handler = text_handler()

In [11]:
cv = handler.get_vectorizer(vectorizer='count', 
                            ngram_range=(1,3), 
                            stop_words = stopwords, 
                            max_df = 0.1)

In [12]:
X = cv.fit_transform(handler.get_all_texts())

In order to compare documents by the strength of words (and eventually the strength of topics), we need to put each document on the same scale.

In [13]:
def normalize(sparse_matrix):
    
    from sklearn.preprocessing import Normalizer
    import numpy as np
    from scipy import sparse

    n = Normalizer()
    n.fit(sparse_matrix.toarray())

    X = n.transform(sparse_matrix.toarray())
    
    X_sparse_cv = sparse.csr_matrix(X)
    
    return X_sparse_cv


In [14]:
X = normalize(X)

In [94]:
class recommender():
    
    def __init__(self, handler):
        
        self.model_dict = {}
        self.new_article = None
        self.recommendations = []
        self.handler = handler

    def get_recommendations(self, new_article, model, vectorizer, training_vectors,n_neighbors, method, metric = 'cosine'):
        
        """
        INPUT:
            new_article: string - This is the new article for which the user will generate a recommended reading list
            model: model (e.g. nmf, lsa, lda) - 
            vectorizer: the vectorizer used (e.g. cv or tfidf)
            training_vectors: the vectorized and normalized data
            n_neighbors: number of recommendations desired
            method: string - to keep track of the results per combination
            metric = 'cosine': can try 'euclidean' for euclidean distance
            
        OUTPUT:
            array of ints which are the index numbers of the name and mongoDB object ids.
        """
        
        self.new_article = new_article
        
        new_vec = model.transform(
            vectorizer.transform([new_article]))
        
        nn = NearestNeighbors(n_neighbors=n_neighbors, metric=metric, algorithm='brute')
        
        nn.fit(training_vectors)
        
        results = nn.kneighbors(new_vec)
        
        self.model_dict[method] = (vectorizer, training_vectors)
        
        self.recommendations.append(results[1][0])
        
        return results[1][0]
    

In [16]:
rec = recommender(handler)

In [17]:
n_topics = 30
n_iter = 30
lda = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=n_iter,
                                random_state=42,
                               learning_method='online')

In [18]:
data = lda.fit_transform(X)

In [19]:
rec.get_recommendations(new_article=house_chip, model = lda, vectorizer=cv,
                       training_vectors = lda.transform(X), n_neighbors = 10, method = 'lda_cv')

array([2536, 2690, 2274, 2261,  959, 3454, 2057, 2292, 1015, 2270])

In [143]:
get_from_mongo(mongo_id=handler.doc_ids[2536], art_name = handler.doc_names[2536])[:3000]

'Comparing Medicaid and Exchanges: Benefits and Costs for Individuals and Families Evelyne P Baumrucker Analyst in Health Care Financing Bernadette Fernandez Specialist in Health Care Financing June ,   Congressional Research Service - wwwcrs  R  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cComparing Medicaid and Exchanges: Benefits and Costs for Individuals and Families  Summary The Patient Protection and Affordable Care Act ACA, PL -, as amended expands health insurance coverage primarily through two mechanisms: by expanding the existing Medicaid program and by establishing new health insurance exchanges where certain individuals and businesses can purchase private health insurance Under ACA, Medicaid and exchanges are envisioned to work in tandem, with the potential to provide a continuous source of subsidized coverage for lower-income individuals and families, beginning in  On June , , the US Supreme Court issued a decision in National Federation of 

This recommendation seems inline with the topic as a whole. How does it do with the hydro bill summary?

In [20]:
rec.get_recommendations(new_article=house_hydro, model = lda, vectorizer=cv,
                       training_vectors = lda.transform(X), n_neighbors = 10, method = 'lda_cv')

array([1708, 3624, 3772, 3600, 2070,  837, 2696, 2646,  166, 2460])

In [144]:
get_from_mongo(mongo_id=handler.doc_ids[1708], art_name = handler.doc_names[1708])[:3000]

'Previewing a  Farm Bill Rene Johnson, Coordinator Specialist in Agricultural Policy March ,   Congressional Research Service - wwwcrs  R  \x0cPreviewing a  Farm Bill  Summary Congress periodically establishes agricultural and food policy in an omnibus farm bill The th Congress faces reauthorization of the  farm billthe Agricultural Act of  PL -, HRept -because many of its provisions expire in  The  farm bill is the most recent omnibus farm bill It was enacted in February  and succeeded the Food, Conservation, and Energy Act of  PL -,  farm bill In recent decades, the breadth of farm bills has steadily grown to include new and expanding food and agricultural interests The  farm bill contains  titles encompassing farm commodity revenue supports, farm credit, trade, agricultural conservation, research, rural development, energy, and foreign and domestic food programs, among other programs Provisions in the  farm bill reshaped the structure of farm commodity support, expanded crop insuran

This is not an idea match for a bill about clean energy and the environment.

<h1>TFIDF</h1>

In [22]:
tfidf = handler.get_vectorizer(vectorizer='tfidf', ngram_range=(1,3),stop_words=stopwords, max_df = 0.1)

In [23]:
X2 = normalize(tfidf.fit_transform(handler.get_all_texts()))

In [24]:
n_topics = 30
n_iter = 30
lda2 = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=n_iter,
                                random_state=42,
                               learning_method='online')

In [25]:
data2 = lda2.fit_transform(X2)

In [26]:
rec.get_recommendations(new_article=house_chip, model = lda2, vectorizer=tfidf,
                       training_vectors = lda2.transform(X2), n_neighbors = 10, method = 'lda_tfidf')

array([1734, 2940, 1556, 3454, 2146, 2594, 2274, 2292, 2270,  932])

In [145]:
get_from_mongo(mongo_id=handler.doc_ids[1734], art_name = handler.doc_names[1734])[:3000]

'Finding Medicare Enrollment Statistics Michele L Malloy Research Librarian June ,   Congressional Research Service - wwwcrs  R  \x0cFinding Medicare Enrollment Statistics  Contents Purpose and Scope  Categories and Definitions   Selected Sources for Medicare Enrollment Statistics   Centers for Medicare & Medicaid Services   Medicare Enrollment Dashboard   Medicare Enrollment Reports   Congressional District Report   Medicare Advantage/Part D Contract and Enrollment Data   CMS Program Statistics   CMS Fast Facts   Medicare-Medicaid Coordination Office: Analytic Reports and Data Resources   CMS Office of Legislation Contact Information   Medicare Board of Trustees   Medicare Board of Trustees Annual Report   Medicare Payment Advisory Commission   MedPAC Reports   Data Book: Health Care Spending and the Medicare Program  Americas Health Insurance Plans  Trends in Medigap Enrollment and Coverage Options  Beneficiaries with Medigap Coverage    Figures Figure  Medicare Enrollment Dashboard 

Again, this is a reasonable recommendation, but does using TFIDF fare better with the hydro bill?

In [27]:
rec.get_recommendations(new_article=house_hydro, model = lda2, vectorizer=tfidf,
                       training_vectors = lda2.transform(X2), n_neighbors = 10, method = 'lda_tfidf')

array([2475,  984, 2752, 3541, 3172, 2296, 3063, 2978, 1415, 2084])

In [146]:
get_from_mongo(mongo_id=handler.doc_ids[2475], art_name = handler.doc_names[2475])[:3000]

'The Federal Minimum Wage: In Brief David H Bradley Specialist in Labor Economics June ,   Congressional Research Service - wwwcrs  R  \x0cThe Federal Minimum Wage: In Brief  Summary The Fair Labor Standards Act FLSA, enacted in , is the federal legislation that establishes the minimum hourly wage that must be paid to all covered workers The minimum wage provisions of the FLSA have been amended numerous times since , typically for the purpose of expanding coverage or raising the wage rate Since its establishment, the minimum wage rate has been raised  separate times The most recent change was enacted in  PL -, which increased the minimum wage to its current level of $ per hour In addition to setting the federal minimum wage rate, the FLSA provides for several exemptions and subminimum wage categories for certain classes of workers and types of work Even with these exemptions, the FLSA minimum wage provisions still cover the vast majority of the workforce Despite this broad coverage, ho

Unforunately, it recommends a PDF on the minimum wage.

<h1> NMF CV </h1>

Ultimately, NMF + CV produces the most reasonable recommendations for the two test cases.

In [29]:
import pickle

In [30]:
def pickle_dis(dis,dis_name):
    
    output = open(dis_name + '.pkl', 'wb')

    pickle.dump(dis, output)

    output.close()
    
def unpickle_dis(dis):
    
    pkl_file = open(dis, 'rb')

    data = pickle.load(pkl_file)

    pkl_file.close()
    
    return data

In [31]:
n_topics = 30
n_iter = 30

#Instantiate an NMF
nmf = NMF(n_components=n_topics, max_iter=n_iter, random_state=42)

In [32]:
pickle_dis(nmf, 'nmf')

In [33]:
#Be sure to fit transform and normalize your data.

X3 = normalize(cv.fit_transform(handler.get_all_texts()))

In [34]:
pickle_dis(X3, 'X')

In [35]:
#Now to put the data into topic space from word space.

nmf_data = nmf.fit_transform(X3)

In [36]:
pickle_dis(nmf_data, 'nmf_data')

In [37]:
pickle_dis(cv, 'cv')

In [38]:
rec.get_recommendations(new_article=house_chip, 
                        model = nmf, 
                        vectorizer=cv,
                        training_vectors=nmf.transform(X3), 
                        n_neighbors=5, method = 'nmf_cv')


array([2126, 2106, 2239,  557, 2705])

In [147]:
get_from_mongo(handler.doc_ids[2126], handler.doc_names[2126])[:3000]

'CHIP and the ACA Maintenance of Effort MOE Requirement: In Brief Alison Mitchell Specialist in Health Care Financing Evelyne P Baumrucker Specialist in Health Care Financing September ,   Congressional Research Service - wwwcrs  R  \x0cCHIP and the ACA Maintenance of Effort MOE Requirement: In Brief  Summary The State Childrens Health Insurance Program CHIP is a means-tested program that provides health coverage to targeted low-income children and pregnant women in families that have annual income above Medicaid eligibility levels but do not have health insurance CHIP is jointly financed by the federal government and the states and administered by the states The federal government sets basic requirements for CHIP, but states have the flexibility to design their own version of CHIP within the federal governments basic framework States may design their CHIP programs in three ways: a CHIP Medicaid expansion, a separate CHIP program, or a combination approach in which the state operates a

Again, a reasonable recommendation, but will it give a good recommendation for the hydro power bill?

In [40]:
rec.get_recommendations(new_article=house_hydro, 
                        model = nmf, 
                        vectorizer=cv,
                        training_vectors=nmf.transform(X3), 
                        n_neighbors=5, method = 'nmf_cv')

array([2879, 3484, 2372, 2741, 2614])

The recommendations seem to be about clean energy, exactly the kind that would help build a background on the topic as a whole.

In [140]:
get_from_mongo(handler.doc_ids[2879], handler.doc_names[2879])[:3000]

'Algaes Potential as a Transportation Biofuel Kelsi Bracmort Specialist in Agricultural Conservation and Natural Resources Policy April ,   Congressional Research Service - wwwcrs  R  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cAlgaes Potential as a Transportation Biofuel  Summary Congress continues to debate the federal role in biofuel research, biofuel tax incentives, and renewable fuel mandates The debate touches on topics such as fuel imports and security, job creation, and environmental benefits, and is particularly significant for advanced biofuels, such as those produced by algae Congress established the Renewable Fuel Standard RFSa mandate requiring that the national fuel supply contain a minimum amount of fuel produced from renewable biomass The RFS is essentially composed of two biofuel mandatesone for unspecified biofuel, which is being met with corn-starch ethanol, and one for advanced biofuels or non-corn starch ethanol, which may not be me

In [141]:
get_from_mongo(handler.doc_ids[3484], handler.doc_names[3484])[:3000]

'Renewable Fuel Standard RFS  Overview and Issues Randy Schnepf Specialist in Agricultural Policy Brent D Yacobucci Section Research Manager March ,   Congressional Research Service - wwwcrs  R  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cRenewable Fuel Standard RFS  Overview and Issues  Summary Federal policy has played a key role in the emergence of the US biofuels industry Policy measures include minimum renewable fuel usage requirements, blending and production tax credits, an import tariff, loans and loan guarantees, and research grants One of the more prominent forms of federal policy support is the Renewable Fuel Standard RFSwhereby a minimum volume of biofuels is to be used in the national transportation fuel supply each year This report describes the general nature of the RFS mandate and its implementation, and outlines some emerging issues related to the continued growth of US biofuels production needed to fulfill the expanding RFS mandate, as

In [43]:
pickle_dis(rec, 'rec')
pickle_dis(handler, 'handler')

In [66]:
chip_bill = textract.process('/Users/jonathanjramirez/Downloads/BILLS-115hr2843ih.pdf')

In [142]:
chip_bill_recs = rec.get_recommendations(new_article=chip_bill, 
                        model = nmf, 
                        vectorizer=cv,
                        training_vectors=nmf.transform(X3), 
                        n_neighbors=5, method = 'nmf_cv')

first_rec = chip_bill_recs[0]

get_from_mongo(handler.doc_ids[first_rec], handler.doc_names[first_rec])[:3000]

'Who Pays for Long-Term Services and Supports? A Fact Sheet Kirsten J Colello Specialist in Health and Aging Policy Scott R Talaga Analyst in Health Care Financing July ,   Congressional Research Service - wwwcrs  R  \x0cWho Pays for Long-Term Services and Supports? A Fact Sheet  L  ong-term services and supports LTSS refer to a broad range of health and health-related services and supports that are needed by individuals over an extended period of time The need for LTSS affects persons of all ages and is generally measured by limitations in an individuals ability to perform daily personal care activities eg, eating, bathing, dressing, walking or activities that allow individuals to live independently in the community eg, shopping, housework, meal preparation The most recent published data estimating the number of Americans in need of LTSS indicate that about  million individuals living in the community need LTSS, or  percent  of the community-resident population It was estimated anothe

The recommended readings do appear to be in within the topics of the bills.

Let's do a sanity check to make sure that the topics themselves make sense.

In [137]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [138]:
display_topics(nmf, cv.get_feature_names(), 10)


Topic  0
tsa, passenger, cargo, airport, cbp, faa, airports, transportation security, passengers, explosives

Topic  1
aca, premiums, care act, affordable care, cost sharing, affordable care act, enrollment, cms, protection affordable, protection affordable care

Topic  2
carbon, ghg, greenhouse, emission, ghg emissions, greenhouse gas, epas, power plants, fired, clean air

Topic  3
aliens, alien, visa, ina, visas, unauthorized aliens, immigrant, immigrants, lpr, ice

Topic  4
fy enacted, nsf, nih, enacted fy, nasa, request fy, federal research, national science, increase million, presidents fy

Topic  5
offense, offenses, defendant, sentencing, imprisonment, cir united, cir united states, sentence, conviction, conspiracy

Topic  6
nomination, nominations, nominees, confirmation, nominee, appointments, recess, cloture, judiciary committee, nominated

Topic  7
chinas, prc, taiwan, beijing, taiwans, hong, communist, kong, hong kong, japanese

Topic  8
tax rate, tax rates, deduction, ded

<h1> NMF TFIDF </h1>

1. Initialize an NMF

In [69]:
n_topics = 30
n_iter = 30

nmf2 = NMF(n_components=n_topics, max_iter=n_iter, random_state=42)

In [70]:
X4 = normalize(tfidf.fit_transform(handler.get_all_texts()))

Fit your data to your NMF

In [71]:
nmf_data2 = nmf2.fit_transform(X4)

In [72]:
rec.get_recommendations(new_article=house_chip, 
                        model = nmf2, 
                        vectorizer=tfidf,
                        training_vectors=nmf.transform(X4), 
                        n_neighbors=10, method = 'nmf_cv')

array([2291, 2377, 2057, 3273, 2143, 2408, 2690, 2418, 2499, 2015])

In [148]:
get_from_mongo(mongo_id = handler.doc_ids[2191], art_name = handler.doc_names[2191])[:3000]

'Drug Enforcement in the United States: History, Policy, and Trends Lisa N Sacco Analyst in Illicit Drugs and Crime Policy October ,   Congressional Research Service - wwwcrs  R  \x0cDrug Enforcement in the United States: History, Policy, and Trends  Summary The federal government prohibits the manufacturing, distribution, and possession of many intoxicating substances that are solely intended for recreational use notable exceptions are alcohol and tobacco; however, the federal government also allows for and controls the medical use of many intoxicants Federal authority to control these substances primarily resides with the Attorney General of the United States Over the last decade, the United States has shifted its stated drug control policy toward a comprehensive approach; one that focuses on prevention, treatment, and enforcement In order to restrict and reduce availability of illicit drugs in the United States, a practice referred to as supply reduction, the federal government cont

In [74]:
rec.get_recommendations(new_article=house_hydro, 
                        model = nmf2, 
                        vectorizer=tfidf,
                        training_vectors=nmf.transform(X4), 
                        n_neighbors=10, method = 'nmf_cv')

array([2104, 2745, 3421, 3199, 2565, 1063, 2667, 2473, 3119, 2533])

In [149]:
get_from_mongo(mongo_id = handler.doc_ids[2104], art_name = handler.doc_names[2104])[:3000]

' National Ambient Air Quality Standard NAAQS for Fine Particulate Matter PM  Designating Nonattainment Areas Robert Esworthy Specialist in Environmental Policy December ,   Congressional Research Service - wwwcrs  R  \x0c NAAQS for Fine Particulate Matter PM  Designating Nonattainment Areas  Summary On April , , the Environmental Protection Agency EPA published amendments to the January , , final rule designating areas for compliance with the  primary annual National Ambient Air Quality Standard NAAQS for fine particulate matter PM Revising a NAAQS established under the Clean Air Act CAA sets in motion a process under which the states and EPA identify areas that exceed the standard nonattainment areas using multi-year air quality monitoring data and other criteria, requiring states to take steps to reduce pollutant concentrations in order to meet the standard The  revisions to the PM NAAQS were the subject of considerable congressional oversight EPAs designation of nonattainment areas

In [80]:
display_topics(nmf2,tfidf.get_feature_names(),10)


Topic  0
usaid, foreign aid, state foreign operations, state foreign, hiv, oco, hiv aids, global health, operations related, foreign operations related

Topic  1
aca, premiums, cost sharing, affordable care, care act, affordable care act, health plans, cms, enrollment, premium tax

Topic  2
ghg, carbon, emission, greenhouse, ghg emissions, greenhouse gas, epas, power plants, ccs, clean air

Topic  3
aliens, ina, alien, visa, visas, unauthorized aliens, lpr, immigrant, nonimmigrant, lprs

Topic  4
tsa, fy enacted, cbp, dhs appropriations, passenger, cargo, transportation security, enacted fy, airport, directorate

Topic  5
fisa, foreign intelligence, fisc, electronic surveillance, foreign intelligence surveillance, foreign power, intelligence surveillance, patriot, usa patriot, patriot act

Topic  6
nomination, nominations, nominees, nominee, recess, confirmation, appointments, cloture, court nominations, court nominees

Topic  7
wto, tpp, fta, trade agreements, ftas, trade agreement, 

In [150]:
get_from_mongo(handler.doc_ids[1412], handler.doc_names[1412])[:3000]


'The Temporary Assistance for Needy Families TANF Block Grant: A Primer on TANF Financing and Federal Requirements Gene Falk Specialist in Social Policy November ,   Congressional Research Service - wwwcrs  RL  \x0cThe Temporary Assistance for Needy Families TANF Block Grant  Summary The Temporary Assistance for Needy Families TANF block grant provides federal grants to the  states, the District of Columbia, American Indian tribes, and the territories for a wide range of benefits, services, and activities It is best known for helping states pay for cash welfare for needy families with children, but it funds a wide array of additional activities TANF was created in the  welfare reform law PL - TANF provides a basic block grant of $ billion It also requires states to contribute in the aggregate from their own funds at least $ billion for benefits and services to needy families with childrenthis is known as the maintenance-of-effort MOE requirement States may use TANF and MOE funds in any

In [82]:
rec.get_recommendations(new_article=house_hydro, 
                        model = nmf2, 
                        vectorizer=tfidf,
                        training_vectors=nmf.transform(X4), 
                        n_neighbors=10, method = 'nmf_cv')

array([2104, 2745, 3421, 3199, 2565, 1063, 2667, 2473, 3119, 2533])

In [151]:
get_from_mongo(mongo_id = handler.doc_ids[178], art_name = handler.doc_names[178])[:3000]

'Government Transparency and Secrecy: An Examination of Meaning and Its Use in the Executive Branch Wendy Ginsberg Analyst in American National Government Maeve P Carey Analyst in Government Organization and Management L Elaine Halchin Specialist in American National Government Natalie Keegan Analyst in American Federalism and Emergency Management Policy November ,   Congressional Research Service - wwwcrs  R  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cGovernment Transparency: An Examination of Its Use in the Executive Branch  Summary From the beginnings of the American federal government, Congress has required executive branch agencies to release or otherwise make available government information and records Some scholars and statesmen, including James Madison, thought access to information commonly referred to in contemporary vernacular as transparencywas an essential cornerstone of ic governance Today, the federal government attempts to balance acce

<h1> LSA CV </h1>

1. Initialize an LSA

In [84]:
n_topics = 30
n_iter = 30

lsa = TruncatedSVD(n_components = n_topics, n_iter = n_iter, random_state=42)


Fit your data to your LSA

In [85]:
X4 = cv.fit_transform(handler.get_all_texts())


In [86]:
lsa_data = lsa.fit_transform(X4)


In [87]:
rec.get_recommendations(new_article=house_chip, 
                        model = lsa, 
                        vectorizer=cv,
                        training_vectors=lsa.transform(X4), 
                        n_neighbors=3, method = 'lsa_cv')


array([2239, 2126, 2146])

In [152]:
get_from_mongo(mongo_id = handler.doc_ids[2238], art_name = handler.doc_names[2238]) [:3000]

'Hatch Act: Candidacy for Office by Federal Employees in the Executive Branch Jack Maskell Legislative Attorney July ,   Congressional Research Service - wwwcrs  R  \x0cHatch Act: Candidacy for Office by Federal Employees in the Executive Branch  Summary The federal law commonly known as the Hatch Act applies to all federal officers and employeesother than the President and Vice Presidentin the agencies, departments, bureaus, and offices of the executive branch of the federal government Under the significant amendments made to the law in , the Hatch Act now generally permits most federal employees to engage in a wide range of voluntary, partisan political activities on their own off-duty time and away from the federal workplace Some employees in specified agencies and positions, including those dealing with law enforcement and national security matters, it should be noted, may be subject to further restrictions on their off-duty partisan political activities, and may not take any activ

In [153]:
get_from_mongo(mongo_id = handler.doc_ids[2125], art_name = handler.doc_names[2125])[:3000]

'Climate Change Adaptation by Federal Agencies: An Analysis of Plans and Issues for Congress Jane A Leggett, Coordinator Specialist in Energy and Environmental Policy February ,   Congressional Research Service - wwwcrs  R  \x0cClimate Change Adaptation by Federal Agencies  Summary Though Congress has debated the significance of global climate change and what federal policies, if any, should address them, the Government Accountability Office GAO since  has identified the changing climate as one of the  most significant risks facing the federal government President Obama established adaptation as a prominent part of his Climate Action Plan in June  The November  Executive Order , Preparing the United States for the Impacts of Climate Change, directed agencies to undertake vulnerability assessments and planning for adaptation The Administration aimed efforts at reducing agencies own risks, taking advantage of no-regrets adaptation opportunities, and actions that promote resilience to cli

In [90]:
rec.get_recommendations(new_article=house_hydro, 
                        model = lsa, 
                        vectorizer=cv,
                        training_vectors=lsa.transform(X4), 
                        n_neighbors=3, method = 'lsa_cv')


array([1966,  878, 2081])

In [154]:
get_from_mongo(mongo_id = handler.doc_ids[1965], art_name = handler.doc_names[1965])[:3000]

'EPAs Clean Power Plan: Implications for the Electric Power Sector Richard J Campbell Specialist in Energy Policy November ,   Congressional Research Service - wwwcrs  R  \x0cEPAs Clean Power Plan: Implications for the Electric Power Sector  Summary On October , , the Environmental Protection Agency EPA released the final version of regulations to reduce greenhouse gas GHG emissions from existing power plants also referred to as electric generating units or EGUs by EPA Since carbon dioxide CO from fossil fuel combustion is the largest source of US GHG emissions, and fossil fuels are used for the majority of electric power generation, reducing CO emissions from power plants plays a key role in the Administrations climate change policy Under the provisions of the Clean Power Plan CPP, states must prepare plans that reduce either total CO emissions or emission rates at affected EGUs When implemented, EPA projects the state plans will reduce CO emissions from US power generation approximat

In [155]:
get_from_mongo(mongo_id = handler.doc_ids[877], art_name = handler.doc_names[877])[:3000]

'US Postal Service Workforce Size and Employment Categories, FY-FY Kathryn A Francis Analyst in Government Organization and Management October ,   Congressional Research Service - wwwcrs  RS  \x0cUS Postal Service Workforce Size and Employment Categories, FY-FY  Summary This report provides data from the past  years on the size and composition of the US Postal Services USPSs workforce Reforms to the size and composition of the workforce have been an integral part of USPSs strategy to reduce costs and regain financial solvency, particularly between FY and FY Since , USPS has experienced significant revenue losses that have affected its ability to manage its expenses Personnel costs are one of the primary drivers of USPSs operating expenses As such, USPS has employed strategies to reform the size and composition of its workforce in an effort to cut personnel costs, primarily through attrition and separation incentives and increased use of lower-cost employees These strategies reduced per

<h1> LSA TFIDF </h1>

1. Initialize an LSA

In [95]:
n_topics = 30
n_iter = 30

lsa_tfidf = TruncatedSVD(n_components = n_topics, n_iter = n_iter, random_state=42)

In [96]:
X5 = tfidf.fit_transform(handler.get_all_texts())

Fit your data to your LSA

In [97]:
lsa_data_tfidf = lsa_tfidf.fit_transform(X5)

In [98]:
rec.get_recommendations(new_article=house_chip, 
                        model = lsa_tfidf, 
                        vectorizer=tfidf,
                        training_vectors=lsa.transform(X5), 
                        n_neighbors=3, method = 'lsa_tfidf')

array([565, 755, 747])

In [156]:
get_from_mongo(mongo_id = handler.doc_ids[564], art_name = handler.doc_names[564])[:3000]

'The Greek Debt Crisis: Overview and Implications for the United States Rebecca M Nelson, Coordinator Specialist in International Trade and Finance Paul Belkin Analyst in European Affairs James K Jackson Specialist in International Trade and Finance April ,   Congressional Research Service - wwwcrs  R  \x0cThe Greek Debt Crisis: Overview and Implications for the United States  Summary Crisis Overview Since , Greece has grappled with a serious debt crisis Most economists believe that Greeces public debt,  percent  of Greek gross domestic product GDP, is unsustainable The ramifications of the debt have been felt throughout the Greek economy, which contracted by  percent  from its precrisis level A fifth of Greeks are unemployed, with youth unemployment at nearly  percent , and the Greek banking system is unstable Although other Eurozone governments, the International Monetary Fund IMF, and the European Central Bank coordinated a substantial crisis response, Greece continues to face serio

In [100]:
rec.get_recommendations(new_article=house_hydro, 
                        model = lsa_tfidf, 
                        vectorizer=tfidf,
                        training_vectors=lsa.transform(X5), 
                        n_neighbors=3, method = 'lsa_tfidf')

array([ 35,  51, 433])

In [157]:
get_from_mongo(mongo_id = handler.doc_ids[34], art_name = handler.doc_names[34])[:3000]

'Order Code RS Updated August ,   CRS Report for Congress Received through the CRS Web  Navy Ship Procurement Rate and the Planned Size of the Navy: Background and Issues for Congress Ronald ORourke Specialist in National Defense Foreign Affairs, Defense, and Trade Division  Summary There is currently no officially approved, consensus plan for the future size and structure of the Navy The absence of such a plan could complicate Congress ability to conduct oversight of the Navys budget and individual Navy ship-acquisition programs DOD is proposing to procure new Navy ships during most of its amended FY-FY Future Years Defense Plan FYDP at an average rate less than what would be required, over the long run, to maintain a Navy of  or more ships over the long run This report will be updated as events warrant  Background Historical and Current Size of the Navy The Navy reached a late-Cold War peak of  battle force ships in FY and has since been declining in size The Navy fell below  battle 

<h1> Can we optimize how LDA recommends documents </h1>

In [102]:
#First, instantiate a new recommender

rec_euclid = recommender(handler)

In [103]:
X = cv.fit_transform(handler.get_all_texts())

In [104]:
X = normalize(X)

In [105]:
n_topics = 30
n_iter = 30
lda_euclid = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=n_iter,
                                random_state=42,
                               learning_method='online')

In [106]:
data_lda_cv_euclid = lda_euclid.fit_transform(X)

In [107]:
rec_euclid.get_recommendations(new_article=house_chip, model = lda_euclid, vectorizer=cv,
                       training_vectors = lda_euclid.transform(X), n_neighbors = 10, method = 'lda_cv', metric = 'euclidean')

array([2015, 2536, 2690, 2261, 3273, 2408,  959, 2274, 1010, 2057])

In [158]:
get_from_mongo(mongo_id = handler.doc_ids[1706], art_name = handler.doc_names[1706])[:3000]

'Science and Technology Issues in the th Congress Frank Gottron, Coordinator Specialist in Science and Technology Policy May ,   Congressional Research Service - wwwcrs  R  \x0cScience and Technology Issues in the th Congress  Summary Science and technology S&T have a pervasive influence over a wide range of issues confronting the nation Public and private research and development spur scientific and technological advancement Such advances can drive economic growth, help address national priorities, and improve health and quality of life The constantly changing nature and ubiquity of science and technology frequently create public policy issues of congressional interest The federal government supports scientific and technological advancement directly by funding and performing research and development and indirectly by creating and maintaining policies that encourage private sector efforts Additionally, the federal government establishes and enforces regulatory frameworks governing many

In [109]:
rec_euclid.get_recommendations(new_article=house_hydro, model = lda_euclid, vectorizer=cv,
                       training_vectors = lda_euclid.transform(X), n_neighbors = 10, method = 'lda_cv', metric = 'euclidean')

array([1708, 3624, 3772, 3600, 2070,  837, 2696, 2646,  166, 2460])

In [159]:
get_from_mongo(mongo_id = handler.doc_ids[2810], art_name = handler.doc_names[2810])[:3000]

'Federal Depository Library Program: Issues for Congress R Eric Petersen Specialist in American National Government Jennifer E Manning Information Research Specialist Christina M Bailey Information Research Specialist March ,   Congressional Research Service - wwwcrs  R  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cFederal Depository Library Program: Issues for Congress  Summary Congress established the Federal Depository Library Program FDLP to provide free public access to federal government information The programs origins date to ; the current structure of the program was established in  and is overseen by the Government Printing Office GPO Access to government information is provided through a network of depository libraries across the United States In the past half-century, information creation, distribution, retention, and preservation has expanded from a tangible, paper-based process to include digital processes managed largely through computeriz

In [112]:
rec_euclid.get_recommendations(new_article=house_chip, model = lda_euclid, vectorizer=cv,
                       training_vectors = lda_euclid.transform(X), n_neighbors = 15, method = 'lda_cv', metric = 'euclidean')

array([2015, 2536, 2690, 2261, 3273, 2408,  959, 2274, 1010, 2057, 2143,
       2292, 2270, 3256, 2281])

<h1> Euclidean Distance + LDA + TFIDF </h1>

In [114]:
rec_euclid2 = recommender(handler)

In [115]:
X2 = tfidf.fit_transform(handler.get_all_texts())
X2 = normalize(X2)

In [116]:
n_topics = 30
n_iter = 30
lda_euclid2 = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=n_iter,
                                random_state=42,
                               learning_method='online')

In [117]:
data_lda_tfidf_euclid = lda_euclid2.fit_transform(X2)

In [118]:
rec_euclid2.get_recommendations(new_article=house_chip, model = lda_euclid2, vectorizer=tfidf,
                       training_vectors = lda_euclid2.transform(X2), n_neighbors = 10, method = 'lda_tfidf', metric = 'euclidean')

array([2594,  932, 2940, 1946, 2562, 1556, 1734, 2126, 2146, 2526])

In [167]:
get_from_mongo(mongo_id = handler.doc_ids[2594], art_name = handler.doc_names[2594])[:3000]

'Medicaid Disproportionate Share Hospital Payments Alison Mitchell Specialist in Health Care Financing June ,   Congressional Research Service - wwwcrs  R  \x0cMedicaid Disproportionate Share Hospital Payments  Summary The Medicaid statute requires states to make disproportionate share hospital DSH payments to hospitals treating large numbers of low-income patients This provision is intended to recognize the disadvantaged financial situation of those hospitals because low-income patients are more likely to be uninsured or Medicaid enrollees Hospitals often do not receive payment for services rendered to uninsured patients, and Medicaid provider payment rates are generally lower than the rates paid by Medicare and private insurance As with most Medicaid expenditures, the federal government reimburses states for a portion of their Medicaid DSH expenditures based on each states federal medical assistance percentage FMAP While most federal Medicaid funding is provided on an open-ended basi

In [120]:
rec_euclid2.get_recommendations(new_article=house_hydro, model = lda_euclid2, vectorizer=tfidf,
                       training_vectors = lda_euclid2.transform(X2), n_neighbors = 10, method = 'lda_tfidf', metric = 'euclidean')

array([2475,  984, 2752, 3541, 3172, 2296, 3063, 2978, 1415, 2084])

In [166]:
get_from_mongo(mongo_id = handler.doc_ids[2475], art_name = handler.doc_names[2475])[:3000]

'The Federal Minimum Wage: In Brief David H Bradley Specialist in Labor Economics June ,   Congressional Research Service - wwwcrs  R  \x0cThe Federal Minimum Wage: In Brief  Summary The Fair Labor Standards Act FLSA, enacted in , is the federal legislation that establishes the minimum hourly wage that must be paid to all covered workers The minimum wage provisions of the FLSA have been amended numerous times since , typically for the purpose of expanding coverage or raising the wage rate Since its establishment, the minimum wage rate has been raised  separate times The most recent change was enacted in  PL -, which increased the minimum wage to its current level of $ per hour In addition to setting the federal minimum wage rate, the FLSA provides for several exemptions and subminimum wage categories for certain classes of workers and types of work Even with these exemptions, the FLSA minimum wage provisions still cover the vast majority of the workforce Despite this broad coverage, ho

<h1> Euclidean + NMF + CV </h1>

1. Initialize an NMF

In [122]:
n_topics = 30
n_iter = 30

nmf_euclid = NMF(n_components=n_topics, max_iter=n_iter, random_state=42)

In [123]:
X3 = normalize(cv.fit_transform(handler.get_all_texts()))

Fit your data to your NMF

In [124]:
nmf_data_euclid = nmf_euclid.fit_transform(X3)

In [125]:
rec_euclid.get_recommendations(new_article=house_chip, 
                        model = nmf_euclid, 
                        vectorizer=cv,
                        training_vectors=nmf.transform(X3), 
                        n_neighbors=10, method = 'nmf_cv', metric = 'euclidean')

array([2690, 3078, 2499, 2143, 3273, 1654, 2408, 2388, 1707, 2418])

In [165]:
get_from_mongo(handler.doc_ids[2690], handler.doc_names[2690])[:3000]

'Health Insurance Exchanges Under the Patient Protection and Affordable Care Act ACA Bernadette Fernandez Specialist in Health Care Financing Annie L Mach Analyst in Health Care Financing January ,   Congressional Research Service - wwwcrs  R  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cHealth Insurance Exchanges Under the Patient Protection and Affordable Care Act ACA  Summary The fundamental purpose of a health insurance exchange is to provide a structured marketplace for the sale and purchase of health insurance The authority and responsibilities of an exchange may vary, depending on statutory or other requirements for its establishment and structure The Patient Protection and Affordable Care Act ACA, PL -, as amended requires health insurance exchanges to be established in every state by January ,  ACA provides certain requirements for the establishment of exchanges, while leaving other choices to be made by the states Qualified individuals and smal

In [127]:
rec_euclid.get_recommendations(new_article=house_hydro, 
                        model = nmf_euclid, 
                        vectorizer=cv,
                        training_vectors=nmf.transform(X3), 
                        n_neighbors=10, method = 'nmf_cv', metric = 'euclidean')

array([ 873, 2372, 3282, 2064, 1392, 2613, 3330, 2879, 2943, 1260])

In [164]:
get_from_mongo(handler.doc_ids[873], handler.doc_names[873])[:3000]

'Waiver Authority Under the Renewable Fuel Standard RFS Brent D Yacobucci Section Research Manager September ,   Congressional Research Service - wwwcrs  RS  CRS Report for Congress Prepared for Members and Committees of Congress  \x0cWaiver Authority Under the Renewable Fuel Standard RFS  Summary Transportation fuels are required by federal law to contain a minimum amount of renewable fuel each year This renewable fuel standard RFS, established by the Energy Policy Act of  EPAct, PL - and amended by the Energy Independence and Security Act of  EISA, PL -, requires that  billion gallons of renewable fuels be blended into gasoline and other transportation fuels in  Most of this mandate  percent  for  will be met using cornbased ethanol Other biofuels used to meet the remainder of the mandate include cellulosic biofuels, biomass-based diesel fuels, and other advanced biofuels Questions have been raised over whether the overall mandate diverts enough corn supply from food/feed production 

<h1> LSA + CV + Euclidean </h1>

1. Initialize an LSA

In [129]:
n_topics = 30
n_iter = 30

lsa_euclid = TruncatedSVD(n_components = n_topics, n_iter = n_iter, random_state=42)


Fit your data to your LSA

In [130]:
X4 = normalize(cv.fit_transform(handler.get_all_texts()))

In [131]:
lsa_cv_euclid_data = lsa_euclid.fit_transform(X4)


In [132]:
rec_euclid.get_recommendations(new_article=house_chip, 
                        model = lsa_euclid, 
                        vectorizer=cv,
                        training_vectors=lsa.transform(X4), 
                        n_neighbors=10, method = 'lsa_cv', metric = 'euclidean')


array([578, 747, 645, 519, 848, 487, 582, 565, 224, 466])

In [162]:
get_from_mongo(handler.doc_ids[578], handler.doc_names[578])[:3000]

'Cuba: Issues and Actions in the th Congress Mark P Sullivan Specialist in Latin American Affairs January ,   Congressional Research Service - wwwcrs  R  \x0cCuba: Issues and Actions in the th Congress  Summary Cuba remains a one-party communist state with a poor record on human rights The countrys political succession in  from the long-ruling Fidel Castro to his brother Ral was characterized by a remarkable degree of stability In , Ral began his second and final fiveyear term, which is scheduled to end in February , when he would be  years of age Castro has implemented a number of market-oriented economic policy changes over the past several years An April  Cuban Communist Party congress endorsed the current gradual pace of Cuban economic reform Few observers expect the government to ease its tight control over the political system While the government has released most long-term political prisoners, shortterm detentions and harassment have increased significantly over the past severa