# A Search Engine for US Congressional Bills
### Holden Huntzinger -- holdenh@umich.edu
Currently includes bills 1993-2016

See the [accompanying report](https://docs.google.com/document/d/18zFmpfLCy-gGJyt9WcHooT_v3RIC75SHj0IgbeayRYo/edit?usp=sharing) for additional information or download the corpus and other code from [Github](https://github.com/hhuntz/congressional_search)

## Step 1: Create an Index using Pyterrier

In [78]:
import pandas as pd
import numpy as np
import pyterrier as pt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# start pyterrier
if not pt.started():
    pt.init()

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [30]:
# get data
bills_df = pd.read_csv('data/us_congress_bills_1993-2016.csv', index_col = 0)
bills_df.columns = ['docno', 'text', 'summary', 'title'] # col name 'docno' prescribed by pyterrier

In [33]:
# create index
index_dir = './congressional_search_index'
iter_indexer = pt.IterDictIndexer(index_dir, overwrite=True)
bills_dict = bills_df.to_dict(orient="records") 
indexref = iter_indexer.index(bills_dict, fields=('text', 'summary', 'title'))
index = pt.IndexFactory.of(indexref)

In [34]:
# collection stats
print(index.getCollectionStatistics().toString())

Number of documents: 22218
Number of terms: 42804
Number of postings: 5367104
Number of fields: 3
Number of tokens: 18691928
Field names: [text, summary, title]
Positions:   false



## Step 2: Retrieve Documents for Annotation

The goal here is to retrieve results for sample queries to be manually annotated for relevance on a 1-5 (5 being most relevant) scale. Annotations are loaded into this notebook below. Many of the sample queries are based on the long-running [Gallop poll](https://news.gallup.com/poll/1675/most-important-problem.aspx) of Americans that asks "What do you think is the most important problem facing the country today?"

In [85]:
# load queries
queries = pd.read_csv('data/sample_queries.csv', index_col = 0)
queries.head()

Unnamed: 0,qid,query
0,q1,elephants
1,q2,gun rights
2,q3,oil prices
3,q4,economic recession depression economy
4,q5,immigration


In [43]:
# use different basic weighting schemes
tfidf = pt.BatchRetrieve(index, wmodel = 'TF_IDF')
bm25 = pt.BatchRetrieve(index, wmodel = 'BM25')
pl2 = pt.BatchRetrieve(index, wmodel = 'PL2')

In [80]:
# get sample query results from base models for annotation

all_results = pd.DataFrame()

for q in queries.qid:
    
    # get results for each ranking model 
    docno_list = []
    for model in [tfidf, bm25, pl2]:
        model_res = model(queries[queries.qid == q]).head(100)
        returned_docnos = list(model_res.docno)
        docno_list.extend(returned_docnos)
    
    # get text, summary, title, and query for each returned bill
    q_results = bills_df[bills_df.docno.isin(docno_list)]
    q_results['qid'] = q
    all_results = pd.concat([all_results, q_results])

results = pd.merge(all_results, queries, on = 'qid')

results.head()

Unnamed: 0,docno,text,summary,title,qid,query
0,115_s1256,SECTION 1. SHORT TITLE.\n\n This Act may be...,Ghost Army Congressional Gold Medal Act This b...,Ghost Army Congressional Gold Medal Act,q1,elephants
1,114_s27,SECTION 1. SHORT TITLE.\n\n This Act may be...,Wildlife Trafficking Enforcement Act of 2015 T...,Wildlife Trafficking Enforcement Act of 2015,q1,elephants
2,115_hr226,SECTION 1. SHORT TITLE.\n\n This Act may be...,African Elephant Conservation and Legal Ivory ...,African Elephant Conservation and Legal Ivory ...,q1,elephants
3,109_s2254,SECTION 1. FINDINGS.\n\n Congress finds tha...,Directs the Secretary of the Army to: (1) carr...,A bill to authorize the Secretary of the Army ...,q1,elephants
4,115_hr2701,SECTION 1. SHORT TITLE.\n\n This Act may be...,Ghost Army Congressional Gold Medal Act This b...,Ghost Army Congressional Gold Medal Act,q1,elephants


In [None]:
# read annotations in as df w/ columns=[‘qid’,’docno’,‘label’] for experiments
