# Developing A Search Engine for US Congressional Bills
### Holden Huntzinger -- holdenh@umich.edu
Currently includes bills 1993-2016

See the [accompanying report](https://docs.google.com/document/d/18zFmpfLCy-gGJyt9WcHooT_v3RIC75SHj0IgbeayRYo/edit?usp=sharing) for additional information or download the corpus and other code from [Github](https://github.com/hhuntz/congressional_search)

## Step 1: Create an Index using Pyterrier

In [1]:
import pandas as pd
import numpy as np
import pyterrier as pt
from pyterrier.measures import *
import warnings
warnings.filterwarnings('ignore')

In [2]:
# start pyterrier
if not pt.started():
    pt.init()

PyTerrier 0.9.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [3]:
# get data
bills_df = pd.read_csv('data/us_congress_bills_1993-2016.csv', index_col = 0)
bills_df.columns = ['docno', 'text', 'summary', 'title'] # col name 'docno' prescribed by pyterrier

In [52]:
# create index
index_dir = './congressional_search_index'
iter_indexer = pt.IterDictIndexer(index_dir, overwrite=True)
bills_dict = bills_df.to_dict(orient="records") 
indexref = iter_indexer.index(bills_dict, fields=(['text']))
index = pt.IndexFactory.of(indexref)

In [5]:
# collection stats
print(index.getCollectionStatistics().toString())

Number of documents: 22218
Number of terms: 42804
Number of postings: 5367104
Number of fields: 3
Number of tokens: 18691928
Field names: [text, summary, title]
Positions:   false



## Step 2: Retrieve Documents for Annotation

The goal here is to retrieve results for sample queries to be manually annotated for relevance on a 1-5 (5 being most relevant) scale. Annotations are loaded into this notebook below. Many of the sample queries are based on the long-running [Gallop poll](https://news.gallup.com/poll/1675/most-important-problem.aspx) of Americans that asks "What do you think is the most important problem facing the country today?"

In [6]:
# load queries
queries = pd.read_csv('data/sample_queries.csv', index_col = 0)
queries.head()

Unnamed: 0,qid,query
0,q1,elephants
1,q2,gun rights
2,q3,oil prices
3,q4,economic recession depression economy
4,q5,immigration


In [7]:
# use a few different basic weighting schemes
tfidf = pt.BatchRetrieve(index, wmodel = 'TF_IDF')
bm25 = pt.BatchRetrieve(index, wmodel = 'BM25')
pl2 = pt.BatchRetrieve(index, wmodel = 'PL2')

In [8]:
# get sample query results from base models for annotation

all_results = pd.DataFrame()

for q in queries.qid:
    
    # get results for each ranking model 
    docno_list = []
    for model in [tfidf, bm25, pl2]:
        model_res = model(queries[queries.qid == q]).head(100)
        returned_docnos = list(model_res.docno)
        docno_list.extend(returned_docnos)
    
    # get text, summary, title, and query for each returned bill
    q_results = bills_df[bills_df.docno.isin(docno_list)]
    q_results['qid'] = q
    all_results = pd.concat([all_results, q_results])

results = pd.merge(all_results, queries, on = 'qid')

results.head()

Unnamed: 0,docno,text,summary,title,qid,query
0,115_s1256,SECTION 1. SHORT TITLE.\n\n This Act may be...,Ghost Army Congressional Gold Medal Act This b...,Ghost Army Congressional Gold Medal Act,q1,elephants
1,114_s27,SECTION 1. SHORT TITLE.\n\n This Act may be...,Wildlife Trafficking Enforcement Act of 2015 T...,Wildlife Trafficking Enforcement Act of 2015,q1,elephants
2,115_hr226,SECTION 1. SHORT TITLE.\n\n This Act may be...,African Elephant Conservation and Legal Ivory ...,African Elephant Conservation and Legal Ivory ...,q1,elephants
3,109_s2254,SECTION 1. FINDINGS.\n\n Congress finds tha...,Directs the Secretary of the Army to: (1) carr...,A bill to authorize the Secretary of the Army ...,q1,elephants
4,115_hr2701,SECTION 1. SHORT TITLE.\n\n This Act may be...,Ghost Army Congressional Gold Medal Act This b...,Ghost Army Congressional Gold Medal Act,q1,elephants


## Step 3: Calculate Benchmark Accuracy

I have downloaded the above dataframe and manually annotated the 2353 rows with a relevance score relative to the query for which the bill was returned by one of the basic models above. These scores range from 1 to 5 (with 5 for very relevant bills) and are saved in the 'labels' column that you can see below. These 'ground truth' values will allow me to compare different models and train machine learning models to power the search function. Before building a new model, I'll get baseline normalized discounted cumulative gain values against which to compare my own results. 

In [9]:
# read in annotation labels
annotations = pd.read_csv('data/congress_bills_annotations.csv')
labels = annotations[['qid', 'docno', 'label']]
labels['label'] = labels['label'].astype('int32')
labels.head()

Unnamed: 0,qid,docno,label
0,q1,115_s1256,1
1,q1,114_s27,4
2,q1,115_hr226,5
3,q1,109_s2254,2
4,q1,115_hr2701,1


In [10]:
pt.Experiment(
    [tfidf, bm25, pl2],
    queries,
    labels,
    eval_metrics=['map', nDCG, nDCG@5, nDCG@10])

Unnamed: 0,name,map,nDCG,nDCG@5,nDCG@10
0,BR(TF_IDF),0.959585,0.909306,0.769908,0.744234
1,BR(BM25),0.956574,0.907808,0.772988,0.748468
2,BR(PL2),0.951638,0.921291,0.802127,0.776127


These models have very high success rates here because they each returned similar documents and I annotated documents based on what these same models returned. That skews my comparison significantly and may make these scores hard to beat. 

Congress.gov, hosted by the Library of Congress, provides a similar search function to what I'm hoping to build here. I'll use that as the most significant benchmark for comparison, with the goal of beating the US Government. 

Congress.gov results were restriced to the relevant years and sorted by relevancy via the web interface before download. For the last query ('corporate corruption greed'), Congress.gov found no results. These results were manually downloaded and then programatically reorganized to fit the format requirements of pyterrier. As results are not explicitly scored by Congress.gov, scores from 1-5 were given to match the result's quintile from the results. For example: 100 results were returned, the first 20 would get a score of 5, the next 20 would get a score of 4, etc.

Citation: Congress.gov. "Quick Search." December 9, 2022. https://www.congress.gov/quick-search/legislation.

In [45]:
gov_results = pd.read_csv('data/congress_results.csv')

In [172]:
pt.Experiment(
    [tfidf, bm25, gov_results],
    queries,
    labels,
    eval_metrics=['map', nDCG, nDCG@5, nDCG@10]) 

Unnamed: 0,name,map,nDCG,nDCG@5,nDCG@10
0,BR(TF_IDF),0.959585,0.909306,0.769908,0.744234
1,BR(BM25),0.956574,0.907808,0.772988,0.748468
2,Unnamed: 0 docno qid rank sco...,0.160404,0.308643,0.371456,0.341436


# Step 4: Learning to Rank

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import fastrank

In [96]:
features_pipe = tfidf >> bm25
topics = features_pipe.transform(queries)

In [97]:
features = bm25 ** tfidf

In [98]:
tr_va_topics, test_topics = train_test_split(topics, test_size = 0.2, random_state = 42)
train_topics, valid_topics =  train_test_split(tr_va_topics, test_size = 0.2, random_state = 42)

In [99]:
# set up random forest
rf = RandomForestRegressor(n_estimators = 400, random_state = 42, n_jobs = 2)

rf_pipe = features >> pt.ltr.apply_learned_model(rf)

%time rf_pipe.fit(train_topics, labels)

CPU times: user 7.17 s, sys: 155 ms, total: 7.33 s
Wall time: 4.52 s


In [100]:
# set up fastrank
train_request = fastrank.TrainRequest.coordinate_ascent()

params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567

ca_pipe = features >> pt.ltr.apply_learned_model(train_request, form='fastrank')

%time ca_pipe.fit(train_topics, labels)

---------------------------
Training starts...
---------------------------
[+] Random restart #1/5...
[+] Random restart #2/5...
[+] Random restart #3/5...
[+] Random restart #4/5...
Shuffle features and optimize!
----------------------------------------
   2|Feature         |   Weight|     NDCG
----------------------------------------
Shuffle features and optimize!
----------------------------------------
   0|Feature         |   Weight|     NDCG
----------------------------------------
Shuffle features and optimize!
----------------------------------------
   1|Feature         |   Weight|     NDCG
----------------------------------------
Shuffle features and optimize!
----------------------------------------
   3|Feature         |   Weight|     NDCG
----------------------------------------
   1|0               |    0.000|    0.810
   0|0               |   -0.440|    0.381
   0|0               |   -0.540|    0.381
   0|0               |   -0.740|    0.381
   1|0               |   -0.1

In [130]:
lmart_x = xgb.sklearn.XGBRanker(objective='rank:ndcg',
      learning_rate = 0.1,
      gamma = 1.0,
      min_child_weight = 0.1,
      max_depth = 6,
      verbose = 2,
      random_state = 42)

lmart_x_pipe = features >> pt.ltr.apply_learned_model(lmart_x, form="ltr")
lmart_x_pipe.fit(train_topics, labels, valid_topics, labels)

Parameters: { "verbose" } are not used.



In [131]:
pt.Experiment(
    [bm25, tfidf, gov_results, rf_pipe, ca_pipe, lmart_x_pipe],
    topics,
    labels,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BR(BM25),0.956574,0.907808,0.748468,0.026402
1,BR(TF_IDF),0.959585,0.909306,0.744234,0.036957
2,Unnamed: 0 docno qid rank sco...,0.160404,0.308643,0.341436,0.0
3,"Compose(FUnion(BR(BM25), BR(TF_IDF)), <pyterri...",0.960541,0.964854,0.913514,0.137471
4,"Compose(FUnion(BR(BM25), BR(TF_IDF)), <pyterri...",0.955094,0.910001,0.75337,0.132982
5,"Compose(FUnion(BR(BM25), BR(TF_IDF)), <pyterri...",0.855016,0.872096,0.644952,0.122942


# Step 5: Iterate

Though they're both a bit slower, both ML pipelines beat the baseline models easily -- and did more than 3 times better than the Congress.gov search! Now, the goal is just to beat the previous score (and also just to try some cool stuff).

### More Indices, More Problems?

Our corpus also includes human-generated titles and summaries, and we can include the TF-IDF and BM25 scores for the query and each of these additional texts as features for the models.

In [55]:
# build title index
index_dir = './congressional_search__title_index'
iter_indexer = pt.IterDictIndexer(index_dir, overwrite=True)
bills_dict = bills_df.to_dict(orient="records") 
indexref = iter_indexer.index(bills_dict, fields=(['title']))
title_index = pt.IndexFactory.of(indexref)

# get collection stats
print(title_index.getCollectionStatistics().toString())

Number of documents: 22218
Number of terms: 7614
Number of postings: 232926
Number of fields: 1
Number of tokens: 250795
Field names: [title]
Positions:   false



In [56]:
# build summary index
index_dir = './congressional_search__summary_index'
iter_indexer = pt.IterDictIndexer(index_dir, overwrite=True)
bills_dict = bills_df.to_dict(orient="records") 
indexref = iter_indexer.index(bills_dict, fields=(['summary']))
summary_index = pt.IndexFactory.of(indexref)

# get collection stats
print(summary_index.getCollectionStatistics().toString())

Number of documents: 22218
Number of terms: 16982
Number of postings: 1401326
Number of fields: 1
Number of tokens: 2280046
Field names: [summary]
Positions:   false



In [102]:
tfidf_title = pt.BatchRetrieve(title_index, wmodel = 'TF_IDF')
bm25_title = pt.BatchRetrieve(title_index, wmodel = 'BM25')
tfidf_summary = pt.BatchRetrieve(summary_index, wmodel = 'TF_IDF')
bm25_summary = pt.BatchRetrieve(summary_index, wmodel = 'BM25')

In [121]:
multiindex_features = bm25 ** tfidf ** bm25_title ** bm25_summary

In [122]:
# set up multi-index random forest
rf = RandomForestRegressor(n_estimators = 400, random_state = 42, n_jobs = 2)

rf_multiindex_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)

%time rf_multiindex_pipe.fit(train_topics, labels)

CPU times: user 14.2 s, sys: 276 ms, total: 14.5 s
Wall time: 9.67 s


In [123]:
# set up multi-index fastrank
train_request = fastrank.TrainRequest.coordinate_ascent()

params = train_request.params
params.init_random = True
params.normalize = True
params.seed = 1234567

ca_multiindex_pipe = multiindex_features >> pt.ltr.apply_learned_model(train_request, form='fastrank')

%time ca_multiindex_pipe.fit(train_topics, labels)

---------------------------
Training starts...
---------------------------
[+] Random restart #1/5...
[+] Random restart #4/5...
[+] Random restart #3/5...
[+] Random restart #2/5...
Shuffle features and optimize!
----------------------------------------
   0|Feature         |   Weight|     NDCG
----------------------------------------
Shuffle features and optimize!
----------------------------------------
Shuffle features and optimize!
----------------------------------------
   1|Feature         |   Weight|     NDCG
   2|Feature         |   Weight|     NDCG
----------------------------------------
----------------------------------------
Shuffle features and optimize!
----------------------------------------
   3|Feature         |   Weight|     NDCG
----------------------------------------
   1|3               |    0.000|    0.762
   0|3               |    0.228|    0.612
   0|3               |    0.328|    0.667
   0|3               |    0.528|    0.708
   3|3               |    0.4

   4|1               |   -0.000|    0.808
   4|1               |   -0.000|    0.808
   4|1               |   -0.200|    0.813
   0|1               |    0.000|    0.831
   4|2               |    0.000|    0.813
   0|1               |    0.028|    0.831
   4|2               |   -0.000|    0.813
   4|2               |    0.000|    0.813
   4|2               |    0.000|    0.813
   4|2               |    0.001|    0.813
   1|0               |    0.411|    0.835
   4|2               |    0.003|    0.813
   4|2               |    0.006|    0.814
   4|2               |    0.012|    0.815
   4|2               |    0.050|    0.821
   4|2               |    0.099|    0.826
   4|2               |    0.199|    0.828
---------------------------
Shuffle features and optimize!
----------------------------------------
   4|Feature         |   Weight|     NDCG
----------------------------------------
   3|0               |   -0.014|    0.834
   4|3               |    0.050|    0.829
   4|3             

In [128]:
lmart_x_multiindex = xgb.sklearn.XGBRanker(objective='rank:ndcg',
      learning_rate=0.1,
      gamma=1.0,
      min_child_weight=0.1,
      max_depth=6,
      verbose=2,
      random_state=42)

lmart_x_multiindex_pipe = multiindex_features >> pt.ltr.apply_learned_model(lmart_x_multiindex, form="ltr")
lmart_x_multiindex_pipe.fit(train_topics, labels, valid_topics, labels)

Parameters: { "verbose" } are not used.



In [129]:
pt.Experiment(
    [bm25, tfidf, gov_results, rf_pipe, ca_pipe, rf_multiindex_pipe, ca_multiindex_pipe, lmart_x_multiindex_pipe],
    topics,
    labels,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BR(BM25),0.956574,0.907808,0.748468,0.031461
1,BR(TF_IDF),0.959585,0.909306,0.744234,0.045614
2,Unnamed: 0 docno qid rank sco...,0.160404,0.308643,0.341436,0.0
3,"Compose(FUnion(BR(BM25), BR(TF_IDF)), <pyterri...",0.960541,0.964854,0.913514,0.149241
4,"Compose(FUnion(BR(BM25), BR(TF_IDF)), <pyterri...",0.955094,0.910001,0.75337,0.106049
5,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.246948
6,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.877736,0.912184,0.790205,0.229661
7,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.848265,0.895405,0.733043,0.237494


Separately indexing the titles and summaries and using the bm25 scores for these indices as features does slightly improve search; it has the highest impact on rank, where the first 10 results are now ~2.4% better. IT does seriously slow down the search, though -- where the random forest was already 5 times slower than BM25, the additional indices double the time it takes. 

### Tuning the Random Forest

I also tried indexing all facets of the corpus (text, title, and summary) in one index, but got slightly worse results than with only indexing the text. Directly indexing the title and summary slows down my search, but does improve it -- especially (by ~2.4%) for the first 10 results. Because the random forest model seems to be much better than the other, I'll do a bit of tuning to see if I can eke out slightly better nDCG or speed up the timing at all. I tried with 100, 200, 300, 400, 500, and 600 estimators; this makes a difference in how long it takes to train the model (100 takes ~4.3 seconds with 600 takes 12), but almost no difference on nDCG. At the end of the day, 400 is still best.

In [162]:
# random forest w/ 600 estimators
rf = RandomForestRegressor(n_estimators = 600, random_state = 42, n_jobs = 2)
rf_600_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)
%time rf_600_pipe.fit(train_topics, labels)

CPU times: user 5.51 s, sys: 77.1 ms, total: 5.59 s
Wall time: 4.32 s


In [163]:
# random forest w/ 200 estimators
rf = RandomForestRegressor(n_estimators = 200, random_state = 42, n_jobs = 2)
rf_200_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)
%time rf_200_pipe.fit(train_topics, labels)

CPU times: user 9.04 s, sys: 152 ms, total: 9.19 s
Wall time: 6.49 s


In [164]:
pt.Experiment(
    [bm25, tfidf, gov_results, rf_200_pipe, rf_multiindex_pipe, rf_600_pipe],
    topics,
    labels,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BR(BM25),0.956574,0.907808,0.748468,0.035066
1,BR(TF_IDF),0.959585,0.909306,0.744234,0.032767
2,Unnamed: 0 docno qid rank sco...,0.160404,0.308643,0.341436,0.0
3,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.958904,0.970156,0.939071,0.273361
4,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.29179
5,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.957601,0.96758,0.932107,0.251668


I also tried to tune based on the number of concurrent jobs; I wondered if this would speed up the model training. It did, and it also gave a very marginal benefit to mean response time. 

In [165]:
# random forest w/ 1 jobs
rf = RandomForestRegressor(n_estimators = 400, random_state = 42, n_jobs = 1)
rf_1_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)
%time rf_1_pipe.fit(train_topics, labels)

CPU times: user 11 s, sys: 136 ms, total: 11.2 s
Wall time: 11.3 s


In [166]:
# random forest w/ 3 jobs
rf = RandomForestRegressor(n_estimators = 400, random_state = 42, n_jobs = 3)
rf_3_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)
%time rf_3_pipe.fit(train_topics, labels)

CPU times: user 15.2 s, sys: 192 ms, total: 15.4 s
Wall time: 7.34 s


In [167]:
# random forest w/ 4 jobs
rf = RandomForestRegressor(n_estimators = 400, random_state = 42, n_jobs = 4)
rf_4_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)
%time rf_4_pipe.fit(train_topics, labels)

CPU times: user 18.5 s, sys: 401 ms, total: 18.9 s
Wall time: 8.52 s


In [169]:
# random forest w/ all jobs (-1 uses all available CPU)
rf = RandomForestRegressor(n_estimators = 400, random_state = 42, n_jobs = -1)
rf_all_pipe = multiindex_features >> pt.ltr.apply_learned_model(rf)
%time rf_all_pipe.fit(train_topics, labels)

CPU times: user 16.3 s, sys: 235 ms, total: 16.5 s
Wall time: 6.76 s


In [171]:
pt.Experiment(
    [bm25, tfidf, gov_results, rf_1_pipe, rf_multiindex_pipe, rf_3_pipe, rf_4_pipe, rf_all_pipe],
    topics,
    labels,
    eval_metrics=["map", "ndcg", "ndcg_cut_10", "mrt"])

Unnamed: 0,name,map,ndcg,ndcg_cut_10,mrt
0,BR(BM25),0.956574,0.907808,0.748468,0.028733
1,BR(TF_IDF),0.959585,0.909306,0.744234,0.031061
2,Unnamed: 0 docno qid rank sco...,0.160404,0.308643,0.341436,0.0
3,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.262539
4,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.22837
5,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.225564
6,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.21742
7,"Compose(FUnion(BR(BM25), FUnion(BR(TF_IDF), FU...",0.959618,0.969221,0.937189,0.222468


Though using all available CPU is the fastest to train, it's very slightly slower to return results (though I don't understand why at this moment). For this reason, I'll stick with 4 jobs. 