# Draft Paper - Jennings Gamble

In [11]:
import requests 
import pandas as pd 

**Problem statement and hypothesis**

The idea of this project is to predict an amount of money that a lobbying firm pays to get a piece of legislation passed by industry. My hypothesis is that certain industries will require much more funding over a much broader selection of committees than others. For instance, the tobacco industry may have to lobby much harder than the telecommunications industry. I would like to arrive at a dollar figure that gives X% likelihood that the bill will pass. 

**Description of dataset and how it was obtained**

The dataset has been cobbled together from a couple of different sources: 
1. OpenSecrets- this is my source of lobbyist dollars. 
2. GovTrack- this is my source for congressional voting data. 
3. Congress Bio Guide- this is my source of information for the terms and jobs of congress people. 

**Description of pre-processing**

Preprocessing isn't necessary for a lot of the data gathered via API. But there are bulk tables that I'm using for a lot of lobbying dollar totals back from 1998. One of the issues pulling information from these tables is the non-traditional delimiters. Because we're using official names for congresspeople, there are periods and commas in the different fields. So columns are separated by '|,|'. This can be tough to work around since many of the methods we have for pulling information from CSVs or JSON don't accept regex separators. 

**Details learned from exploring the data**

OpenSecrets has an API that allows a user to make a number of different calls into the OpenSecrets database. First I investigated the `congCmteIndus` method, which pulls summary fundraising information by committee, industry, and congress number. 

In order to make calls using this method, it's important to understand the values you'll be passing in: 
1. Congress number is simple. It's the congress session whose data you want to pull. The current congress, which started on January 3rd 2017 is the 115th. 
2. Industry has a heirarchy to it. The API takes a parameter called "catorder", which is a 3 character string of one captial letter and 2 digits. Each catorder lives within a sector and maps to an industry. There are "catnames" below the industry signifying more specific. 
<img src="pictures/Sector Map.png">
3. Committee codes can be captured in a separate table

There are 18 sectors of varying sizes: 

In [12]:
path = './data/all_industries.txt'
ic_names = ['catcode', 'catname', 'catorder', 'industry', 'sector', 'sector long']
industry_codes = pd.read_csv(path, sep='\t', names=ic_names)
print industry_codes.sector.value_counts()

Misc Business            103
Ideology/Single-Issue     46
Transportation            37
Finance/Insur/RealEst     36
Communic/Electronics      35
Health                    32
Energy/Nat Resource       30
Labor                     30
Agribusiness              30
Other                     23
Construction              22
Defense                    8
Lawyers & Lobbyists        6
Non-contribution           6
Unknown                    5
Joint Candidate Cmtes      5
Party Cmte                 4
Sector                     1
Candidate                  1
Name: sector, dtype: int64


There are far more industries: 

In [13]:
print industry_codes.industry.value_counts()

Misc Manufacturing & Distributing      31
Misc Issues                            13
Misc Unions                            11
Retail Sales                           11
Business Services                      10
Health Professionals                   10
Misc Transport                         10
Securities & Investment                 9
Air Transport                           9
Health Services/HMOs                    9
Agricultural Services/Products          9
Oil & Gas                               9
TV/Movies/Music                         9
Misc Services                           9
Pharmaceuticals/Health Products         8
Real Estate                             8
Electronics Mfg & Equip                 8
Building Materials & Equipment          7
Education                               7
Food & Beverage                         7
Automotive                              7
Misc Finance                            6
Industrial Unions                       6
Business Associations             

Let's look at this for a single industry's contributions to a single committee in a single congress: the Tobacco industry's contribution to the House Committee in the 114th congress. The tobacco industry contributed $97,518 to this single committee in this single congress. That gets divided between individual contribution and contributions to PACs. 48 percent of that money goes to the top 4 (out of 22 returned) committee members. 
<img src="pictures/Tobacco to House Agriculture.png">


198 bills moved through this committee during the 114th congress. 16 of these passed the House, and 9 were enacted. 0 of these enacted bills appeared to have the word "tobacco" included in their text. However, 7 unpassed bills in this committee did contain the word "tobacco". In light of Monday's class, I'd like to use natural language processing to parse these bills to see if their opinion on tobacco were positive or negative as a next step. This will allow me to establish whether the bill was positive or negative for the industry and therefore, whether lobbyists would have been paying for the passing of the bill or the killing of it.  

**Choosing the features for analysis**

The main features for analysis currently are details about legislation (sponsors, co-sponsors, enaction status, etc.) and details about lobbying dollars (companies the dollars come from, committees the dollars went to, amount of funding, etc.). Currently I believe these are the features that will end up being the most predictive. 

The prediction (Y) is a dollar amount an industry has to lobby to give X% of a bill passing. 

**Details of the modeling process**

Currently still working on the modelling process...

**Challenges**

The biggest challenges so far: 
1. Not all committee members are returned by the Open Secrets API. This doesn't make sense because the API will return an object for congresspeople that had $0 contributed to them. I need to figure out why. 
2. It's a lot easier to measure industries that give a whole bunch of money to get a bill passed than industries that give money kill a bill. This is because bills can be killed in so many ways that aren't recorded. For instance, the House and the Senate can choose to simply not have a vote on a bill to kill it. This leaves behind no voting record to parse. This often happens when the votes aren't there to get the measure passed. 
3. Scope. I would need to apply natural language processing a bill to adequately parse out what a bill is about because a lot of bills have very general names like the "Agriculture Reauthorization Act of 2015". Running these processes on all bills in just a single committee would be very resource intensive. Running them for all bills in congress would be infeasible. This means I have to make some decisions on what committees to choose. 
4. Figuring out lobbying dollars spent on particular bills is still a challenge. I'm investigating better ways to find this link. 
5. Bill mortality rates are very high. 

**Possible extensions or business applications of your project**

There are a few business extensions for this project. This would obviously be useful for lobbyist entities themselves, but I'm more interested in applications around public dissemination of lobbyist information. It would be really good for constituents to understand that $50,000 for a tobacco initiative has the potential to sway a public health vote and has X number of times in the past. As for extensions, I can always redo the same analysis on additional congressional committees. 

**Conclusions and key learnings**

Currently still working on conclusions and key learnings...