# An Analysis of Political Contributions During the 2020 House of Representatives Election

#### In this part, you will obtain as much data as you can on the campaign contributions received by each candidate. This data is avaiable through the website https://www.opensecrets.org/.

## Part 1: Data Gathering

In [4]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from IPython.core.display import HTML
import io
import re
import regex
import csv
from datetime import datetime as dt
import urllib3
from IPython.core.display import HTML
# import re2


### 1. Start by acquiring the data from Tennessee's 7th District, which is available at https://www.opensecrets.org/races/summary?cycle=2020&id=TN07&spec=N. If you click the "Download .csv file", you can get a csv for this district. However, we don't want to have to click this button across all districts. Instead, we'll use Python to help automate this process. Start by sending a get request to the download button URL, https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN07. Convert the result to a DataFrame.

In [6]:
#Nitin's code
url = 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN07'

#Lets use try-except whenever we make any any http request.

#If we invoke .raise_for_status(), then Requests will raise an HTTPError for status codes between 400 and 600. 
#If the status code indicates a successful request, then the program will proceed without raising that exception.

try:
    
    response = requests.get(url)
    response.raise_for_status()
except HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
except Exception as err:
    print(f"Other error occurred: {err}")
else:
    data = response.content.decode('utf8')
    df = pd.read_csv(io.StringIO(data))

#Lets populate the district ID column with TN07 so that we could use it later
df['DistIDCurr']='TN07'
df.head(2)

Unnamed: 0,cid,FirstLastP,Rcpts,Spent,PACs,Indivs,Cand,Other,EndCash,LgIndivs,...,Result,CRPICO,State,IncCID,Incumbent,primarydate,DistIDCurr,capeye,sort,SmLgIndivsNote
0,N00041873,Mark Green (R),1194960.47,935486.67,171900.0,819151.42,0.0,203909.05,287888.55,819151.42,...,W,I,Tennessee,,,2020-08-06 00:00:00 +0000,TN07,0,1,N
1,N00045536,Kiran Sreepada (D),206644.28,207190.98,4000.0,202644.28,0.0,0.0,0.0,179129.75,...,L,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN07,0,2,N


In [7]:
url = 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN07'

r = requests.get(url)
r.raise_for_status
#data = r.content.decode('utf8')
#df = pd.read_csv(io.StringIO(data))
#df.head(2)

<bound method Response.raise_for_status of <Response [200]>>

In [8]:
dist_01 = 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN01'

r = requests.get(dist_01)
data = r.content.decode('utf8')
df = pd.read_csv(io.StringIO(data))
df.head(2)

Unnamed: 0,cid,FirstLastP,Rcpts,Spent,PACs,Indivs,Cand,Other,EndCash,LgIndivs,...,Result,CRPICO,State,IncCID,Incumbent,primarydate,DistIDCurr,capeye,sort,SmLgIndivsNote
0,N00046688,Diana Harshbarger (R),2126945.6,1869099.77,222800.0,359728.5,1461293.0,83124.1,257845.83,315489.1,...,W,O,Tennessee,,,2020-08-06 00:00:00 +0000,,0,2,N
1,N00046686,Blair Nicole Walsingham (D),140209.14,134994.55,1520.0,138689.14,0.0,0.0,5214.59,70085.2,...,L,O,Tennessee,,,2020-08-06 00:00:00 +0000,,0,2,N


### 2. Once you have working code for Tennessee's 7th District, expand on your code to capture all of Tennessee's districts into a single DataFrame. Make sure that you can distinguish which district each result came from. Export the results to a csv file.

In [10]:
# nitin's code

url = 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN07'

#Lets use try-except whenever we make any any http request.

#If we invoke .raise_for_status(), then Requests will raise an HTTPError for status codes between 400 and 600. 
#If the status code indicates a successful request, then the program will proceed without raising that exception.

try:
    
    response = requests.get(url)
    response.raise_for_status()
except HTTPError as http_err:
    print(f"HTTP error occurred: {http_err}")
except Exception as err:
    print(f"Other error occurred: {err}")
else:
    data = response.content.decode('utf8')
    df = pd.read_csv(io.StringIO(data))

#Lets populate the district ID column with TN07 so that we could use it later
df['DistIDCurr']='TN07'
df.head(2)

Unnamed: 0,cid,FirstLastP,Rcpts,Spent,PACs,Indivs,Cand,Other,EndCash,LgIndivs,...,Result,CRPICO,State,IncCID,Incumbent,primarydate,DistIDCurr,capeye,sort,SmLgIndivsNote
0,N00041873,Mark Green (R),1194960.47,935486.67,171900.0,819151.42,0.0,203909.05,287888.55,819151.42,...,W,I,Tennessee,,,2020-08-06 00:00:00 +0000,TN07,0,1,N
1,N00045536,Kiran Sreepada (D),206644.28,207190.98,4000.0,202644.28,0.0,0.0,0.0,179129.75,...,L,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN07,0,2,N


### 3. Once you have working code for all of Tennessee's districts, expand on it to capture all states and districts. The number of districts for each state can be found at https://en.wikipedia.org/wiki/2020_United_States_House_of_Representatives_elections. You may also find the table of state abbreviations here helpful: https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations. Export a csv file for each state.

In [12]:
# https://en.wikipedia.org/wiki/2020_United_States_House_of_Representatives_elections
# Number of representatives for each state
# scrape wiki page - data strings from there and use those strings to interpolate the url through 

wiki_rep_url = 'https://en.wikipedia.org/wiki/2020_United_States_House_of_Representatives_elections'

r = requests.get(wiki_rep_url)
wiki_rep_soup = BeautifulSoup(r.text, features="html.parser")

table_html_rep_wiki = str(wiki_rep_soup.find('table', attrs={'class':['wikitable', 'sortable jquery-tablesorter'], 'style':'text-align:center'}))
HTML(table_html_rep_wiki)
# reformat table as a df

wiki_rep_df = pd.read_html(io.StringIO(str(table_html_rep_wiki)))[0]
#wiki_rep_df.head(2)
wiki_rep_df_limited = wiki_rep_df[['State', 'Total seats']]
wiki_rep_df_limited_flat = wiki_rep_df_limited.to_csv(header=None,index=False)
wiki_rep_df_limited_flat_df = pd.read_csv(io.StringIO(wiki_rep_df_limited_flat), names=['US State', 'Number of Districts'])
wiki_rep_df_limited_flat_df.head(2)

Unnamed: 0,US State,Number of Districts
0,Alabama,7
1,Alaska,1


In [13]:
wiki_rep_df_limited = wiki_rep_df[['State', 'Total seats']]
wiki_rep_df_limited.head(3)

Unnamed: 0_level_0,State,Total seats
Unnamed: 0_level_1,State,Total seats
0,Alabama,7
1,Alaska,1
2,Arizona,9


In [14]:
wiki_rep_df_limited_flat = wiki_rep_df_limited.to_csv(header=None,index=False)
wiki_rep_df_limited_flat_df = pd.read_csv(io.StringIO(wiki_rep_df_limited_flat), names=['US State', 'Number of Districts'])
wiki_rep_df_limited_flat_df.head(2)

Unnamed: 0,US State,Number of Districts
0,Alabama,7
1,Alaska,1


In [15]:
# https://www.worldatlas.com/geography/usa-states.html
# Abbreviations of each state
# scrape page - data strings from there and use those strings to interpolate the url through 
StateAbbrev_url = 'https://www.worldatlas.com/geography/usa-states.html'

r = requests.get(StateAbbrev_url)
StateAbbrev_soup = BeautifulSoup(r.text, features="html.parser")

table_html_StateAbbrev = str(StateAbbrev_soup.find('table'))
HTML(table_html_StateAbbrev)
wiki_st_opp_df = pd.read_html(io.StringIO(str(table_html_StateAbbrev)))[0]
wiki_st_opp_df.head(2)

Unnamed: 0,US State,Abbreviation
0,Alabama,AL
1,Alaska,AK


In [16]:
# merge the wiki df for number of state representatives per state with state name with df for state abbr with state name
st_rep_df_merged = pd.merge(wiki_st_opp_df, wiki_rep_df_limited_flat_df, on = "US State", how = "inner")
st_rep_df_merged_drop = st_rep_df_merged.drop('US State', axis=1)
#st_rep_df_merged_drop['Number of Districts'] = st_rep_df_merged_drop['Number of Districts'].astype(str).str.zfill(2)
st_rep_df_merged_drop.head(2)

Unnamed: 0,Abbreviation,Number of Districts
0,AL,7
1,AK,1


In [17]:
st_rep_df_merged_drop['Abbreviation'][2]

'AZ'

In [18]:
st_rep_df_merged_drop['Number of Districts'][5]

7

In [19]:
#st_rep_df_merged_drop['tuple_try'] = list(zip(st_rep_df_merged_drop.)

In [20]:
st_abbr = st_rep_df_merged_drop['Abbreviation']
dist_num = st_rep_df_merged_drop['Number of Districts']
open_secrets_csv_url_base = f'https://www.opensecrets.org/races/summary.csv?cycle=2020&id='
#open_secrets_csv_url_base
st_abbr[:6]

0    AL
1    AK
2    AZ
3    AR
4    CA
5    CO
Name: Abbreviation, dtype: object

In [21]:
dist_num[:6]

0     7
1     1
2     9
3     4
4    53
5     7
Name: Number of Districts, dtype: int64

In [22]:
urls_st = []
dist_num_container = []
#def add_to_url_begin(st_abbr):

for i in range(0, len(st_abbr)):
    urls_st.append(open_secrets_csv_url_base + st_abbr[i])
    for item in range(dist_num[i], 0, -1):
        urls_st.append(str(item).zfill(2))
urls_st


['https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL',
 '07',
 '06',
 '05',
 '04',
 '03',
 '02',
 '01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AK',
 '01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ',
 '09',
 '08',
 '07',
 '06',
 '05',
 '04',
 '03',
 '02',
 '01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AR',
 '04',
 '03',
 '02',
 '01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=CA',
 '53',
 '52',
 '51',
 '50',
 '49',
 '48',
 '47',
 '46',
 '45',
 '44',
 '43',
 '42',
 '41',
 '40',
 '39',
 '38',
 '37',
 '36',
 '35',
 '34',
 '33',
 '32',
 '31',
 '30',
 '29',
 '28',
 '27',
 '26',
 '25',
 '24',
 '23',
 '22',
 '21',
 '20',
 '19',
 '18',
 '17',
 '16',
 '15',
 '14',
 '13',
 '12',
 '11',
 '10',
 '09',
 '08',
 '07',
 '06',
 '05',
 '04',
 '03',
 '02',
 '01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=CO',
 '07',
 '06',
 '05',
 '04',
 '03',
 '02',
 '01',
 'https://www.opensecrets.org/r

In [23]:
urls_st = []
dist_num_container = []
#def add_to_url_begin(st_abbr):

for i in range(0, len(st_abbr)):
    for d in range(dist_num[i], 0, -1):
        urls_st.append(open_secrets_csv_url_base + st_abbr[i] + str(d).zfill(2)) 

urls_st

['https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL07',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL06',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL05',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL04',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL03',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL02',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AK01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ09',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ08',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ07',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ06',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ05',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ04',
 'https://www.opensecrets.org/race

In [24]:
for i in range(0, len(st_abbr)):
    urls_st.append(open_secrets_csv_url_base + st_abbr[i])

urls_st

['https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL07',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL06',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL05',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL04',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL03',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL02',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AK01',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ09',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ08',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ07',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ06',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ05',
 'https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ04',
 'https://www.opensecrets.org/race

In [25]:
urls_st_dist = []
for i in range(0, len(dist_num)):
    for i in range(dist_num[i], 0, -1):
        urls_st_dist.append(i)
#        print(i)
urls_st_dist
#dist_num[i]

[7,
 6,
 5,
 4,
 3,
 2,
 1,
 1,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 4,
 3,
 2,
 1,
 53,
 52,
 51,
 50,
 49,
 48,
 47,
 46,
 45,
 44,
 43,
 42,
 41,
 40,
 39,
 38,
 37,
 36,
 35,
 34,
 33,
 32,
 31,
 30,
 29,
 28,
 27,
 26,
 25,
 24,
 23,
 22,
 21,
 20,
 19,
 18,
 17,
 16,
 15,
 14,
 13,
 12,
 11,
 10,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 5,
 4,
 3,
 2,
 1,
 1,
 27,
 26,
 25,
 24,
 23,
 22,
 21,
 20,
 19,
 18,
 17,
 16,
 15,
 14,
 13,
 12,
 11,
 10,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 14,
 13,
 12,
 11,
 10,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 2,
 1,
 2,
 1,
 18,
 17,
 16,
 15,
 14,
 13,
 12,
 11,
 10,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 4,
 3,
 2,
 1,
 4,
 3,
 2,
 1,
 6,
 5,
 4,
 3,
 2,
 1,
 6,
 5,
 4,
 3,
 2,
 1,
 2,
 1,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 14,
 13,
 12,
 11,
 10,
 9,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 8,
 7,
 6,
 5,
 4,
 3,
 2,
 1,
 4,
 3,
 2,
 1,
 8,
 7,
 6,
 5,
 4,
 3,