# An Analysis of Political Contributions During the 2020 House of Representatives Election

In this part, you will obtain as much data as you can on the campaign contributions received by each candidate. This data is avaiable through the website https://www.opensecrets.org/.

### Part 1: Data Gathering
1. Start by acquiring the data from Tennessee's 7th District, which is available at https://www.opensecrets.org/races/summary?cycle=2020&id=TN07&spec=N. If you click the "Download .csv file", you can get a csv for this district. However, we don't want to have to click this button across all districts. Instead, we'll use Python to help automate this process. Start by sending a get request to the download button URL, https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN07. Convert the result to a DataFrame.
2. Once you have working code for Tennessee's 7th District, expand on your code to capture all of Tennessee's districts into a single DataFrame. Make sure that you can distinguish which district each result came from. Export the results to a csv file.
3. Once you have working code for all of Tennessee's districts, expand on it to capture all states and districts. The number of districts for each state can be found at https://en.wikipedia.org/wiki/2020_United_States_House_of_Representatives_elections. You may also find the table of state abbreviations here helpful: https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations. Export a csv file for each state.
4. Finally, combine all of the data you've gathered together into a single DataFrame.

In [2]:
import requests
!pip install beautifulsoup4
from bs4 import BeautifulSoup
from IPython.core.display import HTML
import pandas as pd
import io
import re
import csv




In [3]:
url =  "https://www.opensecrets.org/races/summary.csv?cycle-2020&id=TN07"
response = requests.get (url)
data = response.content.decode('utf8')
df = pd.read_csv(io.StringIO(data))
print(df.head())

         cid               FirstLastP       Rcpts       Spent      PACs  \
0  N00041873           Mark Green (R)  1969949.53  1993585.66  253250.0   
1  N00054151          Megan Barry (D)  1183659.72  1032142.74   30462.5   
2  N00055083  Shaun Joseph Greene (I)      180.00      166.37       0.0   

       Indivs  Cand      Other    EndCash    LgIndivs  ...  Result CRPICO  \
0  1279148.93   0.0  437550.60  103711.18  1084402.81  ...       L      I   
1  1157094.37   0.0   -3897.15  151516.98   939482.84  ...       L      C   
2      100.00  80.0       0.00       6.63        0.00  ...       W      C   

       State  IncCID Incumbent primarydate DistIDCurr  capeye  sort  \
0  Tennessee     NaN       NaN         NaN       TN07       0     1   
1  Tennessee     NaN       NaN         NaN                  0     2   
2  Tennessee     NaN       NaN         NaN                  0     2   

   SmLgIndivsNote  
0               N  
1               N  
2               N  

[3 rows x 24 columns]


In [4]:
url =  "https://www.opensecrets.org/races/summary?cycle=2020&id=TN07&spec=N."
response = requests.get (url)
data = response.content.decode('utf8')
df = pd.read_csv(io.StringIO(data))
print(df.head())

    #df = pd.read_csv(io.StringIO(data))
#response.raise_for_status()
#soup = BeautifulSoup(response.text, "html.parser")
#links = [a["href"] for a in soup.find_all("a", href-True)]
#for link in links:
 #print(link)
# = response.content.decode('utf8')
#df = pd.read_csv(io.StringIO(data))

Empty DataFrame
Columns: [<!DOCTYPE html><html lang="en-US"><head><title>Just a moment...</title><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta name="robots" content="noindex, nofollow"><meta name="viewport" content="width=device-width, initial-scale=1"><style>*{box-sizing:border-box;margin:0;padding:0}html{line-height:1.15;-webkit-text-size-adjust:100%;color:#313131;font-family:system-ui, -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, Noto Sans, sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji}body{display:flex;flex-direction:column;height:100vh;min-height:100vh}.main-content{margin:8rem auto;max-width:60rem;padding-left:1.5rem}@media (width <= 720px){.main-content{margin-top:4rem}}.h2{font-size:1.5rem;font-weight:500;line-height:2.25rem}@media (width <= 720px){.h2{font-size:1.25rem;line-height:1.5rem}}#challenge-error-text{background-image:url(da

In [5]:
# 2nd option without beautifulsoup
url =  "https://www.opensecrets.org/races/summary.csv?cycle-2020&id=TN07"
response = requests.get (url)
data = response.content.decode('utf8')
df = pd.read_csv(io.StringIO(data))
print(df.head())

         cid               FirstLastP       Rcpts       Spent      PACs  \
0  N00041873           Mark Green (R)  1969949.53  1993585.66  253250.0   
1  N00054151          Megan Barry (D)  1183659.72  1032142.74   30462.5   
2  N00055083  Shaun Joseph Greene (I)      180.00      166.37       0.0   

       Indivs  Cand      Other    EndCash    LgIndivs  ...  Result CRPICO  \
0  1279148.93   0.0  437550.60  103711.18  1084402.81  ...       L      I   
1  1157094.37   0.0   -3897.15  151516.98   939482.84  ...       L      C   
2      100.00  80.0       0.00       6.63        0.00  ...       W      C   

       State  IncCID Incumbent primarydate DistIDCurr  capeye  sort  \
0  Tennessee     NaN       NaN         NaN       TN07       0     1   
1  Tennessee     NaN       NaN         NaN                  0     2   
2  Tennessee     NaN       NaN         NaN                  0     2   

   SmLgIndivsNote  
0               N  
1               N  
2               N  

[3 rows x 24 columns]


In [6]:
base_url = "https://www.opensecrets.org/races/summary.csv?cycle=2020&id=TN{}"
district_data = []
for district in range(1, 10):
    district_code = f"{district:02d}"
    url = base_url.format(district_code)
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, features="html.parser")
        csv_file = io.StringIO(soup.prettify())
        df = pd.read_csv(csv_file)
        df['District'] = district_code
        district_data.append(df)
    else:
        print(f"Failed to retrieve data for district {district_code} from {url}")
final_df = pd.concat(district_data, ignore_index=True)
final_df.loc[0, 'DistIDCurr'] = 'TN01'
final_df['DistIDCurr'] = final_df['DistIDCurr'].replace(r'^\s*$', None, regex=True)
final_df['DistIDCurr'] = final_df['DistIDCurr'].fillna(method='ffill')
pd.DataFrame(final_df)

  final_df['DistIDCurr'] = final_df['DistIDCurr'].fillna(method='ffill')


Unnamed: 0,cid,FirstLastP,Rcpts,Spent,PACs,Indivs,Cand,Other,EndCash,LgIndivs,...,CRPICO,State,IncCID,Incumbent,primarydate,DistIDCurr,capeye,sort,SmLgIndivsNote,District
0,N00046688,Diana Harshbarger (R),2126945.6,1869099.77,222800.0,359728.5,1461293.0,83124.1,257845.83,315489.1,...,O,Tennessee,,,2020-08-06 00:00:00 +0000,TN01,0,2,N,1
1,N00046686,Blair Nicole Walsingham (D),140209.14,134994.55,1520.0,138689.14,0.0,0.0,5214.59,70085.2,...,O,Tennessee,,,2020-08-06 00:00:00 +0000,TN01,0,2,N,1
2,N00047760,Steve Holder (I),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,O,Tennessee,,,2020-08-06 00:00:00 +0000,TN01,0,2,N,1
3,N00041594,Tim Burchett (R),1336275.75,878487.63,269535.0,1072845.61,0.0,-6104.86,593677.72,729831.26,...,I,Tennessee,,,2020-08-06 00:00:00 +0000,TN02,0,1,N,2
4,N00041699,Renee Hoyos (D),812783.86,816793.15,3100.0,807459.01,0.0,2224.85,209.82,807459.01,...,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN02,0,2,N,2
5,N00047761,Matthew Campbell (I),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN02,0,2,N,2
6,N00030815,Chuck Fleischmann (R),1051653.39,381411.2,453858.46,603344.93,0.0,-5550.0,1880341.32,599059.93,...,I,Tennessee,,,2020-08-06 00:00:00 +0000,TN03,0,1,N,3
7,N00046911,Meg Gorman (D),85843.21,77759.83,2671.6,81271.61,2000.0,-100.0,8083.38,50245.2,...,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN03,0,2,N,3
8,N00046589,Nancy Baxley (I),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN03,0,2,N,3
9,N00047762,Amber Hysell (I),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,C,Tennessee,,,2020-08-06 00:00:00 +0000,TN03,0,2,N,3


In [7]:
base_url = "https://www.opensecrets.org/races/summary.csv?cycle=2020&id={}"

states_and_districts = {
    "AL": 7, "AK": 1, "AZ": 9, "AR": 4, "CA": 53, "CO": 7, "CT": 5, "DE": 1,
    "FL": 27, "GA": 14, "HI": 2, "ID": 2, "IL": 18, "IN": 9, "IA": 4, "KS": 4,
    "KY": 6, "LA": 6, "ME": 2, "MD": 8, "MA": 9, "MI": 14, "MN": 8, "MS": 4,
    "MO": 8, "MT": 1, "NE": 3, "NV": 4, "NH": 2, "NJ": 12, "NM": 3, "NY": 27,
    "NC": 13, "ND": 1, "OH": 16, "OK": 5, "OR": 5, "PA": 18, "RI": 2, "SC": 7,
    "SD": 1, "TN": 9, "TX": 36, "UT": 4, "VT": 1, "VA": 11, "WA": 10, "WV": 3,
    "WI": 8, "WY": 1
}

all_links = []

for state, num_districts in states_and_districts.items():
    for district in range(1, num_districts + 1):

        url = base_url.format(f"{state}{district:02d}") 
        all_links.append(url)

for link in all_links:
    print(link)
    link = pd.DataFrame(all_links, columns=['Generated URLs'])

https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL01
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL02
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL03
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL04
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL05
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL06
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AL07
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AK01
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ01
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ02
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ03
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ04
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ05
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ06
https://www.opensecrets.org/races/summary.csv?cycle=2020&id=AZ07
https://www.opensecrets.o

In [20]:
urls = link['Generated URLs'].tolist()
all_data = []

for url in urls:
    response = requests.get(url)
    response.raise_for_status() 

    # Extract the CSV data directly
    csv_data = pd.read_csv(io.StringIO(BeautifulSoup(response.text, "html.parser").get_text()))
    all_data.append(csv_data)  

# Concatenate all DataFrames into a single DataFrame
combined_data = pd.concat(all_data, ignore_index=True)

# Write the aggregated data to a single CSV file
combined_data.to_csv('combined_district_data.csv', index=False)
print("All data saved to combined_district_data.csv.")

All data saved to combined_district_data.csv.


In [27]:
num_of_rows = len(combined_data)
print(f"The number of rows is {num_of_rows}")





The number of rows is 1264
