# 06 - Combining data with merges

In [8]:
import pandas as pd
import math

pd.set_option('display.max_columns', 100)

## 1. Match candidate districts from the FEC with demographic data from the census

### a. Load candidate and demographic files to identify common attributes

Candidate files exist as bulk data from the [FEC's website](https://www.fec.gov/data/browse-data/?tab=bulk-data)

In [9]:
candidate_header = pd.read_csv('../answers/downloaded_data/cn_header_file.csv').columns.tolist()
candidates = pd.read_csv('../answers/downloaded_data/cn22.txt', sep='|', names=candidate_header)

In [12]:
# candidate_header

In [13]:
candidates.head(2)

Unnamed: 0,CAND_ID,CAND_NAME,CAND_PTY_AFFILIATION,CAND_ELECTION_YR,CAND_OFFICE_ST,CAND_OFFICE,CAND_OFFICE_DISTRICT,CAND_ICI,CAND_STATUS,CAND_PCC,CAND_ST1,CAND_ST2,CAND_CITY,CAND_ST,CAND_ZIP
0,H0AK00105,"LAMB, THOMAS",NNE,2020,AK,H,0.0,C,N,C00607515,1861 W LAKE LUCILLE DR,,WASILLA,AK,99654.0
1,H0AL01055,"CARL, JERRY LEE, JR",REP,2022,AL,H,1.0,I,C,C00697789,PO BOX 852138,,MOBILE,AL,36685.0


In [15]:
# district level
demographics = pd.read_csv('../answers/downloaded_data/ACSDT5Y2019.B01003_2021-07-14T121439/ACSDT5Y2019.B01003_data_with_overlays_2021-07-14T121436.csv', header=1)

In [16]:
demographics.head(2)

Unnamed: 0,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total
0,5001600US0101,"Congressional District 1 (116th Congress), Ala...",710135.0,615
1,5001600US0102,"Congressional District 2 (116th Congress), Ala...",679684.0,2213


In [25]:
demographics['Geographic Area Name'][0]

'Congressional District 1 (116th Congress), Alabama'

Looking at the first couple of entries, what could be common attributes?

- Geographic area
    - District number
    - State

In [23]:
# demographics['state_name'] = demographics['Geographic Area Name'].str.split(', ')

In [26]:
demographics['district_id'] = demographics['id'].str[-2:]

In [30]:
demographics['state_fips'] = demographics['id'].str[-4:-2]

In [40]:
demographics.head()

Unnamed: 0,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,state_name,district_id,state_fips
0,5001600US0101,"Congressional District 1 (116th Congress), Ala...",710135.0,615,"[Congressional District 1 (116th Congress), Al...",1,1
1,5001600US0102,"Congressional District 2 (116th Congress), Ala...",679684.0,2213,"[Congressional District 2 (116th Congress), Al...",2,1
2,5001600US0103,"Congressional District 3 (116th Congress), Ala...",708888.0,1544,"[Congressional District 3 (116th Congress), Al...",3,1
3,5001600US0104,"Congressional District 4 (116th Congress), Ala...",684757.0,1511,"[Congressional District 4 (116th Congress), Al...",4,1
4,5001600US0105,"Congressional District 5 (116th Congress), Ala...",720362.0,234,"[Congressional District 5 (116th Congress), Al...",5,1


### b. Format candidates table

In [47]:
candidates['district_str'] = candidates['CAND_OFFICE_DISTRICT'].astype(str)

In [48]:
candidates['district_str'] = candidates['district_str'].apply(lambda x: x.split('.')[0])

In [51]:
candidates['district_str'] = candidates['district_str'].str.zfill(2)

### c. Format demographics table

How can we identify districts and states the way that the FEC identifies? Split the id column into state ID and district ID

[Read up on splicing notation here](https://www.oreilly.com/content/how-do-i-use-the-slice-notation-in-python/)

[Find FIPs codes here](https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696)

In [61]:
demographics.head()

Unnamed: 0,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,state_name,district_id,state_fips
0,5001600US0101,"Congressional District 1 (116th Congress), Ala...",710135.0,615,"[Congressional District 1 (116th Congress), Al...",1,1
1,5001600US0102,"Congressional District 2 (116th Congress), Ala...",679684.0,2213,"[Congressional District 2 (116th Congress), Al...",2,1
2,5001600US0103,"Congressional District 3 (116th Congress), Ala...",708888.0,1544,"[Congressional District 3 (116th Congress), Al...",3,1
3,5001600US0104,"Congressional District 4 (116th Congress), Ala...",684757.0,1511,"[Congressional District 4 (116th Congress), Al...",4,1
4,5001600US0105,"Congressional District 5 (116th Congress), Ala...",720362.0,234,"[Congressional District 5 (116th Congress), Al...",5,1


Identify...

- Left table: demographics
- Right table: state_lookup
- Left key: state_fips
- Right key: FIPS
- Merge type: left

In [130]:
state_lookup = pd.read_html('https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696')
state_lookup = state_lookup[0]
state_lookup = state_lookup.drop(index=[55])
# [({'Name': 'District of Columbia', 'Postal Code': 'DC', 'FIPS': '11'})]

state_lookup['FIPS'] = state_lookup['FIPS'].apply(lambda x: str(x).split('.')[0].zfill(2))

In [75]:
state_lookup.head(2)

Unnamed: 0,Name,Postal Code,FIPS
0,Alabama,AL,1
1,Alaska,AK,2


In [83]:
demographics_labeled = demographics.merge(state_lookup[['FIPS', 'Postal Code']], how='left', left_on='state_fips', right_on='FIPS')

In [84]:
demographics_labeled = demographics_labeled.drop(['state_name', 'FIPS'], axis=1)

In [86]:
demographics_labeled.head(2)

Unnamed: 0,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,district_id,state_fips,Postal Code
0,5001600US0101,"Congressional District 1 (116th Congress), Ala...",710135.0,615,1,1,AL
1,5001600US0102,"Congressional District 2 (116th Congress), Ala...",679684.0,2213,2,1,AL


### d. Merge formatted tables

We need the identifying columns to be in the same format and variable type. We can do a double merge or we can create a new key with district and state abbreviation in each table.

[Read about merging in pandas here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)

Identify...

- Left table: candidates
- Right table: demographics_labeled
- Left key: key
- Right key: key
- Merge type: left

In [87]:
candidates['key'] = candidates['district_str'] + candidates['CAND_OFFICE_ST']

In [89]:
demographics_labeled['key'] = demographics_labeled['district_id'] + demographics_labeled['Postal Code']

In [92]:
len(demographics_labeled['key']), len(demographics_labeled['key'].unique())

(440, 440)

In [93]:
candidates_demographics = candidates.merge(demographics_labeled, on='key', how='left')

## 2. Clean data

In [96]:
len(candidates_demographics), len(candidates), len(demographics_labeled)

(5377, 5377, 440)

In [107]:
candidates_2022 = candidates_demographics[candidates_demographics['CAND_ELECTION_YR'] == 2022]

In [109]:
# candidates_2022

### a. Separate house candidates from senate candidates

In [110]:
senate_demographics = candidates_2022[candidates_2022['CAND_OFFICE'] == 'S']
house_demographics = candidates_2022[candidates_2022['CAND_OFFICE'] == 'H']

### b. Find districts that don't exist

In [119]:
demographics_labeled[demographics_labeled['Postal Code'] == 'AS']

Unnamed: 0,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,district_id,state_fips,Postal Code,key


In [121]:
demographics_labeled[demographics_labeled['Postal Code'] == 'OR']

Unnamed: 0,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,district_id,state_fips,Postal Code,key
323,5001600US4101,"Congressional District 1 (116th Congress), Oregon",842952.0,1268,1,41,OR,01OR
324,5001600US4102,"Congressional District 2 (116th Congress), Oregon",817793.0,1035,2,41,OR,02OR
325,5001600US4103,"Congressional District 3 (116th Congress), Oregon",837545.0,1804,3,41,OR,03OR
326,5001600US4104,"Congressional District 4 (116th Congress), Oregon",803194.0,1213,4,41,OR,04OR
327,5001600US4105,"Congressional District 5 (116th Congress), Oregon",828319.0,1515,5,41,OR,05OR


In [118]:
house_demographics[house_demographics['id'].isna()].sort_values('CAND_OFFICE_ST')

Unnamed: 0,CAND_ID,CAND_NAME,CAND_PTY_AFFILIATION,CAND_ELECTION_YR,CAND_OFFICE_ST,CAND_OFFICE,CAND_OFFICE_DISTRICT,CAND_ICI,CAND_STATUS,CAND_PCC,CAND_ST1,CAND_ST2,CAND_CITY,CAND_ST,CAND_ZIP,district_str,key,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,district_id,state_fips,Postal Code
1547,H2AL12017,"VON KRIEG, BIANCA",DEM,2022,AL,H,12.0,C,N,C00777615,530 DIVISADERO,SUITE #458,SAN FRANCISCO,CA,94117.0,12,12AL,,,,,,,
2495,H4AS00036,"AMATA, AUMUA",REP,2022,AS,H,0.0,I,N,C00393041,PO BOX 6171,,PAGO PAGO,AS,96799.0,0,00AS,,,,,,,
1585,H2CA00146,"TOOMIM, LEAH MELISSA",REP,2022,CA,H,0.0,C,N,C00776948,1112 MONTANA AVENUE #3-88,,SANTA MONICA,CA,90403.0,0,00CA,,,,,,,
246,H0DC00058,"NORTON, ELEANOR HOLMES",DEM,2022,DC,H,0.0,I,C,C00244335,"10 NINTH STREET, SE",,WASHINGTON,DC,20003.0,0,00DC,,,,,,,
1715,H2DC01011,"HAMILTON, WENDY REV",DEM,2022,DC,H,0.0,C,C,C00763896,85 DANBURY ST SW,,WASHINGTON,DC,20032.0,0,00DC,,,,,,,
1716,H2DC10012,"WINDHAUSER, ANGELA MARIE",REP,2022,DC,H,0.0,,N,,PO BOX 785098,,WINTER GARDEN,FL,34778.0,0,00DC,,,,,,,
2734,H6FL08213,"GRAYSON, ALAN MARK",DEM,2022,FL,H,28.0,C,N,C00424713,4415 GWYNDALE CT,,ORLANDO,FL,328375509.0,28,28FL,,,,,,,
3123,H8GU01020,"SAN NICOLAS, MICHAEL F.Q. MR.",DEM,2022,GU,H,0.0,I,C,C00668335,198 W. SANTA BARBARA AVE.,,DEDEDO,GU,96929.0,0,00GU,,,,,,,
1902,H2IN90019,"PATEL, HIREN MR.",REP,2022,IN,H,90.0,C,N,C00762542,4236 SOUTHPORT TRACE DR,,INDIANAPOLIS,IN,46237.0,90,90IN,,,,,,,
1904,H2KS21010,"GARRETT, FRANK DUNCAN MR",REP,2022,KS,H,21.0,C,N,,11216 BURTON STREET,,INDEPENDENCE,MO,64054.0,21,21KS,,,,,,,


To-do:
- Add DC to the FIPs table

Districts that don't exist:

- Oregon 6th district
- NH ("at large")
- IN 90th district

### c. Extract strings

[Find in pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.find.html?highlight=find#pandas.Series.str.find)

In [129]:
house_demographics[house_demographics['CAND_NAME'].str.find('NORTON') > -1]

Unnamed: 0,CAND_ID,CAND_NAME,CAND_PTY_AFFILIATION,CAND_ELECTION_YR,CAND_OFFICE_ST,CAND_OFFICE,CAND_OFFICE_DISTRICT,CAND_ICI,CAND_STATUS,CAND_PCC,CAND_ST1,CAND_ST2,CAND_CITY,CAND_ST,CAND_ZIP,district_str,key,id,Geographic Area Name,Estimate!!Total,Margin of Error!!Total,district_id,state_fips,Postal Code
246,H0DC00058,"NORTON, ELEANOR HOLMES",DEM,2022,DC,H,0.0,I,C,C00244335,"10 NINTH STREET, SE",,WASHINGTON,DC,20003.0,0,00DC,,,,,,,
673,H0MI03266,"NORTON, THOMAS JOHN",REP,2022,MI,H,3.0,C,C,C00704270,4208 PETTIS AVE.,,ADA,MI,48301.0,3,03MI,5001600US2603,"Congressional District 3 (116th Congress), Mic...",742923.0,121.0,3.0,26.0,MI
