### Dataset setup

The below cells manipulate this data to create a single dataframe which contains information about the county, including health and demographic information, and the percentage of residents that voted for each presidential candidate.

This data is pulled from:
- [County Health Rankings & Roadmaps](https://www.countyhealthrankings.org/)
- [The MIT Election Data and Science Lab](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ)

In [1]:
import pandas as pd

In [2]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

In [3]:
data_location = 's3://mlspace-data-521454461163/project/07NNClassification/datasets/politicalPredictions/'

In [4]:
df = pd.read_csv(data_location+'analytic_data2023.csv')
cols = list(df.columns[:5])
for col in df.columns:
    if 'raw value' in col or 'Ratio' in col:
        cols.append(col)
cutdown = df[df['County FIPS Code'] != 0][cols]

severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.



In [5]:
elect = pd.read_csv(data_location+'countypres_2000-2020.csv')
elect = elect[elect['year'] == 2020]
elect = elect[['state', 'county_name', 'county_fips', 'candidate', 'party', 'candidatevotes', 'totalvotes']]
elect['vote_ratio']=elect['candidatevotes']/elect['totalvotes']
elect_pivot=elect.pivot_table(index='county_fips',columns='candidate',values='vote_ratio')
elect_pivot=elect_pivot.fillna(0)

In [6]:
merged=cutdown.merge(elect_pivot,left_on='5-digit FIPS Code',right_index=True,how='outer')
merged.drop('Living Wage raw value',axis=1,inplace=True)
merged.drop('Children Eligible for Free or Reduced Price Lunch raw value',axis=1,inplace=True)
merged.drop('Residential Segregation - Black/White raw value',axis=1,inplace=True)
merged.drop('Child Care Cost Burden raw value',axis=1,inplace=True)
merged.drop('Child Care Centers raw value',axis=1,inplace=True)

In [7]:
print(merged.columns)
merged.head()

Index(['State FIPS Code', 'County FIPS Code', '5-digit FIPS Code',
       'State Abbreviation', 'Name', 'Premature Death raw value',
       'Poor or Fair Health raw value', 'Poor Physical Health Days raw value',
       'Poor Mental Health Days raw value', 'Low Birthweight raw value',
       'Adult Smoking raw value', 'Adult Obesity raw value',
       'Food Environment Index raw value', 'Physical Inactivity raw value',
       'Access to Exercise Opportunities raw value',
       'Excessive Drinking raw value',
       'Alcohol-Impaired Driving Deaths raw value',
       'Sexually Transmitted Infections raw value', 'Teen Births raw value',
       'Uninsured raw value', 'Primary Care Physicians raw value',
       'Ratio of population to primary care physicians.', 'Dentists raw value',
       'Ratio of population to dentists.', 'Mental Health Providers raw value',
       'Ratio of population to mental health providers.',
       'Preventable Hospital Stays raw value',
       'Mammography Scree

Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Premature Death raw value,Poor or Fair Health raw value,Poor Physical Health Days raw value,Poor Mental Health Days raw value,Low Birthweight raw value,...,% Native Hawaiian or Other Pacific Islander raw value,% Hispanic raw value,% Non-Hispanic White raw value,% Not Proficient in English raw value,% Female raw value,% Rural raw value,DONALD J TRUMP,JO JORGENSEN,JOSEPH R BIDEN JR,OTHER
2.0,1.0,1.0,1001.0,AL,Autauga County,8027.394727,0.169,3.432211,4.797351,0.097382,...,0.001185,0.033268,0.724545,0.002312,0.513783,0.420022,0.714368,0.0,0.270184,0.015448
3.0,1.0,3.0,1003.0,AL,Baldwin County,8118.358206,0.149,3.276177,4.75375,0.083857,...,0.000685,0.048417,0.831488,0.007597,0.513477,0.422791,0.761714,0.0,0.22409,0.014196
4.0,1.0,5.0,1005.0,AL,Barbour County,12876.760319,0.275,4.605432,4.954855,0.119147,...,0.002323,0.049591,0.453052,0.013827,0.467033,0.677896,0.534512,0.0,0.457882,0.007606
5.0,1.0,7.0,1007.0,AL,Bibb County,11191.474323,0.216,4.012182,5.364779,0.100331,...,0.00129,0.030876,0.735641,0.004431,0.460159,0.683526,0.784263,0.0,0.206983,0.008755
6.0,1.0,9.0,1009.0,AL,Blount County,10787.014541,0.184,3.866048,5.37758,0.078599,...,0.001236,0.098677,0.863298,0.017269,0.501922,0.899515,0.895716,0.0,0.095694,0.008591


### Assignment

Your job is to use a neural net to build the best predictor of the 2020 election you can.  You should follow the following steps:

- Identify NaNs, and decide what to do with them,
- Split the data into training and testing sets,
- Split off the targets from the training data,
- Construct a neural net with four logit outputs and cross entropy loss to predict the percentage vote for each candidate,
- Use testing error to find the best version of your network.

This will get you up to a 90%.  For the final 10%, perform an *ablation test*, in which you remove inputs, and observe if and by how much worse your predictor becomes.  A steep drop in accuracy upon the removal of an input would suggest it is very important in making this prediction.

Keep in mind **correlation** vs **causation**.  We are discovering correlation, not causation.

In [8]:
merged[merged.columns[5:]] = merged[merged.columns[5:]].fillna(
    merged[merged.columns[5:]].mean())

In [9]:
merged.head()

Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Premature Death raw value,Poor or Fair Health raw value,Poor Physical Health Days raw value,Poor Mental Health Days raw value,Low Birthweight raw value,...,% Native Hawaiian or Other Pacific Islander raw value,% Hispanic raw value,% Non-Hispanic White raw value,% Not Proficient in English raw value,% Female raw value,% Rural raw value,DONALD J TRUMP,JO JORGENSEN,JOSEPH R BIDEN JR,OTHER
2.0,1.0,1.0,1001.0,AL,Autauga County,8027.394727,0.169,3.432211,4.797351,0.097382,...,0.001185,0.033268,0.724545,0.002312,0.513783,0.420022,0.714368,0.0,0.270184,0.015448
3.0,1.0,3.0,1003.0,AL,Baldwin County,8118.358206,0.149,3.276177,4.75375,0.083857,...,0.000685,0.048417,0.831488,0.007597,0.513477,0.422791,0.761714,0.0,0.22409,0.014196
4.0,1.0,5.0,1005.0,AL,Barbour County,12876.760319,0.275,4.605432,4.954855,0.119147,...,0.002323,0.049591,0.453052,0.013827,0.467033,0.677896,0.534512,0.0,0.457882,0.007606
5.0,1.0,7.0,1007.0,AL,Bibb County,11191.474323,0.216,4.012182,5.364779,0.100331,...,0.00129,0.030876,0.735641,0.004431,0.460159,0.683526,0.784263,0.0,0.206983,0.008755
6.0,1.0,9.0,1009.0,AL,Blount County,10787.014541,0.184,3.866048,5.37758,0.078599,...,0.001236,0.098677,0.863298,0.017269,0.501922,0.899515,0.895716,0.0,0.095694,0.008591


In [10]:
data = merged[merged.columns[5:-4]]
targets = merged[merged.columns[-4:]]

In [11]:
print(data.shape, targets.shape)

(3181, 84) (3181, 4)


In [12]:
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(data, targets)

N, k = Xtr.shape
_, outputs = ytr.shape

In [13]:
class VotePredict(nn.Module):
    def __init__(self):
        super(VotePredict, self).__init__()
        self.hidden1 = nn.Linear(k, 50)
        self.hidden1_act = nn.ReLU()
        self.hidden2 = nn.Linear(50, 50)
        self.hidden2_act = nn.ReLU()
        self.hidden3 = nn.Linear(50, 50)
        self.hidden3_act = nn.ReLU()
        self.output = nn.Linear(50, outputs)

    def forward(self, x):
        x = self.hidden1(x)
        x = self.hidden1_act(x)
        x = self.hidden2(x)
        x = self.hidden2_act(x)
        x = self.hidden3(x)
        x = self.hidden3_act(x)
        x = self.output(x)
        return x

In [14]:
Xtr = torch.tensor(Xtr.values, dtype=torch.float32).to('cuda')
ytr = torch.tensor(ytr.values, dtype=torch.float32).to('cuda')
Xte = torch.tensor(Xte.values, dtype=torch.float32).to('cuda')
yte = torch.tensor(yte.values, dtype=torch.float32).to('cuda')

In [15]:
vp = VotePredict().to('cuda')
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(vp.parameters(), lr=.01)

EPOCHS = 10000

for epoch in range(EPOCHS):
    predictions = vp(Xtr)
    loss = criterion(predictions, ytr)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        with torch.no_grad():
            pred = vp(Xte)
            test_loss = criterion(pred, yte)
        print(f'Training Loss: {loss}, Testing Loss: {test_loss}')

Training Loss: 63.20655822753906, Testing Loss: 1380.306396484375
Training Loss: 0.877166748046875, Testing Loss: 0.8415446877479553
Training Loss: 0.5856888890266418, Testing Loss: 0.593890368938446
Training Loss: 0.5847991108894348, Testing Loss: 0.5888544321060181
Training Loss: 0.5774627327919006, Testing Loss: 0.5871855616569519
Training Loss: 0.5762054324150085, Testing Loss: 0.5855302214622498
Training Loss: 0.5742086172103882, Testing Loss: 0.5832658410072327
Training Loss: 0.5732352137565613, Testing Loss: 0.5821996927261353
Training Loss: 0.5721359252929688, Testing Loss: 0.5813013315200806
Training Loss: 0.5709612369537354, Testing Loss: 0.5806616544723511
Training Loss: 0.7324068546295166, Testing Loss: 0.7241612672805786
Training Loss: 0.588180661201477, Testing Loss: 0.6002880334854126
Training Loss: 0.5764032006263733, Testing Loss: 0.5836740732192993
Training Loss: 0.5724239945411682, Testing Loss: 0.5817884802818298
Training Loss: 0.5711221098899841, Testing Loss: 0.58