### Dataset setup

The below cells manipulate this data to create a single dataframe which contains information about the county, including health and demographic information, and the percentage of residents that voted for each presidential candidate.

This data is pulled from:
- [County Health Rankings & Roadmaps](https://www.countyhealthrankings.org/)
- [The MIT Election Data and Science Lab](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ)

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('analytic_data2023.csv')
cols=list(df.columns[:5])
for col in df.columns:
    if 'raw value' in col or 'Ratio' in col:
        cols.append(col)
cutdown=df[df['County FIPS Code']!=0][cols]

In [3]:
elect=pd.read_csv('countypres_2000-2020.csv')
elect=elect[elect['year']==2020]
elect=elect[['state','county_name','county_fips','candidate','party','candidatevotes','totalvotes']]
elect_pivot=elect.pivot_table(index='county_fips',columns='candidate',values='candidatevotes')
elect_pivot=elect_pivot.fillna(0)
elect_pivot['totalvotes']=elect_pivot.sum(axis=1)
for c in elect_pivot.columns[:-1]:
    elect_pivot[c]=elect_pivot[c]/elect_pivot['totalvotes']
elect_pivot = elect_pivot.drop(columns=['totalvotes'])

In [4]:
merged=cutdown.merge(elect_pivot,left_on='5-digit FIPS Code',right_index=True,how='outer')
merged.drop('Living Wage raw value',axis=1,inplace=True)
merged.drop('Children Eligible for Free or Reduced Price Lunch raw value',axis=1,inplace=True)
merged.drop('Residential Segregation - Black/White raw value',axis=1,inplace=True)
merged.drop('Child Care Cost Burden raw value',axis=1,inplace=True)
merged.drop('Child Care Centers raw value',axis=1,inplace=True)
merged=merged.dropna(subset=['State FIPS Code','DONALD J TRUMP'])

In [5]:
merged.head()

Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Premature Death raw value,Poor or Fair Health raw value,Poor Physical Health Days raw value,Poor Mental Health Days raw value,Low Birthweight raw value,...,% Native Hawaiian or Other Pacific Islander raw value,% Hispanic raw value,% Non-Hispanic White raw value,% Not Proficient in English raw value,% Female raw value,% Rural raw value,DONALD J TRUMP,JO JORGENSEN,JOSEPH R BIDEN JR,OTHER
2.0,1.0,1.0,1001.0,AL,Autauga County,8027.394727,0.169,3.432211,4.797351,0.097382,...,0.001185,0.033268,0.724545,0.002312,0.513783,0.420022,0.714368,0.0,0.270184,0.015448
3.0,1.0,3.0,1003.0,AL,Baldwin County,8118.358206,0.149,3.276177,4.75375,0.083857,...,0.000685,0.048417,0.831488,0.007597,0.513477,0.422791,0.761714,0.0,0.22409,0.014196
4.0,1.0,5.0,1005.0,AL,Barbour County,12876.760319,0.275,4.605432,4.954855,0.119147,...,0.002323,0.049591,0.453052,0.013827,0.467033,0.677896,0.534512,0.0,0.457882,0.007606
5.0,1.0,7.0,1007.0,AL,Bibb County,11191.474323,0.216,4.012182,5.364779,0.100331,...,0.00129,0.030876,0.735641,0.004431,0.460159,0.683526,0.784263,0.0,0.206983,0.008755
6.0,1.0,9.0,1009.0,AL,Blount County,10787.014541,0.184,3.866048,5.37758,0.078599,...,0.001236,0.098677,0.863298,0.017269,0.501922,0.899515,0.895716,0.0,0.095694,0.008591


### Assignment

Your job is to use a neural net to build the best predictor of the 2020 election you can.  You should follow the following steps:

- Identify NaNs, and decide what to do with them,
- Split the data into training and testing sets,
- Split off the targets from the training data,
- Construct a neural net with four logit outputs and cross entropy loss to predict the percentage vote for each candidate,
- Use testing error to find the best version of your network.

This will get you up to a 90%.  For the final 10%, perform an *ablation test*, in which you remove inputs, and observe if and by how much worse your predictor becomes.  A steep drop in accuracy upon the removal of an input would suggest it is very important in making this prediction.

Keep in mind **correlation** vs **causation**.  We are discovering correlation, not causation.

In [6]:
merged[merged.columns[5:]] = merged[ merged.columns[5:] ].fillna(merged[ merged.columns[5:]].mean())

In [7]:
Xdf = merged[ merged.columns[5:-4] ]
ydf = merged[ merged.columns[-4:] ]
n,k=Xdf.shape
print(n,k)

3115 84


In [8]:
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(Xdf.values, ydf.values, random_state=1)

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim

In [10]:
class VoteModel(nn.Module):
    def __init__(self, n_feats):
        super(VoteModel, self).__init__()
        self.hidden1 = nn.Linear(n_feats, 100)
        self.hidden1_act = nn.ReLU()
        self.hidden2 = nn.Linear(100, 50)
        self.hidden2_act = nn.ReLU()
        self.hidden3 = nn.Linear(50, 50)
        self.hidden3_act = nn.ReLU()
        self.output = nn.Linear(50,4)

    def forward(self, x):
        x = self.hidden1(x)
        x = self.hidden1_act(x)
        x = self.hidden2(x)
        x = self.hidden2_act(x)
        x = self.hidden3(x)
        x = self.hidden3_act(x)
        x = self.output(x)
        return x

In [11]:
Xtr = torch.tensor(Xtr, dtype=torch.float32).to('cuda')
Xte = torch.tensor(Xte, dtype=torch.float32).to('cuda')
ytr = torch.tensor(ytr, dtype=torch.float32).to('cuda')
yte = torch.tensor(yte, dtype=torch.float32).to('cuda')

In [12]:
model = VoteModel(k).to('cuda')

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=.001)
EPOCHS = 10001

In [13]:
for epoch in range(EPOCHS):
    predictions = model(Xtr)
    loss = criterion(predictions, ytr)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch%1000 == 0:
        print(f'epoch: {epoch} train_loss: {loss.item()}')
        with torch.no_grad():
            predictions = model(Xte)
            test_loss = criterion(predictions, yte)
            print(f'  test_loss: {test_loss.item()}')

epoch: 0 train_loss: 193.2246551513672
  test_loss: 155.3465576171875
epoch: 1000 train_loss: 3.8059353828430176
  test_loss: 2.7310922145843506
epoch: 2000 train_loss: 0.7183730006217957
  test_loss: 0.7316306829452515
epoch: 3000 train_loss: 0.6942815780639648
  test_loss: 0.7071449756622314
epoch: 4000 train_loss: 0.6866461634635925
  test_loss: 0.6968273520469666
epoch: 5000 train_loss: 0.7071458697319031
  test_loss: 0.7077979445457458
epoch: 6000 train_loss: 0.6850718855857849
  test_loss: 0.6917330026626587
epoch: 7000 train_loss: 0.681161642074585
  test_loss: 0.6894801259040833
epoch: 8000 train_loss: 0.6784219741821289
  test_loss: 0.6902355551719666
epoch: 9000 train_loss: 0.6879373788833618
  test_loss: 0.6969490051269531
epoch: 10000 train_loss: 0.6843848824501038
  test_loss: 0.6901247501373291


In [14]:
indices = [618, 1605, 1216]
counties = ['Cass', 'St. Louis', 'AA']

In [15]:
ofInterest = merged.loc[indices]
ofInterest

Unnamed: 0,State FIPS Code,County FIPS Code,5-digit FIPS Code,State Abbreviation,Name,Premature Death raw value,Poor or Fair Health raw value,Poor Physical Health Days raw value,Poor Mental Health Days raw value,Low Birthweight raw value,...,% Native Hawaiian or Other Pacific Islander raw value,% Hispanic raw value,% Non-Hispanic White raw value,% Not Proficient in English raw value,% Female raw value,% Rural raw value,DONALD J TRUMP,JO JORGENSEN,JOSEPH R BIDEN JR,OTHER
618.0,17.0,17.0,17017.0,IL,Cass County,8575.094202,0.166,3.301286,3.838231,0.090244,...,0.002114,0.206921,0.726924,0.054994,0.496281,0.521331,0.683511,0.009428,0.304516,0.002545
1605.0,29.0,189.0,29189.0,MO,St. Louis County,8753.449963,0.128,3.03685,4.449772,0.097503,...,0.000182,0.031497,0.646994,0.010331,0.521771,0.01138,0.372811,0.011435,0.613247,0.002507
1216.0,24.0,3.0,24003.0,MD,Anne Arundel County,6936.516502,0.099,2.62228,4.433571,0.077281,...,0.001284,0.08951,0.650828,0.013995,0.502644,0.053041,0.415737,0.015131,0.562106,0.007027


In [16]:
print(counties)
print()

inputs = ofInterest[ ofInterest.columns[5:-4]].values
truth = ofInterest[ ofInterest.columns[-4:]].values

print('ORDER')
print(ofInterest.columns[-4:])

inputs = torch.tensor(inputs, dtype=torch.float32).to('cuda')
logits = model(inputs).detach()
print()
print('LOGITS:')
print(logits.cpu().numpy())

predprobs = nn.Softmax(dim=1)(logits)
print()
print('PREDICTED PROBS')
print(predprobs.cpu().numpy())

print()
print('ACTUAL PROBS')
print(truth)

['Cass', 'St. Louis', 'AA']

ORDER
Index(['DONALD J TRUMP', 'JO JORGENSEN', 'JOSEPH R BIDEN JR', 'OTHER'], dtype='object')

LOGITS:
[[ 3.384496   -0.67321074  2.3764546  -1.7617202 ]
 [ 0.9247706  -2.777       1.3258486  -3.8056989 ]
 [ 8.648645    5.3942366   9.194101    4.3497276 ]]

PREDICTED PROBS
[[0.7204388  0.01245539 0.26291192 0.00419395]
 [0.3957362  0.00976669 0.5910058  0.00349131]
 [0.36002544 0.01389829 0.62118596 0.00489032]]

ACTUAL PROBS
[[0.68351089 0.00942774 0.30451589 0.00254549]
 [0.37281129 0.01143515 0.61324657 0.00250699]
 [0.41573686 0.0151306  0.56210553 0.00702701]]


In [17]:
# Remove % Rural
Xtr_ab = Xtr[:,:-1]
Xte_ab = Xte[:,:-1]

In [20]:
model2 = VoteModel(k-1).to('cuda')

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model2.parameters(), lr=.001)
EPOCHS = 10001

In [21]:
for epoch in range(EPOCHS):
    predictions = model2(Xtr_ab)
    loss = criterion(predictions, ytr)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch%1000 == 0:
        print(f'epoch: {epoch} train_loss: {loss.item()}')
        with torch.no_grad():
            predictions = model2(Xte_ab)
            test_loss = criterion(predictions, yte)
            print(f'  test_loss: {test_loss.item()}')

epoch: 0 train_loss: 499.0849609375
  test_loss: 145.1230010986328
epoch: 1000 train_loss: 2.5738470554351807
  test_loss: 1.4926680326461792
epoch: 2000 train_loss: 0.8526983857154846
  test_loss: 0.7514132857322693
epoch: 3000 train_loss: 0.7004548907279968
  test_loss: 0.7080554962158203
epoch: 4000 train_loss: 0.684840977191925
  test_loss: 0.6998701095581055
epoch: 5000 train_loss: 0.6879040002822876
  test_loss: 0.6966198086738586
epoch: 6000 train_loss: 0.8504706621170044
  test_loss: 0.8525370359420776
epoch: 7000 train_loss: 0.690767228603363
  test_loss: 0.6942671537399292
epoch: 8000 train_loss: 0.6843013167381287
  test_loss: 0.6889854669570923
epoch: 9000 train_loss: 0.6803603172302246
  test_loss: 0.6883827447891235
epoch: 10000 train_loss: 0.6800031065940857
  test_loss: 0.6864268779754639
