Progress summary: I've trained random forests and logistic regression models on predicting the cluster labels for four clusters, with high cross-validated accuracy.

I printed some of the most important features (in random forests) and those with the highest coefficients (for logistic regression). 

I can do some more EDA with these features, not sure what else. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report

In [2]:
def cv_score(clf, x, y, score_func=accuracy_score):
    result = 0
    nfold = 5
    for train, test in KFold(nfold, random_state = 42).split(x): # split data into train/test groups, 5 times
        clf.fit(x[train], y.iloc[train]) # fit
        result += score_func(clf.predict(x[test]), y.iloc[test]) # evaluate score function on held-out data
    return result / nfold # average

In [3]:
df = pd.read_csv('clustered.csv')

In [4]:
variables = [col for col in df.columns if (col != 'four_cluster_label' and col != 'Unnamed: 0')]
             
print(variables)

['gender', 'legalstatus', 'borderpatrol', 'legalstatusHS', 'deport', 'minwage12', 'faminc', 'mandatorymin', 'bodycamera', 'threestrikes', 'pew_bornagain', 'increasepolice', 'fuelefficiency', 'EPACO2', 'abortionchoice', 'abortioncoverage', 'abortion20wks', 'repealACA', 'banmostabortion', 'child18', 'TPP', 'NCLB', 'primary', 'Iransanctions', 'infraspending', 'backgroundcheck', 'medicarereform', 'concealedcarry', 'gunregistry', 'banassault', 'gaymarriage', 'hispanic', 'investor', 'trans', 'votereg_post', 'militaryoil', 'militaryterror', 'militarycivilwar', 'militarydemocracy', 'militaryally', 'militaryUN', 'polmeeting', 'polsign', 'campaignwork', 'campaigndonate', 'donateblood', 'edloan', 'runoffice', 'Obama', 'Romney', 'age', 'whiteadvantage_Disagree', 'whiteadvantage_Neutral', 'angryracism_Disagree', 'angryracism_Neutral', 'racismrare_Disagree', 'racismrare_Neutral', 'fearrace_Disagree', 'fearrace_Neutral', 'statewelfare_Increase', 'statewelfare_Maintain', 'statehealthcare_Increase', 's

In [5]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df[variables].values, df['four_cluster_label'], random_state = 42, test_size = 0.2)

In [6]:
rf_clf = RandomForestClassifier(random_state = 42, class_weight = 'balanced')
cv_score(rf_clf, Xtrain, ytrain)

0.88671945995066859

In [7]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)

rf_clf.fit(Xtrain, ytrain)
print(classification_report(ytrain, rf_clf.predict(Xtrain)))

             precision    recall  f1-score   support

          0       0.99      1.00      1.00      9140
          1       1.00      1.00      1.00     12336
          2       1.00      0.99      1.00      6593
          3       1.00      1.00      1.00     10446

avg / total       1.00      1.00      1.00     38515



this is pretty overfit

In [8]:
from sklearn.model_selection import GridSearchCV

In [9]:
param_grid = {'max_depth':[10,20,100,500], 'min_impurity_decrease':[1e-7,1e-6,1e-5, 1e-4, 1e-3, 1e-2]}
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)
rf_clf_cv = GridSearchCV(rf_clf, param_grid, cv = 5)
rf_clf_cv.fit(Xtrain, ytrain)

print(rf_clf_cv.best_params_)

{'max_depth': 20, 'min_impurity_decrease': 0.0001}


In [20]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 20, min_impurity_decrease = 0.0001)

cv_score(rf_clf, Xtrain, ytrain)

0.89178242243281824

CV score is slightly better after tuning. Narrowing down the grid search:

In [24]:
param_grid = {'max_depth':[10,20,25,30,50,100], 'min_impurity_decrease':[1e-6,5e-5,1e-4,5e-3,1e-3]}
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)
rf_clf_cv = GridSearchCV(rf_clf, param_grid, cv = 5)
rf_clf_cv.fit(Xtrain, ytrain)

print(rf_clf_cv.best_params_)

{'max_depth': 20, 'min_impurity_decrease': 5e-05}


In [23]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 25, min_impurity_decrease = 5e-05)

cv_score(rf_clf, Xtrain, ytrain)

0.89552122549655988

In [25]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 25, min_impurity_decrease = 5e-05)

rf_clf.fit(Xtrain, ytrain)
print(classification_report(ytrain, rf_clf.predict(Xtrain)))

             precision    recall  f1-score   support

          0       0.94      0.95      0.94      9140
          1       0.98      0.97      0.97     12336
          2       0.92      0.97      0.95      6593
          3       0.98      0.95      0.96     10446

avg / total       0.96      0.96      0.96     38515



Recall is between 0.95 and 0.97 for all four classes. Precision between 0.92 and 0.98. Overall classifier is pretty good.

In [26]:
feature_importances = list(rf_clf.feature_importances_)

ranked = []
for var in zip(variables, feature_importances):
    ranked.append(var)
    
ranked = sorted(ranked, key = lambda x: x[1], reverse = True)
ranked

[('stateedu_Increase', 0.062096795211724322),
 ('stateLE_Maintain', 0.05548018141892147),
 ('pres_Trump', 0.054110428251454699),
 ('stateLE_Increase', 0.050504341430196831),
 ('Romney', 0.046799458441476628),
 ('statehealthcare_Increase', 0.043699247870838134),
 ('repealACA', 0.041898778589493361),
 ('Obama', 0.036654067106673641),
 ('stateinfra_Maintain', 0.033218776264035202),
 ('abortion20wks', 0.033018118241591995),
 ('minwage12', 0.032743429221233314),
 ('Party_Republican', 0.023444680423831368),
 ('EPACO2', 0.022806179628141778),
 ('deport', 0.020794325230335132),
 ('stateedu_Maintain', 0.0197410603400071),
 ('campaigndonate', 0.01969761482866992),
 ('stateinfra_Increase', 0.017895441305307513),
 ('abortionchoice', 0.017626960995552853),
 ('banmostabortion', 0.016702156498949166),
 ('statehealthcare_Maintain', 0.016595189318431562),
 ('abortioncoverage', 0.01629464598355649),
 ('concealedcarry', 0.016227361477420717),
 ('pres_None', 0.013342681851936298),
 ('gaymarriage', 0.01316

2012 vote is a major factor, interestingly.

In [15]:
log_clf = LogisticRegression(class_weight = 'balanced')

cv_score(log_clf, Xtrain, ytrain)

0.95503050759444363

Quite good out of the box.

In [16]:
coefficients = log_clf.coef_

print 10 highest value coefficients for each cluster:

In [17]:
for i in range(0,4):
    cluster_coefs = []
    for item, item2 in zip(variables, coefficients[i]):
        cluster_coefs.append((item, item2))
    cluster_coefs = sorted(cluster_coefs, key = lambda x: abs(x[1]), reverse = True)
    print("Cluster",i)
    for j in range(10):
        print(cluster_coefs[j])

Cluster 0
('Romney', -1.595174941638742)
('stateLE_Increase', 1.3688140965060851)
('stateLE_Maintain', -1.3337265712282713)
('statehealthcare_Increase', 1.2744207580084015)
('campaigndonate', -1.2129050673673913)
('age', -1.1083470618677991)
('stateedu_Increase', 1.0774949910932754)
('pres_Johnson', 0.91184881220532921)
('abortion20wks', 0.88947554311442145)
('runoffice', 0.81708584422050801)
Cluster 1
('Romney', 2.8067926491891084)
('pres_Trump', 2.4387703026415486)
('EPACO2', -2.2837649902200807)
('minwage12', -1.9633614412995659)
('statehealthcare_Increase', -1.8784000404583596)
('stateLE_Maintain', -1.8763254868465344)
('Party_Republican', 1.8707182311824313)
('TPP', -1.7411768002305101)
('Obama', -1.7397960200659095)
('banassault', -1.7046230478533297)
Cluster 2
('stateLE_Increase', -1.9481623729776008)
('stateinfra_Increase', -1.5894853786886598)
('stateedu_Increase', -1.5459652257505498)
('Romney', -1.4439755017678089)
('statehealthcare_Increase', -1.4153609790833266)
('pres_Joh

In [18]:
for i in [0,2]:
    cluster_coefs = []
    for item, item2 in zip(variables, coefficients[i]):
        cluster_coefs.append((item, item2))
    cluster_coefs = sorted(cluster_coefs, key = lambda x: abs(x[1]), reverse = True)
    print("Cluster",i)
    for j in range(25):
        print(cluster_coefs[j])

Cluster 0
('Romney', -1.595174941638742)
('stateLE_Increase', 1.3688140965060851)
('stateLE_Maintain', -1.3337265712282713)
('statehealthcare_Increase', 1.2744207580084015)
('campaigndonate', -1.2129050673673913)
('age', -1.1083470618677991)
('stateedu_Increase', 1.0774949910932754)
('pres_Johnson', 0.91184881220532921)
('abortion20wks', 0.88947554311442145)
('runoffice', 0.81708584422050801)
('stateedu_Maintain', -0.81258012657265644)
('religimp_Very important', 0.78427727870686348)
('pres_Trump', -0.77230388459376875)
('banmostabortion', 0.75246707467827967)
('race_Black', 0.74559672149994882)
('EPACO2', 0.73453459338142479)
('TPP', 0.71456886155642718)
('minwage12', 0.68193209578937375)
('relig_Muslim', 0.67814559149220832)
('increasepolice', 0.67448531972917602)
('primary', -0.64693341762060785)
('relig_Roman Catholic', 0.63778010663393614)
('pres_Other', 0.63019151466405743)
('pres_McMullin', 0.60751411469709271)
('college', -0.59295158568350681)
Cluster 2
('stateLE_Increase', -1.

0 and 2 are the most interesting ones, because they are less partisan. people in cluster 0 want to increase law enforcement spending, they tended not to vote for Romney, they're younger, they tended to vote for Other or Johnson, they tended not to donate to campaigns.

Cluster 2 wants to decrease law enforcement spending and decrease infrastructure spending. Cluster 2 is associated with voting for McMullin. 