Progress summary: I've trained random forests and logistic regression models on predicting the cluster labels for four clusters, with high cross-validated accuracy.

I printed some of the most important features (in random forests) and those with the highest coefficients (for logistic regression). 

I can do some more EDA with these features, not sure what else. 

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report

In [2]:
def cv_score(clf, x, y, score_func=accuracy_score):
    result = 0
    nfold = 5
    for train, test in KFold(nfold, random_state = 42).split(x): # split data into train/test groups, 5 times
        clf.fit(x[train], y.iloc[train]) # fit
        result += score_func(clf.predict(x[test]), y.iloc[test]) # evaluate score function on held-out data
    return result / nfold # average

In [3]:
df = pd.read_csv('clustered.csv')

In [4]:
variables = [col for col in df.columns if (col != 'four_cluster_label' and col != 'Unnamed: 0')]
             
print(variables)

['gender', 'legalstatus', 'borderpatrol', 'legalstatusHS', 'deport', 'minwage12', 'faminc', 'mandatorymin', 'bodycamera', 'threestrikes', 'pew_bornagain', 'increasepolice', 'fuelefficiency', 'EPACO2', 'abortionchoice', 'abortioncoverage', 'abortion20wks', 'repealACA', 'banmostabortion', 'child18', 'TPP', 'NCLB', 'primary', 'Iransanctions', 'infraspending', 'backgroundcheck', 'medicarereform', 'concealedcarry', 'gunregistry', 'banassault', 'gaymarriage', 'hispanic', 'investor', 'trans', 'votereg_post', 'militaryoil', 'militaryterror', 'militarycivilwar', 'militarydemocracy', 'militaryally', 'militaryUN', 'polmeeting', 'polsign', 'campaignwork', 'campaigndonate', 'donateblood', 'edloan', 'runoffice', 'Obama', 'Romney', 'age', 'whiteadvantage_Disagree', 'whiteadvantage_Neutral', 'angryracism_Disagree', 'angryracism_Neutral', 'racismrare_Disagree', 'racismrare_Neutral', 'fearrace_Disagree', 'fearrace_Neutral', 'statewelfare_Increase', 'statewelfare_Maintain', 'statehealthcare_Increase', 's

In [5]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df[variables].values, df['four_cluster_label'], random_state = 42, test_size = 0.2)

In [6]:
rf_clf = RandomForestClassifier(random_state = 42, class_weight = 'balanced')
cv_score(rf_clf, Xtrain, ytrain)

0.88793976372841743

In [7]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)

rf_clf.fit(Xtrain, ytrain)
print(classification_report(ytrain, rf_clf.predict(Xtrain)))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6726
          1       1.00      1.00      1.00      9422
          2       1.00      1.00      1.00     10188
          3       1.00      1.00      1.00     12179

avg / total       1.00      1.00      1.00     38515



this is pretty overfit

In [8]:
from sklearn.model_selection import GridSearchCV

In [9]:
param_grid = {'max_depth':[10,20,100,500], 'min_impurity_decrease':[1e-7,1e-6,1e-5, 1e-4, 1e-3, 1e-2]}
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)
rf_clf_cv = GridSearchCV(rf_clf, param_grid, cv = 5)
rf_clf_cv.fit(Xtrain, ytrain)

print(rf_clf_cv.best_params_)

{'max_depth': 10, 'min_impurity_decrease': 1e-06}


In [10]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 20, min_impurity_decrease = 0.0001)

cv_score(rf_clf, Xtrain, ytrain)

0.89025055173309098

CV score is slightly better after tuning. Narrowing down the grid search:

In [11]:
param_grid = {'max_depth':[10,20,25,30,50,100], 'min_impurity_decrease':[1e-6,5e-5,1e-4,5e-3,1e-3]}
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)
rf_clf_cv = GridSearchCV(rf_clf, param_grid, cv = 5)
rf_clf_cv.fit(Xtrain, ytrain)

print(rf_clf_cv.best_params_)

{'max_depth': 20, 'min_impurity_decrease': 5e-05}


In [12]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 25, min_impurity_decrease = 5e-05)

cv_score(rf_clf, Xtrain, ytrain)

0.8934960405036998

In [13]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 25, min_impurity_decrease = 5e-05)

rf_clf.fit(Xtrain, ytrain)
print(classification_report(ytrain, rf_clf.predict(Xtrain)))

             precision    recall  f1-score   support

          0       0.92      0.96      0.94      6726
          1       0.93      0.94      0.94      9422
          2       0.97      0.95      0.96     10188
          3       0.98      0.96      0.97     12179

avg / total       0.96      0.95      0.95     38515



Recall is between 0.95 and 0.97 for all four classes. Precision between 0.92 and 0.98. Overall classifier is pretty good.

In [14]:
feature_importances = list(rf_clf.feature_importances_)

ranked = []
for var in zip(variables, feature_importances):
    ranked.append(var)
    
ranked = sorted(ranked, key = lambda x: x[1], reverse = True)
ranked

[('Romney', 0.070745452080867355),
 ('stateLE_Increase', 0.065715457859039358),
 ('repealACA', 0.063744443518362892),
 ('abortioncoverage', 0.058143864303410356),
 ('stateedu_Increase', 0.053947661287040802),
 ('statehealthcare_Increase', 0.040667231614730598),
 ('stateLE_Maintain', 0.037446680337461652),
 ('Obama', 0.033869736733895137),
 ('stateinfra_Increase', 0.028069483356168858),
 ('pres_Trump', 0.027236463906360225),
 ('legalstatusHS', 0.026968354437650964),
 ('minwage12', 0.026373158023005151),
 ('abortion20wks', 0.025283542374712392),
 ('stateedu_Maintain', 0.021759069124853505),
 ('EPACO2', 0.021515057853525255),
 ('campaigndonate', 0.018933648902625246),
 ('ideo_Moderate or Not sure', 0.01887009923792158),
 ('abortionchoice', 0.01867247680193259),
 ('deport', 0.016300032924500985),
 ('primary', 0.016165903844125435),
 ('stateinfra_Maintain', 0.015488909975746668),
 ('gaymarriage', 0.014820597343414791),
 ('statehealthcare_Maintain', 0.014688327272316556),
 ('age', 0.01143451

2012 vote is a major factor, interestingly.

In [15]:
log_clf = LogisticRegression(class_weight = 'balanced')

cv_score(log_clf, Xtrain, ytrain)

0.95747111514994165

Quite good out of the box.

In [16]:
coefficients = log_clf.coef_

print 10 highest value coefficients for each cluster:

In [17]:
for i in range(0,4):
    cluster_coefs = []
    for item, item2 in zip(variables, coefficients[i]):
        cluster_coefs.append((item, item2))
    cluster_coefs = sorted(cluster_coefs, key = lambda x: abs(x[1]), reverse = True)
    print("Cluster",i)
    for j in range(10):
        print(cluster_coefs[j])

Cluster 0
('stateLE_Increase', -2.0805822792968378)
('stateedu_Increase', -1.6430276195418712)
('stateinfra_Increase', -1.5280517275909118)
('statehealthcare_Increase', -1.4767138593713485)
('Romney', -1.4406633147462773)
('pres_Johnson', 1.2620078109540516)
('stateedu_Maintain', 1.1066999032031224)
('ideo_Very conservative', -1.1056522543307397)
('ideo_Moderate or Not sure', 1.1054643492532916)
('statehealthcare_Maintain', 1.045519422194904)
Cluster 1
('Romney', -1.5073656725444844)
('stateLE_Maintain', -1.3226832467524661)
('stateLE_Increase', 1.3125056511225635)
('statehealthcare_Increase', 1.2660211591613006)
('campaigndonate', -1.2018299337707725)
('age', -1.0814025017319289)
('ideo_Very liberal', -1.0686448670228497)
('stateedu_Increase', 0.9803724578077132)
('stateedu_Maintain', -0.91048214794985982)
('pres_Johnson', 0.88069766336038835)
Cluster 2
('stateLE_Increase', -3.5821691495216448)
('repealACA', -2.7168170547875357)
('stateedu_Maintain', -2.4811898562271395)
('abortion20w

In [19]:
for i in [0,1]:
    cluster_coefs = []
    for item, item2 in zip(variables, coefficients[i]):
        cluster_coefs.append((item, item2))
    cluster_coefs = sorted(cluster_coefs, key = lambda x: abs(x[1]), reverse = True)
    print("Cluster",i)
    for j in range(25):
        print(cluster_coefs[j])

Cluster 0
('stateLE_Increase', -2.0805822792968378)
('stateedu_Increase', -1.6430276195418712)
('stateinfra_Increase', -1.5280517275909118)
('statehealthcare_Increase', -1.4767138593713485)
('Romney', -1.4406633147462773)
('pres_Johnson', 1.2620078109540516)
('stateedu_Maintain', 1.1066999032031224)
('ideo_Very conservative', -1.1056522543307397)
('ideo_Moderate or Not sure', 1.1054643492532916)
('statehealthcare_Maintain', 1.045519422194904)
('relig_Muslim', 1.0440300161189549)
('stateLE_Maintain', 1.0424766167623269)
('campaigndonate', -1.0236067753013502)
('stateinfra_Maintain', 0.97355003739190238)
('pres_None', 0.93930106515743705)
('primary', -0.88241750329348378)
('statewelfare_Increase', -0.86682051390461434)
('pres_Trump', -0.85504676574117977)
('ideo_Very liberal', -0.82591160115334217)
('pres_McMullin', 0.78079066323859236)
('pres_Other', 0.75299317015783462)
('pres_Stein', 0.66713971365507374)
('militaryally', -0.64219855635989931)
('backgroundcheck', 0.63257182942392098)
(

0 and 1 are the most interesting ones, because they are less partisan. 

Cluster 0 wants to maintain law enforcement spending and infrastructure spending. Cluster 0 is associated with voting for Johnson and McMullin, and ideological moderation.

people in cluster 1 want to increase law enforcement spending, they tended not to vote for Romney, they're younger, they tended to vote for Other or Johnson, they tended not to donate to campaigns. They do not tend to identify as ideologically liberal, but are more likely to be moderate or unsure.
