Progress summary: I've trained random forests and logistic regression models on predicting the cluster labels for four clusters, with high cross-validated accuracy.

I printed some of the most important features (in random forests) and those with the highest coefficients (for logistic regression). 

I can do some more EDA with these features, not sure what else. 

In [2]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report

In [2]:
def cv_score(clf, x, y, score_func=accuracy_score):
    result = 0
    nfold = 5
    for train, test in KFold(nfold, random_state = 42).split(x): # split data into train/test groups, 5 times
        clf.fit(x[train], y.iloc[train]) # fit
        result += score_func(clf.predict(x[test]), y.iloc[test]) # evaluate score function on held-out data
    return result / nfold # average

In [3]:
df = pd.read_csv('clustered.csv')

In [4]:
variables = [col for col in df.columns if (col != 'four_cluster_label' and col != 'Unnamed: 0')]
             
print(variables)

['gender', 'legalstatus', 'borderpatrol', 'legalstatusHS', 'deport', 'milstat_1', 'milstat_2', 'milstat_3', 'milstat_4', 'minwage12', 'faminc', 'mandatorymin', 'bodycamera', 'threestrikes', 'pew_bornagain', 'increasepolice', 'cleanair', 'fuelefficiency', 'EPACO2', 'abortionchoice', 'abortioncoverage', 'federalabortion', 'renewables', 'abortion20wks', 'repealACA', 'banmostabortion', 'child18', 'TPP', 'NCLB', 'primary', 'Iransanctions', 'infraspending', 'backgroundcheck', 'medicarereform', 'concealedcarry', 'gunregistry', 'banassault', 'gaymarriage', 'hispanic', 'investor', 'trans', 'votereg_post', 'militaryoil', 'militaryterror', 'militarycivilwar', 'militarydemocracy', 'militaryally', 'militaryUN', 'polmeeting', 'polsign', 'campaignwork', 'campaigndonate', 'donateblood', 'edloan', 'runoffice', 'Obama', 'Romney', 'age', 'whiteadvantage_Disagree', 'whiteadvantage_Neutral', 'angryracism_Disagree', 'angryracism_Neutral', 'racismrare_Disagree', 'racismrare_Neutral', 'fearrace_Disagree', 'fe

In [5]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df[variables].values, df['four_cluster_label'], random_state = 42, test_size = 0.2)

In [6]:
rf_clf = RandomForestClassifier(random_state = 42, class_weight = 'balanced')
cv_score(rf_clf, Xtrain, ytrain)

0.88887446449435292

In [7]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)

rf_clf.fit(Xtrain, ytrain)
print(classification_report(ytrain, rf_clf.predict(Xtrain)))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      9277
          1       1.00      1.00      1.00     11795
          2       1.00      1.00      1.00      6669
          3       1.00      1.00      1.00     10774

avg / total       1.00      1.00      1.00     38515



damn overfitting!

In [8]:
from sklearn.model_selection import GridSearchCV

In [9]:
param_grid = {'max_depth':[10,20,100,500], 'min_impurity_decrease':[1e-7,1e-6,1e-5, 1e-4, 1e-3, 1e-2]}
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)
rf_clf_cv = GridSearchCV(rf_clf, param_grid, cv = 5)
rf_clf_cv.fit(Xtrain, ytrain)

print(rf_clf_cv.best_params_)

{'max_depth': 10, 'min_impurity_decrease': 1e-06}


In [10]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 10, min_impurity_decrease = 1e-06)

cv_score(rf_clf, Xtrain, ytrain)

0.89183435025314817

CV score is slightly better after tuning. Narrowing down the grid search:

In [11]:
param_grid = {'max_depth':[10,25,50,100], 'min_impurity_decrease':[1e-7,5e-6,1e-6,5e-5,1e-5]}
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42)
rf_clf_cv = GridSearchCV(rf_clf, param_grid, cv = 5)
rf_clf_cv.fit(Xtrain, ytrain)

print(rf_clf_cv.best_params_)

{'max_depth': 25, 'min_impurity_decrease': 5e-05}


In [12]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 25, min_impurity_decrease = 5e-05)

cv_score(rf_clf, Xtrain, ytrain)

0.89546929767623007

In [13]:
rf_clf = RandomForestClassifier(class_weight = 'balanced', random_state = 42, max_depth = 25, min_impurity_decrease = 5e-05)

rf_clf.fit(Xtrain, ytrain)
print(classification_report(ytrain, rf_clf.predict(Xtrain)))

             precision    recall  f1-score   support

          0       0.93      0.95      0.94      9277
          1       0.98      0.96      0.97     11795
          2       0.92      0.97      0.94      6669
          3       0.98      0.95      0.96     10774

avg / total       0.96      0.96      0.96     38515



Recall is between 0.95 and 0.97 for all four classes. Precision between 0.93 and 0.98. Overall classifier is pretty good.

In [14]:
feature_importances = list(rf_clf.feature_importances_)

ranked = []
for var in zip(variables, feature_importances):
    ranked.append(var)
    
ranked = sorted(ranked, key = lambda x: x[1], reverse = True)
ranked

[('pres_Trump', 0.080069316354228565),
 ('stateLE_Increase', 0.062372391873136659),
 ('stateedu_Increase', 0.049556628354393714),
 ('Obama', 0.047190781821376267),
 ('stateinfra_Increase', 0.04620984385781772),
 ('repealACA', 0.044047424265772281),
 ('Romney', 0.039045182215672311),
 ('statehealthcare_Increase', 0.031209075656139333),
 ('stateLE_Maintain', 0.028486175026776122),
 ('stateedu_Maintain', 0.027797928655782345),
 ('statehealthcare_Maintain', 0.027461490608496612),
 ('gaymarriage', 0.026073409226767708),
 ('abortioncoverage', 0.025425305985564511),
 ('federalabortion', 0.024200422669977944),
 ('cleanair', 0.022260814891625048),
 ('abortion20wks', 0.018168679425092333),
 ('minwage12', 0.017663785115495025),
 ('whiteadvantage_Disagree', 0.017281398286528655),
 ('stateinfra_Maintain', 0.015802638836985874),
 ('EPACO2', 0.015522428406865057),
 ('renewables', 0.015310524826665239),
 ('banassault', 0.014407623558220468),
 ('legalstatus', 0.013340961616259967),
 ('legalstatusHS', 0

2012 vote is a major factor, interestingly.

In [15]:
log_clf = LogisticRegression(class_weight = 'balanced')

cv_score(log_clf, Xtrain, ytrain)

0.95939244450214201

Quite good out of the box.

In [16]:
coefficients = log_clf.coef_

print 10 highest value coefficients for each cluster:

In [17]:
for i in range(0,4):
    cluster_coefs = []
    for item, item2 in zip(variables, coefficients[i]):
        cluster_coefs.append((item, item2))
    cluster_coefs = sorted(cluster_coefs, key = lambda x: abs(x[1]), reverse = True)
    print("Cluster",i)
    for j in range(10):
        print(cluster_coefs[j])

Cluster 0
('stateLE_Increase', 1.345915254851912)
('Romney', -1.2954577920287345)
('stateLE_Maintain', -1.2024122975389)
('statehealthcare_Increase', 1.1550477791032436)
('campaigndonate', -1.1176900213028396)
('age', -1.1150169473062193)
('pres_Other', 1.0527138588071092)
('pres_Johnson', 1.0055749555457958)
('stateedu_Increase', 0.91442630542997871)
('stateedu_Maintain', -0.86117923103695726)
Cluster 1
('Romney', 2.5263127326243198)
('pres_Trump', 2.3214492967752913)
('EPACO2', -2.0659664811097458)
('minwage12', -2.0392347098624724)
('statehealthcare_Increase', -1.9496879224193429)
('Obama', -1.92164056628168)
('cleanair', -1.9201875350527464)
('banassault', -1.6980916930646739)
('TPP', -1.6932017554049166)
('whiteadvantage_Disagree', 1.6802727794660475)
Cluster 2
('stateLE_Increase', -1.9342054027013087)
('stateinfra_Increase', -1.6210150181073359)
('stateedu_Increase', -1.3851842924342705)
('statehealthcare_Increase', -1.3231852354449627)
('Romney', -1.2456866399319484)
('pres_McMu

In [18]:
for i in [0,2]:
    cluster_coefs = []
    for item, item2 in zip(variables, coefficients[i]):
        cluster_coefs.append((item, item2))
    cluster_coefs = sorted(cluster_coefs, key = lambda x: abs(x[1]), reverse = True)
    print("Cluster",i)
    for j in range(25):
        print(cluster_coefs[j])

Cluster 0
('stateLE_Increase', 1.345915254851912)
('Romney', -1.2954577920287345)
('stateLE_Maintain', -1.2024122975389)
('statehealthcare_Increase', 1.1550477791032436)
('campaigndonate', -1.1176900213028396)
('age', -1.1150169473062193)
('pres_Other', 1.0527138588071092)
('pres_Johnson', 1.0055749555457958)
('stateedu_Increase', 0.91442630542997871)
('stateedu_Maintain', -0.86117923103695726)
('EPACO2', 0.79358259494602945)
('abortion20wks', 0.78505618357244078)
('banmostabortion', 0.75873099364443031)
('race_Black', 0.70271399599778339)
('milstat_1', 0.69157955597548293)
('minwage12', 0.67124792663075161)
('religimp_Very important', 0.67011936673217809)
('relig_Mormon', 0.67002401547420076)
('TPP', 0.65674462656781052)
('runoffice', 0.65530218756994296)
('increasepolice', 0.64185535924565729)
('relig_Roman Catholic', 0.64153097483538968)
('repealACA', 0.63195837426913792)
('relig_Muslim', 0.62855536256501643)
('relig_Hindu', 0.59272655948494379)
Cluster 2
('stateLE_Increase', -1.934

0 and 2 are the most interesting ones, because they are less partisan. people in cluster 0 want to increase law enforcement spending, they tended not to vote for Romney, they're younger, they tended to vote for Other or Johnson, they tended not to donate to campaigns.

Cluster 2 wants to decrease law enforcement spending and decrease infrastructure spending. Cluster 2 is associated with voting for McMullin. 