### Feature selection--wrapper model
* This notebook iteratively select 50 most significant features out of the 907 numerical features using inertia score of the K-means algorithm as criteria. The algorithm was run on a subsample of the original data (1.25% of the original data set).

In [3]:
from sklearn.cluster import KMeans
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist, pdist
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm
from sklearn.decomposition import PCA

In [3]:
# Use a smaller data set to save time
df = pd.read_csv('PHBsample14_sss.csv', low_memory=False)
# drop the column resulted from sampling of the original data set
df.drop('Unnamed: 0', axis=1, inplace=True)
# In order to run K-means, drop all the categoricald data for now.
df = df.select_dtypes(include=['float64', 'int64'])
# Impute missing values with means
df = df.fillna(df.mean())

In [4]:
X = df

In [5]:
n, m = X.shape[0], X.shape[1]
print(n, m)

59159 907


In [5]:
model_test = KMeans(n_clusters=7)
model_test.fit(X) 
pred_y=model_test.labels_

In [9]:
print("Clustered class labels:", "\n", pd.value_counts(pd.Series(pred_y)))

Clustered class labels: 
 0    15491
5    14160
4     8910
6     5925
3     5687
1     5313
2     3673
dtype: int64


In [10]:
X.columns

Index(['ValDate', 'IssDate', 'IssAgeALB', 'Dur', 'AttAge', 'JointInd', 'AV',
       'CSV', 'SCPeriod', 'WDtoDate',
       ...
       'Match4', 'tie3', 'HealthScore_C5', 'Surr', 'EligibleInd', 'WDResponse',
       'FirstEligQInd', 'UtilizationInd', 'WDModelFilterIn', 'PolNum_UW'],
      dtype='object', length=907)

In [15]:
X.head()

Unnamed: 0,ValDate,IssDate,IssAgeALB,Dur,AttAge,JointInd,AV,CSV,SCPeriod,WDtoDate,...,Match4,tie3,HealthScore_C5,Surr,EligibleInd,WDResponse,FirstEligQInd,UtilizationInd,WDModelFilterIn,PolNum_UW
0,16343.0,16104.0,65.0,1.0,65.8,0.0,448559.96,421076.98,5,0.0,...,1.0,64.859049,0.5,0.0,1.0,0.0,0.0,0.0,1.0,294692
1,15613.0,14397.0,69.0,4.0,72.4,0.0,67321.31,64451.77,7,0.0,...,1.0,53.0,0.869053,0.0,1.0,0.0,0.0,0.0,1.0,281394
2,16070.0,13518.0,55.0,7.0,62.6,0.0,301121.04,295758.92,7,56438.97,...,1.0,57.0,0.5,0.0,1.0,0.0,0.0,1.0,0.0,475776
3,16343.0,14419.0,53.0,6.0,58.3,0.0,187344.04,180762.56,7,0.0,...,1.0,104.0,0.5,0.0,1.0,0.0,0.0,0.0,1.0,288738
4,15613.0,15044.0,63.0,2.0,65.2,0.0,183155.51,171789.66,6,1845.38,...,0.0,64.859049,0.869053,0.0,1.0,0.0,0.0,1.0,1.0,15320


In [42]:
# let's assume there are 7 clusters
num_of_cluster = 7
# Let's assume we're going to select 50 features out of 907 features, therefore we're going to iterate 50 times
num_of_iter = 50
model = KMeans(n_clusters=num_of_cluster)
score = np.zeros([num_of_iter, m]) # the sum of squared distances of samples to their closest cluster center
exclude_columns = [] # best performed models with selected features will be added to this list after every iteration
include_columns = [i for i in range(np.shape(score)[1]) if i not in exclude_columns] # rest of the features

for iteration in range(num_of_iter):
    # The first iteration, we're going to test clustering models on each individual variables
    if iteration == 0:
        print("Now processing iteration %d" %iteration, "\n")   
        for i in range(m):
            model.fit(X.iloc[:, i][:, np.newaxis])
#             pred_y = model.labels_
#             print("cluster labels based on variable %s:" %X.columns[i], "\n", pd.value_counts(pd.Series(pred_y)))
            score[iteration][i] = model.inertia_
#             print("the sum of squared distances of samples to their closest cluster center based on variable %s" \
#                  %X.columns[i], "is:", score[iteration][i])   
        
        selected_feature_index = np.argmin(score[iteration], axis=0) 
        selected_feature_score = np.amin(score[iteration], axis=0) 
        selected_feature = X[X.columns[selected_feature_index]]
        exclude_columns.append(selected_feature_index)
        print("Conclusion: cluster based on variable %s" %X.columns[selected_feature_index], "gives the best performance", "\n") 
    #for following iteration, we're going to add the rest the feature to the selected feature and perform cluster model
    else:
        print("Now processing iteration %d" %iteration, "\n") 
        for i in range(m):
            if i not in exclude_columns:
                # Generate data with features selected from last iteration plus each individual rest of the features
                data = pd.concat([selected_feature, X[X.columns[i]]], axis=1)
                model.fit(data)
#                 pred_y = model.labels_
#                 print("cluster labels based on variables:", data.columns, "\n", pd.value_counts(pd.Series(pred_y)))
                score[iteration][i] = model.inertia_
#                 print("the sum of squared distances of samples to their closest cluster center based on variables:", \
#                  data.columns, "is:", score[iteration][i]) 
        include_columns = [i for i in range(np.shape(score)[1]) if i not in exclude_columns]
        selected_feature_score = np.amin(score[:,include_columns][iteration], axis=0) 
        selected_feature_index = np.argmin(score[:,include_columns][iteration], axis=0) 
        selected_feature = pd.concat([selected_feature, X[X.columns[selected_feature_index]]], axis=1)
        exclude_columns.append(selected_feature_index)
        print("Conclusion: cluster based on variable %s" %X.columns[exclude_columns], "gives the best performance", "\n") 
print("Selected features are %s" %X.columns[exclude_columns])

Now processing iteration 0 

Conclusion: cluster based on variable JointInd gives the best performance 

Now processing iteration 1 

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1'], dtype='object') gives the best performance 

Now processing iteration 2 

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc'], dtype='object') gives the best performance 

Now processing iteration 3 

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily'], dtype='object') gives the best performance 

Now processing iteration 4 

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales'], dtype='object') gives the best performance 

Now processing iteration 5 

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pc

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales', 'iau34_C4', 'CEN_bg_ageUnder5', 'CEN_tr_pctInformationProd',
       'CEN_tr_pctSeasonalHousingUnits', 'CEN_tr_pctConstructionCon', 'CEN_tr_pctManufacturingService', 'CEN_tr_pctManufacturingCon', 'ifn31_C4', 'CEN_tr_pctManagementPrivate',
       'CEN_bg_pctWorkforceFemale', 'CEN_bg_age35plus', 'CEN_tr_pctLT25KAge45plus', 'CEN_bg_pctGE60KAge65plus', 'CEN_bg_pctHHWageIncome', 'CEN_tr_pctHHincomeLT20K', 'CEN_tr_pctConstructionService',
       'CEN_tr_pctArtsProd', 'CEN_tr_pctLT10KAge25plus'],
      dtype='object') gives the best performance 

Now processing iteration 23 

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales', 'iau34_C4', 'CEN_bg_ageUnder5', 'CEN_tr_pctInformationProd',
       'CEN_tr_pctSeasonalHousingUnits', 'CEN_tr_pctConstruc

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales', 'iau34_C4', 'CEN_bg_ageUnder5', 'CEN_tr_pctInformationProd',
       'CEN_tr_pctSeasonalHousingUnits', 'CEN_tr_pctConstructionCon', 'CEN_tr_pctManufacturingService', 'CEN_tr_pctManufacturingCon', 'ifn31_C4', 'CEN_tr_pctManagementPrivate',
       'CEN_bg_pctWorkforceFemale', 'CEN_bg_age35plus', 'CEN_tr_pctLT25KAge45plus', 'CEN_bg_pctGE60KAge65plus', 'CEN_bg_pctHHWageIncome', 'CEN_tr_pctHHincomeLT20K', 'CEN_tr_pctConstructionService',
       'CEN_tr_pctArtsProd', 'CEN_tr_pctLT10KAge25plus', 'CEN_bg_pctHHincomeGE50K', 'CEN_bg_pctGE125KAge65plus', 'CEN_tr_pctHHincomeGE125K', 'CEN_tr_pctProfessionalService', 'CEN_bg_pctCleaningOcc',
       'CEN_bg_age45plus', 'CEN_tr_pctEducationProd', 'CEN_tr_pctServicePrivate', 'CEN_bg_populationDensity', 'CEN_tr_pctOwnOccValGE500K'],
      dtype='object') gives the best performance 

Now processing iterati

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales', 'iau34_C4', 'CEN_bg_ageUnder5', 'CEN_tr_pctInformationProd',
       'CEN_tr_pctSeasonalHousingUnits', 'CEN_tr_pctConstructionCon', 'CEN_tr_pctManufacturingService', 'CEN_tr_pctManufacturingCon', 'ifn31_C4', 'CEN_tr_pctManagementPrivate',
       'CEN_bg_pctWorkforceFemale', 'CEN_bg_age35plus', 'CEN_tr_pctLT25KAge45plus', 'CEN_bg_pctGE60KAge65plus', 'CEN_bg_pctHHWageIncome', 'CEN_tr_pctHHincomeLT20K', 'CEN_tr_pctConstructionService',
       'CEN_tr_pctArtsProd', 'CEN_tr_pctLT10KAge25plus', 'CEN_bg_pctHHincomeGE50K', 'CEN_bg_pctGE125KAge65plus', 'CEN_tr_pctHHincomeGE125K', 'CEN_tr_pctProfessionalService', 'CEN_bg_pctCleaningOcc',
       'CEN_bg_age45plus', 'CEN_tr_pctEducationProd', 'CEN_tr_pctServicePrivate', 'CEN_bg_populationDensity', 'CEN_tr_pctOwnOccValGE500K', 'MinLoanTermRem_C1', 'CEN_tr_pctFinanceMgt', 'iat96m06_C4',
       'CEN_bg

Conclusion: cluster based on variable Index(['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales', 'iau34_C4', 'CEN_bg_ageUnder5', 'CEN_tr_pctInformationProd',
       'CEN_tr_pctSeasonalHousingUnits', 'CEN_tr_pctConstructionCon', 'CEN_tr_pctManufacturingService', 'CEN_tr_pctManufacturingCon', 'ifn31_C4', 'CEN_tr_pctManagementPrivate',
       'CEN_bg_pctWorkforceFemale', 'CEN_bg_age35plus', 'CEN_tr_pctLT25KAge45plus', 'CEN_bg_pctGE60KAge65plus', 'CEN_bg_pctHHWageIncome', 'CEN_tr_pctHHincomeLT20K', 'CEN_tr_pctConstructionService',
       'CEN_tr_pctArtsProd', 'CEN_tr_pctLT10KAge25plus', 'CEN_bg_pctHHincomeGE50K', 'CEN_bg_pctGE125KAge65plus', 'CEN_tr_pctHHincomeGE125K', 'CEN_tr_pctProfessionalService', 'CEN_bg_pctCleaningOcc',
       'CEN_bg_age45plus', 'CEN_tr_pctEducationProd', 'CEN_tr_pctServicePrivate', 'CEN_bg_populationDensity', 'CEN_tr_pctOwnOccValGE500K', 'MinLoanTermRem_C1', 'CEN_tr_pctFinanceMgt', 'iat96m06_C4',
       'CEN_bg

In [4]:
res = ['JointInd', 'NoOfCars_C1', 'CEN_tr_pctAdminOcc', 'CEN_tr_pctSalesFamily', 'CEN_tr_pctAdministrationSales', 'iau34_C4', 'CEN_bg_ageUnder5', 'CEN_tr_pctInformationProd',
       'CEN_tr_pctSeasonalHousingUnits', 'CEN_tr_pctConstructionCon', 'CEN_tr_pctManufacturingService', 'CEN_tr_pctManufacturingCon', 'ifn31_C4', 'CEN_tr_pctManagementPrivate',
       'CEN_bg_pctWorkforceFemale', 'CEN_bg_age35plus', 'CEN_tr_pctLT25KAge45plus', 'CEN_bg_pctGE60KAge65plus', 'CEN_bg_pctHHWageIncome', 'CEN_tr_pctHHincomeLT20K', 'CEN_tr_pctConstructionService',
       'CEN_tr_pctArtsProd', 'CEN_tr_pctLT10KAge25plus', 'CEN_bg_pctHHincomeGE50K', 'CEN_bg_pctGE125KAge65plus', 'CEN_tr_pctHHincomeGE125K', 'CEN_tr_pctProfessionalService', 'CEN_bg_pctCleaningOcc',
       'CEN_bg_age45plus', 'CEN_tr_pctEducationProd', 'CEN_tr_pctServicePrivate', 'CEN_bg_populationDensity', 'CEN_tr_pctOwnOccValGE500K', 'MinLoanTermRem_C1', 'CEN_tr_pctFinanceMgt', 'iat96m06_C4',
       'CEN_bg_pctHHincomeLT40K', 'ihi21_C4', 'CEN_tr_Top5PercentMeanIncome', 'CEN_tr_pctWorkforceGovt', 'imt42_C4', 'ifn96m06_C4', 'CEN_bg_pctProtectServiceOcc', 'CEN_tr_age80plus',
       'CEN_bg_pctGE150KAge65plus', 'CEN_tr_pctHealthPractitionersOcc', 'iau42_C4', 'i03ccmv1_C4', 'noOfRooms_zip_mean_C1', 'i03ccpq1_C4']
res2 = {}
df_dic = pd.read_excel("/data/capstone_data/DataDictionary_allPHB_allvendors_cleaned.xlsx")
for column in res:
    res2[column] = df_dic.loc[df_dic['Variable'] == column, 'Description'].item()
    print(df_dic.loc[df_dic['Variable'] == column, 'Description'].item())
selected_feature = pd.DataFrame.from_dict(res2, orient='index')
selected_feature.reset_index(level=0, inplace=True)
selected_feature.columns = ['Variable', 'Description']

Indicator of a joint contract
Self explanatory
Pecentage of people with Office and administrative support occupations
Percentage of People work as Self-employed in own not incorporated business workers and unpaid family workers with Sales and office occupations
Pecentage of people in Public administration Industry with Sales and office occupations
Utilization of auto trades verified in last 12 months
Percentage of Age under 5
Pecentage of people in Information Industry with Production, transportation, and material moving occupations
Pecentage of Seasonal Housing Units
Pecentage of people in Construction Industry with Natural resources, construction, and maintenance occupations
Pecentage of people in Manufacturing Industry with Service occupations
Pecentage of people in Manufacturing Industry with Natural resources, construction, and maintenance occupations
Percentage of open auto trades > 75% of limit verified in last 12 months
Percentage of People work as Employee of private company w

In [44]:
pd.set_option('display.max_colwidth', -1) 
selected_feature

Unnamed: 0,Variable,Description
0,JointInd,Indicator of a joint contract
1,NoOfCars_C1,Self explanatory
2,CEN_tr_pctAdminOcc,Pecentage of people with Office and administrative support occupations
3,CEN_tr_pctSalesFamily,Percentage of People work as Self-employed in own not incorporated business workers and unpaid family workers with Sales and office occupations
4,CEN_tr_pctAdministrationSales,Pecentage of people in Public administration Industry with Sales and office occupations
5,iau34_C4,Utilization of auto trades verified in last 12 months
6,CEN_bg_ageUnder5,Percentage of Age under 5
7,CEN_tr_pctInformationProd,"Pecentage of people in Information Industry with Production, transportation, and material moving occupations"
8,CEN_tr_pctSeasonalHousingUnits,Pecentage of Seasonal Housing Units
9,CEN_tr_pctConstructionCon,"Pecentage of people in Construction Industry with Natural resources, construction, and maintenance occupations"


In [5]:
selected_feature.to_csv('selected_feature_Kmeans_inertia.csv')