**Section I: Import the Data**

Imports and Read in File

In [1]:
%matplotlib inline 

import pandas as pd
import numpy as np
from sklearn import cluster
from sklearn import metrics
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
matplotlib.style.use('ggplot') 

In [2]:
adult = pd.read_csv("../../assets/datasets/adult.csv")

**Section II: Format the Data**

Convert the categorical data to numeric, and prepare a dataframe with these data.  For now, focus on: 'workclass', 'education-num', 'hours-per-week', 'income'.  

In [12]:
#adult.head()
#adult.info()
df=adult.loc[:, ['workclass', 'education-num', 'hours-per-week', 'income']]

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 4 columns):
workclass         46043 non-null object
education-num     48842 non-null int64
hours-per-week    48842 non-null int64
income            32561 non-null object
dtypes: int64(2), object(2)
memory usage: 1.5+ MB


Unnamed: 0,workclass,education-num,hours-per-week,income
0,State-gov,13,40,small
1,Self-emp-not-inc,13,13,small
2,Private,9,40,small
3,Private,7,40,small
4,Private,13,40,small


Check for and drop NaNs - our data are messy!

In [22]:
dfn=df.dropna()
dfn.info()
from sklearn.preprocessing import LabelEncoder
encode=LabelEncoder()

dfn['workclassENC']=encode.fit_transform(dfn['workclass'])
dfn['incomeENC']=encode.fit_transform(dfn['income'])

dfn.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30725 entries, 0 to 32560
Data columns (total 4 columns):
workclass         30725 non-null object
education-num     30725 non-null int64
hours-per-week    30725 non-null int64
income            30725 non-null object
dtypes: int64(2), object(2)
memory usage: 1.2+ MB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,workclass,education-num,hours-per-week,income,workclassENC,incomeENC
0,State-gov,13,40,small,6,1
1,Self-emp-not-inc,13,13,small,5,1
2,Private,9,40,small,3,1
3,Private,7,40,small,3,1
4,Private,13,40,small,3,1


In [29]:
d=pd.get_dummies(dfn['workclass'])
dfn=pd.concat([dfn, d], axis=1)
dfn.head()

Unnamed: 0,workclass,education-num,hours-per-week,income,workclassENC,incomeENC,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay
0,State-gov,13,40,small,6,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,Self-emp-not-inc,13,13,small,5,1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,Private,9,40,small,3,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,Private,7,40,small,3,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,Private,13,40,small,3,1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


Calculating the silhouette score can take a long time!  So for this lab, let's subset our data to only 2000 rows.  Sample randomly!  *Hint*: pandas has a function to randomly subset a dataframe

In [24]:
from sklearn.metrics import silhouette_score

In [32]:
X=dfn[['education-num', 'hours-per-week', 'Federal-gov', "Local-gov","Never-worked", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov", "Without-pay", "incomeENC"]]
Xsample=X.sample(frac=0.05)
Xsample.shape
ysample=Xsample.pop('incomeENC')
silhouette_score(Xsample, ysample, metric='euclidean')

0.042256847363000692

Scale your features.  Add the scaled features as features to your dataframe.

In [35]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

Xscaled=pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
Xscaled.head()

Unnamed: 0,education-num,hours-per-week,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay,incomeENC
0,1.120047,-0.078956,-0.17959,-0.27037,-0.015096,-1.681295,-0.194142,-0.300262,4.761411,-0.021351,0.575784
1,1.120047,-2.33136,-0.17959,-0.27037,-0.015096,-1.681295,-0.194142,3.33042,-0.210022,-0.021351,0.575784
2,-0.44083,-0.078956,-0.17959,-0.27037,-0.015096,0.59478,-0.194142,-0.300262,-0.210022,-0.021351,0.575784
3,-1.221269,-0.078956,-0.17959,-0.27037,-0.015096,0.59478,-0.194142,-0.300262,-0.210022,-0.021351,0.575784
4,1.120047,-0.078956,-0.17959,-0.27037,-0.015096,0.59478,-0.194142,-0.300262,-0.210022,-0.021351,0.575784


**Section III: Clustering Analysis**

Cluster the data with sklearn.cluster.KMeans.  Cluster on the scaled features

Get the labels and centroids

Compute the silhouette score

Add these new cluster labels to your dataframe. 

### Section IV: Interpreting your clusters
Create scatterplots, showing the clusters in different hues.

Look at your scatterplots. See how each of the clusters breakdown. Come up with descriptions for each of the clusters.
