In [50]:
import pandas as pd
import numpy as np
import math
import pydot
from io import BytesIO
from sklearn.tree import export_graphviz
df = pd.read_csv('./HouseholderAtRisk.csv')
def analyse_feature_importance(dm_model, feature_names, n_to_display=20):
    # grab feature importances from the model
    importances = dm_model.feature_importances_
    
    # sort them out in descending order
    indices = np.argsort(importances)
    indices = np.flip(indices, axis=0)

    # limit to 20 features, you can leave this out to print out everything
    indices = indices[:n_to_display]

    for i in indices:
        print(feature_names[i] + ': ' + str(importances[i]))
        
def visualize_decision_tree(dm_model, feature_names, save_name):
    dotfile = BytesIO()
    export_graphviz(dm_model, out_file=dotfile, feature_names=feature_names)
    graph = pydot.graph_from_dot_data(dotfile.getvalue())
    graph[0].write_png(save_name) # saved in the following file

<h2>Task 1</h2>
<h3>1. What is the proportion of householders at risk?</h3>

In [64]:
print(df['AtRisk'].value_counts())

High    30498
Low      9501
Name: AtRisk, dtype: int64


Of 39999 observations, we found
High Risk : 30498 (76.247%)
Low Risk: 9501 (23.753%)

<h3>2.Did you have to fix any data quality problems?</h3>

Missing values:

In [61]:
print(df.isna().sum())

ID                            0
Age                         967
WorkClass                   972
Weighting                  1292
Education                   972
NumYearsEducation           972
MaritalStatus               972
Occupation                  986
Relationship                972
Race                      39954
Gender                      972
CapitalLoss                 972
CapitalGain                 972
CapitalAvg                  972
NumWorkingHoursPerWeek      972
Sex                         972
Country                      30
AtRisk                        0
dtype: int64


The following attributes had quality problems:
<h5>Race</h5>
Most values in this attribute were missing.
The fix was to drop this column.
<h5>Age</h5>
There were a few values less than 1 and there were 968 missing values.
These values were imputed with the mean value of 38.66.
<h5>WorkClass</h5>
The values were prepended with a space. The space was removed .
There were 2240 records with invalid value of "?". There were 972 missing values. These values were imputed with the mode "Private".
<h5>NumYearsEducation</h5>
There were 972 missing values. These values were replaced with the mean value of 10.
<h5>MaritalStatus</h5>
The values in this attribute were prepended with a space. The space was removed.

There were 972 missing values. These were replaced with the mode "Married-civ-spouse".
<h5>Occupation</h5>
The values in this attribute were prepended with a space. The space was removed.

There were 2246 records with invalid value of "?" and there were 986 missing values. These values were imputed with the mode "Prof-specialty".
<h5>Relationship</h5>
The values in this attribute were prepended with a space. The space was removed.

The 972 missing values were imputed with ‘Husband’ which is the mode. 
<h5>CapitalLoss</h5>
The 972 missing values were imputed with the mode of 0. Considering the values being skewed to the far left, it makes sense to impute 0 to the missing values.
<h5>CapitalGain</h5>
The 972 missing values were imputed with the mode of 0.
<h5>CapitalAvg</h5>
The 972 missing values were imputed with the mode of 0.
<h5>NumWorkingHoursPerWeek</h5>
There were 972 missing values. These values were imputed with the mean value of 40.
<h5>Sex</h5>
There were 972 missing values. These values were imputed with the mode of 0.
<h5>Country</h5>
The values in this attribute were prepended with a space. The space was removed.

<p>699 values were ‘?’ - These were imputed with the mode‘United-States’.</p>
<p>30 missing values were imputed with ‘United-States’</p>
<p>917 values were ‘USA’ - These were changed to ‘United-States’</p>
<p>9 values were ‘US’ - These were changed to ‘United-States’</p>
<p>20 values were ‘Hong’ - These were changed to ‘Hong Kong’</p>
<p>97 values were South - These were imputed with 'United-States'</p>

<h3>Data types</h3>
<h5>Age</h5>
The data type was converted from float to int.
<h5>Sex</h5>
The data type was converted from float to binary.
<h5>NumYearsEducation</h5>
The data type was converted from float to int.
<h5>Weighting</h5>
The data type was converted from float to int.
<h5>AtRisk</h5>
There are only two possible values 'High' or 'Low'. This can be formatted as binary variable.

<h3>One-Hot Encoding</h3>
The following categorical variables needs to be converted to numerical variables
<h5>Country</h5>
<h5>MaritalStatus</h5>
<h5>Occupation</h5>
<h5>Relationship</h5>
<h5>Country</h5>

<h3>3. Irrelevant and redundant variables</h3>

<h5>ID</h5>
This attribute is a unique identifier and does not provide useful information for predicting the target variable.
<h5>Gender</h5>
This attribute is identical to Sex attribute but with different name. Sex attribute was chosen over this because when there are only two possible values it is better to transform it to binary variable. 
<h5>Education</h5>
Education attribute and NumYearsEducation is essentially a one-to-one mapping except that Education attribute is ordinal but NumYearsEducation is numeric.

In [36]:
# Drop ID, Weighting, Race, Gender, Education
df.drop(['ID', 'Race', 'Gender', 'Education'], axis=1, inplace=True)

### Age Column
# Age less than 1 is invalid
# Impute the invalid values and missing values with mean
# because ...
mask = df['Age'] < 1
df.loc[mask, 'Age'] = np.nan
df['Age'].fillna(df['Age'].mean(), inplace=True)

### WorkClass column
# Remove spaces
for uniq in df['WorkClass'].unique():
    if isinstance(uniq, str):
        mask = df['WorkClass'] == uniq
        df.loc[mask, 'WorkClass'] = uniq[1:]

mask = df['WorkClass'] == '?'
df.loc[mask, 'WorkClass'] = np.nan
df['WorkClass'].fillna('Private', inplace=True)

### Weighting column
df['Weighting'].fillna(df['Weighting'].mean(), inplace=True)

### NumYearsEducation column
df['NumYearsEducation'].fillna(df['NumYearsEducation'].mean(), inplace=True)

### MaritalStatus column
# Remove spaces
for uniq in df['MaritalStatus'].unique():
    if isinstance(uniq, str):
        mask = df['MaritalStatus'] == uniq
        df.loc[mask, 'MaritalStatus'] = uniq[1:]

df['MaritalStatus'].fillna('Married-civ-spouse', inplace=True)

### Occupation column
for uniq in df['Occupation'].unique():
    if isinstance(uniq, str):
        mask = df['Occupation'] == uniq
        df.loc[mask, 'Occupation'] = uniq[1:]

mask = df['Occupation'] == '?'
df.loc[mask, 'Occupation'] = np.nan
df['Occupation'].fillna('Prof-specialty', inplace=True)

### Relationship column
# Remove spaces
for uniq in df['Relationship'].unique():
    if isinstance(uniq, str):
        mask = df['Relationship'] == uniq
        df.loc[mask, 'Relationship'] = uniq[1:]

df['Relationship'].fillna('Husband', inplace=True)

### CapitalLoss column
# Impute missing values with 0 which is the median
# because the data has great outliers (Skewed to left)
df['CapitalLoss'].fillna(0, inplace=True)

### CapitalGain column
# Impute missing values with 0
df['CapitalGain'].fillna(0, inplace=True)

### CapitalAvg column
# Impute with 0
df['CapitalAvg'].fillna(0, inplace=True)

### NumWorkingHoursPerWeek column
# Impute with mean of 40
df['NumWorkingHoursPerWeek'].fillna(df['NumWorkingHoursPerWeek'].mean(), inplace=True)

### Sex column
# Impute with 0 which is the mode
df['Sex'].fillna(0, inplace=True)

### Country column
# Remove spaces 
for uniq in df['Country'].unique():
    if isinstance(uniq, str):
        mask = df['Country'] == uniq
        df.loc[mask, 'Country'] = uniq[1:]

mask = df['Country'] == '?'
df.loc[mask, 'Country'] = 'United-States'
mask = df['Country'] == 'USA'
df.loc[mask, 'Country'] = 'United-States'
mask = df['Country'] == 'US'
df.loc[mask, 'Country'] = 'United-States'
mask = df['Country'] == 'Hong'
df.loc[mask, 'Country'] = 'Hong Kong'
mask = df['Country'] == 'South'
df.loc[mask, 'Country'] = 'United-States'
df['Country'].fillna('United-States', inplace=True)

### Data types
# format Sex to binary
data_type_map = {1.0: 1, 0.0: 0}
df['Sex'] = df['Sex'].map(data_type_map)
# format Age to int
df['Age'] = df['Age'].astype(int)
# # format NumYearsEducation to int
df['NumYearsEducation'] = df['NumYearsEducation'].astype(int)
# format Weighting to int
df['Weighting'] = df['Weighting'].astype(int)
# # format AtRisk to binary
data_type_map = {'High': 1, 'Low': 0}
df['AtRisk'] = df['AtRisk'].map(data_type_map)


### One-Hot Encoding
df = pd.get_dummies(df)

<h3>3. What distribution scheme did you use? What data partitioning allocation did you set?</h3>
I used 70/30 split (Test dataset is 30%) with stratified sampling. I used stratified sampling because our dataset is skewed (76% of instances are high risk). Using random sampling can produce an inaccurate or overfitting model.

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
y = df['AtRisk']
x = df.drop(['AtRisk'], axis=1)

rs = 20

import warnings
warnings.filterwarnings("ignore")

x_mat = x.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(x_mat, y, test_size=0.3, stratify=y, random_state=rs)



<h2>Task 2. Decision Trees</h2>
<h3>1. Build a decision tree using the default setting. Examine the tree results and answer the followings</h3>


In [45]:
model = DecisionTreeClassifier(random_state=rs)
model.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=20,
            splitter='best')

In [48]:
print(model)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=20,
            splitter='best')


<h5>a. What is classification accuracy on training and test datasets?</h5>

In [28]:
print("Accuracy on training dataset: ", model.score(x_train, y_train))
print("Accuracy on test dataset: ", model.score(x_test, y_test))

('Accuracy on training dataset: ', 0.9942855101967928)
('Accuracy on test dataset: ', 0.8121666666666667)


<b>Training Dataset</b>: 99.42% accuracy<br/>
<b>Test Dataset</b>: 81.21% accuracy

<h5>b. Which variable is used for the first split? What are the variables used for the second split?</h5>
The graph image wasn't visible...

<h5>c. What are the 5 important variables in building the tree?</h5>

In [34]:
analyse_feature_importance(model, x.columns, 5)

MaritalStatus_Married-civ-spouse: 0.19643569062546867
Weighting: 0.18514243213032736
NumYearsEducation: 0.12266823377842039
Age: 0.12095449603378713
CapitalGain: 0.08649375604168594


The top 5 important variable are:<br/>
MaritalStatus_Married-civ-spouse: 0.19643569062546867<br/>
Weighting: 0.18514243213032736<br/>
NumYearsEducation: 0.12266823377842039<br/>
Age: 0.12095449603378713<br/>
CapitalGain: 0.08649375604168594<br/>

In [82]:
visualize_decision_tree(model, x.columns, "./graph1.png")

<h5>d. Report if you see any evidence of model overfitting</h5>

The accuracy of the model is 18.21% higher for training dataset. This is an evidence of overfitting.

<h3>3. Build another decision tree tuned with GridSearchCV</h3>

In [38]:
from sklearn.model_selection import GridSearchCV

params = {'criterion': ['gini', 'entropy'],
          'max_depth': range(2, 7), 
          'min_samples_leaf': range(20, 60, 10)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs), cv=10)
cv.fit(x_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=20,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [2, 3, 4, 5, 6], 'min_samples_leaf': [20, 30, 40, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

<h5>a. What is classification accuracy on training and test datasets?</h5>

In [39]:
print("Train accuracy:", cv.score(x_train, y_train))
print("Test accuracy:", cv.score(x_test, y_test))

('Train accuracy:', 0.8554948391013965)
('Test accuracy:', 0.8540833333333333)


<b>Training Dataset</b>: 85.55% accuracy<br/>
<b>Test Dataset</b>: 85.41% accuracy<br/>
<h5>b. What are the parameters used? Explain your decision.</h5>
The hyperparameters used were:
<ul>
    <li><b>max_depth</b>: To pre-prune the maximal tree. By limiting the maximum depth, we can limit the size of the tree and therefore overfitting</li>
    <li><b>min_samples_leaf</b>: Setting larger value for this parameter has similar effect as setting max_depth. It limits the granularity of the tree and reduce overfitting</li>
</ul>
<h5>What are the optimal parameters for this decision tree?</h5>

In [40]:
print(cv.best_params_)

{'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 30}


The best parameters of the cross validation before optimization are:
max_depth: 6<br/>
min_samples_leaf: 30<br/>
To optimize:

In [41]:
params = {'criterion': ['gini', 'entropy'],
    'max_depth': range(4, 8), 
          'min_samples_leaf': range(25, 35)}

cv = GridSearchCV(param_grid=params, estimator=DecisionTreeClassifier(random_state=rs), cv=10)
cv.fit(x_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=20,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'criterion': ['gini', 'entropy'], 'max_depth': [4, 5, 6, 7], 'min_samples_leaf': [25, 26, 27, 28, 29, 30, 31, 32, 33, 34]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [42]:
print("Train accuracy:", cv.score(x_train, y_train))
print("Test accuracy:", cv.score(x_test, y_test))
print(cv.best_params_)

('Train accuracy:', 0.8589592485445908)
('Test accuracy:', 0.8555833333333334)
{'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 27}


<b>Training Dataset</b>: 85.9% accuracy<br/>
<b>Test Dataset</b>: 85.56% accuracy<br/>

The optimal parameters are:
max_depth: 7<br/>
min_samples_leaf: 27<br/>
<h5>d. Which variable is used for the first split? What are the 
variables that are used for the second split?</h5>
<h5>

In [43]:
visualize_decision_tree(cv.best_estimator_, x.columns, "./graph2.png")

<img src="./graph2.png"/>
<b>First split</b>: MaritalStatus_Married-civ-spouse<br/>
<b>Second split</b>: CapitalGain, NumYearsEducation
<h5>e. What are the 5 important variables in building the tree? </h5>

In [44]:
analyse_feature_importance(cv.best_estimator_, x.columns, 5)

MaritalStatus_Married-civ-spouse: 0.4268844001353971
NumYearsEducation: 0.22390731862508767
CapitalAvg: 0.13072234618972783
CapitalGain: 0.10450153204574017
CapitalLoss: 0.03311808855743001


The top 5 important variables are:<br/>
MaritalStatus_Married-civ-spouse: 0.4268844001353971<br/>
NumYearsEducation: 0.22390731862508767<br/>
CapitalAvg: 0.13072234618972783<br/>
CapitalGain: 0.10450153204574017<br/>
CapitalLoss: 0.03311808855743001<br/>
<h5>f. Report if you see any evidence of model overfitting</h5>
The accuracy of the model on the training dataset and test dataset is almost the same. Therefore I can say there's no overfitting.
<h3>3. What is the significant difference do you see between these two decision tree 
models</h3>
<h5>Performance Difference</h5>
The improvement in the decision tree using grid search cross validation was elimination of overfitting and improvement in accuracy in test dataset by 4.35%.
<h5>Changes</h5>
- depth
- feature importance
- purity
