# Decision Tree: Income Prediction
In this lab, we will build a decision tree to predict the income of a given population, which is labelled as <= 50k 𝑎𝑛𝑑 > 50k. The attributes (predictors) are age, working class type, marital status, gender, race etc.

In the following sections, we'll:

Clean and prepare the data,
- build a decision tree with default hyperparameters,
- understand all the hyperparameters that we can tune, and finally
- choose the optimal hyperparameters using grid search cross-validation.

In [None]:
# Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Reading the csv file and putting it into 'df' object.
df = pd.read_csv('csv_file_name')

In [None]:
# Let's understand the data, how it look like.

# NOTE
This dataset contains missing rows with a value='?'. Remove the missing values by dropping those rows.

In [None]:
# select all categorical variables
df_categorical = df.select_dtypes(include=['object'])

In [None]:
# checking whether any other columns contain a "?"
df_categorical.apply(lambda x: x=="?", axis=0).sum()

In [None]:
# dropping the "?"s
df = df[df['workclass'] != '?']
df = df[df['occupation'] != '?']
df = df[df['native.country'] != '?']

In [None]:
# clean dataframe

# Data Preparation
There are a number of preprocessing steps we need to do before building the model.

Firstly, note that we have both categorical and numeric features as predictors. In previous models such as linear and logistic regression, we had created dummy variables for categorical variables, since those models (being mathematical equations) can process only numeric variables.

All that is not required in decision trees, since they can process categorical variables easily. However, we still need to encode the categorical variables into a standard format so that sklearn can understand them and build the tree. We'll do that using the LabelEncoder() class, which comes with sklearn.preprocessing.

In [None]:
#import required libraries
from sklearn import preprocessing

In [None]:
# select all categorical variables from the clean dataframe
df_categorical = df.select_dtypes(include=['object'])

In [None]:
# apply Label encoder to df_categorical
le = preprocessing.LabelEncoder()
df_categorical = df_categorical.apply(le.fit_transform)

In [None]:
# concat df_categorical with original df
df = df.drop(df_categorical.columns, axis=1)
df = pd.concat([df, df_categorical], axis=1)

In [None]:
# convert target variable income to categorical
df['income'] = df['income'].astype('category')

Now all the categorical variables are suitably encoded. Let's build the model.

# Model Building and Evaluation
Let's first build a decision tree with default hyperparameters. Then we'll use cross-validation to tune them.

In [None]:
# Importing train-test-split 
from sklearn.model_selection import train_test_split

In [None]:
# Putting feature variable to X
# Putting response variable to y

In [None]:
# Splitting the data into train and test (70/30 ratio)

In [None]:
# Importing decision tree classifier from sklearn library
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Build a Decision Tree

In [None]:
# Let's check the evaluation metrics of our default model
# Importing classification report and confusion matrix from sklearn metrics
# Making predictions
# Printing classification report

Question 1: Find the accuracy of the model. [Mark the correct answer in graded questions segment]

In [None]:
# Printing confusion matrix and accuracy

# Plotting the Decision Tree

To visualise decision trees in python, you need to install certain external libraries. You can read about the process in detail here: http://scikit-learn.org/stable/modules/tree.html

We need the ```graphviz``` library to plot a tree.

In [None]:
# Importing required packages for visualization
from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
import pydotplus, graphviz

In [None]:
# Putting features

In [None]:
# Import path for Graphviz
import os
os.environ["PATH"] += os.pathsep + 'C:\Program Files (x86)\Graphviz2.38\bin'

In [None]:
# Plot a Decision Tree
dot_data = StringIO()  
export_graphviz(dt_default, out_file=dot_data,
                feature_names=features, filled=True,rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

# OPTIMAL HYPERPARAMETERS

In [None]:
#import libraries required
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

Question 2: Which are the most optimal criteria for splitting? [Mark the correct answer in graded questions segment]

In [None]:
# Create the parameter grid 

In [None]:
# Instantiate the grid search model
# Fit the grid search to the data

In [None]:
# printing the optimal accuracy score and hyperparameters

# Running the model with best parameters obtained from grid search.

In [None]:
# model with optimal hyperparameters

Question 3: What is the change in accuracy after using hyperparameters? [Mark the correct answer in graded questions segment]

In [None]:
# accuracy score

In [None]:
# plotting the tree