# Part 1: Introduction to Decision Trees

## The decision tree algorithm is a supervised learning algorithm, first construct the tree with historical data, then use it to predict an outcome.

In [1]:
import pandas as pd
pd.options.display.max_columns = 100
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
## Incividual income data, from 1994 census.

## Set index_col to False to avoid pandas thinking that the first column is row indexes (it's age)
income = pd.read_csv('income.csv', index_col=False)
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  high_income     32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


## Category column, workclass, sex, convert them to numerical values.

## Convert the columns to a categorical type. Pandas will display the labels as strings, but inernally store them as numbers so we can do computations with them.

In [6]:
col = pd.Categorical(income['workclass'])
income['workclass'] = col.codes
income['workclass'].head(5)

0    7
1    6
2    4
3    4
4    4
Name: workclass, dtype: int8

In [13]:
cols = ['education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country',
       'high_income']
for col in cols:
    col_cat = pd.Categorical(income[col])
    income[col] = col_cat.codes

In [14]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


## Split income into two parts based on the value of workclass column

In [17]:
private_incomes = income[income['workclass'] == 4]
public_incomes = income[income['workclass'] != 4]

In [19]:
print(private_incomes.shape)
print(public_incomes.shape)

(22696, 15)
(9865, 15)


## The nodes at the bottom of the tree, where we decide to stop splitting, are called terminal nodes, or leaves. To ensure that we can make a prediction on future data, all rows in each leaf must have only one value for the target column.

## Try to predict the high_income column.

## In order to be able to make prediction for high_income, all of the rows in a leaf should only have a single value for high_income. Leaves can't have both 0 and 1 values in the high_income column.
***

## When we split, we'll try to separate as many 0s from 1s in the high_income column as we can. In order to do this, we need a metric for how 'together' the difference values in the high_income columns are, commonly use a metric called entropy for this purpose. Entropy refers to disorder. The more "mixed together" 1 s and 0 s are, the higher the entropy.

## The formula for entropy looks like this: $-\sum_{i=1}^{c}P(x_i)log_bP(x_i)$

age    high_income
***
25     1
***
50     1
***

30     0
***
50     0
***
80     1

In [21]:
# For the above data

import math
entropy = -(2/5 * math.log(2/5, 2) + 3/5 * math.log(3/5, 2))
print(entropy)

0.9709505944546686


In [25]:
print(income.high_income.value_counts())
print(income.shape[0])

0    24720
1     7841
Name: high_income, dtype: int64
32561


In [27]:
total = income.shape[0]
number_0 = income[income['high_income'] == 0].shape[0]
number_1 = total - number_0
income_entropy = -(number_0 / total * math.log(number_0 / total, 2) + number_1 / total * math.log(number_1/total, 2))
income_entropy

0.7963839552022132