# Module 7 - Decision Trees

Decision tree is a classification algorithm (though there are regression trees) for supervised learning, and is popular for its ease-of-use and generally good performance. It learns different decision boundaries by segmenting the dataset into groups based on similar attributes and target outcomes. This technique is called **recursive partitioning**.

When new data is given to the decision tree model for prediction, it starts from the **root node** (most important/significant decision for prediction), then follows the path of **decision nodes** that ultimately leads it to a **terminal node** with the predicted category.

![Decision Tree](https://miro.medium.com/max/2649/1*iMOtF7bwKPHl1Pg52xN7fg.png)
Source: [Towards Data Science: Decision Tree in Layman's Terms](https://towardsdatascience.com/decision-tree-in-laymans-terms-part-1-76e1f1a6b672)

## Titanic Survival

In this lesson, we will clean and prepare the dataset of Titanic passengers using the same methods as the previous lesson, in order to build a decision tree model. Then we will compare the results of both the logistic regression and decision tree models to determine which one is the better performing model. 

In [None]:
# import libraries 
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# import functions directly from sci-kit learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# read in dataset
filepath = "datasets/titanic.xls"

df = pd.read_excel(filepath)
df.head()

### Data Dictionary

The dataset contains the following features (characteristics) in the columns:

- `pclass`: passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class) 
- `survived`: survival status (0 = No(died), 1 = Yes(survived))
- `name`: passenger's name
- `sex`: passenger's sex (male/female)
- `age`: passenger's age
- `sibsp`: number of siblings and/or spouses with passenger
- `parch`: number of parents and/or children with passenger
- `ticket`: ticket number
- `fare`: total fare for passenger and others in party (currency: British Pound)
- `cabin`: room cabin number(s) for passenger and their party
- `embarked`: port of embarkation (C = Chernbourg, Q = Queenstown, S = Southampton)
- `boat`: lifeboat name (combination of letters and/or numbers)
- `body`: body identification number
- `home.dest`: hometown or destination after disembark

### Clean and Prepare Data

After conducting exploratory analysis, we need to clean up the data and prepare it in a format for the predictive model. Columns relevant for our model will be cleaned and prepared, while colums and rows that are not significant for prediction will be removed. 

`age` and `embarked` are the only columns with missing values that will be used in the predictive model. We will fill in the missing information with a "guesstimate" and all the other columns with missing values will be removed from the dataframe.

In [None]:
# identify columns with missing values
df.isnull().sum()

#### Clean `age` column

Because there are many values missing in the `age` column, we will create an estimate value for `age` that's specific to a person's survival status, as well as other significant characteristics like sex and passenger class (a proxy for socio-economic status).

The `.transform()` function creates a single column that for every row of data, when the row matches the characteristics in a row of the `groupby`, it will take on that value. In this example, the `.transform()` produces a column where each row (passenger) has a mean average age, based on the passenger's survival status, sex, and passenger class.

Then we will take the transformed column and using the `.fillna()` function, take the value from the transformed row and only use it in the corresponding row in the dataframe if the passenger's age is missing.

In [None]:
# average age grouped by survival status, sex, and passenger class
df.groupby(['survived', 'sex', 'pclass'])['age'].mean()

In [None]:
# store transform column as a variable
tranform_age = df.groupby(['survived', 'sex', 'pclass'])['age'].transform('mean')

In [None]:
# fill missing values for age using values from transformed column
df['age'].fillna(tranform_age, inplace=True)

In [None]:
# verify there are no more missing age values
df.isnull().sum()

#### Clean `embarked` column

Because the values in the `embarked` column are string categories, we can't use statistical methods to impute the missing information. There are very few rows of missing data, so we can fill in those values with the most common port of embarkation.

In [None]:
# number of passengers for each embarkation port
df['embarked'].value_counts()

In [None]:
# fill missing values with "S" for Southampton (most common port)
df['embarked'].fillna('S', inplace=True)

In [None]:
# verify no missing values in 'embarked' column
df.isnull().sum()

#### Remove unnecessary columns

Now that the values are filled for columns that will be used in the predictive model, we can remove the columns that we do not need.

In [None]:
# remove columns that will not be used in the model
modeldf = df.drop(['name','ticket','fare', 'cabin', 'boat', 'body', 'home.dest'], axis=1)

In [None]:
# columns in the new dataframe
modeldf.columns

### Feature Engineering

Some columns in the data are object types that have string values that cannot be used in the algorithm function. During the Module 4 lesson, we transformed ordinal qualitative values into a numerical representation that preserved their ranking. For nominal (non-ordered) data, there is no ranking, so we will use **one-hot encoding** to numerically represent the values.

One-hot encoding is a technique that takes discrete (categorical) data values from a column and creates a new column for each distinct category value. Within each column, the values `0` or `1` will be assigned, indicating a `True (1)` or `False (0)` value for that category. The one-hot encoded columns created are called **dummy variables**. 

The `pd.get_dummies()` function extracts the categories from a column, then makes them into dummy variables and assigns the boolean values. `pd.get_dummies` will automatically drop the column that was used as the source data.

In [None]:
# dummy variables for embarkation port
modeldf = pd.get_dummies(data=modeldf, columns=['embarked'])
modeldf.head()

#### Categorical sex values as boolean

Boolean values are very common to use to represent binary (two options) categorical data. For the model, we will reassign the string values for sex as boolean values.

In [None]:
# reassign 'female'= 0, 'male'= 1
modeldf['sex'] = modeldf['sex'].map({'female':0, 'male':1})
modeldf.head()

#### Combine family member total

In the dataset, there are separate columns for immediate family members that are in the same generation as the passenger (`sibsp` - sibling/spouse) or different generations (`parch` - parent/child). During the incident of the Titanic sinking, if a passenger were traveling with any family members then we should account for them all in a single column. Furthermore, we can also hypothesize that the more family members a passenger is traveling with, the more difficult it would have been to quickly move everybody to safety. For this reason, more family members might be linked to decreased survival likelihood, which is why we will create a new `family_num` column in the dataframe.

In [None]:
# create new column based on number of family members
modeldf['family_num'] = modeldf['sibsp'] + modeldf['parch']

# drop sibsp and parch columns
modeldf.drop(['sibsp', 'parch'], axis=1, inplace=True)
modeldf.head()

### Predictive Modeling

The data is "done" being cleaned and prepared, so now we can build, or **fit**, our decision tree model. There are a few final tasks that need to be done before the data is given to the model:

- Separate the attributes (features used to predict) from the target (outcome to predict)
- Shuffle the order of the rows in the dataset, then separate into a dataset for training (for the model to learn from) and testing (to see how well it predicts with new data)

When the model finishes "learning" with the training data, we will evaluate its performance.

#### Separate attributes and target

The target is the column of data we are teaching the model to predict. In math, this information is typically represented as the variable `y`, so we keep the same conventions. Attributes (characteristics) that calculate/predict `y` are stored into a variable called `X`. Although `y` is a single column, `X` is a dataframe of all the attribute columns.

In [None]:
# 'survived' is target variable
y = modeldf['survived']

In [None]:
# attributes are all the columns EXCEPT 'survived'
X = modeldf.drop(['survived'], axis=1)

#### Separate training and test data

Scikit-learn's `train_test_split()` function takes the attribute columns (`X` variable) and target column (`y` variable), then shuffles the rows using the `random_state=` argument, which will trigger a randomizing sequence (see the [`random_state` documentation](https://scikit-learn.org/stable/glossary.html#term-random-state) for more information). By default, `test_size=` will separate 25% of the dataset as the test set, leaving the other 75% for the training set. However, you can adjust the value of the ratio split.

`train_test_split()` then generates four outputs in this order - a dataframe of the attributes for the training set (`X_train`), a dataframe of the attributes for the test set (`X_test`), a column of the target for the training set (`y_train`), and a column of the target for the test set (`y_test`).

In [None]:
# separate 80% for training data, 20% for test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Train the model

The `DecisionTreeClassifier()` function will take the `X_train` and `y_train` dataset, and calculate the attributes' segmented groups and probabilities that best fits the data.

In [None]:
# assign decision tree function to variable
model = DecisionTreeClassifier()

In [None]:
# give training data to learn
model.fit(X_train, y_train)

In [None]:
# overall ratio of correct predictions for training data
model.score(X_train, y_train)

#### Evaluate the model on test data

To assess how well the model will perform on new data, we will use the test set to:

- Display the ratio of overall correct predictions
- Compare the number of correct and incorrect predictions for each target category
- Compare the ratio of correct predictions for all actual target values and all predicted values for a category

In [None]:
# overall ratio of correct predictions for test data
model.score(X_test, y_test)

In [None]:
# generate predictions
y_pred = model.predict(X_test)

In [None]:
# compare how many items in each category model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted: Died', 'Predicted: Survived'],
    index=['Actual: Died', 'Actual: Survived']
)

cm

In [None]:
# compare ratio of correct predictions vs all predicted values for each category (precision)
# compare ratio of correct predictions vs all actual values for each category (recall)

print(classification_report(y_test, y_pred))