# Decision Trees

**Decision trees** are `supervised learning models used for problems involving classification and regression`. Tree models present a high flexibility that comes at a price: on one hand, `trees are able to capture complex non-linear relationships`; on the other hand, `they are prone to memorizing the noise present in a dataset`. By aggregating the predictions of trees that are trained differently, **ensemble methods** take advantage of the flexibility of trees while reducing their tendency to memorize noise. **Ensemble methods** are used across a variety of fields and have a proven track record of winning many machine learning competitions. 

In this notebook, you'll learn how to use Python to train **decision trees** and **tree-based models** with the user-friendly scikit-learn machine learning library. You'll understand the advantages and shortcomings of trees and demonstrate how ensembling can alleviate these shortcomings, all while practicing on real-world datasets. Finally, you'll also understand how to tune the most influential hyperparameters in order to get the most out of your models.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

path = 'data/dc21/'

## Classification and Regression Trees (CART)

CART are a set of supervised learning models used for problems involving classification and regression. In this chapter, you'll be introduced to the CART algorithm.

<img src="images/tree_class_01.png" alt="" style="width: 400px;"/>

<img src="images/tree_class_02.png" alt="" style="width: 400px;"/>

<img src="images/tree_class_03.png" alt="" style="width: 400px;"/>

<img src="images/tree_class_04.png" alt="" style="width: 400px;"/>


## Train your first classification tree

In this exercise you'll work with the [Wisconsin Breast Cancer Dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) from the UCI machine learning repository. You'll predict whether a tumor is malignant or benign based on two features: the mean radius of the tumor (`radius_mean`) and its mean number of concave points (`concave points_mean`).

The dataset is already loaded in your workspace and is split into 80% train and 20% test. The feature matrices are assigned to `X_train` and `X_test`, while the arrays of labels are assigned to `y_train` and `y_test` where `class 1` corresponds to a malignant tumor and `class 0` corresponds to a benign tumor. To obtain reproducible results, we also defined a variable called SEED which is set to 1.

In [14]:
X_train = pd.read_csv(path+'X_train.csv', index_col=0)
X_train.head()

Unnamed: 0,radius_mean,concave points_mean
195,12.91,0.02377
560,14.05,0.04304
544,13.87,0.02369
495,14.87,0.04951
527,12.34,0.02647


In [15]:
y_train = pd.read_csv(path+'y_train.csv', index_col=0)
y_train.head()

Unnamed: 0_level_0,0
195,Unnamed: 1_level_1
560,0
544,0
495,0
527,0
222,0


In [16]:
X_test = pd.read_csv(path+'X_train.csv', index_col=0)
y_test = pd.read_csv(path+'y_train.csv', index_col=0)

In [17]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(454, 2)
(454, 1)
(454, 2)
(454, 1)


In [18]:
SEED = 1

# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth=6, random_state=SEED)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict test set labels
y_pred = dt.predict(X_test)
print(y_pred[0:5])

[0 0 0 0 0]


You've just trained your first classification tree! You can see the first five predictions made by the fitted tree on the test set in the console. In the next exercise, you'll evaluate the tree's performance on the entire test set.

## Evaluate the classification tree

Now that you've fit your first classification tree, it's time to `evaluate its performance on the test set`. You'll do so using the **accuracy metric** which corresponds to the fraction of correct predictions made on the test set.

The trained model `dt` from the previous exercise is loaded in your workspace along with the test set features matrix `X_test` and the array of labels `y_test`.

In [19]:
# Import accuracy_score
from sklearn.metrics import accuracy_score

# Predict test set labels
y_pred = dt.predict(X_test)

# Compute test set accuracy  
acc = accuracy_score(y_pred, y_test)
print("Test set accuracy: {:.2f}".format(acc))

Test set accuracy: 0.71


## Logistic regression vs classification tree

A **classification tree** divides the feature space into rectangular regions. In contrast, a linear model such as **logistic regression** produces only a single linear decision boundary dividing the feature space into two decision regions.

We have written a custom function called plot_labeled_decision_regions() that you can use to plot the decision regions of a list containing two trained classifiers. You can type help(plot_labeled_decision_regions) in the IPython shell to learn more about this function.

In [20]:
# Import LogisticRegression from sklearn.linear_model
from sklearn.linear_model import  LogisticRegression

# Instatiate logreg
logreg = LogisticRegression(random_state=1)

# Fit logreg to the training set
logreg.fit(X_train, y_train)

# Define a list called clfs containing the two classifiers logreg and dt
clfs = [logreg, dt]

# Review the decision regions of the two classifiers
#plot_labeled_decision_regions(X_test, y_test, clfs)

  y = column_or_1d(y, warn=True)


<img src="images/tree_class_05.png" alt="" style="width: 400px;"/>

Notice how the decision boundary produced by logistic regression is linear while the boundaries produced by the classification tree divide the feature space into rectangular regions.