# Decision Trees
### Workshop 1 of DASIL's series on "Introduction to Machine Learning"
### Created by Martin Pollack, Yusen He, and Declan O'Reilly

In this Jupyter notebook we will teach you how to fit the machine learning models we talked about last week in Python using the `scikit-learn` package.

All of our example datasets come from the `datasets` sub-package within `scikit-learn`. So we import them now.

In [2]:
from sklearn import datasets

## Supervised Learning - Classification

#### Dataset Introduction

Let's continue our example of a classification problem, where the outcome can only take on 2 or more discrete values. But of course our predictors can be either continuous or discrete.

We are going to use `scikit-learn`'s breast cancer dataset here. Each row is a patient, and our outcome can take on a 0, for no breast cancer, or 1, for breast cancer. 

So in this case we actually have a *binary classification* problem, meaning our category can only take on 2 discrete values. In most binary classification problems, like in this case, the categories are 0 and 1 indiciating the presence or absence of some trait.

In [3]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer(as_frame=True)

In [None]:
breast_cancer.target.value_counts()

1    357
0    212
Name: target, dtype: int64

The `breast cancer` dataset contains 30 predictive variables. For example:



*    radius (mean of distances from center to points on the perimeter)

*    texture (standard deviation of gray-scale values)

*    perimeter

*    area

*    smoothness (local variation in radius lengths)

*    compactness (perimeter^2 / area - 1.0)

*    concavity (severity of concave portions of the contour)

*    concave points (number of concave portions of the contour)

*    symmetry

*    fractal dimension (“coastline approximation” - 1)



The mean, standard error, and “worst” or largest (mean of the three worst/largest values) of these features were computed for each image, resulting in 30 features. For instance, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius.



In [None]:
breast_cancer.data

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


#### Define the predictor variable set and a target


For the regression tasks, a feature set `X` and a target set `y` need to be defined first

In [4]:
X = breast_cancer.data
Y = breast_cancer.target

#### Fit a decision tree

Now we can fit our first machine learning model!

To start we need to import the proper algorithm from `scikit-learn`. In this case, we want the `DecisionTreeClassifier`.

In [5]:
from sklearn.tree import DecisionTreeClassifier

As you will see, the machine learning process in `scikit-learn` composes of 3 steps.

First, we need to create our model object and save it to a variable. In this step we also need to specify any *hyperparameters* like the maximum tree depth or the number of features to consider at each split.

Below, we do not specify any of these hyperparameters. In that case, `scikit-learn` will use the default hyperparameter values. The default maximum tree depth is "None", so the tree can be as big as it wants as long as the entropy keeps falling with additional splits. The default number of features to consider at each split is also "None", so all features will always be considered.

In [6]:
classifier = DecisionTreeClassifier(random_state=0)

Second, we fit our model object to our data. The first parameter to the fit method should be the input data, and the second should be the output data.

In [7]:
classifier.fit(X, Y)

DecisionTreeClassifier(random_state=0)

Third, we can use our model to predict new output variables for a combination of input variables. Just give one argument to the predict method: the input data you want to predict with. We do this on the first line below.

On the second line below, we see how accurate our model was, or what percentage of observations were correctly predicted. We got 100%! That may seem good, but on Thursday we will talk about why it potentially is not.



In [14]:
print(classifier.predict(X))
print((classifier.predict(X) == Y).mean() * 100)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 