<font size="+5">#02 | Decision Tree. A Supervised Classification Model</font>

- Subscribe to my [Blog ↗](https://blog.pythonassembly.com/)
- Let's keep in touch on [LinkedIn ↗](www.linkedin.com/in/jsulopz) 😄

# Discipline to Search Solutions in Google

> Apply the following steps when **looking for solutions in Google**:
>
> 1. **Necesity**: How to load an Excel in Python?
> 2. **Search in Google**: by keywords
>   - `load excel python`
>   - ~~how to load excel in python~~
> 3. **Solution**: What's the `function()` that loads an Excel in Python?
>   - A Function to Programming is what the Atom to Phisics.
>   - Every time you want to do something in programming
>   - **You will need a `function()`** to make it
>   - Theferore, you must **detect parenthesis `()`**
>   - Out of all the words that you see in a website
>   - Because they indicate the presence of a `function()`.

# Load the Data

> Load the Titanic dataset with the below commands
> - This dataset **people** (rows) aboard the Titanic
> - And their **sociological characteristics** (columns)
> - The aim of this dataset is to predict the probability to `survive`
> - Based on the social demographic characteristics.

In [1]:
import seaborn as sns

df = sns.load_dataset(name='titanic').iloc[:, :4]

In [2]:
df.head()

Unnamed: 0,survived,pclass,sex,age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0


# `DecisionTreeClassifier()` Model in Python

## Build the Model

> 1. **Necesity**: Build Model
> 2. **Google**: How do you search for the solution?
> 3. **Solution**: Find the `function()` that makes it happen

## Code Thinking

> Which function computes the Model?
> - `fit()`
>
> How could can you **import the function in Python**?

In [3]:
from sklearn.tree import DecisionTreeClassifier

In [4]:
model = DecisionTreeClassifier()

In [5]:
model.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

### Separate Variables for the Model

> Regarding their role:
> 1. **Target Variable `y`**
>
> - [ ] What would you like **to predict**?
>
> 2. **Explanatory Variable `X`**
>
> - [ ] Which variable will you use **to explain** the target?

In [6]:
target = df.survived

In [7]:
df['pclass', 'sex']

KeyError: ('pclass', 'sex')

In [8]:
df.keys()

Index(['survived', 'pclass', 'sex', 'age'], dtype='object')

In [9]:
explanatory = df[['pclass', 'sex', 'age']]

### Finally `fit()` the Model

In [10]:
model = DecisionTreeClassifier()

In [11]:
model.fit(X=explanatory, y=target)

ValueError: could not convert string to float: 'male'

In [12]:
float('2.34')

2.34

In [13]:
float('male')

ValueError: could not convert string to float: 'male'

In [14]:
import pandas as pd

In [15]:
read_csv

NameError: name 'read_csv' is not defined

In [16]:
df

Unnamed: 0,survived,pclass,sex,age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0
...,...,...,...,...
886,0,2,male,27.0
887,1,1,female,19.0
888,0,3,female,
889,1,1,male,26.0


In [17]:
df = pd.get_dummies(data=df, drop_first=True)

In [18]:
df

Unnamed: 0,survived,pclass,age,sex_male
0,0,3,22.0,1
1,1,1,38.0,0
2,1,3,26.0,0
3,1,1,35.0,0
4,0,3,35.0,1
...,...,...,...,...
886,0,2,27.0,1
887,1,1,19.0,0
888,0,3,,0
889,1,1,26.0,1


In [34]:
explanatory = df.drop(columns='survived')

In [36]:
target = df.survived

In [38]:
model.fit(X=explanatory, y=target)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [39]:
explanatory

Unnamed: 0,pclass,age,sex_male
0,3,22.0,1
1,1,38.0,0
2,3,26.0,0
3,1,35.0,0
4,3,35.0,1
...,...,...,...
886,2,27.0,1
887,1,19.0,0
888,3,,0
889,1,26.0,1


In [45]:
df = df.dropna().reset_index(drop=True)

In [46]:
explanatory = df.drop(columns='survived')
target = df.survived

In [54]:
model = DecisionTreeClassifier()

In [55]:
model.__dict__

{'criterion': 'gini',
 'splitter': 'best',
 'max_depth': None,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'min_weight_fraction_leaf': 0.0,
 'max_features': None,
 'max_leaf_nodes': None,
 'random_state': None,
 'min_impurity_decrease': 0.0,
 'class_weight': None,
 'ccp_alpha': 0.0}

In [56]:
model.fit(X=explanatory, y=target)

DecisionTreeClassifier()

In [57]:
model.__dict__

{'criterion': 'gini',
 'splitter': 'best',
 'max_depth': None,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'min_weight_fraction_leaf': 0.0,
 'max_features': None,
 'max_leaf_nodes': None,
 'random_state': None,
 'min_impurity_decrease': 0.0,
 'class_weight': None,
 'ccp_alpha': 0.0,
 'feature_names_in_': array(['pclass', 'age', 'sex_male'], dtype=object),
 'n_features_in_': 3,
 'n_outputs_': 1,
 'classes_': array([0, 1]),
 'n_classes_': 2,
 'max_features_': 3,
 'tree_': <sklearn.tree._tree.Tree at 0x166121490>}

## Calculate a Prediction with the Model

> - `model.predict_proba()`

In [49]:
df[:1]

Unnamed: 0,survived,pclass,age,sex_male
0,0,3,22.0,1


## Model Visualization

> - `tree.plot_tree()`

## Model Interpretation

> Why `sex` is the most important column? What has to do with **EDA** (Exploratory Data Analysis)?

In [1]:
%%HTML

<iframe width="560" height="315" src="https://www.youtube.com/embed/7VeUPuFGJHk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

# Prediction vs Reality

> How good is our model?

## Precision

> - `model.score()`

## Confusion Matrix

> 1. **Sensitivity** (correct prediction on positive value, $y=1$)
> 2. **Specificity** (correct prediction on negative value $y=0$).

## ROC Curve

> A way to summarise all the metrics (score, sensitivity & specificity)