
# Decision tree (classification) algorithm
---
- **Traininig**:
    1. Find most *informative* combination of `node of the tree`,  `feature`, and  `split value`
    2. Do split if `max_depth` is not reached
    3. Iterate over 1-2.
    
    
- **Inference** (prediction):
    - Follow the rules ^_^.
    

## Decision tree example

![](fancy_tree.png)

picture link https://yadi.sk/i/fKgXgTdruFVMng

---

## Probabilities (sample means)

> Before the first split:

$$P(y=\text{BLUE}) = \frac{9}{20} = 0.45$$

$$P(y=\text{YELLOW}) = \frac{11}{20} = 0.55$$

> After the first split:

$$P(y=\text{BLUE}|X\leq 12) = \frac{8}{13} \approx 0.62$$
$$P(y=\text{BLUE}|X> 12) = \frac{1}{7} \approx 0.14$$

$$P(y=\text{YELLOW}|X\leq 12) = \frac{5}{13} \approx 0.38$$
$$P(y=\text{YELLOW}|X > 12) = \frac{6}{7} \approx 0.86$$


---

## Entropy

$$
H(p) = - \sum_i^K p_i\log(p_i)
$$


> Before the first split

$$H = - 0.45 \log 0.45 - 0.55 \log 0.55 \approx -0.69 $$

> After the first split

$$H_{\text{left}} = - 0.62 \log 0.62 - 0.38 \log 0.38 \approx -0.66$$

$$H_{\text{right}} = - 0.14 \log 0.14 - 0.86 \log 0.86 \approx -0.40$$

$$H_{\text{total}} =  - \frac{13}{20} 0.66 - \frac{7}{20} 0.40 \approx -0.86$$

## Information Gain
$$
IG = H(\text{parent}) - H(\text{child})
$$


$$IG = -0.69 - (-0.86) = 0.13$$

In [None]:
# !pip install -q kaggle
# !kaggle competitions download -c forest-cover-type-prediction
# !unzip forest-cover-type-prediction.zip -d forest-cover-type-prediction

# Kaggle's 'Forest Cover Type Prediction' competition

Read in the data as pandas dataframes. Data was downloaded as csv files from the Kaggle competition Data page https://www.kaggle.com/c/forest-cover-type-prediction/data.

> You could install kaggle package https://github.com/Kaggle/kaggle-api and obtain this dataset by `kaggle competitions download -c forest-cover-type-prediction`

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('./forest-cover-type-prediction/train.csv', index_col=0)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.Cover_Type.value_counts()

In [None]:
df.count()

# Split data

In [None]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('Cover_Type', axis=1),
                                                    df.Cover_Type, train_size=.80, random_state=1)

In [None]:
clf = DecisionTreeClassifier()

params_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': np.arange(3, 30),
    'min_samples_split': np.arange(10, 30, 5)
}

cv = KFold(n_splits=5, shuffle=True, random_state=5)

In [None]:
gs = GridSearchCV(clf, param_grid=params_grid, cv=cv, n_jobs=-1, verbose=1)

gs.fit(X_train, y_train)

In [None]:
gs.best_estimator_

In [None]:
gs.best_params_

In [None]:
gs.best_score_

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
y_pred = gs.predict(X_test)

accuracy_score(y_test, y_pred)

# Public test (we do not have labels for them)

In [None]:
test = pd.read_csv('./forest-cover-type-prediction/test.csv', index_col=0)

In [None]:
gs.best_estimator_.fit(df.drop('Cover_Type', axis=1), df.Cover_Type)

In [None]:
y_pred_leaderboard = gs.predict(test)

In [None]:
predictions = pd.DataFrame(data=y_pred_leaderboard,
                           index=test.index, 
                           columns=['Cover_Type'])
predictions.to_csv('decision_tree.csv')

In [None]:
# !kaggle competitions submit -c forest-cover-type-prediction -f decision_tree.csv -m "{'criterion': 'entropy', 'max_depth': 24, 'min_samples_split': 10}"

In [None]:
y_test.value_counts()

In [None]:
predictions.Cover_Type.value_counts()

# Useful links


- All parameters of a DecisionTreeClassifier explained https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680 
