# 6. Decision Trees and Ensemble Learning

## 6.1 Credit risk scoring project

- `Dataset`: https://github.com/gastonstat/CreditScoring

| **`Column`** | **`Meaning`**              |
| ------------ | -------------------------- |
| 1 Status	   | credit status              |
| 2 Seniority  | job seniority (years)      |
| 3 Home       | type of home ownership     |
| 4 Time       | time of requested loan     |
| 5 Age        | client's age               |
| 6 Marital    | marital status             |
| 7 Records    | existance of records       |
| 8 Job        | type of job                |
| 9 Expenses   | amount of expenses         |
| 10 Income    | amount of income           |
| 11 Assets    | amount of assets           |
| 12 Debt      | amount of debt             |
| 13 Amount    | amount requested of loan   |
| 14 Price     | price of good              | 

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 6.2 Data cleaning and preparation

- Download ing the dataset
- Re-encoding the categorical variables
- Doing the train / validation / test split

### Downloading the dataset

In [2]:
data = "https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv"
!wget -c $data

--2023-10-17 16:36:55--  https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-06-trees/CreditScoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [3]:
df = pd.read_csv(data)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4455 entries, 0 to 4454
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Status     4455 non-null   int64
 1   Seniority  4455 non-null   int64
 2   Home       4455 non-null   int64
 3   Time       4455 non-null   int64
 4   Age        4455 non-null   int64
 5   Marital    4455 non-null   int64
 6   Records    4455 non-null   int64
 7   Job        4455 non-null   int64
 8   Expenses   4455 non-null   int64
 9   Income     4455 non-null   int64
 10  Assets     4455 non-null   int64
 11  Debt       4455 non-null   int64
 12  Amount     4455 non-null   int64
 13  Price      4455 non-null   int64
dtypes: int64(14)
memory usage: 487.4 KB


In [5]:
df.head()

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


The categorical variable are alread encoded as numerical values. This is useful for machine learning model, however the readability for humans is very low. It would be good to find out what categorical values the numbers are encoding. 

Transforming the column-header

In [6]:
df.columns = df.columns.str.lower()

Preprocessing of the data was done with the following R-file: [Part1_CredScoring_Processing.R](https://github.com/gastonstat/CreditScoring/blob/master/Part1_CredScoring_Processing.R)

### Re-encoding the categorical variables

Relevant code:
```R
# change factor levels (i.e. categories)
levels(dd$Status) = c("good", "bad")
levels(dd$Home) = c("rent", "owner", "priv", "ignore", "parents", "other")
levels(dd$Marital) = c("single", "married", "widow", "separated", "divorced")
levels(dd$Records) = c("no_rec", "yes_rec")
levels(dd$Job) = c("fixed", "partime", "freelance", "others")
```

In [7]:
status_values = {
    1: "ok", 
    2: "default",
    0: "unk"
}
df["status"] = df["status"].map(status_values)

home_values = {
    1: "rent",
    2: "owner",
    3: "private",
    4: "ignore",
    5: "parents",
    6: "other",
    0: "unk"
}
df["home"] = df["home"].map(home_values)

marital_values = {
    1: "single",
    2: "married",
    3: "widow",
    4: "separated",
    5: "divorced",
    0: "unk"
}
df["marital"] = df["marital"].map(marital_values)

records_values = {
    1: "no",
    2: "yes",
    0: "unk"
}
df["records"] = df["records"].map(records_values)

job_values = {
    1: "fixed",
    2: "partime",
    3: "freelance",
    4: "others",
    0: "unk"
}
df["job"] = df["job"].map(job_values)

In [8]:
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


**Handling missing values**

In [9]:
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


The `max`-values of `income`, `assests` and `debt` are $99999999$, which was set in the `R`-program mentioned above. This has to be replaced with an adequate value like `np.nan`.

In [10]:
for c in ["income", "assets", "debt"]:
    df[c] = df[c].replace(to_replace=99999999, value=np.nan)

In [11]:
# The 99999999-values as max are gone now 
df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4421.0,4408.0,4437.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,131.0,5403.0,343.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,86.0,11573.0,1246.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,165.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


Removing the customer with unknown statue `unk`

In [12]:
df = df[df["status"] != "unk"].reset_index(drop=True)

### Doing the train / validation / test split

In [13]:
from sklearn.model_selection import train_test_split
rs = 11

In [14]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=rs)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=rs)

print(f"train: {len(df_train)} | {len(df_train) / len(df)*100:.0f}%")
print(f"val: {len(df_val)} | {len(df_val) / len(df)*100:.0f}%")
print(f"test: {len(df_test)} | {len(df_test) / len(df)*100:.0f}%")


train: 2672 | 60%
val: 891 | 20%
test: 891 | 20%


In [15]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [16]:
# obtaining the target-values that are used for the machine learning model. default -> 1, ok -> 0 
y_train = (df_train["status"] == "default").astype("int").values
y_val   = (df_val["status"] == "default").astype("int").values
y_test  = (df_test["status"] == "default").astype("int").values

In [17]:
# removing the target-values from the data
if "status" in df_train.columns:
    df_train.drop(["status"], axis=1, inplace=True)
if "status" in df_val.columns:
    df_val.drop(["status"], axis=1, inplace=True)
if "status" in df_test.columns:
    df_test.drop(["status"], axis=1, inplace=True)

The pre-processing is done and a machine learning model can be trained with it.

## 6.3 Decision Trees

- How a decision tree looks like
- Training a decision tree
- Overfitting
- Controlling the size of a tree

## 6.4 Decision Tree learnign algorithm

- Finding the best split for one column
- Finding the best split for the entire dataset
- Stopping criteria
- Decision Tree learning algorithm

## 6.5 Decision Trees parameter tuning

- selecting `max_depth`
- selecting `min_samples_leaf`

## 6.6 Ensembles and random forest

- Board of experts
- Ensemble models
- Random forest - ensembling decision trees
- Tuning random forest

Other usefulv parameters:
- `max_features`
- `bootstrap`

Link: [sklearn.ensemble.RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

## 6.7 Gradient boosting and XGBoost

- Gradient boosting vs random forest
- Installing `XGBoost`
- Training the first model
- Performance monitoring
- Parsing `XGBoost`'s monitoring output

## 6.8 XGBoost parameter tuning

Tuning the following parameters:
- `eta`
- `max_depth`
- `min_child_weight`

Other parameters: https://xgboost.readthedocs.io/en/latest/parameter.html

Useful ones:
- `subsample` and `colsample_bytree`
- `lambda` and `alpha`

## 6.9 Selecting the final model

- Choosing between xgboost, random forest and decision tree
- Training the final model
- Saving the model

## 6.10 Summary

- Decision tree learn if-then-else rule from data
- Finding the best split: select the least impure split. The algorithm can overfit, that's why we control it by limiting the max depth and the size of the group.
- Tandom forest is a way of combining multiple decision trees. It should habe a diverse set of models to make good predictions
- Gradient boosting trains model sequentially: each model tries to fix errors of the previous model. XGBoost is an implementation of gradient boosting