# Week 6 Notes

## 6.1 [Credit risk scoring project](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/01-credit-risk.md)

This week we'll focus on a credit risk scoring model. For example if a customer wants to borrow money from a bank, the bank can decide whether to lend the money or not based on some information. The model predicts a risk that a customer would default on their loan.

The model is trained on historical data on people who got a loan and defaulted or not. The model is a binary classification model, similar to last week's customer churn model.

$$
y_i \in \{0, 1\}
$$

0 means the customer did not default, 1 means the customer did default.


## 6.2 [Data cleaning and preparation](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/02-data-prep.md)


In [1]:
# !wget https://github.com/gastonstat/CreditScoring/raw/master/CreditScoring.csv

In [2]:
import pandas as pd


df = pd.read_csv("CreditScoring.csv")

First we'll lower case the columns:

In [3]:
df.columns = df.columns.str.lower()

Next, we'll substitute the numerical values to string, such that we know what the different categories are instead of seeing numbers. To this end we use `df.map` which takes in a mapping dictionary as argument:

In [4]:
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}
 
marital_values = {
    1: 'single', 
    2: 'married', 
    3: 'widow', 
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}
 
records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}
 
job_values = {
    1: 'fixed', 
    2: 'partime', 
    3: 'freelance', 
    4: 'others',
    0: 'unk'
}

df.records = df.records.map(records_values)
df.marital = df.marital.map(marital_values)
df.home = df.home.map(home_values)
df.job = df.job.map(job_values)
 
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,1,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,2,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,1,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,1,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


In [5]:
df.describe().round()

Unnamed: 0,status,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,1.0,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,0.0,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,1.0,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,1.0,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,2.0,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,2.0,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


`income`, `assets`, and `debt` have very large values. These are actually missing values. We will replace them by `np.nan`:

In [6]:
import numpy as np

for c in ["income", "assets", "debt"]:
    df.loc[df[c]==99999999, c] = np.nan

df[["income", "assets", "debt"]].max()

income       959.0
assets    300000.0
debt       30000.0
dtype: float64

We have one sample with unknown status. We will just drop it:

In [7]:
df.status.value_counts()

status
1    3200
2    1254
0       1
Name: count, dtype: int64

In [8]:
df = df[df["status"]!=0].reset_index(drop=True)

Also, we'll change the encoding. `2` means the customer defaulted so it should become `1`:

In [9]:
df.status = (df.status==2).astype(int)

In [10]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=11)

In [11]:
df_full_train = df_full_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)

In [12]:
df_train.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,10,owner,36,36,married,no,freelance,75,0.0,10000.0,0.0,1000,1400
1,1,6,parents,48,32,single,yes,fixed,35,85.0,0.0,0.0,1100,1330
2,0,1,parents,48,40,married,no,fixed,75,121.0,0.0,0.0,1320,1600
3,1,1,parents,48,23,single,no,partime,35,72.0,0.0,0.0,1078,1079
4,0,5,owner,36,46,married,no,freelance,60,100.0,4000.0,0.0,1100,1897


Let's define `y`:

In [13]:
y_train = df_train.status.values
y_val = df_val.status.values
y_test = df_test.status.values

Let's drop the target variable from our dataframes:

In [14]:
del df_train["status"]
del df_val["status"]
del df_test["status"]

Now we are ready to train a model.


## 6.3 [Decision trees](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/03-decision-trees.md)


Decision tree consists of conditions based on which an outcome is predicted. See example below. It's a bunch of if-then-else rules.

<img src=decisiontree.png width=600>

This is an example. If we were to write this in code, it would be:

In [15]:
def assess_risk(client):
    if client["records"] == "yes":
        if client["job"] == "parttime":
            return "default"
        else:
            return "ok"
    else:
        if client["assets"] > 6000:
            return "ok"
        else:
            return "default"

Let's use this decision tree on the first record in `df_train`: 

In [16]:
xi = df_train.iloc[0].to_dict()

assess_risk(xi)

# y_train[0]

'ok'

Although we encoded a set of rules in the decision tree above, we can also learn the if-then-else rules using sklearn:

In [36]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

In [24]:
train_dicts = df_train.fillna(0).to_dict(orient="records")

In [25]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

In [114]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

We will check our roc auc score on the validation dataset:

In [115]:
val_dicts = df_val.fillna(0).to_dict(orient="records")
X_val = dv.transform(val_dicts)

y_pred = dt.predict_proba(X_val)[:, 1]

roc_auc_score(y_val, y_pred)

np.float64(0.6580247511564263)

It's not great. Let's check it for our training dataset:

In [116]:
y_pred = dt.predict_proba(X_train)[:, 1]

roc_auc_score(y_train, y_pred)

np.float64(1.0)

The AUC score on the training dataset is perfect, whereas on the validation dataset it's poor. Whatever the model learned on the training dataset, does not translate well to unseen data in the validation dataset. This is called **overfitting**.

What happens is that the model creates a rule for each customer in the training dataset. But this pattern is not true in general. Our model is said to have low bias and high variance. Right now we have not restricted the depth of the decision tree. What this means is that it can create as many conditionals as it wants to fit the data. We end up with overly specific rules which are reprenting the individual samples in the training set, rather then representing any general mechanisms. What we can do is to constrain the decision tree depth. This will give us rules that are less specific.

In [117]:
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)

In [118]:
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)

np.float64(0.7761016984958594)

In [119]:
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

np.float64(0.7389079944782155)

We can see that with a `max_depth` of 3, we have better performance on the validation dataset and it is in line with the training dataset as well, suggesting whatever we learned from the training datset, generalizes to unseen data.

We can visualize the rules that our model came up with by visualizing the decision tree:

In [120]:
from sklearn.tree import export_text

print(export_text(dt, feature_names=dv.feature_names_))

|--- records=no <= 0.50
|   |--- seniority <= 6.50
|   |   |--- amount <= 862.50
|   |   |   |--- class: 0
|   |   |--- amount >  862.50
|   |   |   |--- class: 1
|   |--- seniority >  6.50
|   |   |--- income <= 103.50
|   |   |   |--- class: 1
|   |   |--- income >  103.50
|   |   |   |--- class: 0
|--- records=no >  0.50
|   |--- job=partime <= 0.50
|   |   |--- income <= 74.50
|   |   |   |--- class: 0
|   |   |--- income >  74.50
|   |   |   |--- class: 0
|   |--- job=partime >  0.50
|   |   |--- assets <= 8750.00
|   |   |   |--- class: 1
|   |   |--- assets >  8750.00
|   |   |   |--- class: 0



## 6.4 [Decision tree learning algorithm](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/04-decision-tree-learning.md)


## 6.5 [Decision trees parameter tuning](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/05-decision-tree-tuning.md)



## 6.6 [Ensemble learning and random forest](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/06-random-forest.md)



## 6.7 [Gradient boosting and XGBoost](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/07-boosting.md)



## 6.8 [XGBoost parameter tuning](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/08-xgb-tuning.md)



## 6.9 [Selecting the best model](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/09-final-model.md)



## 6.10 [Summary](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/10-summary.md)



## 6.11 [Explore more](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/11-explore-more.md)



## 6.12 [Homework](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/06-trees/homework.md)
