# German Credit dataset

## Contents

4. Build a baseline model
5. Prepare the data to better expose the underlying patterns to machine learning algorithm (incl feature engineering)
6. Explore many modesl; Select a model and train it
7. Fine-tune the model
8. Present your solution
9. Deploy, monitor and maintain your system



##### TODO
- Ensemble model?
- Deploy


## The metric: f2

In [1]:
# Imports
import numpy as np
import pandas as pd


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# sklearn imports
from sklearn.metrics import (accuracy_score, recall_score, precision_score, fbeta_score, roc_auc_score, classification_report)
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV


In [3]:
# Custom utilities imports
from src.modeling_utilities import Baseline, classification_scores, f2

In [4]:
# Get the (user-friendly) data for a baseline model
df = pd.read_csv('data/user_friendly_cats.csv')
df.tail(5)

Unnamed: 0,tenure,amount,rate,residence,age,credits,maintenance,history,savings,employment,...,status,purpose,guarantor,installments,housing,telephone,foreign,sex,personal,label
995,12,1736,3,4,31,1,1,so far so good,"[0, 100)","[4, 7)",...,no account,furniture,none,none,ownership,none,True,female,female divorced/separated/married,0
996,30,3857,4,4,40,1,1,so far so good,"[0, 100)","[1, 4)",...,overdrawn,used car,none,none,ownership,yes,True,male,male divorced/separated,0
997,12,804,4,4,38,1,1,so far so good,"[0, 100)","[7, inf)",...,no account,television,none,none,ownership,none,True,male,male single,0
998,45,1845,4,4,23,1,1,so far so good,"[0, 100)","[1, 4)",...,overdrawn,television,none,none,without payment,yes,True,male,male single,1
999,45,4576,3,4,27,1,1,critical,"[100, 500)",unemployed,...,petty,used car,none,none,ownership,none,True,male,male single,0


# 4. Baseline model

This baseline model is based on a simple lookup table approach. You can view the code here:
[src/modeling_utilities.py](src/modeling_utilities.py)

In [5]:
# Train Test Split
X = df.copy()
y = X.pop('label')
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

In [6]:
# This baseline model is based on a simple lookup table approach
baseline = Baseline(best_features=['status', 'history', 'savings'], threshold=0.5)
baseline.fit(Xtrain, ytrain)

In [7]:
# The default threshold of 0.5 givs us the following results on the test set:
ypred = baseline.predict(Xtest)
classification_scores(ytest, ypred)

accuracy     0.73
precision    0.55
recall       0.47
f1           0.51
f2           0.49
dtype: float64

In [8]:
# Cross validation F2 score (on the whole dataset; with the default threshold of 0.5)
cross_val_score(baseline, X, y, scoring=f2, cv=5).mean()

0.5086351820901933

In [9]:
# AUC
ytrue = ytest
yscore = baseline.predict_proba(Xtest)
roc_auc_score(ytrue, yscore)

0.7540569780021636

In [10]:
# Hyperparameter grid search: the best model's threshold is 0.125 and has the F2 = 0.71
gs = GridSearchCV(baseline, {'threshold': np.linspace(0.05, 0.2, num=7)}, cv=5, scoring=f2).fit(Xtrain, ytrain)
print("threshold =", gs.best_estimator_.threshold)

ytrue = ytest
ypred = gs.best_estimator_.predict(Xtest)
classification_scores(ytrue, ypred)

threshold = 0.125


accuracy     0.48
precision    0.35
recall       0.95
f1           0.52
f2           0.71
dtype: float64

So, the goal is to beat the F2-score of 71% (and possibly the AUC of 0.754)

# 5. Data Preprocessing

### Load the original dataset

In [11]:
df_original = pd.read_csv("data/german.data", header=None, delimiter=' ')
df_original.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


### Train Test Split (stratify=y)

In [12]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df_original)

ValueError: not enough values to unpack (expected 4, got 2)



### Plot numerical features to identify outliers (scatter_matrix); do IQR*1.5 test

### Deal with missing values

...  2. ...

In [None]:
# MMM
# https://www.youtube.com/watch?v=SeEF62_0SJY&list=PLctJCIReWvUHUnrl3MH6yHp8dFq2Jz9C5&index=9

# https://en.wikipedia.org/wiki/Marketing_mix_modeling


# WATCH LATER YOUTUBE
# https://www.youtube.com/playlist?list=WL



### Prepare the data

1. Data cleaning:
    - Fix or remove outliers (optional).
    - Fill in missing values (e.g., with zero, mean, median...) or drop their rows (or columns).

2. Feature selection (optional):
    - Drop the attributes that provide no useful information for the task.

3. Encode, dummify categorical features. Maybe "numerize" categorical features with too many categories.

4. Feature engineering, where appropriate:
    - Discretize continuous features.
    - Decompose features (e.g., categorical, date/time, etc.).
    - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
    - Aggregate features into promising new features.

5. Feature scaling: standardize or normalize features.


### Checklist:
https://github.com/ageron/handson-ml3/blob/main/ml-project-checklist.md

### TODO
Ensemble


