# German Credit dataset

## Contents

4. Build a baseline model
5. Prepare the data to better expose the underlying patterns to machine learning algorithm (incl feature engineering)
6. Explore many modesl; Select a model and train it
7. Fine-tune the model
8. Present your solution
9. Deploy, monitor and maintain your system



##### TODO
- Ensemble model?
- Deploy


## The metric: f2

<br>

### Imports

In [20]:
# imports from Python Standard Library
import re

from collections import Counter

In [21]:
# Third party imports
import numpy as np
import pandas as pd


In [22]:
# sklearn imports
from sklearn.metrics import (accuracy_score, recall_score, precision_score, fbeta_score, roc_auc_score, classification_report)
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV


In [23]:
# Custom utilities imports
from src.helper_utilities import load_data
from src.modeling_utilities import Baseline, classification_scores, f2

### Load the data

In [24]:
# Get the (user-friendly) data for a baseline model
df = load_data(mode='analysis', format='dataframe')

# Save the "user friendly" dataframe for EDA as csv
df.to_csv("data/user_friendly_cats.csv", index=False)

# get the data from the saved csv due to the pd quirk with Ctegoricals
df = pd.read_csv("data/user_friendly_cats.csv")
df.head()


Unnamed: 0,tenure,amount,rate,residence,age,credits,maintenance,history,savings,employment,...,status,purpose,guarantor,installments,housing,telephone,foreign,sex,personal,label
0,6,1169,4,4,67,2,1,critical,no savings,"[7, inf)",...,overdrawn,television,none,none,ownership,yes,True,male,male single,0
1,48,5951,2,2,22,1,1,so far so good,"[0, 100)","[1, 4)",...,petty,television,none,none,ownership,none,True,female,female divorced/separated/married,1
2,12,2096,2,3,49,1,2,critical,"[0, 100)","[4, 7)",...,no account,education,none,none,ownership,none,True,male,male single,0
3,42,7882,2,4,45,1,2,so far so good,"[0, 100)","[4, 7)",...,overdrawn,furniture,guarantor,none,without payment,none,True,male,male single,0
4,24,4870,3,4,53,2,2,delay,"[0, 100)","[1, 4)",...,overdrawn,car,none,none,without payment,none,True,male,male single,1


# 4. Baseline model

This baseline model is based on a simple lookup table approach. You can view the code here:
[src/modeling_utilities.py](src/modeling_utilities.py)

In [25]:
# Train Test Split
X = df.copy()
y = X.pop('label')
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# This baseline model is based on a simple lookup table approach
baseline = Baseline(best_features=['status', 'history', 'savings'], threshold=0.5)
baseline.fit(Xtrain, ytrain)

In [27]:
# The default threshold of 0.5 givs us the following results on the test set:
ypred = baseline.predict(Xtest)
classification_scores(ytest, ypred)

accuracy     0.73
precision    0.55
recall       0.47
f1           0.51
f2           0.49
dtype: float64

In [28]:
# Cross validation F2 score (on the whole dataset; with the default threshold of 0.5)
cross_val_score(baseline, X, y, scoring=f2, cv=5).mean()

0.5086351820901933

In [29]:
# AUC
ytrue = ytest
yscore = baseline.predict_proba(Xtest)
roc_auc_score(ytrue, yscore)

0.7540569780021636

In [30]:
# Hyperparameter grid search: the best model's threshold is 0.125 and has the F2 = 0.71
gs = GridSearchCV(baseline, {'threshold': np.linspace(0.05, 0.2, num=7)}, cv=5, scoring=f2).fit(Xtrain, ytrain)
print("threshold =", gs.best_estimator_.threshold)

ytrue = ytest
ypred = gs.best_estimator_.predict(Xtest)
classification_scores(ytrue, ypred)

threshold = 0.125


accuracy     0.48
precision    0.35
recall       0.95
f1           0.52
f2           0.71
dtype: float64

So, the goal is to beat the F2-score of 71% (and possibly the AUC of 0.754)

<br>

# 5. Data Preprocessing

### Note how the features are ordered in the original dataset

In [39]:
# View the attribute names from the info document
path = 'data/german.doc'

with open(path, mode='r') as file:
    text = file.read()
    
pattern = r"Attr?ibute (?P<attr>\d{1,2}):.+?\n\s+(?P<name>.+?)\n"

print('\033[91m{}\033[0m'.format("Column index,"),  "original feature name and", '\033[92m{}\033[0m'.format("my short name"), end="\n\n")

# make a mapping from the "handy" name to the actual column index
column_index = dict()

for m in re.finditer(pattern, text):
    possible_names = [s for s in df.columns for pattern in (fr"\b{s}\b", fr"\b{s[:-1]}")
                      if re.search(pattern, m.groupdict()['name'], re.IGNORECASE)] or ['tenure']
    my_column_name = Counter(sorted(possible_names)).most_common(1)[0][0]
    print('\033[91m{}\033[0m'.format(int(m.groupdict()['attr'])-1), f"{m.groupdict()['name'].strip()}", '\033[92m({})\033[0m'.format(my_column_name))
    column_index[my_column_name] = int(m.groupdict()['attr'])-1


[91mColumn index,[0m original feature name and [92mmy short name[0m

[91m0[0m Status of existing checking account [92m(status)[0m
[91m1[0m Duration in month [92m(tenure)[0m
[91m2[0m Credit history [92m(history)[0m
[91m3[0m Purpose [92m(purpose)[0m
[91m4[0m Credit amount [92m(amount)[0m
[91m5[0m Savings account/bonds [92m(savings)[0m
[91m6[0m Present employment since [92m(employment)[0m
[91m7[0m Installment rate in percentage of disposable income [92m(rate)[0m
[91m8[0m Personal status and sex [92m(personal)[0m
[91m9[0m Other debtors / guarantors [92m(guarantor)[0m
[91m10[0m Present residence since [92m(residence)[0m
[91m11[0m Property [92m(property)[0m
[91m12[0m Age in years [92m(age)[0m
[91m13[0m Other installment plans [92m(installments)[0m
[91m14[0m Housing [92m(housing)[0m
[91m15[0m Number of existing credits at this bank [92m(credits)[0m
[91m16[0m Job [92m(job)[0m
[91m17[0m Number of people being liable to pr

### Load the original dataset

In [47]:
# load as ndarray
X, y = load_data(mode='modeling', format='ndarray')

# load as df
features, labels = load_data(mode='modeling', format='dataframe')
features.tail(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
995,A14,12,A32,A42,1736,A61,A74,3,A92,A101,4,A121,31,A143,A152,1,A172,1,A191,A201
996,A11,30,A32,A41,3857,A61,A73,4,A91,A101,4,A122,40,A143,A152,1,A174,1,A192,A201
997,A14,12,A32,A43,804,A61,A75,4,A93,A101,4,A123,38,A143,A152,1,A173,1,A191,A201
998,A11,45,A32,A43,1845,A61,A73,4,A93,A101,4,A124,23,A143,A153,1,A173,1,A192,A201
999,A12,45,A34,A41,4576,A62,A71,3,A93,A101,4,A123,27,A143,A152,1,A173,1,A191,A201


### Train Test Split (stratify=y)

In [33]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, stratify=y, test_size=0.2)
Xtest[:3]

array([['A14', 24, 'A32', 'A43', 1901, 'A62', 'A73', 4, 'A93', 'A101', 4,
        'A123', 29, 'A143', 'A151', 1, 'A174', 1, 'A192', 'A201'],
       ['A13', 6, 'A34', 'A40', 1343, 'A61', 'A75', 1, 'A93', 'A101', 4,
        'A121', 46, 'A143', 'A152', 2, 'A173', 2, 'A191', 'A202'],
       ['A12', 15, 'A30', 'A40', 1778, 'A61', 'A72', 2, 'A92', 'A101', 1,
        'A121', 26, 'A143', 'A151', 2, 'A171', 1, 'A191', 'A201']],
      dtype=object)

### Determine the features which may be excluded from our model

According to our statistical tests earlier these features are useless: ['residence', 'job', 'credits', 'telephone', 'maintenance']

In [52]:
# sorted bad to worst
weak_features = ['residence', 'job', 'credits', 'telephone', 'maintenance']
features_to_keep = sorted(set(range(X.shape[1])) - set(column_index[k] for k in weak_features))
features_to_keep

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 19]


### Prepare the data




3. Encode, dummify categorical features. Maybe "numerize" categorical features with too many categories.

4. Feature engineering, where appropriate:
    - Discretize continuous features.
    - Decompose features (e.g., categorical, date/time, etc.)  SEX
    - Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
    - Aggregate features into promising new features.

5. Feature scaling: standardize or normalize features.


### Checklist:
https://github.com/ageron/handson-ml3/blob/main/ml-project-checklist.md

### TODO
Ensemble


