<a href="https://colab.research.google.com/github/mnijhuis-dnb/Artificial_Intelligence_and_Machine_Learning_for_SupTech/blob/main/Tutorials/Tutorial%203%20Data%20pre-processing%20and%20assessing%20model%20performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Artificial Intelligence and Machine Learning for SupTech  
Tutorial 3: Data pre-processing and assessing model performance

*	How to pre-process: standardize your data
*	Pros and cons of standardization
*	Working with the confusion matrix
  *	What if costs are not symmetric?
  *	The trade-off between precision and recall


<br/>

14 March 2023  

**Instructors**  
Prof. Iman van Lelyveld (iman.van.lelyveld@vu.nl)<br/>
Dr. Michiel Nijhuis (m.nijhuis@dnb.nl)  

----

### Previous Tutorials
In this section we re-run some of the code from the first 2 tutorials to have a starting model. With these steps out of the way we can focus on the pre-processing and the further evaluating of the model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
!gdown 1-3c9BhPfl6D92HvTI4kNd0MfmTquiUwQ
!gdown 1-5ZzK3EAqc-i3AgnLOSZXTGGZsEPEmzH

In [None]:
path = 'credit_record.csv'
df_record = pd.read_csv(path)

In [None]:
path = 'application_record.csv'
df_applications = pd.read_csv(path)

In [None]:
df_record.loc[:,'status'] = df_record.loc[:,'STATUS']
df_record.loc[:,'status'] = df_record.loc[:,'status'].replace('X', '0')
df_record.loc[:,'status'] = df_record.loc[:,'status'].replace('C', '0')

In [None]:
df_record.loc[:,'status'] = pd.to_numeric(df_record.loc[:,'status'])

In [None]:
sr_defaults = df_record.groupby('ID')['status'].agg(lambda x: sum(x>2)>0)

In [None]:
df_applications = df_applications.drop_duplicates(subset='ID')

In [None]:
df_applications = df_applications.set_index('ID')

In [None]:
df_applications = df_applications.dropna()

In [None]:
obj_cols = df_applications.select_dtypes(include=['object']).columns.tolist()
dummies_list = [pd.get_dummies(df_applications[col], prefix=col, drop_first=True) for col in obj_cols]
df_applications = pd.concat([df_applications.drop(columns=obj_cols)] + dummies_list, axis=1)

In [None]:
df_data = df_applications.merge(sr_defaults, how='inner', left_index=True, right_on='ID')

In [None]:
df_data= df_data.rename(columns={'status':'DEFAULTED'}).dropna()

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(C=1.0, 
          kernel='rbf', 
          degree=3, 
          gamma='scale', 
          coef0=0.0, 
          shrinking=True, 
          probability=False, 
          tol=0.1, 
          cache_size=200, 
          class_weight=None, 
          verbose=False, 
          max_iter=5, 
          decision_function_shape='ovr', 
          break_ties=False, 
          random_state=43)

In [None]:
X = df_data.drop(columns='DEFAULTED')
y = df_data['DEFAULTED']

In [None]:
clf = clf.fit(X.iloc[:10000], y.iloc[:10000])

In [None]:
y_model = clf.predict(X)

### Tutorial 3

Evaluate the performance based on the confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y, y_model)

Instead of the predictions we can also get the scores the SVM produces

In [None]:
clf.predict_proba(X)

What was the original prediction threshold?

The original threshold was 0.5

In [None]:
(clf.predict_proba(X)<0.5).astype(int)==y

Imagine we value the recall twice as much as the precision, can you adjust the decision threshold to get an optimum prediction

An easier way to evaluate the performance is with the precision recall curve. We can use the following function for that

In [None]:
from sklearn.metrics import precision_recall_curve

Can you make a precision recall curve?

In [None]:
precision_recall_curve(y, clf.predict_proba(X))

Let's have a look at the distribution of one of the variables

In [None]:
df_data['DAYS_EMPLOYED'].plot.hist(bins=30)

As you can see the distribution of the data is not that ideal for most machine learning algorithms. Can you improve the prediction of the DEFAULT rate, by adjusting the parameters?

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
df_data['DAYS_EMPLOYED'] = pt.fit_transform(df_data['DAYS_EMPLOYED'])
precision_recall_curve(y, clf.predict_proba(X))

Another point with the DAYS_EMPLOYED variable is that it's range is between -16000 and 0, this is much higher than most other variables, can you scale this and other variables to a better range?

In [None]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
df_data = mms.fit_transform(df_data)

Does this lead to a better model performance

In [None]:
precision_recall_curve(y, clf.predict_proba(X))