<a href="https://colab.research.google.com/github/mnijhuis-dnb/Artificial_Intelligence_and_Machine_Learning_for_SupTech/blob/main/Tutorials/Tutorial%202%20Regressions%20versus%20Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DNB Academie: Machine Learning – Tools and applications for policy 
Tutorial 2: Regressions versus Classifiers
*	Logit as a statistical model vs ML model
*	How to find the optimal (hyper)parameters
*	A different classifier: Support vector machines 
  *	Different types of kernels
  *	First glimpse: Dangers of overfitting
  *	Evaluating performance

<br/>

15 & 22 Jan 2024  

**Instructors**  
Prof. Iman van Lelyveld (iman.van.lelyveld@vu.nl)<br/>
Dr. Michiel Nijhuis (m.nijhuis@dnb.nl)  

----

# Preparation

At the beginning of each notebook, we have a short preparation section. This section will do three things. First of all it will loads all the necessary packages or download and install them. Secondly it will also download and extract the data we are going to use during the tutorial. The third thing is to run most of the code from the previous notebook so we can continue working with the data

In [None]:
in_colab = False
if 'google.colab' in str(get_ipython()):
  in_colab = True

In [None]:
!pip install gdown==4.6.0

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
!gdown 1-3c9BhPfl6D92HvTI4kNd0MfmTquiUwQ
!gdown 1-5ZzK3EAqc-i3AgnLOSZXTGGZsEPEmzH

### Tutorial 1
In this section we re-run most of the code from tutorial 1. This setups up the data so we can use it for this tutorial

In [None]:
if in_colab:
    df_record = pd.read_csv('/content/credit_record.csv')
    df_applications = pd.read_csv('/content/application_record.csv')
else:
    df_record = pd.read_csv('credit_record.csv')
    df_applications = pd.read_csv('application_record.csv')
    

In [None]:
df_record.loc[:,'status'] = df_record.loc[:,'STATUS']
df_record.loc[:,'status'] = df_record.loc[:,'status'].replace('X', '0')
df_record.loc[:,'status'] = df_record.loc[:,'status'].replace('C', '0')

In [None]:
df_record.loc[:,'status'] = pd.to_numeric(df_record.loc[:,'status'])

In [None]:
sr_defaults = df_record.groupby('ID')['status'].agg(lambda x: sum(x>=2)>0)

In [None]:
df_applications = df_applications.drop_duplicates(subset='ID')

In [None]:
df_applications = df_applications.set_index('ID')

In [None]:
df_applications = df_applications.dropna()

In [None]:
obj_cols = df_applications.select_dtypes(include=['object']).columns.tolist()
dummies_list = [pd.get_dummies(df_applications[col], prefix=col, drop_first=True) for col in obj_cols]
df_applications = pd.concat([df_applications.drop(columns=obj_cols)] + dummies_list, axis=1)

In [None]:
df_data = df_applications.merge(sr_defaults, how='inner', left_index=True, right_on='ID')

In [None]:
df_data= df_data.rename(columns={'status':'DEFAULTED'}).dropna()

## Data analysis
Before we can make our model we first have to apply some data analysis to get a better understanding of the data

In [None]:
df_corr = df_data.corr()
df_corr

In [None]:
df_corr.loc['DEFAULTED'].sort_values(ascending=False)

# Linear Regression
Now we can apply our linear regression model, both from sklearn (which takes more of a machine learning view) and for statsmodels (which takes more of an econometrics view)

In [None]:
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

Select the columns that you want to use for the regression

In [None]:
exogenous_collumns = []

Get the data ready for the regression

In [None]:
sr_endog = df_data.loc[:,'DEFAULTED'].astype(float)

df_exogs = df_data[exogenous_collumns]
df_exogs = sm.add_constant(df_exogs)
df_exogs.head()

## Econometric view: with `statsmodels`

In [None]:
linreg_sm = sm.OLS(
    endog=sr_endog,
    exog=df_exogs,
).fit()

print(linreg_sm.summary())

## Machine learning: with `scikit-learn`

In [None]:
X = df_exogs.values
y = sr_endog.values

In [None]:
linreg_ml = LinearRegression(fit_intercept=False)
linreg_ml.fit(X, y)

In [None]:
linreg_ml.coef_

In [None]:
linreg_ml.score(X, y)

# Classification

Until now we have used linear regressions to best predict the outcome. The results are very poor. With $R^2$ of 3-4%, there is little hope for this to continue well, especially if we are concerned about "external validity". 

This is not surprising since we do not have a regression problem. Instead, the outcome is binary. We are not that much interested in a trend or "regressing toward the mean". A better approach is classification, so we can use logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
mdl_logit = LogisticRegression(fit_intercept=True)

Fit the logistic regression

Make prediction for the all the data

Assess the performance of the logit model

### Support Vector Machine
The next step is to use a SVM to predict the classification

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(C=1.0, 
          kernel='rbf', 
          degree=3, 
          gamma='scale', 
          coef0=0.0, 
          shrinking=True, 
          probability=False, 
          tol=0.1, 
          cache_size=200, 
          class_weight=None, 
          verbose=False, 
          max_iter=20, 
          decision_function_shape='ovr', 
          break_ties=False, 
          random_state=43)

Fit the data to the model

Make a prediction

In [None]:
y_model = clf.predict(X)

Calculate the recall and the precision

In [None]:
from sklearn.metrics import recall_score, precision_score

Calculate the prediction yourself

In [None]:
recall = recall_score(y,y_model)

print(f'The recall is:{recall} \t The precision is: ')

The following code is a function to plot the decision boundary of a SVC

In [None]:
import matplotlib as mpl

def plot_decision_boundary(model: SVC, 
                           data_x: pd.DataFrame, 
                           data_y: pd.Series) -> None:
    """
    Plot the decision boundary of a classifier using the given data.

    Parameters
    ----------
    model : sklearn.svm.SVC
        The classifier to use for generating the decision boundary.
    data_x : pandas.DataFrame
        The input data to use for generating the decision boundary.
    data_y : pandas.Series
        The target values to use for generating the decision boundary.

    Returns
    -------
    None
        This function does not return anything, but it generates a plot of the decision boundary.
    """


    def make_meshgrid(x_in: pd.Series, 
                      y_in: pd.Series) -> tuple[np.ndarray, np.ndarray]:
        """
        Generate a meshgrid based on the given x and y series.

        Parameters
        ----------
        x_in : pandas.Series
            Input x values.
        y_in : pandas.Series
            Input y values.

        Returns
        -------
        Tuple[np.ndarray, np.ndarray]
            A tuple of two NumPy arrays representing the meshgrid generated from x and y values.
            The first array represents the x-coordinates and the second array represents the y-coordinates.
        """
        x_min, x_max = x_in.min() - 1, x_in.max() + 1
        y_min, y_max = y_in.min() - 1, y_in.max() + 1
        x, y = np.meshgrid(np.arange(x_min, x_max, (x_max - x_min)/100), 
                           np.arange(y_min, y_max, (y_max - y_min)/100))
        
        return x, y

    def plot_contours(ax: mpl.axes.Axes, 
                      clf: SVC, 
                      xx: np.array, 
                      yy: np.array, 
                      **kwargs) -> mpl.figure.Figure:
        """
        Plot the decision boundaries for a classifier.

        Parameters
        ----------
        ax : matplotlib.axes.Axes
            The axes on which to plot the decision boundaries.
        clf : sklearn.svm.SVC
            The classifier to use for generating the decision boundaries.
        xx : numpy.ndarray
            The x values for the meshgrid used to generate the decision boundaries.
        yy : numpy.ndarray
            The y values for the meshgrid used to generate the decision boundaries.
        **kwargs : Dict[str,Any]
            Additional keyword arguments to pass to the `contourf` method.

        Returns
        -------
        matplotlib.figure.Figure
            The figure object returned by the `contourf` method.
        """
        Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        out = ax.contourf(xx, yy, Z, **kwargs)
        return out

    fig, ax = plt.subplots()
    x, y = make_meshgrid(data_x[:, 0], data_x[:, 1])

    ax.scatter(data_x[:, 0], 
               data_x[:, 1], 
               c=data_y, 
               cmap=plt.cm.coolwarm, 
               s=20)
    plot_contours(ax, 
                  model, 
                  x, 
                  y, 
                  cmap=plt.cm.coolwarm, 
                  alpha=0.4)
    ax.set_ylabel(data_x.columns[0])
    ax.set_xlabel(data_x.columns[1])
    ax.set_xticks(())
    ax.set_yticks(())
    ax.set_title('Decision boundary SVC ')

    return None

Redo the prediction with just a couple of variables, so we can plot the decision boundary and see the effect of different kernels. You can use the following kernels: 'linear', 'poly', 'rbf', 'sigmoid' 

In [None]:
clf = SVC(kernel='linear', 
          degree=3, 
          gamma='scale', 
          coef0=0.0, 
          shrinking=True, 
          probability=False, 
          tol=0.1, 
          cache_size=200, 
          class_weight=None, 
          verbose=False, 
          max_iter=20, 
          decision_function_shape='ovr', 
          break_ties=False, 
          random_state=43)

Select the two columns which you will use for the model and fit the model

In [None]:
two_selected_columns = []

clf = clf.fit(X[two_selected_columns], y)

Plot the decision boundary

In [None]:
plot_decision_boundary(clf, X, y)

Adjust the kernel and see the effects

Changing the model to get a better prediction

In [None]:
clf = SVC(C=1.0, 
          kernel='poly', 
          degree=30, 
          gamma='scale', 
          coef0=0.0, 
          shrinking=True, 
          probability=False, 
          tol=0.0001, 
          cache_size=200, 
          class_weight=None, 
          verbose=False, 
          max_iter=-1, 
          decision_function_shape='ovr', 
          break_ties=False, 
          random_state=43)

See how the prediction is changed

Do you think this is a good prediction?

Can you select better parameters for the model?