# Credit risk prediction with Scikit-learn and custom library

This notebook uses dataset Credit risk dataset (https://raw.githubusercontent.com/leonardofurnielis/wml-toolkit/master/datasets/credit_risk_dataset.csv)

The notebook will train, create, and deploy a Credit Risk model.

### Contents

1. [Explore and prepare training data](#explore_prepare_data)
1. [Create train and test dataset](#train_test_set)
1. [Train the model](#train_model)
1. [Save the model](#save_model)

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import seaborn as sns
import json

<a id="explore_prepare_data"></a>
## 1. Explore and prepare training data

### 1.1 Importing training data

NOTE: read from `/data` directory if running locally

In [None]:
df_training = pd.read_csv('../data/credit_risk_training.csv')
df_training.head()

### 1.2. Exploring and preparing data

In [None]:
df_training.describe()

In [None]:
ax = sns.countplot(x="Risk", data=df_training)
plt.title("Risk label distribution")

<a id="train_test_set"></a>
## 2. Create train and test dataset

NOTE: Training dataset

In [None]:
Y = df_training['Risk']
df_training = df_training.drop(['Risk'], axis=1)
df_training.head()

NOTE: Test dataset

In [None]:
df_test = pd.read_csv('../data/credit_risk_test.csv')
df_test.head()

Y_test = df_test['Risk']
df_test = df_test.drop(['Risk'], axis=1)

df_test.head()

<a id="train_model"></a>
## 3. Train the model

Create a Scikit-learn Pipeline containing: 

1. normalization
1. model training

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [None]:
pipeline = Pipeline([('hotencoder', OneHotEncoder(handle_unknown='ignore', sparse=False)), ('lr', LogisticRegression())])

In [None]:
risk_model = pipeline.fit(df_training.values, Y)

### 3.1 Model evaluation

In [None]:
risk_model_predicted = risk_model.predict(df_test.values)

In [None]:
print(metrics.accuracy_score(Y_test, risk_model_predicted))

In [None]:
print(metrics.classification_report(Y_test, risk_model_predicted))

In [None]:
risk_model_conf_matrix = metrics.confusion_matrix(Y_test, risk_model_predicted)
sns.heatmap(risk_model_conf_matrix, annot=True,  fmt='');
plt.title('Confusion matrix, Logistic Regression');

In [None]:
Y_test[1]

In [None]:
X_test = df_test.values.tolist()
print(X_test[1])

In [None]:
risk_model_predicted = risk_model.predict([X_test[1]])

In [None]:
risk_model_predicted[0]

<a id="save_model"></a>
## 4. Save the model

In [None]:
import pickle
pickle.dump(risk_model, open('../models/credit_risk_model.pickle', 'wb'))