# Innovate Data Academy
[Krisolis](http://www.krisolis.ie)

## Workshop 1 Simple Predictive Models In Python

In [None]:
# General data handling
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 1000) 
pd.set_option('display.max_colwidth', 200)
import numpy as np

# Drawing plots
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# 
# Machine learning with scikit-learn
import sklearn
import sklearn.impute
import sklearn.model_selection
import sklearn.metrics
import sklearn.tree

### Introduction
Credit scoring is one of the most established uses of machine learning and predictive modeling in finance. By recognizing the patterns that precede borrowers running into financial distress banks can take action to avoid negative impacts of this. In this workshop you will use a dataset of past borrowers to build a model that predicts the likelihood that a borrower will experience financial distress in the next two years.

The descriptive features available to describe borrowers are:

- **Age**:	The age of borrower in years.
- **CustomerLifeTime**: How long the borrower has been a customer of the bank. 
- **MonthlyIncome**: The borrower's Monthly gross income.
- **NumberOfDependents**: Number of dependents (e.g. spouse or children) in the borrower's family (excluding themselves)
- **NumberOfTime30_59DaysLateNotWorse**: Number of times borrower has previously been 30-59 days past due, but no worse, in the last 2 years.
- **NumberOfTime60_89DaysLateNotWorse**: Number of times borrower has previously been 60-89 days past due, but no worse, in the last 2 years.
- **NumberOfTimes90DaysLate**: The number of times the borrower has previously been 90 days or more past due.
- **DebtRatio**: Monthly debt payments plus other living expenses paid by the borrower, divided by their monthly gross income.
- **NumberOfOpenCreditLinesAndLoans**: The number of existing loans (e.g. car loans or mortgages) and other lines of credit (e.g. credit cards) that the borrower currently holds. 
- **NumberRealEstateLoansOrLines**: Number of mortgage and other real estate loans held by the borrower.
- **UtilizationOfUnsecuredLinesTotal**: Total balance the borrower owes on credit cards and other short term loans, divided by the sum of their credit limits.

The target feature to predict is: 

- **SeriousDlqin2yrs**:	The borrower experienced 90 days past due delinquency or worse in the following 2 years (0 = No, 1 = Yes)


### Task 1
Load the dataset from the file **credit_scoring_bal.csv** into a Python data frame called `dataset`. View its shape, the column headings, and the first and last few rows. 

In [None]:
target_feature_name = 'SeriousDlqin2yrs'

# Add code here

### Task 2
Extract the descriptive features into a DataFrame named `X` and the target feature into a Series named `y`.

In [None]:
# Add code here


Examine the shape, column headings, and the first and last few rows for `X` and `y` (note `y` will not have column names).

In [None]:
# Add code here


In [None]:
# Add code here


### Task 3 
Divide the available dataset into a training partition (70%) - `X_train` and `y_train` - and a validation partition (30%)  - `X_valid` and `y_valid`.

In [None]:
# Add code here


### Task 4

Create a decision tree classifier object using '*entropy*' as the splitting `criterion` and all other default hyper-parameters.

In [None]:
model_clf = # Add code here


Train the decision tree classifier using its `fit` function with the data in `X_train` and `y_train`.

In [None]:
# Add code here


Print a representation of the decision tree - what feature was chosen to be examined at the root node? 

In [None]:
print(sklearn.tree.export_text(model_clf, 
                               feature_names = X_train.columns.to_list()))

Draw the decision tree - **WATCH OUT!** this tree is most likely very big and will take a long time to draw.

In [None]:
#fig = plt.figure(figsize=(10,10))
#_ = sklearn.tree.plot_tree(model_clf, 
#                           feature_names = X_train.columns,
#                           class_names = model_clf.classes_,
#                           filled = True)

### Task 5
Make predictions for each of the instances in the **training dataset** and assess the performance of the trained decision tree based on these predictions using **accuracy**. 

In [None]:
# Add code here


Make predictions for each of the instances in the **validation dataset** and assess the performance of the trained decision tree based on these predictions using **accuracy**. 

In [None]:
# Add code here


What might explain the difference between these performance scores? 

In [None]:
# Add answer here


### Task 6
Train a more effective decision tree by setting the `min_samples_leaf` hyper parameter to 0.05.

In [None]:
model_clf = # Add code here


Print a representation of the decision tree - what feature was chosen to be examined at the root node? 

In [None]:
print(sklearn.tree.export_text(model_clf, 
                               feature_names = X_train.columns.to_list()))

In [None]:
fig = plt.figure(figsize=(10,10))
_ = sklearn.tree.plot_tree(model_clf, 
                           feature_names = X_train.columns,
                           class_names = model_clf.classes_,
                           filled = True)

Repeat the previous evaluation on the training and validation datasets.

In [None]:
# Add code here


In [None]:
# Add code here


### Task 7
The file **credit_scoring_query.csv** contains a set of query instances for which predictions need to be made. Load this file into a DataFrame named `X_query`.

In [None]:
# Add code here


Use the model trained to make a set of predictions for the instances in `X_query`.

In [None]:
y_pred = # Add code here

Examine the predictions made using the model. 

In [None]:
predictions = pd.DataFrame({'prediction' : y_pred})
print(predictions.head())

In [None]:
predictions.value_counts()