# Credit Risk Analysis

Projeto desenvolvido por:
* Mariana Ramos - up201806869
* Pedro Ferreira - up201806506
* Pedro Ponte - up201809694

## Table of contents

1. [Introduction](#Introduction)

2. [Required Libraries](#Required-Libraries)

3. [Step 1: Answering the question](#Step-1:-Answering-the-question)

4. [Step 2: Checking the data](#Step-2:-Checking-the-data)

5. [Step 3: Tidying the data](#Step-3:-Tidying-the-data)

6. [Step 4: Exploratory Analysis](#Step-4:-Exploratory-Analysis)

7. [Step 5: Classification](#Step-5:-Classification)

    - [5.1: Decision Trees](#5.1:-Decision-Trees)

        - [5.1.1: Parameter Tuning](#5.1.1:-Parameter-Tuning)
    
    - [5.2: K-Nearest Neighbor](#5.2:-K-Nearest-Neighbor)
        
        - [5.2.1: Parameter Tuning](#5.2.1:-Parameter-Tuning)
        
    - [5.3: Support-Vector Machines](#5.3:-Support-Vector-Machines)
        
        - [5.3.1: Parameter Tuning](#5.3.1:-Parameter-Tuning)
        
    - [5.4: Neural Networks](#5.4:-Neural-Networks)
        
        - [5.4.1: Parameter Tuning](#5.4.1:-Parameter-Tuning)
        
8. [Step 6: Results Analysis](#Step-6:-Results-Analysis)

## Introduction

[[ go back to the top ]](#Table-of-contents)

## Required libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **Seaborn**: Advanced statistical plotting library.
* **Plotly**: Tables

To make sure you have all of the packages you need, install them with `conda`:

    conda install numpy pandas scikit-learn matplotlib seaborn
    
    conda install -c conda-forge watermark

`conda` may ask you to update some of them if you don't have the most recent version. Allow it to do so.

## Step 1: Answering the question

[[ go back to the top ]](#Table-of-contents)

The first step to any data analysis project is to define the question or problem we're looking to solve, and to define a measure (or set of measures) for our success at solving that task. The data analysis checklist has us answer a handful of questions to accomplish that, so let's work through those questions.

>Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?

We're trying to design a predictive model in order to evaluate the credit risk of a given loan and decide whether the loan should be granted or not.

>Did you define the metric for success before beginning?

Let's do that now. Since we're performing classification, we can use [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision) — the fraction of correctly classified loans — to quantify how well our model is performing. The accuracy achieved should be, at least, 

<hr />

## Step 2: Checking the data

[[ go back to the top ]](#Table-of-contents)

In order to be able to make conclusions and reach our goal, we will need to extract and study the data. To do so, we have to import the panda library and extract the information to be stored in a variable so we can work with it. 

In [None]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
from sklearn import neighbors, datasets, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.utils import resample
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import sklearn.tree as tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import plotly.graph_objects as go
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
import matplotlib.patches as mpatches

credit_data = pd.read_csv('data.csv', low_memory=False)

In [None]:
# data frame shape
print('Number of rows: ', credit_data.shape[0])
print('Number of columns: ', credit_data.shape[1])

As we can see the data file has a total of 855969 different results and a total of 73 evaluation criteria. The second parameter of the read_csv function makes all the empty fields be filled with NA so we can easily analyse missing values in the future.

We can start by reading the data into a pandas DataFrame to see if is everything alright.

In [None]:
credit_data.head(10)

In [None]:
credit_data.tail()

In [None]:
credit_data.describe()

Let's examine the structure of the data set.

In [None]:
# all data frame columns
credit_data.columns

In [None]:
# data frame summary
credit_data.info()

We can see that some features have missing values. Let's take a closer look at them.

In [None]:
# percentage of missing values per feature
print((credit_data.isnull().sum() * 100 / credit_data.shape[0]).sort_values(ascending=False))

In [None]:
credit_data.isnull().sum().plot(kind='bar', figsize=(18,8), fontsize=14,);
plt.ylabel('Null values');

## Step 3: Tidying the data
[[ go back to the top ]](#Table-of-contents)

Looking to all features and the respective percentage of missing values, we can conclude that some of them (like *verification_status_joint*, *annual_inc_joint*, *dti_joint*, *il_util*, *mths_since_rcnt_il*, *total_bal_il*, *inq_last_12m*, *open_acc_6m*, *open_il_6m*, *open_il_24m*, *open_il_12m*, *open_rv_12m*, *open_rv_24m*, *max_bal_bc*, *all_util*, *inq_fi*, *total_cu_tl* have almost all entries missing. As this feature is not crucial for the project, we are dropping it. 

In [None]:
# drop the columns
credit_data.drop(['verification_status_joint', 'annual_inc_joint', 'dti_joint', 'il_util', 'mths_since_rcnt_il',
                   'total_bal_il', 'inq_last_12m', 'open_acc_6m', 'open_il_6m', 'open_il_24m', 'open_il_12m', 'open_rv_12m',
                   'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl'], axis=1, inplace=True)

Now, let's examine the number of unique values for each feature.

In [None]:
# number of unique observations per column
credit_data.nunique().sort_values()

The feature *policy_code* only has one value, so we can conclude that this one is not important for our analysis and we can also drop this collumn.

In [None]:
# drop the column "policy_code"
credit_data.drop('policy_code', axis=1, inplace=True)

*id* and *member_id* features are randomly generated fields by bank for unique identification purposes only, so we can drop them.

In [None]:
# drop the columns "id" and "member_id"
credit_data.drop(['id', 'member_id'], axis=1, inplace=True)

Features like *funded_amnt*, *funded_amnt_inv*, *mths_since_last_record*, *out_prncp*, *out_prncp_inv*, *total_pymnt*, *total_pymnt_inv*, *total_rec_prncp*, *total_rec_int*, *total_rec_late_fee*, *recoveries*, *collection_recovery_fee*, *last_pymnt_d*, *last_pymnt_amnt*, *next_pymnt_d* can be dropped because they leak data from future, after the loan has already started to be funded. 

In [None]:
# drop columns
credit_data.drop(['funded_amnt', 'funded_amnt_inv', 'mths_since_last_record', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
                   'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 
                   'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d'], axis=1, inplace=True)

*emp_title* feature requires other data and a lot of processing to become potentially useful, so we opt to drop that one too.

In [None]:
# drop "emp_title" column
credit_data.drop('emp_title', axis=1, inplace=True)

Also *desc*, *initial_list_status*, *total_rev_hi_lim* features doesn't add value to our model, so we will drop them.

In [None]:
# drop "desc", "initial_list_status", "total_rev_hi_lim" columns
credit_data.drop(['desc', 'initial_list_status', 'total_rev_hi_lim'], axis=1, inplace=True)

*zip_code* feature is mostly redundant since only the first 3 digits of the 5 digit zip code are visible.

In [None]:
# drop "zip_code" column
credit_data.drop('zip_code', axis=1, inplace=True)

*addr_state* collumn seems like to contain categorical values. Let's explore the unique value counts of this column.

In [None]:
print(credit_data['addr_state'].value_counts())

The *addr_state* column contains too many unique values, so it’s better to drop this.

In [None]:
# drop "addr_state" column
credit_data.drop('addr_state', axis=1, inplace=True)

Let's look at *pymnt_plan* feature values.

In [None]:
print(credit_data['pymnt_plan'].value_counts())

It seems that this column has two unique values, *y* and *n*, with *y* occurring only 5 times. Let’s drop this column:

In [None]:
# drop "pymnt_plan" column
credit_data.drop('pymnt_plan', axis=1, inplace=True)

Let's also look at *application_type* column.

In [None]:
print(credit_data['application_type'].value_counts())

It seems that this column only has two unique values, *INDIVIDUAL* and *JOINT*, with *JOINT* occurring only 442 times. Let's drop this column too:

In [None]:
# drop "application_type" column
credit_data.drop('application_type', axis=1, inplace=True)

Next, let’s look at the unique value counts for the *purpose* and *title* columns to understand which columns we want to keep.

In [None]:
for name in ['purpose','title']:
    print("Unique Values in column: {}\n".format(name))
    print(credit_data[name].value_counts(),'\n')

It appears the *purpose* and *title* columns do contain overlapping information. *title* column contains too many different values with distinct information, so it is hard to analyse its values. On the other hand, *purpose* has many unique values and they are nominal so, to convert them to numerical in order to analyse them will create many new collumns and difficult our job.
Taking this into account, we opt to drop both columns.

In [None]:
# drop "title" and "purpose" column
credit_data.drop(['title', 'purpose'], axis=1, inplace=True)

Looking at the features that we still have, we see that we have two that are very similar: *grade* and *sub_grade*.

In [None]:
for name in ['grade','sub_grade']:
    print("Unique Values in column: {}\n".format(name))
    print(credit_data[name].value_counts(),'\n')

We can conclude that *sub_grade* contains redundant information that is already in the *grade* column, so we can also drop this column.

In [None]:
# drop "sub_grade" column
credit_data.drop('sub_grade', axis=1, inplace=True)

In [None]:
null_counts = credit_data.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))

*mths_since_last_major_derog* column has to many null values, in the order of 75%, so we will drop this one to. We also drop *mths_since_last_delinq* column as it has a high  percentage of null valuess too, in order of 50%.

In [None]:
# drop "mths_since_last_major_derog" and "mths_since_last_delinq" columns
credit_data.drop(['mths_since_last_major_derog', 'mths_since_last_delinq'], axis=1, inplace=True)

Let's now investigate columns that are of the **object** data type and figure out how we can make those values numeric.

In [None]:
object_columns_df = credit_data.select_dtypes(include=['object'])
print(object_columns_df.iloc[0])

These columns seem to represent categorical values:
    * term
    * grade
    * emp_length
    * home_ownership
    * verification_status
    * earliest_cr_line
    * last_credit_pull_d
    * issue_data

Let's now explore the other categorical columns.

In [None]:
cols = ['term', 'grade', 'emp_length', 'home_ownership', 'verification_status']
for name in cols:
    print(name,':')
    print(credit_data[name].value_counts(),'\n')

*grade*, *emp_length* and *term* columns contain ordinal values, i.e. they are in natural order and we can sort or order them either in increasing or decreasing order. For this reason, we can change the values of this columns to the appropriate numeric values.

In [None]:
# strip months from 'term' and make it an int
credit_data['term'] = credit_data['term'].str.split(' ').str[1]


mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    },
    "grade": {
        "A": 1,
        "B": 2,
        "C": 3,
        "D": 4,
        "E": 5,
        "F": 6,
        "G": 7
    },
    "term": {
        "36": 36.0,
        "60": 60.0
    }
}
credit_data = credit_data.replace(mapping_dict)

In [None]:
credit_data[['emp_length', 'grade', 'term']].head(10)

*verification_status* and *home_ownership* features contain nominal values, so we can't order them. In this case, we will have to convert them to numerical values using dummy variables.

In [None]:
# Converting nominal features into numerical features by encoding them as dummy variables
nominal_columns = ["verification_status", "home_ownership"]
dummy_loan = pd.get_dummies(credit_data[nominal_columns])
print(dummy_loan.head())

In [None]:
# Insert the dummy variables into the original DataFrame, drop nominal columns
credit_data = pd.concat([credit_data, dummy_loan], axis=1)
credit_data = credit_data.drop(nominal_columns, axis=1)

In [None]:
credit_data.head()

*earliest_cr_line*, *issue_d*, *last_credit_pull_d* columns contain date values. Let's convert them:

In [None]:
# Convert date inputs
cols = ["earliest_cr_line","issue_d","last_credit_pull_d"]
for col in cols:
    credit_data[col] = pd.to_datetime(credit_data[col],format="%d-%m-%Y")
credit_data[cols].head()

Let's now check if still exist columns with null values.

In [None]:
null_counts = credit_data.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))

As there are some missing values, let's deal with them.

In [None]:
# Fill in the missing values for 'revol_util' with the median 'revol_util'.
credit_data.revol_util.fillna(credit_data.revol_util.median(), inplace=True)

# Fill in the missing values for 'collections_12_mths_ex_med' with the median 'collections_12_mths_ex_med'.
credit_data.collections_12_mths_ex_med.fillna(credit_data.collections_12_mths_ex_med.median(), inplace=True)

# Fill in the missing values for 'tot_coll_amt' with the median 'tot_coll_amt'.
credit_data.tot_coll_amt.fillna(credit_data.tot_coll_amt.median(), inplace=True)

# Fill in the missing values for 'tot_cur_bal' with the median 'tot_cur_bal'.
credit_data.tot_cur_bal.fillna(credit_data.tot_cur_bal.median(), inplace=True)

# Fill in the missing values for 'emp_length' with the 0.
credit_data.emp_length.fillna(0, inplace=True)

In [None]:
null_counts = credit_data.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))

In [None]:
# Drop rows with missing values
credit_data = credit_data.dropna()
print(credit_data.isnull().sum().sort_values(ascending=False))

In [None]:
credit_data.info()

## Step 4: Exploratory Analysis
[[ go back to the top ]](#Table-of-contents)

In [None]:
# Heatmap on the correlations between features in the loan data
credit_correlations = credit_data.corr()
plt.figure(figsize=(20, 20,))
plt.imshow(credit_correlations, cmap=None, interpolation='none', aspect='auto')
plt.colorbar()
plt.xticks(range(len(credit_correlations)), credit_correlations.columns, rotation='vertical')
plt.yticks(range(len(credit_correlations)), credit_correlations.columns);
plt.suptitle('Loan correlations Heat Map', fontsize=30, fontweight='bold')
plt.show()

Data preparation is finished and we can now start to apply machine learning algorithms in order to predict the credit risk.

It's usually better to visualize the data in some way. Visualization makes outliers and errors immediately stand out, whereas they might go unnoticed in a large table of numbers.
As the dataset contains to many rows, first we need to get a subset of the original dataset in order to be possiblle to apply the supervised learning algorithms to test our model.

## Step 5: Classification
[[ go back to the top ]](#Table-of-contents)

To advance to the data analysis we need to gather the test and the train samples. To do so we will import the function train_test_split from sklearn and use it on thecredit_data data.

In [None]:
credit_data_subset = credit_data.sample(frac=0.05)

As the subset is not balanced (has more 0 in *default_ind* than 1), we need to balance it.

In [None]:
default_ind_0 = credit_data_subset.loc[credit_data_subset['default_ind'] == 0]
default_ind_1 = credit_data_subset.loc[credit_data_subset['default_ind'] == 1]

df_minority_upsampled = resample(default_ind_1, replace=True, n_samples=len(default_ind_0),random_state=123)

credit_data_subset = pd.concat([default_ind_0, df_minority_upsampled])

credit_data_subset.default_ind.value_counts()

As the dataset contains to many rows, first we need to get a subset of the original dataset in order to be possiblle to apply the supervised learning algorithms to test our model.

In [None]:
X = credit_data_subset[credit_data_subset.columns.drop(['default_ind', 'issue_d', 'earliest_cr_line', 'last_credit_pull_d', 'dti',
                                                        'inq_last_6mths', 'revol_util', 'collections_12_mths_ex_med', 'home_ownership_ANY',
                                                       'delinq_2yrs'])] 
y = credit_data_subset['default_ind']

# get a test dataset with 20% of the credit_data_subset
(X_train, X_test, y_train, y_test) = train_test_split(X, y, test_size=0.2, random_state=1)

X_test.head(10)

In [None]:
X_train.info()
print()
X_test.info()

### 5.1: Decision Trees
[[ go back to the top ]](#Table-of-contents)

In [None]:
# Create the classifier
decision_tree_classifier = DecisionTreeClassifier()

# Train the classifier on the training set
decision_tree_classifier.fit(X_train, y_train)

dtc_prediction = decision_tree_classifier.predict(X_test)

dtc_classification_report = classification_report(y_test, dtc_prediction, output_dict=True)

print(f"Classification report:\n{classification_report(y_test, dtc_prediction)}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_dtc = confusion_matrix(y_test, dtc_prediction)

sb.heatmap(confusion_matrix_dtc, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

#### 5.1.1: Parameter Tuning
[[ go back to the top ]](#Table-of-contents)

Every Machine Learning model comes with a variety of parameters to tune, and these parameters can be vitally important to the performance of our classifier.

The most common method for model parameter tuning is Grid Search. The idea behind Grid Search is simple: explore a range of parameters and find the best-performing parameter combination. Focus your search on the best range of parameters, then repeat this process several times until the best parameters are discovered.

Let's tune our decision tree classifier. Let's start by tuning only two parameters and analyse the results.

In [None]:
parameter_grid = {'criterion': ['gini', 'entropy'],
                  'splitter': ['best', 'random'],
                  'max_depth': range(10, 20),
                  'max_features': range(10,20)}

grid_search = GridSearchCV(DecisionTreeClassifier(),
                           param_grid=parameter_grid,
                           cv=10,
                           verbose=4,
                           n_jobs=-1)

grid_search.fit(X_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
print('Best estimator: {}'.format(grid_search.best_estimator_))

In [None]:
decision_tree_classifier = grid_search.best_estimator_
dtc_prediction = decision_tree_classifier.predict(X_test)

best_dtc_classification_report = classification_report(y_test, dtc_prediction, output_dict=True)

print("--- Improved model ---\n")
print(f"Classification report:\n{classification_report(y_test, dtc_prediction)}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_dtc = confusion_matrix(y_test, dtc_prediction)

sb.heatmap(confusion_matrix_dtc, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

Here we have the best classifier, that allows our model to achieve a score of aproximately 94%.

We can also visualize the Decision Tree:

In [None]:
tree.plot_tree(decision_tree_classifier)
plt.figure(figsize=(15, 150))
plt.show()

with open('iris_dtc.dot', 'w') as out_file:
    out_file = tree.export_graphviz(decision_tree_classifier, out_file=out_file)

### 5.2: K-Nearest Neighbor
[[ go back to the top ]](#Table-of-contents)

The K-Nearest Neighbors algorithm (k-NN) is a non-parametric classification method. The input consists of the k closest training examples in data set. The output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Let's start to apply this algorithmn to our model with k=5 and see the results.

In [None]:
knn = KNeighborsClassifier()

knn.fit(X_train, y_train)
knn_prediction = knn.predict(X_test)

knn_classification_report = classification_report(y_test, knn_prediction, output_dict=True)

print(f"Classification report:\n{classification_report(y_test, knn_prediction, labels=np.unique(y_train))}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_knn = confusion_matrix(y_test, knn_prediction)

sb.heatmap(confusion_matrix_knn, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

#### 5.2.1: Parameter Tunning
[[ go back to the top ]](#Table-of-contents)

In [None]:
parameter_grid = {'n_neighbors': [5,10,15,20],
                  'weights': ['uniform', 'distance'],
                  'algorithm': ['ball_tree', 'kd_tree', 'brute']}

grid_search = GridSearchCV(KNeighborsClassifier(),
                           param_grid=parameter_grid,
                           scoring='precision_weighted',
                           cv=10,
                           n_jobs=3,
                           verbose=4)

grid_search.fit(X_train, y_train)
print(f"Best score: {grid_search.best_score_}")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best estimator: {grid_search.best_estimator_}")

In [None]:
knn = grid_search.best_estimator_
yk_pred = knn.predict(X_test)

best_knn_classification_report = classification_report(y_test, yk_pred, output_dict=True)

print("--- Improved model ---\n")
print(f"Classification report:\n{classification_report(y_test, yk_pred)}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_knn = confusion_matrix(y_test, yk_pred)

sb.heatmap(confusion_matrix_knn, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

### 5.3: Support-Vector Machines
[[ go back to the top ]](#Table-of-contents)

Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data.

In [None]:
svc = SVC()

svc.fit(X_train, y_train)
svc_prediction = svc.predict(X_test)

svm_classification_report = classification_report(y_test, svc_prediction, output_dict=True)

print(f"Classification report:\n{classification_report(y_test, svc_prediction)}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_svm = confusion_matrix(y_test, svc_prediction)

sb.heatmap(confusion_matrix_svm, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

best_svm_classification_report = svm_classification_report

#### 5.3.1: Parameter Tunning
[[ go back to the top ]](#Table-of-contents)

In [None]:
parameter_grid = {'C' : [0.1, 1, 10], 
                'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}

grid_search = GridSearchCV(SVC(),
                           param_grid=parameter_grid,
                           cv=10,
                           verbose=4,
                           n_jobs=4)

grid_search.fit(X_train, y_train)
print(f"Best score: {grid_search.best_score_}")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best estimator: {grid_search.best_estimator_}")

In [None]:
svc = grid_search.best_estimator_
yk_pred = svc.predict(X_test)

best_svm_classification_report = classification_report(y_test, yk_pred, output_dict=True)

print("--- Improved model ---\n")
print(f"Classification report:\n{best_knn_classification_report(y_test, yk_pred)}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_svm = confusion_matrix(y_test, yk_pred)

sb.heatmap(confusion_matrix_svm, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

### 5.4: Neural Networks
[[ go back to the top ]](#Table-of-contents)

A neural network is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. In this sense, neural networks refer to systems of neurons, either organic or artificial in nature. Neural networks can adapt to changing input; so the network generates the best possible result without needing to redesign the output criteria.

In [None]:
scaler = StandardScaler()

# Fit only to the training data
scaler.fit(X_train)

# Now apply the transformations to the data:
X_train_nn = scaler.transform(X_train)
X_test_nn = scaler.transform(X_test)

# Create the classifier
ANNClassifier = MLPClassifier(random_state=1, max_iter=500)

# Train the classifier on the training set
ANNClassifier.fit(X_train_nn, y_train)

predictions = ANNClassifier.predict(X_test_nn)

confusion_matrix_ann = confusion_matrix(y_test,predictions)

nn_classification_report = classification_report(y_test, predictions, output_dict=True)
print(classification_report(y_test,predictions))

sb.set(font_scale=1.0)

ax = plt.subplot()

sb.heatmap(confusion_matrix_ann, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

best_nn_classification_report = nn_classification_report

#### 5.4.1: Parameter Tunning
[[ go back to the top ]](#Table-of-contents)

In [None]:
parameter_grid = {'activation': ['tanh','identity','logistic','relu'],
                  'solver': ['adam','lbfgs','sgd'],
                  'hidden_layer_sizes': [3,5,8,13,21,34],
                  'verbose': [True]}

cross_validation = StratifiedKFold(n_splits=10, shuffle=True)

grid_search = GridSearchCV(ANNClassifier,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(X_train, y_train)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
print('Best estimator: {}'.format(grid_search.best_estimator_))

In [None]:
ANNClassifier = grid_search.best_estimator_
yk_pred = ANNClassifier.predict(X_test)

best_nn_classification_report = classification_report(y_test, yk_pred, output_dict=True)

print("--- Improved model ---\n")
print(f"Classification report:\n{best_nn_classification_report(y_test, yk_pred)}\n")

sb.set(font_scale=1.0)

ax = plt.subplot()

confusion_matrix_ann = confusion_matrix(y_test, yk_pred)

sb.heatmap(confusion_matrix_ann, annot=True, ax=ax, fmt="g")

ax.set_xlabel('Predicted Grades');
ax.set_ylabel('Observed Grades');
ax.set_title('Confusion Matrix');
plt.show()

## Step 6: Results Analysis
[[ go back to the top ]](#Table-of-contents)

In [None]:
import matplotlib.patches as mpatches

algorithms_data = [
    [dtc_classification_report['accuracy'], knn_classification_report['accuracy'], svm_classification_report['accuracy'], nn_classification_report['accuracy']],
    [best_dtc_classification_report['accuracy'], best_knn_classification_report['accuracy'], best_svm_classification_report['accuracy'], best_nn_classification_report['accuracy']]
]

X = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X+0.0, algorithms_data[0], width=0.25, tick_label=['DTC', 'KNN', 'SVM', 'NN'], color = 'b')
ax.bar(X+0.25, algorithms_data[1], width=0.25, color = 'c')

original = mpatches.Patch(color='b', label='Original')
improved = mpatches.Patch(color='c', label='Improved')

plt.legend(handles=[original, improved], bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

plt.show()

In [None]:
algorithms_data = [
    [best_dtc_classification_report['accuracy'], best_knn_classification_report['accuracy'], best_svm_classification_report['accuracy'], best_nn_classification_report['accuracy']],
    [best_dtc_classification_report['f1-score'], best_knn_classification_report['f1-score'], best_svm_classification_report['f1-score'], best_nn_classification_report['f1-score']],
    [best_dtc_classification_report['precision'], best_knn_classification_report['precision'], best_svm_classification_report['precision'], best_nn_classification_report['precision']]
]

X = np.arange(4)
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(X+0.0, algorithms_data[0], width=0.25, color = 'b')
ax.bar(X+0.25, algorithms_data[1], width=0.25, tick_label=['DTC', 'KNN', 'SVM', 'NN'], color = 'g')
ax.bar(X+0.50, algorithms_data[2], width=0.25, color = 'y')

accuracy = mpatches.Patch(color='b', label='accuracy')
f1_score = mpatches.Patch(color='g', label='f1-score')
precision = mpatches.Patch(color='y', label='precision')


plt.legend(handles=[original, improved], bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

plt.show()

In [None]:
table = go.Figure(data=[go.Table(header=dict(values=['Algorithm', 'Accuracy Normal', 'Accuracy Improved']),
                 cells=dict(values=[algs, [dtc_classification_report['accuracy'], knn_classification_report['accuracy'], svm_classification_report['accuracy'], nn_classification_report['accuracy']], 
                                   [best_dtc_classification_report['accuracy'], best_knn_classification_report['accuracy'], best_svm_classification_report['accuracy'], best_nn_classification_report['accuracy']]]))
                     ])
table.show()