# Assignment 2 - Intelligent Machines, Ethics and Law (COMP2400/6400)

### Name:

### Student Id:

# Credit Risk Classification

In this assignment, you will implement a **credit risk classification system** as an illustration of *algorithmic decision making*. You will  use Python for implementation, and explore its social implications. This assessment is designed to familiarize you with the process of training a classification model, implementing it using a Python program; and critically examine it from a societal perspective. This assignment will be marked out of 100, and will count towards **40%** of the total unit assessment.

## Credit Risk Dataset Description

The Credit Risk dataset being provided is based on a synthetic dataset publicly available on [Kaggle](https://www.kaggle.com/). It includes various financial attributes to evaluate credit risk. It features 32,581 samples with 11 variables. The key attributes include borrowers' age, employment status, education level, annual income, loan amount, and interest rate. The dataset aims to facilitate the prediction of credit default, that is failure on part of the loan applicant (borrower) to make the loan repayments.

## Setup

In [None]:
pip install pandas

In [None]:
pip install scikit-learn

In [None]:
pip install shap

In [None]:
pip install matplotlib

## Task 0 - (0 Marks)

Load the dataset using Python. 

Preprocess the dataset. 			

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Load the dataset
df=pd.read_csv('credit_risk_dataset_v3.csv')

# Show first few rows of the dataset
print (df.head())


# Size of the datset 
print ('Size of the dataset:', df.shape)
print ('===============================')

# Check if data has null values
print ('null information:')
print ('variable \t number of null values')
print (df.isnull().sum())

# remove null data points from the dataset
# axis = 0 means drop rows which contain missing values.
df = df.dropna(axis=0)
print ('===============================')
print ('Null rows removed. \n')

# Check if data has null values
print ('Updated null information:')
print ('variable \t number of null values')
print (df.isnull().sum())
print ('===============================')
print ('Updated size of the dataset:', df.shape)

In [None]:
# Show distribution of the categorical variable 'loan_intent'
print (df['loan_intent'].value_counts())

# Visualize the information
GroupedData=df.groupby('loan_intent').size()
GroupedData.plot(kind='bar', figsize=(4,3))

In [None]:
# Show distribution of the numerical variable 'person_income'
print (df['person_income'].value_counts())

# Visualize the information
# A box plot is a method for graphically depicting groups of numerical data through their quartiles.
df['person_income'].plot.box()


In [None]:
# From the visualization, we notice that 'person_income' has skewness, and log transform can fix the skewness.
df['person_income'] = np.log(df['person_income'])

In [None]:
# Some variables, such as 'person_home_ownership' and 'loan_intent', are categorical.
# Others, such as 'person_age' and 'person_income', are numerical.
# We need to convert categorical variables to numeric before we can use the data for our Machine Learnining models.

# identify all categorical variables
cat_columns = df.select_dtypes(['object']).columns

# Convert categorical variable into dummy/indicator variables.
# Each n-valued variable is converted to n Boolean variables, with 0 indicating FALSE and 1 indicating TRUE. 

df = pd.get_dummies(df, columns = cat_columns, dtype=int)

# For example, 'loan_intent' with values EDUCATION, MEDICAL, VENTURE, PERSONAL, DEBTCONSOLIDATION and HOMEIMPROVEMENT  
# are now converted to 6 dummy columns: loan_intent_EDUCATION, loan_intent_MEDICAL, ..., loan_intent_HOMEIMPROVEMENT, each taking 0/1 as possible values.
df.head()

In [None]:
# 'loan_status' indicates if the loan was defaulted (1) or repayed on time (0). 
# This is what we would like to predict based on the dataset.
# Let's visualize it. 
df["loan_status"].value_counts().plot.pie()

# Resample

From the visualization above, we can see that substantially more number of loans were paid on time than defaulted. 

This imbalanced dataset is problematic for training a model. 

To handle imbalanced dataset, we can either under-sample the majority class or over-sample the minority class.

![Screenshot 2024-05-09 at 10.37.12 pm.png](attachment:56c29b01-07d2-4f5e-a575-13b0fe1b5e5a.png)

In [None]:
# Under sample
count_class_0, count_class_1 = df["loan_status"].value_counts()

# Divide by class
df_class_0 = df[df["loan_status"] == 0]
df_class_1 = df[df["loan_status"] == 1]

# df_under is the updated under-sampled data frame
df_class_0_under = df_class_0.sample(count_class_1)
df_under = pd.concat([df_class_0_under, df_class_1], axis=0)

print('Random under-sampling:')
print(df_under["loan_status"].value_counts())

df_under["loan_status"].value_counts().plot(kind='bar', title='Count (target)')

In [None]:
# Over sample 
# df_over is the updated over-sampled data frame
df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_over = pd.concat([df_class_0, df_class_1_over], axis=0)

print('Random over-sampling:')
print(df_over["loan_status"].value_counts())

df_over["loan_status"].value_counts().plot(kind='bar', title='Count (target)')

In [None]:
# In this assignment, we choose over-sample techniques.
# Now we split our dataset into input X and target value Y.

# Drop the labels from the data frame.
drop_df_over = df_over.drop(['loan_status'],axis=1)
X = drop_df_over.values

# Get the labels separately
y = df_over['loan_status'].values

# Now we split the data into training dataset and test dataset. 
# We use 80% of the data for Training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Now we standardize features by removing the mean and scaling to unit variance.
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

## Task 1: Data Visualisation - (10 marks)

Visualize some (at least two) other variables in the **over-sampled dataset**. 
Note that column names have changed after we convert categorical variable into dummy/indicator variables.

In [None]:
# your code will go here


In [None]:
# your code will go here


## Task 2: Training a Traditional Machine Learning Model	- (25 Marks)

Train a machine learning model for predicting loan default with a machine learning algorithm discussed during the lectures **(Week 10 to Week 12)** (or any other appropriate algorithm not covered in the lectures). Which machine learning algorithm did you choose, and why? What is the accuracy of the model on the test set?

In [None]:
# your code will go here


## Task 3: Training a Neural Network-based Model - (25 Marks)

Train an artificial neural network (Multi-Layered Perceptron Classifier (MLP)) model for loan default prediction. What is the accuracy of the model on the test set?

In [None]:
# your code will go here


## Task 4: Model Explainability - (20 Marks)

SHAP importance is calculated on row level and can be used to understand what is important to a specific row. The values represent how a feature influences the prediction of a single row relative to the average outcome in the dataset.

It can be used to:

1. Understand which features most influence the predicted outcome.
2. Dive into a feature and understand how the different values of that feature affect the prediction.
3. Understand what is most influential on individual rows or subsets within the data.

In this task, you are expected to use SHAP (SHapley Additive exPlanations) to understand the model.

Get the SHAP importance for your traditional machine learning model (Task 2  above) using the **shap** library. 

**Make sure you provide the visualisation of shap values**

Based on the visualization, can you explain how each feature is contributing to the final prediction of loan default? Does your explaination align with business sense?

In [None]:
# use SHAP (SHapley Additive exPlanations) to understand the model 
import shap

# your code will go here for your traditional machine learning model


***Your analysis and dicussion will go here*** 

## Task 5: Social Implications - (20 Marks)

This task involves exploration of the social implications of algorithmic decision making in the context of credit risk prediction. In particular you will be looking at bias and discrimination. As a preparation, read the MIT Technology Review article *Bias isn’t the only problem with credit scores—and no, AI can’t help* available as: https://www.technologyreview.com/2021/06/17/1026519/racial-bias-noisy-data-credit-scores-mortgage-loans-fairness-machine-learning/
1. **(5 marks)** Summarize the above reading in about 200 words. 
2. **(5 marks)** Elaborate and explain in about 100 words the following sentence from this reading: "Incomplete data is troubling because detecting it will require researchers to have a fairly nuanced understanding of societal inequities".
3. **(10 marks)** In the context of the admittedly synthetic credit risk dataset you examined in this assignment, did you notice any possible issue of fairness and discrimination? Describe what steps you would recommend for alleviating such bias in the intended classification model.