## Project Overview
### Train and test a classification model(s) on the Default dataset.
Before training a model on this dataset, a good understanding of each explanatory variable (features) is vital.
### Definition of each feature

 - **limit_bal**: limit balance also known as credit limit is the **credit limit** after applying for a credit card determined by the credit card issuer.
 - **sex:** Sex of the credit card owner which is either 1 for **Male** or 2 for **Female**
 - **education:** Highest level of education for the credit card owner. where 1 = graduate school; 2 = university; 3 = high school; 4 = others
 - **marriage:** Marital status of credit card owner where 1 = married; 2 = single; 3 = others
 - **age:** Ages of card owners.
 - **pay_0 to pay_6:** History of past monthly pay records starting from April(pay_0) to September(pay_6) for each card owner.
    statuses are 0: pay duly, 1: payment delay for one month, 2: payment delay for two months.
 - **bill_amtt1 to bill_amnt6** represents amount of bill statement from April(bill_amt1) to September(bill_amt6). 
    **Bill statement** is a periodic statement that lists all the payments, purchases and other debits and credits during the billing cycle.
 - **pay_amt1 to pay_amt6** is amount paid in the previous month. From April(pay_amt1) to September(pay_amt6)
 - **defaulted:** To defauult means failure to pay a debt on the agreed upon date. IN this case, creditors mostly raise interest rates or decrese the credit limit.
        
Since defaulted is the target variable in this case, suitable models will be used to train and test the other explanatory variables and see which model predicts with highest accuracy on the dataset.


In [69]:
#importing packages needed
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,7)

In [70]:
data = pd.read_csv('defaults.csv')
data.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [71]:
# check for null values
data.isnull().any().any()

False

In [72]:
# Using a Logistic regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# The column ID is still included as it is the only unique column tha can be used to trace a data point that exists in the train set.
# Will need the ID column to be able to track a record that exists under the train set that is why I named it X_try

X_try = data.iloc[:, 0:24].values
y = data.iloc[:, 24].values

'''
 - Splitting the data into train and test set in the ratio 80:20

 - random_state=0 is used to maintain the random numbers chosen when splitting the data at any moment the code is run again.
'''

Xtrain, Xtest, ytrain, ytest = train_test_split(X_try, y, test_size=0.2, random_state=0)


In [73]:
type(X_try)

numpy.ndarray

In [74]:
# Since X_try is a numpy array, it has to be converted to a dataframe so the common record that occurs here and in the train set can easily be tracked.

trial = pd.DataFrame(X_try)
trial.columns = ['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5',
                   'bill_amt6', 'pay_amt1',  'pay_amt2',   'pay_amt3', 'pay_amt4',  'pay_amt5',   'pay_amt6'  ]
trial.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
0,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
1,2,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,3,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,4,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,5,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [75]:

# convert the numpy array (Xtrain) to a dataframe so as to better see which point has fallen under the train set
trial_X = pd.DataFrame(Xtrain)
trial_X.shape

(24000, 24)

In [76]:
# Putting back column names to better visualize the data

trial_X.columns = ['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5',
                   'bill_amt6', 'pay_amt1',  'pay_amt2',   'pay_amt3', 'pay_amt4',  'pay_amt5',   'pay_amt6'  ]
trial_X.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
0,3226,20000,2,3,1,44,0,0,2,0,...,17980,18780,0,0,3000,0,1000,1000,0,0
1,11816,260000,2,2,2,30,-1,-1,-1,-1,...,274,165,333,165,165,274,165,333,165,293
2,7339,20000,1,2,1,39,2,0,0,0,...,19299,19928,20204,20398,1500,1500,900,700,1480,0
3,14981,30000,1,2,1,23,2,2,2,2,...,28635,30127,30525,29793,1800,150,2250,1000,0,700
4,27168,10000,1,2,1,29,0,0,0,0,...,8600,9470,6690,9690,2800,2000,1500,900,3000,0


In [79]:
'''
To see which data points (records) occurs in the train set and it's corresponding index in the initial dataframe. 
This shows that the records at index number 1 and 10 are under the train set after randomly selecting records with the ratio 80:20
I decided to merge just the first 10 rows so as to reduce computation time as just one record is needed to proceed.
'''
common = pd.merge(trial.head(10), trial_X.head(10), how='inner', on=['limit_bal'])
common.head()

Unnamed: 0,id_x,limit_bal,sex_x,education_x,marriage_x,age_x,pay_0_x,pay_2_x,pay_3_x,pay_4_x,...,bill_amt3_y,bill_amt4_y,bill_amt5_y,bill_amt6_y,pay_amt1_y,pay_amt2_y,pay_amt3_y,pay_amt4_y,pay_amt5_y,pay_amt6_y
0,1,20000,2,2,1,24,2,2,-1,-1,...,17980,18780,0,0,3000,0,1000,1000,0,0
1,1,20000,2,2,1,24,2,2,-1,-1,...,19299,19928,20204,20398,1500,1500,900,700,1480,0
2,10,20000,1,3,2,35,-2,-2,-2,-2,...,17980,18780,0,0,3000,0,1000,1000,0,0
3,10,20000,1,3,2,35,-2,-2,-2,-2,...,19299,19928,20204,20398,1500,1500,900,700,1480,0


 - The first data I will use to train and test the model will be that with all the records
 - The next will now be with the record at index number 1 ommited which falls under the train set. 
 - Accuracy scores of both models will now be printed to see the infleunce this record has on the test accuracy score.

In [80]:
# Since the ID column is not needed when computing the accuracy score of this model, it will therefore be dropped and the computations done again with same train/test ratio and same random_number
# Here the ID column has been removed since it is not necesary during predictions

X = data.iloc[:, 1:24].values
y = data.iloc[:, 24].values

# Splitting the data into train and test set in the ratio 80:20
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)


# logistic regression classifier with regularization set to l1.
clf_lr = LogisticRegression(penalty='l1', solver='liblinear')

clf_lr.fit(Xtrain, ytrain)
     
ypred_test = clf_lr.predict(Xtest)

accuracy_score1 = accuracy_score(ypred_test, ytest)
print('[Test] Accuracy score with 1st record is : ', round(accuracy_score1, 4))

[Test] Accuracy score with 1st record is :  0.8193


In [81]:

#Since the train set is already in the form of a dataframe, the row with ID=1 will then be dropped before further computation.

sub_data = data.drop(data.index[0])
print('Sub data size is: ', sub_data.shape)

X = sub_data.iloc[:, 1:24].values
y = sub_data.iloc[:, 24].values

# Splitting the data into train and test set in the ratio 80:20
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

# To make sure that the dropped record falls under the train set, the numpy array has to be converted to a dataframe and then view the shape
trial_X = pd.DataFrame(Xtrain)
print('train set size when the 1st record is removed:', trial_X.shape)

Sub data size is:  (29999, 25)
train set size when the 1st record is removed: (23999, 23)


### Observations from above implementation
Since the first record has been found under the train set after train/test split, this record at this moment is dropped from initial data.  
This was therefore verified by converting the numpy array Xtrain into a dataframe where the size of the dataframe was seen to have reduced by 1.

In [82]:
'''
The dataframe trial_X will now be converted back to a numpy array (Xtrain_drop) since predictions using machine learning models can only work with numpy arrays
Instead of using Xtrain as above, I will be using Xtrain_drop because the first record has been dropped from there
'''

Xtrain_drop = trial_X.to_numpy()
Xtrain_drop

array([[200000,      2,      2, ...,   7521,   2000,   3000],
       [110000,      2,      2, ...,   3000,   4000,   4000],
       [180000,      2,      2, ...,      0,      0,      0],
       ...,
       [200000,      2,      1, ...,   1387,  20057,  51281],
       [130000,      1,      2, ...,   1016,   1026,   1049],
       [170000,      2,      2, ...,     97,   1002,   1018]])

In [83]:
# logistic regression classifier with regularization set to l1 when the first record has been removed.
clf_lr = LogisticRegression(penalty='l1', solver='liblinear')

clf_lr.fit(Xtrain_drop, ytrain)
     
ypred_test_drop = clf_lr.predict(Xtest)

accuracy_score2 = accuracy_score(ypred_test_drop, ytest)
print('[Test] Accuracy score without 1st record removed : ', round(accuracy_score2, 4))

[Test] Accuracy score without 1st record removed :  0.813


In [84]:
# test accuracy difference 
test_diff = accuracy_score1 - accuracy_score2
print('Accuracy score difference of the test set with and without the 1st record is: ', round(test_diff, 4))

Accuracy score difference of the test set with and without the 1st record is:  0.0063


### Observations from above
 - The model accuracy with the first record still in place which after doing a train/test split, the record falls under the train set produces a test accuracy score of 0.8193.
 - When the first record has been dropped, the test accuracy score is affected though the data point doesn't belong to the test set. This produces a test accuracy score of 0.813
 - Dropping this data point reduces the model performance by 0.63%. THis difference may seem small but will in a long run cause many misclassifications when using the model
 - This therefore brings out the importance of each data point when training a model as the accuracy of the test set depends on how well the model was trained on the given data.