## Project Overview
### Train and test a classification model(s) on the Default dataset.
Before training a model on this dataset, a good understanding of each explanatory variable (features) is vital.
### Definition of each feature

 - **limit_bal**: limit balance also known as credit limit is the **credit limit** after applying for a credit card determined by the credit card issuer.
 - **sex:** Sex of the credit card owner which is either 1 for **Male** or 2 for **Female**
 - **education:** Highest level of education for the credit card owner. where 1 = graduate school; 2 = university; 3 = high school; 4 = others
 - **marriage:** Marital status of credit card owner where 1 = married; 2 = single; 3 = others
 - **age:** Ages of card owners.
 - **pay_0 to pay_6:** History of past monthly pay records starting from April(pay_0) to September(pay_6) for each card owner.
    statuses are 0: pay duly, 1: payment delay for one month, 2: payment delay for two months.
 - **bill_amtt1 to bill_amnt6** represents amount of bill statement from April(bill_amt1) to September(bill_amt6). 
    **Bill statement** is a periodic statement that lists all the payments, purchases and other debits and credits during the billing cycle.
 - **pay_amt1 to pay_amt6** is amount paid in the previous month. From April(pay_amt1) to September(pay_amt6)
 - **defaulted:** To defauult means failure to pay a debt on the agreed upon date. IN this case, creditors mostly raise interest rates or decrese the credit limit.
        
Since defaulted is the target variable in this case, suitable models will be used to train and test the other explanatory variables and see which model predicts with highest accuracy on the dataset.


In [50]:
#importing packages needed
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt  
plt.rcParams["figure.figsize"] = (10,7)

In [51]:
data = pd.read_csv('..\..\datasets\defaults.csv')
data.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [52]:
# check for null values
data.isnull().any().any()

False

The inbuilt function isnull() prints out any null(undefined or empty space) values and which feature the record belongs. 
In this case there is no null value which therefore helps proceed to the next level.

- Implementation of the effect of training and yesying a Logistic regression model on this data with and without a single data point.
- A difference in the accuracy score will be printed out to better see the infleunce of this data point to the accuracy score of the model as a whole.

In [53]:
'''
Dataset contains total 30,000 records.
Considering the record with index 99. 
Accuracy score with and without this record will be printed out.
''' 

sub_data = data.drop(data.index[99])
# This record at index 99 is in the training set which will now see it's infleunce to the test accuracy

print('data: ', data.shape)
print('sub_data: ', sub_data.shape)

data:  (30000, 25)
sub_data:  (29999, 25)


In [65]:
# Using a Logistic regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X = data.iloc[:, 1:24].values
y = data.iloc[:, 24].values

# Splitting the data into train and test set in the ratio 80:20
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=10)
print("Accuracy of the logistic regression model on the test set with the 100th record included is" )

# logistic regression classifier with regularization set to l1.
clf_lr = LogisticRegression(solver='liblinear')

clf_lr.fit(Xtrain, ytrain)
     
ypred_test = clf_lr2.predict(Xtest)

accuracy_score1 = accuracy_score(ypred_test, ytest)
print('[Test] Accuracy score is: ', accuracy_score1)

Accuracy of the logistic regression model on the test set with the 100th record included is
[Test] Accuracy score is:  0.813


In [66]:
# Without this record in the train set

X = sub_data.iloc[:, 1:24].values
y = sub_data.iloc[:, 24].values

# Splitting the data into train and test set in the ratio 80:20
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=10)
print("Accuracy of the logistic regression model on the test set without the 100th record included is" )

# logistic regression classifier with regularization set to l1.
clf_lr = LogisticRegression(solver='liblinear')

clf_lr.fit(Xtrain, ytrain)
     
y_pred_test = clf_lr2.predict(Xtest)
accuracy_score2 = accuracy_score(y_pred_test, ytest)
print('[Test] Accuracy score is: ', accuracy_score2)

Accuracy of the logistic regression model on the test set without the 100th record included is
[Test] Accuracy score is:  0.8145


In [70]:
# test accuracy difference 
test_diff = accuracy_score2 - accuracy_score1
print('Accuracy score difference of the test set with and without the 100th record is: ', round(test_diff, 4))

Accuracy score difference of the test set with and without the 100th record is:  0.0015


## Observations from above
 - The model accuracy with the data point at index number 99(part of the train set) still maintained gives an accuracy score of 0.813 on the test set.
 - After dropping this record at index number 99, the accuracy score on the test data instead increases.
 - This therefore gives a difference of 0.0015 in accuracy score which shows that this record has a 0.15% infleunce on making the model have better performance on this data.