## Project Overview
### Train and test a classification model(s) on the Default dataset.
Before training a model on this dataset, a good understanding of each explanatory variable (features) is vital.
### Definition of each feature

 - **limit_bal**: limit balance also known as credit limit is the **credit limit** after applying for a credit card determined by the credit card issuer.
 - **sex:** Sex of the credit card owner which is either 1 for **Male** or 2 for **Female**
 - **education:** Highest level of education for the credit card owner. where 1 = graduate school; 2 = university; 3 = high school; 4 = others
 - **marriage:** Marital status of credit card owner where 1 = married; 2 = single; 3 = others
 - **age:** Ages of card owners.
 - **pay_0 to pay_6:** History of past monthly pay records starting from April(pay_0) to September(pay_6) for each card owner.
    statuses are 0: pay duly, 1: payment delay for one month, 2: payment delay for two months.
 - **bill_amtt1 to bill_amnt6** represents amount of bill statement from April(bill_amt1) to September(bill_amt6). 
    **Bill statement** is a periodic statement that lists all the payments, purchases and other debits and credits during the billing cycle.
 - **pay_amt1 to pay_amt6** is amount paid in the previous month. From April(pay_amt1) to September(pay_amt6)
 - **defaulted:** To defauult means failure to pay a debt on the agreed upon date. IN this case, creditors mostly raise interest rates or decrese the credit limit.
        
Since defaulted is the target variable in this case, suitable models will be used to train and test the other explanatory variables and see which model predicts with highest accuracy on the dataset.


In [1]:
#importing packages needed
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10,7)

In [2]:
data = pd.read_csv('defaults.csv')
data.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6,defaulted
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [3]:
# check for null values
data.isnull().any().any()

False

In [9]:
# Using a Logistic regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# The column ID is still included as it is the only unique column that can be used to trace a data point that exists in the train set.
# Will need the ID column to be able to track a record that exists under the train set so the named X_try

X_try = data.iloc[:, 0:24].values
y = data.iloc[:, 24].values

'''
 - Splitting the data into train and test set in the ratio 80:20

 - random_state=0 is used to maintain the random numbers chosen when splitting the data at any moment the code is run again.
'''

Xtrain, Xtest, ytrain, ytest = train_test_split(X_try, y, test_size=0.2, random_state=0)


In [5]:
type(X_try)

numpy.ndarray

In [24]:
# Since X_try is a numpy array, it has to be converted to a dataframe so the common records that occurs here and in the train set can easily be tracked.

trial = pd.DataFrame(X_try)
trial.columns = ['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5',
                   'bill_amt6', 'pay_amt1',  'pay_amt2',   'pay_amt3', 'pay_amt4',  'pay_amt5',   'pay_amt6'  ]
trial.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
0,1,20000,2,2,1,24,2,2,-1,-1,...,689,0,0,0,0,689,0,0,0,0
1,2,120000,2,2,2,26,-1,2,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,3,90000,2,2,2,34,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,4,50000,2,2,1,37,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,5,50000,1,2,1,57,-1,0,-1,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [25]:


# convert the numpy array (Xtrain) to a dataframe so as to better see which point has fallen under the train set
trial_X = pd.DataFrame(Xtrain)
# Putting back column names to better visualize the data

trial_X.columns = ['id', 'limit_bal', 'sex', 'education', 'marriage', 'age', 'pay_0', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6', 'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5',
                   'bill_amt6', 'pay_amt1',  'pay_amt2',   'pay_amt3', 'pay_amt4',  'pay_amt5',   'pay_amt6'  ]
trial_X.head()

Unnamed: 0,id,limit_bal,sex,education,marriage,age,pay_0,pay_2,pay_3,pay_4,...,bill_amt3,bill_amt4,bill_amt5,bill_amt6,pay_amt1,pay_amt2,pay_amt3,pay_amt4,pay_amt5,pay_amt6
0,3226,20000,2,3,1,44,0,0,2,0,...,17980,18780,0,0,3000,0,1000,1000,0,0
1,11816,260000,2,2,2,30,-1,-1,-1,-1,...,274,165,333,165,165,274,165,333,165,293
2,7339,20000,1,2,1,39,2,0,0,0,...,19299,19928,20204,20398,1500,1500,900,700,1480,0
3,14981,30000,1,2,1,23,2,2,2,2,...,28635,30127,30525,29793,1800,150,2250,1000,0,700
4,27168,10000,1,2,1,29,0,0,0,0,...,8600,9470,6690,9690,2800,2000,1500,900,3000,0


In [28]:
'''
An inner join of train and initial dataframe is done to visualize which data
points fall under the train set after splitting the data with a split ratio of 80:20 and random sate of 0
id is used as the common feature between them
'''
common = pd.merge(trial.head(500), trial_X.head(500), how='inner', on=['id'])
common.head(10)

Unnamed: 0,id,limit_bal_x,sex_x,education_x,marriage_x,age_x,pay_0_x,pay_2_x,pay_3_x,pay_4_x,...,bill_amt3_y,bill_amt4_y,bill_amt5_y,bill_amt6_y,pay_amt1_y,pay_amt2_y,pay_amt3_y,pay_amt4_y,pay_amt5_y,pay_amt6_y
0,65,130000,2,2,1,51,-1,-1,-2,-2,...,0,0,2353,0,0,0,0,2353,0,0
1,160,50000,1,3,1,57,3,2,0,0,...,13447,13427,13711,14083,0,1600,500,500,600,600
2,182,80000,2,3,2,35,0,-1,0,0,...,14873,17364,17770,17460,12500,6500,3000,2000,3000,2000
3,372,160000,1,1,2,30,-1,-1,-1,-1,...,15086,8578,13028,21712,2977,15086,9123,13028,29712,50000
4,378,140000,2,1,2,28,-1,0,0,-1,...,7609,4991,3400,3745,14000,3855,4991,3600,5500,4000
5,453,260000,1,2,2,37,0,0,0,-1,...,36638,122388,127402,131074,14000,5022,130000,7000,6000,6000
6,482,140000,1,2,2,26,0,0,2,2,...,140202,144035,140419,130271,17000,0,11400,550,5300,5400


- All these data points will be used to test their influence on the test accuracy score when they are dropped.
- Each record will be dropped and the accuracy score of the test set computed with the difference to see if the accuracy reduced or increased.

In [44]:
# Since the ID column is not needed when computing the accuracy score of this model, it will therefore be dropped and the computations done again with same train/test ratio and same random_number
# Here the ID column has been removed since it is not necesary during predictions

X = data.iloc[:, 1:24].values
y = data.iloc[:, 24].values

# Splitting the data into train and test set in the ratio 80:20
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)


# logistic regression classifier with regularization set to l1.
clf_lr = LogisticRegression(penalty='l1', solver='liblinear')

clf_lr.fit(Xtrain, ytrain)
     
ypred_test = clf_lr.predict(Xtest)

accuracy_score1 = accuracy_score(ypred_test, ytest)
print('[Test] Accuracy score with all the records included : ', round(accuracy_score1, 4))

[Test] Accuracy score with all the records included :  0.8193


In [39]:
# From the inner join above, the common data points will all be considered to see how they influence the test accuracy when dropped off.
# Instead of using ID, I will use the row index

drop_records = [64, 159, 181, 371, 377, 452, 481]
accuracy_score1 = 0.8193
score = []
diff = []

for i in drop_records:
    
    sub_data = data.drop(data.index[i])
    
    X = sub_data.iloc[:, 1:24].values
    y = sub_data.iloc[:, 24].values

    # Splitting the data into train and test set in the ratio 80:20
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=0)

    # logistic regression classifier with regularization set to l1.
    clf_lr = LogisticRegression(penalty='l1', solver='liblinear')

    clf_lr.fit(Xtrain, ytrain)

    ypred_test_drop = clf_lr.predict(Xtest)

    accuracy_score2 = accuracy_score(ypred_test_drop, ytest)
    score.append(round(accuracy_score2, 4))
    
    # test accuracy difference 
    test_diff = accuracy_score1 - accuracy_score2
    diff.append(round(test_diff, 4))


In [43]:
output = pd.DataFrame({'Dropped record': drop_records, 'Initial test accuracy': accuracy_score1, 'New test accuracy': score, 'Difference': diff})
print(output)

   Dropped record  Initial test accuracy  New test accuracy  Difference
0              64                 0.8193             0.8128      0.0065
1             159                 0.8193             0.8138      0.0055
2             181                 0.8193             0.8137      0.0056
3             371                 0.8193             0.8135      0.0058
4             377                 0.8193             0.8133      0.0060
5             452                 0.8193             0.8130      0.0063
6             481                 0.8193             0.8130      0.0063


## Observations from the table above
- Considering a number of records, each dropped and from the train set and the test accuracy calculated shows that all these points though at different index positions caused a decrease in the test accuracy.
- There was a slight difference in the magnitude of this difference but all caused a fall in the accuracy score of the test set.
- This visualization therefore shows that any point dropped off the train set will have a negative influence on the model performance when validated.
- Every record(data point) in the train set helps the model to perform better on the data as when it is tested, these points that were used to train it will therefore improve model performance as it is being tested.