<h1>Loan Status Prediction</h1>

Credit Risk Models are one of the most successful applications of data science. 
Given a set of information for a user, predict how 'reliable' she is in terms of being financially stable.
These models have raised many discussions concerning privacy, racial profiling etc.

We will be using an anonymized dataset that consists of a list of a bank's users.
You may find the dataset [here](https://www.kaggle.com/zaurbegiev/my-dataset).

Let us first install sklearn and numpy libraries:

In [1]:
!pip install sklearn
!pip install numpy



Now let's import all the required libraries.

In [17]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from sklearn.impute import SimpleImputer as Imputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
# import warnings
# # warnings.simplefilter(action='ignore', category=FutureWarning)
# # warnings.simplefilter(action='ignore', category=DeprecationWarning)
# warnings.simplefilter(action='ignore', category=ConvergenceWarning)
from warnings import filterwarnings
filterwarnings('ignore')

In [18]:
# Load the data to a pandas dataframe
df = pd.read_csv("credit_train.csv")
df.head()

Unnamed: 0,Loan ID,Customer ID,Loan Status,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job,Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens
0,14dd8831-6af5-400b-83ec-68e61888a048,981165ec-3274-42f5-a3b4-d104041a9ca9,Fully Paid,445412.0,Short Term,709.0,1167493.0,8 years,Home Mortgage,Home Improvements,5214.74,17.2,,6.0,1.0,228190.0,416746.0,1.0,0.0
1,4771cc26-131a-45db-b5aa-537ea4ba5342,2de017a3-2e01-49cb-a581-08169e83be29,Fully Paid,262328.0,Short Term,,,10+ years,Home Mortgage,Debt Consolidation,33295.98,21.1,8.0,35.0,0.0,229976.0,850784.0,0.0,0.0
2,4eed4e6a-aa2f-4c91-8651-ce984ee8fb26,5efb2b2b-bf11-4dfd-a572-3761a2694725,Fully Paid,99999999.0,Short Term,741.0,2231892.0,8 years,Own Home,Debt Consolidation,29200.53,14.9,29.0,18.0,1.0,297996.0,750090.0,0.0,0.0
3,77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a,e777faab-98ae-45af-9a86-7ce5b33b1011,Fully Paid,347666.0,Long Term,721.0,806949.0,3 years,Own Home,Debt Consolidation,8741.9,12.0,,9.0,0.0,256329.0,386958.0,0.0,0.0
4,d4062e70-befa-4995-8643-a0de73938182,81536ad9-5ccf-4eb8-befb-47a4d608658e,Fully Paid,176220.0,Short Term,,,5 years,Rent,Debt Consolidation,20639.7,6.1,,15.0,0.0,253460.0,427174.0,0.0,0.0


In [19]:
def siivoaData(data, slice=1.0):

    # Mikä on muuttuja josta olemme kiinnostuneita?
    # Haluamme tietysti ymmärtää muuttujaa lainan tilanne "Loan staus"
    # Dikotominne muuttuja kuvaa sitä maksetaanko laina takaisin vai ei.

    #Haluammeko poistaa jotain muuttujia?
    poistettavatMuuttujat = ['Loan ID','Customer ID']
    data = data.drop(poistettavatMuuttujat, axis=1)

    #Annetaan keskiarvot puutuville tietopisteille
    sarakkeet =['Current Loan Amount','Credit Score','Annual Income','Years of Credit History',
            'Months since last delinquent','Number of Open Accounts','Number of Credit Problems',
           'Current Credit Balance','Maximum Open Credit','Bankruptcies','Tax Liens']
    muuttujanTäyttäjä = Imputer()
    data[sarakkeet] = muuttujanTäyttäjä.fit_transform(data[sarakkeet])
    data[sarakkeet] = data[sarakkeet].astype(int)

    #Poistetaan vielä NaN rivit
    data=data.dropna()

#     Note:  Select only some of the data for testing purposes!
    if slice > 0 and slice < 1:
        mid_point = int(len(data)*slice)
        data = data.loc[:mid_point]


    #Valitaan muuttuja josta olemme kiinnostuneita ja koodataan se
    y = data['Loan Status']
    new_y = []
    for i in y:
        if i == 'Fully Paid':
            new_y.append(1)
        else:
            new_y.append(0)
    data = data.drop('Loan Status', axis=1)

    # Koodataan kategoriset muuttujat
    data = pd.get_dummies(data)
    #print(data.head())

    # Normalisoidaan data
    # Palautamme myös dataMean ja dataDev arvot jos haluamme syöttää koneelle uusia havaintoja
    dataMean = np.mean(data, axis=0)
    dataDev = np.std(data, axis=0)
    norm_x= (data - dataMean) / dataDev

    x = data.values #muutetaan numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    normMinMax = pd.DataFrame(x_scaled)

    return norm_x, normMinMax, data, new_y, dataMean, dataDev

xNorm, xMinMax, xNoNorm, y, xMean, xDev = siivoaData(df, slice=0.25)

In [20]:
#Check how xNorm, xMinMax and xNoNorm dataframes look like
xNoNorm

Unnamed: 0,Current Loan Amount,Credit Score,Annual Income,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,...,Purpose_Medical Bills,Purpose_Other,Purpose_Take a Trip,Purpose_major_purchase,Purpose_moving,Purpose_other,Purpose_renewable_energy,Purpose_small_business,Purpose_vacation,Purpose_wedding
0,445412,709,1167493,5214.74,17,34,6,1,228190,416746,...,0,0,0,0,0,0,0,0,0,0
1,262328,1076,1378276,33295.98,21,8,35,0,229976,850784,...,0,0,0,0,0,0,0,0,0,0
2,99999999,741,2231892,29200.53,14,29,18,1,297996,750090,...,0,0,0,0,0,0,0,0,0,0
3,347666,721,806949,8741.90,12,34,9,0,256329,386958,...,0,0,0,0,0,0,0,0,0,0
4,176220,1076,1378276,20639.70,6,34,15,0,253460,427174,...,0,0,0,0,0,0,0,0,0,0
5,206602,7290,896857,16367.74,17,34,6,0,215308,272448,...,0,0,0,0,0,0,0,0,0,0
6,217646,730,1184194,10855.08,19,10,13,1,122170,272052,...,0,0,0,0,0,0,0,0,0,0
7,648714,1076,1378276,14806.13,8,8,15,0,193306,864204,...,0,0,0,0,0,0,0,0,0,0
8,548746,678,2559110,18660.28,22,33,4,0,437171,555038,...,0,0,0,0,0,0,0,0,0,0
9,215952,739,1454735,39277.75,13,34,20,0,669560,1021460,...,0,0,0,0,0,0,0,0,0,0


Next we will see how we get different predictive performances by using the raw features vs the two other feature normalization techniques.

In [51]:
cases = []
# Define three cases we will study
for x in [xNorm, xMinMax, xNoNorm]:
    case = {}
    case['x_train'], case['x_test'], case['y_train'], case['y_test'] = train_test_split(x, y, test_size= 0.25, random_state=13)
    cases.append(case)

In [52]:
# Declare the classifiers we will be using
classifiers = [SGDClassifier(), LogisticRegression(), SVC()]

for i, case in enumerate(cases):
    print("Evaluating the models with data from the case #{}".format(i+1))
    for j, clf in enumerate(classifiers):
        #train model with train data
        clf.fit(X=case['x_train'], y=case['y_train'])
        #predict test data
        predictions = clf.predict(X=case['x_test'])
        #calculate the accuracy
        accuracy = accuracy_score(case['y_test'], predictions)

        print("\t Classifier #{} achieved {} accuracy on test data.".format(j+1, accuracy))

Evaluating the models with data from the case #1
	 Classifier #1 achieved 0.821384964242107 accuracy on test data.
	 Classifier #2 achieved 0.8253968253968254 accuracy on test data.
	 Classifier #3 achieved 0.8236525379382522 accuracy on test data.
Evaluating the models with data from the case #2
	 Classifier #1 achieved 0.8250479679051108 accuracy on test data.
	 Classifier #2 achieved 0.8255712541426827 accuracy on test data.
	 Classifier #3 achieved 0.8248735391592534 accuracy on test data.
Evaluating the models with data from the case #3
	 Classifier #1 achieved 0.6493982208267922 accuracy on test data.
	 Classifier #2 achieved 0.8203383917669632 accuracy on test data.
	 Classifier #3 achieved 0.778998778998779 accuracy on test data.


In [54]:
# Implement Cross Validation using Logistic Regression classifier
# Using xMinMax dataframe

clf = LogisticRegression()
# Use all of the data (slice=1.0):
xNorm, xMinMax, xNoNorm, y, xMean, xDev = siivoaData(df, slice=1)
# 5 fold cross validation
scores = cross_val_score(clf, xMinMax, y, verbose=1, cv=5)
print(scores)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[0.82355398 0.81817707 0.82162247 0.82286609 0.82328374]
Accuracy: 0.822 (+/- 0.004)


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    4.5s finished
