# Creditworthiness (Project 4)

## Problem setting

A bank wants to predict the creditworthiness of its customers. Based on the customer records, the credit history, etc., a customer should be classified as creditworthy or unworthy of credit. 
It is five times more 'expensive' for the bank to rate a customer who is unworthy of credit as creditworthy than vice versa. In addition, not all information is available for all customers. 
For 1,000 representatively selected customers, the creditworthiness is known. For these customers the following data has been collected. (Features for which not all values are known are marked with the addition "incomplete".)

## Importing Python packages

Importing all the python packages we are going to use:

In [494]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from scipy.stats.mstats import zscore

## Preprocessing the data

First of all we need to import the data from the given file 'kredit.dat'.

In [495]:
#read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./kredit.dat").readlines()]

#write it as a new CSV file
with open("./kredit.csv", "w") as f:
    writer = csv.writer(f)
    writer.writerows(datContent)

#naming the labels of the columns
columns = ['Status of existing checking account','Duration in month','Credit history','Purpose','Credit amount','Savings account/bonds','Present employment since','Installment rate in percentage of disposable income',
'Personal status and sex','Other debtors/guarantors','Present residence since','Porperty','Age in years','Other installment plans','Housing','Number of existing credits at this brank','Job','Number of people being liable to provide maintenance for',
'Telephone','Foreign worker','Creditworthy']

#creating the dataframe
df = pd.read_csv('./kredit.csv',names=columns)
df.head()

Unnamed: 0,Status of existing checking account,Duration in month,Credit history,Purpose,Credit amount,Savings account/bonds,Present employment since,Installment rate in percentage of disposable income,Personal status and sex,Other debtors/guarantors,...,Porperty,Age in years,Other installment plans,Housing,Number of existing credits at this brank,Job,Number of people being liable to provide maintenance for,Telephone,Foreign worker,Creditworthy
0,A14,36,A32,?,2299,A63,?,4,A93,A101,...,A123,39,A143,A152,1,A173,1,A191,?,1
1,A12,18,A32,A46,1239,A65,A73,4,A93,A101,...,A124,61,A143,A153,1,?,1,A191,A201,1
2,A13,24,A32,A40,947,A61,A74,4,A93,A101,...,A124,38,A141,A153,1,?,2,A191,?,2
3,A14,15,A33,A43,1478,A61,A73,4,A94,A101,...,A121,33,A141,A152,2,A173,1,A191,A201,1
4,A14,24,A32,A40,1525,A64,A74,4,A92,A101,...,A123,34,A143,A152,1,A173,2,A192,A201,1


We have now build our dataframe, but to handle the missing values we need to transform some of the attributes, since a lot of them are devided into classes (like A32, A33,...). To train a model that can make a prediction on our missing values we need vectors with only numerical values. Therefore we create new features for our data like that: A30=0, A31=0, A32=1, A33=0, A34=0 (for the "Credit history" of the client being A32). We need to apply that process for every non numerical feature.

In [496]:
print(df.dtypes) #display the data type of each column

Status of existing checking account                         object
Duration in month                                            int64
Credit history                                              object
Purpose                                                     object
Credit amount                                                int64
Savings account/bonds                                       object
Present employment since                                    object
Installment rate in percentage of disposable income          int64
Personal status and sex                                     object
Other debtors/guarantors                                    object
Present residence since                                      int64
Porperty                                                    object
Age in years                                                 int64
Other installment plans                                     object
Housing                                                     ob

As we can the features: "Status of existing checking account", "Credit history", "Purpose", "Savings account/bonds", "Present employment since", "Personal status and sex", "Other debtors/guarantors", "Porperty", "Other installment plans", "Housing", "Job", "Telephone" and "Foreign worker" are all non numerical features, which we have to transform.

In [497]:
style = OneHotEncoder()
non_numerics = df.select_dtypes(include='object')
non_numerics = non_numerics.drop(['Purpose','Present employment since','Job','Foreign worker'],axis=1) #excluding these columns since there are values missing and therefore not transformable
non_numerics = non_numerics.drop('Telephone',axis=1) #only has two classes, which means we can transform it within the column to 0 for A191 and 1 for A192
df.loc[df['Telephone'] == 'A191', 'Telephone'] = 0.0
df.loc[df['Telephone'] == 'A192', 'Telephone'] = 1.0
df['Telephone'] = df['Telephone'].astype('int64')
for i in non_numerics.columns.tolist():
    transformation = style.fit_transform(df[[i]]) #transform column i
    df = df.join(pd.DataFrame(transformation.toarray(), columns=style.categories_[0])) #add new categories (of transformation) to our dataframe
    df = df.drop(i, axis=1) #dropping old column since we transformed its information
df.head()

Unnamed: 0,Duration in month,Purpose,Credit amount,Present employment since,Installment rate in percentage of disposable income,Present residence since,Age in years,Number of existing credits at this brank,Job,Number of people being liable to provide maintenance for,...,A121,A122,A123,A124,A141,A142,A143,A151,A152,A153
0,36,?,2299,?,4,4,39,1,A173,1,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,18,A46,1239,A73,4,4,61,1,?,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,24,A40,947,A74,4,3,38,1,?,2,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,15,A43,1478,A73,4,3,33,2,A173,1,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,24,A40,1525,A74,4,3,34,1,A173,2,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


With the next step we are going to rearrange the columns of the dataframe that our features with the missing values are the first four columns of the dataframe. This has no other purpose than giving us a better overview of the data.

In [498]:
old_order = df.columns.tolist()
new_order = [old_order[1]] + [old_order[3]] + [old_order[8]] + [old_order[11]] + [old_order[0]] + [old_order[2]] + old_order[4:8] + old_order[9:11] + old_order[12:]
df = df[new_order]
df.head()

Unnamed: 0,Purpose,Present employment since,Job,Foreign worker,Duration in month,Credit amount,Installment rate in percentage of disposable income,Present residence since,Age in years,Number of existing credits at this brank,...,A121,A122,A123,A124,A141,A142,A143,A151,A152,A153
0,?,?,A173,?,36,2299,4,4,39,1,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,A46,A73,?,A201,18,1239,4,4,61,1,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,A40,A74,?,?,24,947,4,3,38,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,A43,A73,A173,A201,15,1478,4,3,33,2,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,A40,A74,A173,A201,24,1525,4,3,34,1,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


Next up we define a function, so that we can prepare our data in order to train our models for each of the four features with missing values. We need to seperate the dataframe to one set of data with the missing values and one without missing values. The dataset without the missing values we are going to split again in train and test data. 

In [499]:
def prepare_data(dataframe, feature):
    not_missing = dataframe.loc[df[feature] != '?'] #the missing values are represented by '?'
    df_y = not_missing[[feature]]
    df_X = not_missing.drop([feature],axis=1)

    transformation = style.fit_transform(df_y[[feature]]) #transform
    df_y = pd.DataFrame(transformation.toarray(), columns=style.categories_[0]) #add new categories (of transformation) to our dataframe

    #get the raw data
    X = df_X.values
    y = df_y.values
    
    #split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33, random_state=0)

    return X_train, X_test, y_train, y_test

Besides a function that prepares the data to our need, we also need our linear classification model, which consists of decision functions for each class and of course the classifier, which compares the scores of the functions and makes the actual classification.

In [500]:
def linear_reg_model(X_train,y_train):
    model = LinearRegression()
    model.fit(X_train,y_train)
    return model

def classify(model, X):
    pred = model.predict(X)
    result = []
    for i in range(0,len(pred)):
        biggest = 0
        vector = []
        for j in range(0,len(pred[i])):
            if pred[i][biggest] < pred[i][j]:
                biggest = j
            vector.append(0)
        vector[biggest] = 1
        result.append(vector)
    return np.array(result)

We prepare a dataframe for each of the four features we want to predict without the other three features, so we don't make a prediction based on a prediction:

In [501]:
df_1 = df.drop(['Present employment since','Job','Foreign worker'],axis=1) #'Purpose'
df_2 = df.drop(['Purpose','Job','Foreign worker'],axis=1) #'Present employment since'
df_3 = df.drop(['Purpose','Present employment since','Foreign worker'],axis=1) #'Job'
df_4 = df.drop(['Purpose','Present employment since','Job'],axis=1) #'Foreign worker'

### Model for predicting 'Purpose'

In [502]:
X_train, X_test, y_train, y_test = prepare_data(df_1,'Purpose')
model_1 = linear_reg_model(X_train,y_train)
y_pred = classify(model_1,X_train)
print("Accuracy on feature 'Purpose' with train data:",accuracy_score(y_true=y_train,y_pred=y_pred))
y_pred = classify(model_1,X_test)
print("Accuracy on feature 'Purpose' with test data:",accuracy_score(y_true=y_test,y_pred=y_pred))

Accuracy on feature 'Purpose' with train data: 0.4368231046931408
Accuracy on feature 'Purpose' with test data: 0.32234432234432236


### Model for predicting 'Present employment since'

In [503]:
X_train, X_test, y_train, y_test = prepare_data(df_2,'Present employment since')
model_2 = linear_reg_model(X_train,y_train)
y_pred = classify(model_2,X_train)
print("Accuracy on feature 'Present employment since' with train data:",accuracy_score(y_true=y_train,y_pred=y_pred))
y_pred = classify(model_2,X_test)
print("Accuracy on feature 'Present employment since' with test data:",accuracy_score(y_true=y_test,y_pred=y_pred))

Accuracy on feature 'Present employment since' with train data: 0.5400593471810089
Accuracy on feature 'Present employment since' with test data: 0.3772455089820359


### Model for predicting 'Job'

In [504]:
X_train, X_test, y_train, y_test = prepare_data(df_3,'Job')
model_3 = linear_reg_model(X_train,y_train)
y_pred = classify(model_3,X_train)
print("Accuracy on feature 'Job' with train data:",accuracy_score(y_true=y_train,y_pred=y_pred))
y_pred = classify(model_3,X_test)
print("Accuracy on feature 'Job' with test data:",accuracy_score(y_true=y_test,y_pred=y_pred))

Accuracy on feature 'Job' with train data: 0.7023346303501945
Accuracy on feature 'Job' with test data: 0.6417322834645669


### Model for predicting 'Foreign worker'

In [505]:
#df_4 = df_4.drop(['Number of people being liable to provide maintenance for'],axis=1) #makes not different on accuracy
X_train, X_test, y_train, y_test = prepare_data(df_4,'Foreign worker')
model_4 = linear_reg_model(X_train,y_train)
y_pred = classify(model_4,X_train)
print("Accuracy on feature 'Foreign worker' with train data:",accuracy_score(y_true=y_train,y_pred=y_pred))
y_pred = classify(model_4,X_test)
print("Accuracy on feature 'Foreign worker' with test data:",accuracy_score(y_true=y_test,y_pred=y_pred))

Accuracy on feature 'Foreign worker' with train data: 0.955607476635514
Accuracy on feature 'Foreign worker' with test data: 0.9622641509433962


As we can see the models, which have to perform multi-class classification are not performing as well as the fourth model, which only has to decide between two classe. This result is expected, because the models, which have to decide between more classes have more "options" to missclassify the data as there are more classes to choose from and therefore a higher probability to make a false prediction. Additionally the line between the classes can be very thin. For example in the feature 'Present employment since' the difference of a day can mean that the sample is in a different class (364 days = A72 and 365 days = A73).