# Workshop Task - Training models and basic preprocessing in Scikit-Learn


You have the following dataset :
A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants that are at high risk to default, allowing the bank to refuse credit requests

**Taraget class**  
  default                
      - 1 : no           
      - 2 = yes

**Features**
* months_loan_duration   - Loan duration in months
* amount                 - Loan amunt
* installment_rate       - Monthly Rate
* residence_history      - Residence history – length of time living at current location
* age                    - Client age
* existing_credits       - Number of Existing lines of credit with the bank
* dependents             - Number of loan dependents


* checking_balance      - Checking balance, savings balance amounts
* credit_history        -  represents if loan is delayed, ‘fully repaid’, critical etc
* purpose               -  Purpose of loan
* savings_balance       -  savings balance amounts
* employment_length     -  0-1 years, 1-4 years, 4-7, >7 years, unemployed
* personal_status       - relationship status
* other_debtors         - other debts activities
* property              - type of owned property
* installment_plan       -  method of paying
* housing                -  does he live for free, owns or is living on rental
* telephone              - owner of telephone
* foreign_worker         - Is the client from other country 
* job                    - job skill


**Missing values :  Annotated with 'unknown'**  
Hint : dataset = dataset.replace('unknown', np.nan)

Your task is to : 

* Preprocess the data 
* Make train/test split : with train(70%), test(30%).  
  Use - train_test_split(X, y, test_size=0.30, random_state=0)
* Train the model
* Evaluate the model
* Achieve accuracy on test set >=0.76
* For reproducibility please use random_state on train_test_split and model initialization
* Write a summary :
    - Which model gives the best result?
    - Which model can be interpreted most easily for a client?
    - What can we improve in the future?
    
Bonus points : 
* Add precision/recall evaluation

* Hint : use:  from sklearn.metrics import classification_report



In [1]:
import pandas as pd
import numpy as np

In [2]:
dataset = pd.read_csv('credit.csv')

In [3]:
dataset.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,...,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,...,real estate,67,none,own,2,1,1,yes,yes,skilled employee
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,...,real estate,22,none,own,1,2,1,none,yes,skilled employee
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,...,real estate,49,none,own,1,1,2,none,yes,unskilled resident
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,...,building society savings,45,none,for free,1,1,2,none,yes,skilled employee
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,...,unknown/none,53,none,for free,2,2,2,none,yes,skilled employee


In [4]:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype


print("Numeric columns")
for column in dataset.columns:
    if is_numeric_dtype(dataset[column]):
        print(column)
print("----------------------------------")        
print("Category columns")        
for column in dataset.columns:
    if is_string_dtype(dataset[column]):
        print(column)

Numeric columns
months_loan_duration
amount
installment_rate
residence_history
age
existing_credits
default
dependents
----------------------------------
Category columns
checking_balance
credit_history
purpose
savings_balance
employment_length
personal_status
other_debtors
property
installment_plan
housing
telephone
foreign_worker
job


In [5]:
dataset = dataset.replace('unknown', np.NaN)
dataset = dataset.replace('unknown/none', np.NaN)

In [6]:
dataset.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,...,property,age,installment_plan,housing,existing_credits,default,dependents,telephone,foreign_worker,job
0,< 0 DM,6,critical,radio/tv,1169,,> 7 yrs,4,single male,none,...,real estate,67,none,own,2,1,1,yes,yes,skilled employee
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,...,real estate,22,none,own,1,2,1,none,yes,skilled employee
2,,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,...,real estate,49,none,own,1,1,2,none,yes,unskilled resident
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,...,building society savings,45,none,for free,1,1,2,none,yes,skilled employee
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,...,,53,none,for free,2,2,2,none,yes,skilled employee


In [7]:
dataset.isna().sum()

checking_balance        394
months_loan_duration      0
credit_history            0
purpose                   0
amount                    0
savings_balance         183
employment_length         0
installment_rate          0
personal_status           0
other_debtors             0
residence_history         0
property                154
age                       0
installment_plan          0
housing                   0
existing_credits          0
default                   0
dependents                0
telephone                 0
foreign_worker            0
job                       0
dtype: int64

In [None]:
#dataset.employment_length.unique()

In [8]:
mapper = {'unemployed':0, '0 - 1 yrs':1, '1 - 4 yrs':2, '4 - 7 yrs':3, '> 7 yrs':4}
dataset['employment_length'] = dataset['employment_length'].replace(mapper)
#dataset

In [None]:
#dataset.savings_balance.unique()

In [9]:
mapper = {'< 100 DM':0, '101 - 500 DM':1, '501 - 1000 DM':2, '> 1000 DM':3}
dataset['savings_balance'] = dataset['savings_balance'].replace(mapper)
#dataset

In [10]:
mapper = {'1 - 200 DM': 1, '< 0 DM': 0, '> 200 DM': 2}
dataset = dataset.assign(checking_balance_add = dataset['checking_balance'].replace(mapper))
#dataset

In [11]:
mapper = {'building society savings': 0, 'other': 1, 'real estate': 2}
dataset = dataset.assign(property_add = dataset['property'].replace(mapper))
#dataset

In [12]:
from sklearn.impute import KNNImputer

transformer = KNNImputer(n_neighbors=2)

columns = ['property_add','checking_balance_add']

dataset[columns] = transformer.fit_transform(dataset[columns])

In [13]:
dataset = dataset.drop(['checking_balance','property'], axis =1)

In [14]:
from sklearn.impute import SimpleImputer

def input_values(strategy, dataset, columns):
    dataset_new = dataset.copy()
    simple_imputer = SimpleImputer(strategy=strategy)
    dataset_new[columns] = simple_imputer.fit_transform(dataset_new[columns])
    return dataset_new

In [15]:
dataset_without_missing_values = input_values(strategy='most_frequent', 
                                              dataset = dataset, 
                                              columns = ['savings_balance'])

In [16]:
dataset_without_missing_values.isna().sum()

months_loan_duration    0
credit_history          0
purpose                 0
amount                  0
savings_balance         0
employment_length       0
installment_rate        0
personal_status         0
other_debtors           0
residence_history       0
age                     0
installment_plan        0
housing                 0
existing_credits        0
default                 0
dependents              0
telephone               0
foreign_worker          0
job                     0
checking_balance_add    0
property_add            0
dtype: int64

In [17]:
def one_hot_encoding(dataset, columns):
    dataste_new = dataset.copy()
    data_dummies =  pd.get_dummies(dataset[columns])
    dataset_new = pd.concat([dataste_new, data_dummies],  axis='columns')
    dataset_new.drop(columns, axis='columns', inplace=True)

    return dataset_new

In [18]:
columns = ['credit_history',
           'purpose',
           'savings_balance',
           'employment_length',
           'personal_status',
           'other_debtors',
           'installment_plan',
           'housing',
           'telephone',
           'foreign_worker',
           'job']

In [19]:
dataset_encoding = one_hot_encoding(dataset_without_missing_values, columns)

In [20]:
dataset_encoding.head()

Unnamed: 0,months_loan_duration,amount,installment_rate,residence_history,age,existing_credits,default,dependents,checking_balance_add,property_add,...,housing_own,housing_rent,telephone_none,telephone_yes,foreign_worker_no,foreign_worker_yes,job_mangement self-employed,job_skilled employee,job_unemployed non-resident,job_unskilled resident
0,6,1169,4,4,67,2,1,1,0.0,2.0,...,1,0,0,1,0,1,0,1,0,0
1,48,5951,2,2,22,1,2,1,1.0,2.0,...,1,0,1,0,0,1,0,1,0,0
2,12,2096,2,3,49,1,1,2,0.5,2.0,...,1,0,1,0,0,1,0,0,0,1
3,42,7882,2,4,45,1,1,2,0.0,0.0,...,0,0,1,0,0,1,0,1,0,0
4,24,4870,3,4,53,2,2,2,0.0,1.0,...,0,0,1,0,0,1,0,1,0,0


In [21]:
from sklearn.preprocessing import MinMaxScaler

columns = ['age', 'months_loan_duration']

transformer = MinMaxScaler(feature_range = (0,1))

dataset_encoding[columns] = transformer.fit_transform(dataset_encoding[columns])

In [22]:
from sklearn.preprocessing import StandardScaler

def scale(dataset, columns):
    dataset_new = dataset.copy()
    standard_scaller = StandardScaler()
    dataset_new[columns] = standard_scaller.fit_transform(dataset_new[columns])
    
    return dataset_new

In [23]:
columns_1 = ['amount']

dataset_final = scale(dataset_encoding, columns_1)

In [24]:
dataset_final.head()

Unnamed: 0,months_loan_duration,amount,installment_rate,residence_history,age,existing_credits,default,dependents,checking_balance_add,property_add,...,housing_own,housing_rent,telephone_none,telephone_yes,foreign_worker_no,foreign_worker_yes,job_mangement self-employed,job_skilled employee,job_unemployed non-resident,job_unskilled resident
0,0.029412,-0.745131,4,4,0.857143,2,1,1,0.0,2.0,...,1,0,0,1,0,1,0,1,0,0
1,0.647059,0.949817,2,2,0.053571,1,2,1,1.0,2.0,...,1,0,1,0,0,1,0,1,0,0
2,0.117647,-0.416562,2,3,0.535714,1,1,2,0.5,2.0,...,1,0,1,0,0,1,0,0,0,1
3,0.558824,1.634247,2,4,0.464286,1,1,2,0.0,0.0,...,0,0,1,0,0,1,0,1,0,0
4,0.294118,0.566664,3,4,0.607143,2,2,2,0.0,1.0,...,0,0,1,0,0,1,0,1,0,0


In [25]:
from sklearn.model_selection import train_test_split

X = dataset_final.drop(['default'], axis = 1)
y = dataset_final['default']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [37]:
from sklearn.ensemble import RandomForestClassifier

tree = RandomForestClassifier(n_estimators = 500,
                              criterion = 'gini',
                              min_samples_split = 4,
                              min_samples_leaf = 5,
                              max_features = 10,
                              random_state = 0)
tree.fit(X_train, y_train)


print('Accuracy on the training subset: {:.3f}'.format(tree.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(tree.score(X_test, y_test)))

Accuracy on the training subset: 0.866
Accuracy on the test subset: 0.760


In [35]:
from sklearn.svm import SVC 

svm = SVC(kernel='rbf', C=14, gamma='auto', random_state = 0)

svm.fit(X_train, y_train)

print('Accuracy on the training subset: {:.3f}'.format(svm.score(X_train, y_train)))
print('Accuracy on the test subset: {:.3f}'.format(svm.score(X_test, y_test)))

Accuracy on the training subset: 0.844
Accuracy on the test subset: 0.763


In [34]:
from sklearn.tree import DecisionTreeClassifier


tree = DecisionTreeClassifier(random_state = 0,
                              criterion = 'gini',
                              max_depth = 6, 
                              min_samples_leaf = 4)

tree.fit(X_train, y_train)

print('Accuracy on the training subset: {:.3f}'.format(tree.score(X_train, y_train)))
print('Accuracy on the test subset: {:.2f}'.format(tree.score(X_test, y_test)))

Accuracy on the training subset: 0.793
Accuracy on the test subset: 0.75


In [29]:
from sklearn.metrics import classification_report

y_pred = tree.predict(X_test)

analysis = classification_report(y_test, y_pred)
print(analysis)

              precision    recall  f1-score   support

           1       0.77      0.93      0.84       214
           2       0.64      0.29      0.40        86

    accuracy                           0.75       300
   macro avg       0.70      0.61      0.62       300
weighted avg       0.73      0.75      0.72       300



In [30]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

[[200  14]
 [ 61  25]]


1) Which model gives the best result?

The best result can be seen with Random Forest and SVC.

2) Which model can be interpreted most easily for a client?

Personally I do not have clear picture what is "interpreted most easily for a client" yet, but I would say that all of them can be interpreted to a client.  

3) What can we improve in the future?

The balance of the dataset need to be impoved in order to produce more clear and reliable resultes. 