# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [2]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

In [3]:
# check for missing values
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [4]:
data.loc[data.age.isnull()].shape,data.loc[data.NumberOfDependents.isnull()].shape

((4267, 12), (4267, 12))

In [5]:
# fill missing values for Age with the median age
data.age.fillna(data.age.median(), inplace=True)

In [6]:
# fill missing values for Age with the median age
data.NumberOfDependents.fillna('0', inplace=True) ###float(y la moda)

In [7]:
data.loc[data.age.isnull()].shape,data.loc[data.NumberOfDependents.isnull()].shape

((0, 12), (0, 12))

In [8]:
# check for missing values
data.isnull().sum()

Unnamed: 0                              0
SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

In [9]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

In [10]:
list(data)

['Unnamed: 0',
 'SeriousDlqin2yrs',
 'RevolvingUtilizationOfUnsecuredLines',
 'age',
 'NumberOfTime30-59DaysPastDueNotWorse',
 'DebtRatio',
 'MonthlyIncome',
 'NumberOfOpenCreditLinesAndLoans',
 'NumberOfTimes90DaysLate',
 'NumberRealEstateLoansOrLines',
 'NumberOfTime60-89DaysPastDueNotWorse',
 'NumberOfDependents']

In [11]:
data.dtypes

Unnamed: 0                                int64
SeriousDlqin2yrs                          int64
RevolvingUtilizationOfUnsecuredLines    float64
age                                     float64
NumberOfTime30-59DaysPastDueNotWorse    float64
DebtRatio                               float64
MonthlyIncome                           float64
NumberOfOpenCreditLinesAndLoans         float64
NumberOfTimes90DaysLate                 float64
NumberRealEstateLoansOrLines            float64
NumberOfTime60-89DaysPastDueNotWorse    float64
NumberOfDependents                       object
dtype: object

In [12]:
import itertools

possible_models = []
features = ['RevolvingUtilizationOfUnsecuredLines',
 'age',
 'NumberOfTime30-59DaysPastDueNotWorse',
 'DebtRatio',
 'MonthlyIncome',
 'NumberOfOpenCreditLinesAndLoans',
 'NumberOfTimes90DaysLate',
 'NumberRealEstateLoansOrLines',
 'NumberOfTime60-89DaysPastDueNotWorse',
 'NumberOfDependents']
for i in range(1,len(features)+1):
    possible_models.extend(list(itertools.combinations(features,i)))

possible_models[:10]

[('RevolvingUtilizationOfUnsecuredLines',),
 ('age',),
 ('NumberOfTime30-59DaysPastDueNotWorse',),
 ('DebtRatio',),
 ('MonthlyIncome',),
 ('NumberOfOpenCreditLinesAndLoans',),
 ('NumberOfTimes90DaysLate',),
 ('NumberRealEstateLoansOrLines',),
 ('NumberOfTime60-89DaysPastDueNotWorse',),
 ('NumberOfDependents',)]

In [13]:
import numpy as np
np.array(possible_models).shape

(1023,)

In [16]:
import random
possible_models2 = random.sample(possible_models, )

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
logreg = LogisticRegression(C=1e9)

results = pd.DataFrame(index=possible_models, columns=['F1-Score'])
for model in possible_models2:
    X = data[list(model)]
    results.loc[model, 'F1-Score'] = cross_val_score(logreg, X, y, cv=10, scoring='f1',).mean()

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [18]:
results.sort_values('F1-Score', ascending=False).head()

Unnamed: 0,F1-Score
"(NumberOfTime30-59DaysPastDueNotWorse, DebtRatio, NumberOfOpenCreditLinesAndLoans, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse)",0.0792035
"(NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse)",0.0711885
"(age, NumberOfOpenCreditLinesAndLoans, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse)",0.0500513
"(RevolvingUtilizationOfUnsecuredLines, age, NumberOfOpenCreditLinesAndLoans, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse)",0.0500513
"(NumberOfTime30-59DaysPastDueNotWorse, MonthlyIncome, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse)",0.0422882


#### El set de variables (NumberOfTime30-59DaysPastDueNotWorse, DebtRatio, NumberOfOpenCreditLinesAndLoans, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse) es el que mejor F1-Score da.

# Exercise 3.3

Now which is the best set of features selected by AUC

In [21]:
from sklearn import metrics
from sklearn.cross_validation import KFold
import numpy as np

results['AUC'] = 0

for model in possible_models2:
    X = data[list(model)]
    
    # Create k-folds
    kf = KFold(X.shape[0], n_folds=10, random_state=0)

    res = []

    for train_index, test_index in kf:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
        logreg = LogisticRegression(C=1e9)
        logreg.fit(X_train, y_train)
        
        y_pred_prob = logreg.predict_proba(X_test)[:, 1]
        res.append(metrics.roc_auc_score(y_test, y_pred_prob))

    results.loc[model, 'AUC'] = np.mean(res)

In [22]:
results.sort_values('AUC', ascending=False).head()

Unnamed: 0,F1-Score,AUC
"(NumberOfTime30-59DaysPastDueNotWorse, DebtRatio, NumberOfOpenCreditLinesAndLoans, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse)",0.0792035,0.679117
"(age, NumberOfTime30-59DaysPastDueNotWorse, DebtRatio, MonthlyIncome, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse, NumberOfDependents)",0.0290796,0.665295
"(NumberOfTime30-59DaysPastDueNotWorse, MonthlyIncome, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse)",0.0422882,0.654419
"(NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse)",0.0711885,0.648256
"(RevolvingUtilizationOfUnsecuredLines, age, MonthlyIncome, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse)",0.0349749,0.647792


#### Basados en el criterio de AUC, el mejor set de variables es: (NumberOfTime30-59DaysPastDueNotWorse, DebtRatio, NumberOfOpenCreditLinesAndLoans, NumberOfTimes90DaysLate, NumberRealEstateLoansOrLines, NumberOfTime60-89DaysPastDueNotWorse