# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [91]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [92]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

In [93]:
# check for missing values
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [83]:
data.shape

(112915, 12)

In [94]:
# fill missing values for age with the median age and number of dependents
data.age.fillna(data.age.median(), inplace=True)
data.NumberOfDependents.fillna(data.NumberOfDependents.median(), inplace=True)

In [95]:
# check for missing values
data.isnull().sum()

Unnamed: 0                              0
SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

In [17]:
data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [96]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

In [87]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
# simulate splitting a dataset of 25 observations into 10 folds
from sklearn.cross_validation import KFold
kf = KFold(25, n_folds=10, shuffle=False)

# print the contents of each training and testing set
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for iteration, data in enumerate(kf, start=1):
    print('{:^9} {} {:^25}'.format(str(iteration), str(data[0]), str(data[1])))
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))

Iteration                   Training set observations                   Testing set observations
    1     [ 3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]          [0 1 2]         
    2     [ 0  1  2  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]          [3 4 5]         
    3     [ 0  1  2  3  4  5  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]          [6 7 8]         
    4     [ 0  1  2  3  4  5  6  7  8 12 13 14 15 16 17 18 19 20 21 22 23 24]        [ 9 10 11]        
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 15 16 17 18 19 20 21 22 23 24]        [12 13 14]        
    6     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 17 18 19 20 21 22 23 24]          [15 16]         
    7     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 19 20 21 22 23 24]          [17 18]         
    8     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 21 22 23 24]          [19 20]         
    9     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 

In [71]:
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

logreg = LogisticRegression(C=1e9)

results = cross_val_score(logreg, X, y, cv=10, scoring='accuracy')
print(results)

[0.93260716 0.93269571 0.93234148 0.93269571 0.93269571 0.93243004
 0.93295545 0.93295545 0.93330972 0.93286094]


In [36]:
pd.Series(results).describe()

count    10.000000
mean      0.932728
std       0.002739
min       0.929236
25%       0.931208
50%       0.931940
75%       0.935218
max       0.936498
dtype: float64

In [72]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, classification_report
print(metrics.confusion_matrix(y_test, y_pred_class))


[[10483     6]
 [  793     9]]


In [23]:
y_test.value_counts()

0    10489
1      802
Name: SeriousDlqin2yrs, dtype: int64

In [None]:
A traves de la matrix de confusion se puede ver que el modelo no se encuentra clasificando correctamente las observaciones.
Una explicacion a esto es que se encuentran muchas mas observaciones en la clase no default que en la clase default, esto
tiene una implicacion en la estimacion del modelo.

In [57]:
print(classification_report(y_test, y_pred_class))


             precision    recall  f1-score   support

          0       0.93      1.00      0.96     10489
          1       0.60      0.01      0.02       802

avg / total       0.91      0.93      0.90     11291



In [None]:
Al realizar los kfold cross validation se encontro que para cada una de las 10 muestras los acuraccy asociados a los mismos se
encuentran por encima del 0.9, inicialmente se podria concluir que este modelo se esta ajustando bien a los datos. Sin embargo
al comprobar la efectividad del modelo en la prediccion a traves de la matriz de clasificacion se puede ver que este no predice
correctamente la categoria  default debido a que en la base de datos originales no hay suficientes datos asociados al default de 
los clientes, es decir las clases de la variable respuesta se encuentran desbalanceado. En la matriz de confusion se encontro un
falso positivo de 0.057% , de 10.489 no default se clasificaron mal 7 datos(esto se debe a que en esta clase se encuentran mas 
datos),lo contrario ocurre con la categoria default en el cual se encuentra un 98.8% de falso negativos, esto quiere decir que se
esta clasificando como default un 98.8% de veces.
Dado lo anterior el F1 es una mejor medida cuando las clases se encuentran desbalanceda, para ese caso la medida de 0.02, indica
que los individuos no se encuentran muy bien clasificados debido a que el rendimiento del modelo se encuentra cercano al cero


# Exercise 3.3

Now which is the best set of features selected by AUC

In [97]:
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)
pt = np.get_printoptions()['threshold']
from sklearn.feature_selection import VarianceThreshold




In [77]:
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
feature_set = sel.fit_transform(X)

In [99]:
for i in range(2,len(data.columns)):
    X=data.iloc[:,i:]
    y=data.SeriousDlqin2yrs
    logreg=LogisticRegression(C=1e9)
    results=cross_val_score(logreg,X, y, cv=3, scoring='roc_auc')
    print("Variable %s-> ROC_AUC: %s" % (len(data.columns)-i, pd.Series(results).mean()))

Variable 10-> ROC_AUC: 0.6898472400269444
Variable 9-> ROC_AUC: 0.6912329985465466
Variable 8-> ROC_AUC: 0.655121846476556
Variable 7-> ROC_AUC: 0.6076744672870932
Variable 6-> ROC_AUC: 0.6094700005482953
Variable 5-> ROC_AUC: 0.581782175627945
Variable 4-> ROC_AUC: 0.585609324772863
Variable 3-> ROC_AUC: 0.5900729374608203
Variable 2-> ROC_AUC: 0.5919118789920467
Variable 1-> ROC_AUC: 0.5477952944963926


In [None]:
El mejor modelo es con 9 variables ya que es el que tiene mayor AUC