# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [2]:
import pandas as pd
pd.set_option('display.max_columns', 500)
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.head()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


In [24]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

# CONTEO DE VALORES FALTANTES

In [26]:
# check for missing values
X.isnull().sum()

Unnamed: 0                                 0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

## IMPUTACIÓN PARA "AGE" A TRAVÉS DE LA MEDIANA

In [27]:
X.age.fillna(X.age.median(), inplace=True)

## IMPUTACIÓN PARA "NumberOfDependents" A TRAVÉS DE LA MEDIA

In [28]:
X.NumberOfDependents.fillna(X.age.mean(), inplace=True)

## VERIFICACIÓN DE LA IMPUTACIÓN CONTEO DE NAs

In [29]:
X.isnull().sum()

Unnamed: 0                              0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

In [None]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

## CARGA DE LIBRERIAS

In [81]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score,auc
from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_predict

# TEST Y TRAIN

In [61]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## INICIALIZACIÓN DE LA REGRESIÓN LOGÍSTICA

In [70]:
logreg = LogisticRegression(C=1e9)
logreg

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

## K-FOLD CROSS-VALIDATION USANDO F1 SCORE

In [62]:
results= cross_val_score(logreg, X_train, y_train, cv=10, scoring='f1')
results

array([0.01034483, 0.02405498, 0.02076125, 0.01730104, 0.01393728,
       0.01718213, 0.01032702, 0.03407155, 0.02409639, 0.01038062])

## DESCRIPTIVOS K-FOLD F1 SCORE

In [72]:
pd.Series(results).describe()

count    10.000000
mean      0.018246
std       0.007691
min       0.010327
25%       0.011270
50%       0.017242
75%       0.023232
max       0.034072
dtype: float64

## PREDICCIÓN

In [71]:
predicted = cross_val_predict(logreg, X_test, y_test, cv=10)
predicted

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

## F1 SCORE

In [83]:
metrics.f1_score(y_test, predicted)

0.026625704045058884

# Exercise 3.3

Now which is the best set of features selected by AUC

## K-FOLD CROSS-VALIDATION USANDO AUC

In [77]:
results_2= cross_val_score(logreg, X_train, y_train, cv=10, scoring='roc_auc')
results_2

array([0.62820207, 0.65092347, 0.63461797, 0.6359819 , 0.63380247,
       0.63530096, 0.6449723 , 0.65428401, 0.62758971, 0.63796053])

## DESCRIPTIVOS K-FOLD AUC

In [78]:
pd.Series(results_2).describe()

count    10.000000
mean      0.638364
std       0.008971
min       0.627590
25%       0.634006
50%       0.635641
75%       0.643219
max       0.654284
dtype: float64

## PREDICCIÓN

In [85]:
#Clases
predicted_2 = cross_val_predict(logreg, X_test, y_test, cv=10,)
print(predicted_2)

#Probabilidad
predicted_2_prob = cross_val_predict(logreg, X_test, y_test, cv=10,method='predict_proba')
print(predicted_2_prob)


[0 0 0 ... 0 0 0]
[[0.93950153 0.06049847]
 [0.91250264 0.08749736]
 [0.98391764 0.01608236]
 ...
 [0.89174169 0.10825831]
 [0.90032064 0.09967936]
 [0.94597962 0.05402038]]


## ROC AUC

In [98]:
#metrics.auc(y_test, predicted_2_prob)
auc = metrics.roc_auc_score(y_test, predicted_2)
auc

0.5065450511059741