## Nicolas Leguizamón, código: 201727960
## Leidy Araque Molina código: 201727196

# Exercise 03

## Data preparation and model evaluation exercise with credit scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit. 

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. [Dataset](https://www.kaggle.com/c/GiveMeSomeCredit)

Attribute Information:

|Variable Name	|	Description	|	Type|
|----|----|----|
|SeriousDlqin2yrs	|	Person experienced 90 days past due delinquency or worse 	|	Y/N|
|RevolvingUtilizationOfUnsecuredLines	|	Total balance on credit divided by the sum of credit limits	|	percentage|
|age	|	Age of borrower in years	|	integer|
|NumberOfTime30-59DaysPastDueNotWorse	|	Number of times borrower has been 30-59 days past due |	integer|
|DebtRatio	|	Monthly debt payments	|	percentage|
|MonthlyIncome	|	Monthly income	|	real|
|NumberOfOpenCreditLinesAndLoans	|	Number of Open loans |	integer|
|NumberOfTimes90DaysLate	|	Number of times borrower has been 90 days or more past due.	|	integer|
|NumberRealEstateLoansOrLines	|	Number of mortgage and real estate loans	|	integer|
|NumberOfTime60-89DaysPastDueNotWorse	|	Number of times borrower has been 60-89 days past due |integer|
|NumberOfDependents	|	Number of dependents in family	|	integer|


Read the data into Pandas

In [8]:
# Libreria para manejo de datos
import pandas as pd
import numpy as np
# Fija un parámetro
pd.set_option('display.max_columns', 500)
#
import zipfile
with zipfile.ZipFile('../datasets/KaggleCredit2.csv.zip', 'r') as z:
    f = z.open('KaggleCredit2.csv')
    data = pd.io.parsers.read_table(f, sep=',')

data.tail()

Unnamed: 0.1,Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
112910,112910,0,0.385742,50.0,0.0,0.404293,3400.0,7.0,0.0,0.0,0.0,0.0
112911,112911,0,0.040674,74.0,0.0,0.225131,2100.0,4.0,0.0,1.0,0.0,0.0
112912,112912,0,0.299745,44.0,0.0,0.716562,5584.0,4.0,0.0,1.0,0.0,2.0
112913,112913,0,0.0,30.0,0.0,0.0,5716.0,4.0,0.0,0.0,0.0,0.0
112914,112914,0,0.850283,64.0,0.0,0.249908,8158.0,8.0,0.0,2.0,0.0,0.0


In [9]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.1

Input the missing values of the Age and Number of Dependents 

**Missing Values**

Se calcula el número total de registros que tienen datos faltantes en cada variable

In [10]:
data.isnull().sum()

Unnamed: 0                                 0
SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

Se rellenan los datos faltantes con la media de las variables

In [11]:
data.age.fillna(data.age.mean(), inplace=True)
data.NumberOfDependents.fillna(data.NumberOfDependents.mean(), inplace=True)
data=data

In [12]:
data.isnull().sum()

Unnamed: 0                              0
SeriousDlqin2yrs                        0
RevolvingUtilizationOfUnsecuredLines    0
age                                     0
NumberOfTime30-59DaysPastDueNotWorse    0
DebtRatio                               0
MonthlyIncome                           0
NumberOfOpenCreditLinesAndLoans         0
NumberOfTimes90DaysLate                 0
NumberRealEstateLoansOrLines            0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfDependents                      0
dtype: int64

In [13]:
y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs', axis=1)

# Exercise 3.2

From the set of features

Select the features that maximize the **F1Score** the model using K-Fold cross-validation

**Modelo 1**
* RevolvingUtilizationOfUnsecuredLines
* age
* NumberOfTime30-59DaysPastDueNotWorse
* MonthlyIncome
* NumberOfOpenCreditLinesAndLoans
* NumberRealEstateLoansOrLines
* NumberOfDependents

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs

In [15]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    from sklearn import metrics
    results.append(metrics.f1_score(y_test, y_pred_class))



In [16]:
pd.Series(results).describe()

count    10.000000
mean      0.041643
std       0.026574
min       0.014815
25%       0.020834
50%       0.028967
75%       0.066970
max       0.080952
dtype: float64

Arrroja una media de 0.019 y un máximo de 0.028

**2. Modelo 2**

* RevolvingUtilizationOfUnsecuredLines
* NumberOfTime30-59DaysPastDueNotWorse
* DebtRatio
* NumberOfOpenCreditLinesAndLoans
* NumberOfTimes90DaysLate
* NumberOfTime60-89DaysPastDueNotWorse
* NumberOfDependents

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['RevolvingUtilizationOfUnsecuredLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs

In [18]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    from sklearn import metrics
    results.append(metrics.f1_score(y_test, y_pred_class))

In [19]:
pd.Series(results).describe()

count    10.000000
mean      0.076070
std       0.010957
min       0.059102
25%       0.070559
50%       0.075317
75%       0.080384
max       0.097800
dtype: float64

**Modelo 3**

* RevolvingUtilizationOfUnsecuredLines
* age
* DebtRatio
* MonthlyIncome
* NumberRealEstateLoansOrLines
* NumberOfDependents

In [20]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['RevolvingUtilizationOfUnsecuredLines', 'age', 'DebtRatio', 'MonthlyIncome', 'NumberRealEstateLoansOrLines', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [21]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.f1_score(y_test, y_pred_class))

  'precision', 'predicted', average, warn_for)


In [22]:
pd.Series(results).describe()

count    10.0
mean      0.0
std       0.0
min       0.0
25%       0.0
50%       0.0
75%       0.0
max       0.0
dtype: float64

Se corren los 10 modelos y se guardan los resultados del f_score

**Modelo 4**

* age
* NumberOfTime30-59DaysPastDueNotWorse
* MonthlyIncome
* NumberOfOpenCreditLinesAndLoans
* NumberRealEstateLoansOrLines
* NumberOfTime60-89DaysPastDueNotWorse
* NumberOfDependents

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['age', 'NumberOfTime30-59DaysPastDueNotWorse', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [24]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.f1_score(y_test, y_pred_class))

In [25]:
pd.Series(results).describe()

count    10.000000
mean      0.027214
std       0.014583
min       0.014815
25%       0.020261
50%       0.022263
75%       0.030027
max       0.065421
dtype: float64

De los modelos probados el modelo 2 es que el maximiza el parámetro F1_SCORE

# Exercise 3.3

Now which is the best set of features selected by AUC

**Modelo 1**
* RevolvingUtilizationOfUnsecuredLines
* age
* NumberOfTime30-59DaysPastDueNotWorse
* MonthlyIncome
* NumberOfOpenCreditLinesAndLoans
* NumberRealEstateLoansOrLines
* NumberOfDependents

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['age', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs

In [27]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    from sklearn import metrics
    results.append(metrics.roc_auc_score(y_test, y_pred_class))

In [28]:
pd.Series(results).describe()

count    10.000000
mean      0.510376
std       0.006855
min       0.503388
25%       0.505011
50%       0.507142
75%       0.516863
max       0.520519
dtype: float64

Arrroja una media de 0.019 y un máximo de 0.028

**2. Modelo 2**

* RevolvingUtilizationOfUnsecuredLines
* NumberOfTime30-59DaysPastDueNotWorse
* DebtRatio
* NumberOfOpenCreditLinesAndLoans
* NumberOfTimes90DaysLate
* NumberOfTime60-89DaysPastDueNotWorse
* NumberOfDependents

In [29]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['RevolvingUtilizationOfUnsecuredLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs

In [30]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    from sklearn import metrics
    results.append(metrics.roc_auc_score(y_test, y_pred_class))

In [31]:
pd.Series(results).describe()

count    10.000000
mean      0.519325
std       0.003087
min       0.514513
25%       0.517680
50%       0.519171
75%       0.520654
max       0.525411
dtype: float64

**Modelo 3**

* RevolvingUtilizationOfUnsecuredLines
* age
* DebtRatio
* MonthlyIncome
* NumberRealEstateLoansOrLines
* NumberOfDependents

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['RevolvingUtilizationOfUnsecuredLines', 'age', 'DebtRatio', 'MonthlyIncome', 'NumberRealEstateLoansOrLines', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [33]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.roc_auc_score(y_test, y_pred_class))

In [34]:
pd.Series(results).describe()

count    10.0
mean      0.5
std       0.0
min       0.5
25%       0.5
50%       0.5
75%       0.5
max       0.5
dtype: float64

Se corren los 10 modelos y se guardan los resultados del f_score

**Modelo 4**

* age
* NumberOfTime30-59DaysPastDueNotWorse
* MonthlyIncome
* NumberOfOpenCreditLinesAndLoans
* NumberRealEstateLoansOrLines
* NumberOfTime60-89DaysPastDueNotWorse
* NumberOfDependents

In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# define X and y
feature_cols = ['age', 'NumberOfTime30-59DaysPastDueNotWorse', 'MonthlyIncome', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents']
X = data[feature_cols]
y = data.SeriousDlqin2yrs
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [36]:
from sklearn.cross_validation import KFold
# Create k-folds
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []

for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = logreg.predict(X_test)

    # calculate testing accuracy
    results.append(metrics.roc_auc_score(y_test, y_pred_class))

In [37]:
pd.Series(results).describe()

count    10.000000
mean      0.506592
std       0.003679
min       0.503388
25%       0.504871
50%       0.505349
75%       0.507401
max       0.516217
dtype: float64

De los modelos probados el modelo 2 es que el maximiza el parámetro AUC Score