<img src='https://weclouddata.com/wp-content/uploads/2016/11/logo.png' width='30%'>
-------------

<h3 align='center'> Machine Learning Hands-on Workshop </h2>
<h1 align='center'> Home Credit Default Risk Kaggle Competition </h1>

<br>
<center align="left"> <h4>Presentened by:</h4> </center>
<center align="left"> <h4>Vanessa Feng, WeCloudData Academy</h4> </center>

----------

[Home Credit Default Risk Kaggle Competition](https://www.kaggle.com/c/home-credit-default-risk) Can you predict how capable each applicant is of repaying a loan?

# Content:
- [Problem analysis](#Problem-analysis)
- [Methodologies](#Methodologies)
    - [Building blocks of a Machine Learning algorithm](#Building-blocks-of-a-Machine-Learning-algorithm)
    - [Attempt 1](#Attempt-1)
        - [Data preparation](#Data-preparation)
        - [Model training](#Model-training)
        - [Model evaluation](#Model-evaluation)
        - [Cross validation](#Cross-validation)
        - [Model tuning](#Model-tuning)
        - [Test data prediction](#Test-data-prediction)
    - [Attempt 2](#Attempt-2)
        - [Data preparation: multiple datasets](#Data-preparation:-multiple-datasets)

# Problem analysis
### Understand the problem
Read the competition overview at https://www.kaggle.com/c/home-credit-default-risk#description.
We are trying to predict whether each credit applicant is going to have payment difficulty or not.

### Understand the data
Read the dataset description at https://www.kaggle.com/c/home-credit-default-risk/data.

In [None]:
# get a peak into the data files using shell command
!head -n 10 data/installments_payments.csv

In [None]:
# get a peak into the data files using pandas
import pandas as pd
filename = 'data/application_train.csv'
df = pd.read_csv(filename)
df.head(20) # print out the first 20 rows

In [None]:
print(df.dtypes) # check the data type of each column

In [None]:
# understand the target distribution (raw count)
df[['SK_ID_CURR', 'TARGET']].groupby('TARGET').count()

In [None]:
# understand the target distribution (ratio)
df[['SK_ID_CURR', 'TARGET']].groupby('TARGET').count()/df.shape[0]

#### Data files relationship
<img src='dataset.png' width='100%'>

<hr></hr>

# Methdologies

## Building blocks of a Machine Learning algorithm
<img src='training.png' width='80%'>

<hr></hr>
    
### Phase 2: validation time

<img src='validation.png' width='80%'>

## Attempt 1
Use one single dataset, basic feature extraction, and a single model to begin with.

### Data preparation

In [None]:
# Use `application_{train|test}.csv` only
# Take a look at the data fields in this file
filename = 'data/HomeCredit_columns_description.csv'
desc_df = pd.read_csv(filename, encoding = "ISO-8859-1")
desc_df[desc_df['Table'] == 'application_{train|test}.csv']

In [None]:
# the target distribution of the entire dataset
def get_target_dist(df):
    rows = df.shape[0]
    target_dist_df = df[['SK_ID_CURR', 'TARGET']].groupby('TARGET').count()
    target_dist_df['PERCENT'] = target_dist_df['SK_ID_CURR']*100/rows
    return target_dist_df

In [None]:
# let's check the target distribution across the entire dataset
train_df = pd.read_csv(filename)
rows = train_df.shape[0] # total number of rows in the data
print(f'total rows: {rows}')
print(get_target_dist(train_df))

In [None]:
# let's just use a subset of this dataset by reading the first 10000 rows
train_df = pd.read_csv(filename, nrows=10000)
rows = train_df.shape[0] # total number of rows in the data
print(f'total rows: {rows}')

In [None]:
# double check the target distribution
print(get_target_dist(train_df))

In [None]:
# better approach: randomly sample the entire dataset, not just take the first n rows
filename = 'data/application_train.csv'
train_df = pd.read_csv(filename)
train_df = train_df.sample(n=10000)
rows = train_df.shape[0] # total number of rows in the data
print(f'total rows: {rows}') 
print(get_target_dist(train_df))

In [None]:
nan_cols = train_df.columns[train_df.isnull().any()] # find out columns with any null value
nan_cnt = train_df[nan_cols].isnull().sum()
print(nan_cnt)

In [None]:
import numpy as np

y = [] # collect targets
data = [] # data (all columns except the target)

target_col = 'TARGET'
features = list([x for x in train_df.columns if x != target_col])

for row in train_df.to_dict('records'):
    y.append(row[target_col])
    data.append({k: row[k] for k in features})
    
y = np.array(y)

In [None]:
# training-validation split
from sklearn.cross_validation import train_test_split

data_train, data_val, y_train, y_val = train_test_split(data, y, train_size=0.8, stratify=y)
print(f'data_train cnt: {len(data_train)}')
print(f'data_val cnt: {len(data_val)}')

In [None]:
from collections import defaultdict

def get_y_dist(y):
    target2cnt = defaultdict(int)
    for yi in y:
        target2cnt[yi] += 1
    
    print('target\tcnt\tratio')
    for target in sorted(target2cnt):
        cnt = target2cnt[target]
        print(f'{target}\t{cnt}\t{cnt/len(y)}')

In [None]:
print('target distribution in training data')
get_y_dist(y_train)

print('\ntarget distribution in validation data')
get_y_dist(y_val)

In [None]:
from sklearn.feature_extraction import DictVectorizer

# transform list of dictionary into numpy/scipy matrix
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(data_train)
print(f'after vectorization: {X_train.shape}')

In [None]:
# inspect feature names
for i, feature in enumerate(vectorizer.feature_names_):
    print(f'{i}\t{feature}')

In [None]:
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MaxAbsScaler

# for each feature, fill in nan values with the mean value across all samples
imputer = Imputer(strategy='mean')
X_train = imputer.fit_transform(X_train)

# scaling data by columns so different features have roughly the same magnitude
scaler = MaxAbsScaler()
X_train = scaler.fit_transform(X_train.toarray())
print(f'X_train data type: {type(X_train)}')
print(f'X_train: {X_train.shape})')

In [None]:
# IMPORTANT: use the same set of preprocessors (vectorizer, imputer, and scaler) to transform the validation/test data
X_val = vectorizer.transform(data_val)
X_val = imputer.transform(X_val)
X_val = scaler.transform(X_val)
print(f'X_val data type: {type(X_val)}')
print(f'X_val: {X_val.shape})')

### Model training

In [None]:
from sklearn.linear_model import LogisticRegression
import time

# fit model
model = LogisticRegression(class_weight='balanced')

start = time.time()
print(f'Fitting model on {X_train.shape[0]} samples...')
model.fit(X_train, y_train)

end = time.time()
print('Finished model training in %.3f seconds.' % (end-start))

In [None]:
def get_sample_weights(y):
    weights = []
    for yi in y:
        weights.append(10 if yi else 1)
    return np.array(weights)

from sklearn.linear_model import LogisticRegression
import time

# fit model
model = LogisticRegression()

start = time.time()
print(f'Fitting model on {X_train.shape[0]} samples...')
model.fit(X_train, y_train, sample_weight=get_sample_weights(y_train))

end = time.time()
print('Finished model training in %.3f seconds.' % (end-start))

### Model evaluation

The competition uses the area under ROC curve for evaluation, see https://www.kaggle.com/c/home-credit-default-risk#evaluation.

<img src='validation.png' width='100%'>

In [None]:
# perform your trained model on test data to get predictions
y_preds = model.predict(X_val)

In [None]:
for i, y_pred in enumerate(y_preds):
    y_true = y_val[i]
    print(f'i\ty_pred: {y_pred}, y_true: {y_true}')

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_true=y_val, y_pred=y_preds, labels=[0, 1], target_names=['NO', 'YES']))

In [None]:
def evaluate(X_val, y_val):
    from sklearn.metrics import roc_curve, roc_auc_score, auc
    pos_idx = list(model.classes_).index(1)
    # compute area under ROC
    # we need probabilities to do this
    print(X_val.shape, model.predict_proba(X_val).shape, pos_idx)
    y_score = model.predict_proba(X_val)[:,pos_idx]
    print(y_score.shape)
    fpr, tpr, _ = roc_curve(y_val, y_score, pos_label=1)
    roc_auc = roc_auc_score(y_val, y_score)
#     print(f'auc: {auc}')
    
    return roc_auc, fpr, tpr

In [None]:
print(X_val.shape)
print(y_val.shape)
roc_auc, fpr, tpr = evaluate(X_val, y_val)

In [None]:
import matplotlib.pyplot as plt
lw = 2
plt.plot(fpr, tpr, color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")

### Cross validation

To get a more reliable performance evaluation of your model, you need to repeat the above random train-test split multiple times and average the performance across different splits. This is called **cross-validation**.

The convention is to perform 5-fold or 10-fold cross-validation.

<hr></hr>

**Do we need more samples?**

Previously, we just sampled 10000 samples from the entire `application_train.csv` dataset, and there's a chance we didn't have enough data to train the model.

Offline practice: See how more training data will affect the overall performance.

<hr></hr>

### Model tuning

- **Hyper-parameter tuning**: pick a model and try different model hyper-parameters and pick the set of parameters with the best validation score
  
  **Example**: http://scikit-learn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-auto-examples-model-selection-plot-randomized-search-py
  
- Try different models

<hr></hr>

## Attempt2

What if we use some of the other information provided?

### Data preparation: multiple datasets

In [None]:
prev_app_filename = 'data/previous_application.csv'
prev_app_df = pd.read_csv(prev_app_filename)
prev_app_df.head(20) # print out the first 20 rows

In [None]:
filename = 'data/HomeCredit_columns_description.csv'
desc_df = pd.read_csv(filename, encoding = "ISO-8859-1")
desc_df[desc_df['Table'] == 'previous_application.csv']

In [None]:
# since it is a 1-N mapping between `SK_ID_CURR` and `SK_ID_PREV`, 
# we need to find a way to encode those variable number of corresponding 
# previous applications for a given current application.

# one way to do this is by aggregating
prev_agg = prev_app_df.groupby('SK_ID_CURR')
prev_df = prev_agg.agg({'SK_ID_PREV': 'count', 'AMT_ANNUITY': 'sum'}).rename(columns={
    'SK_ID_PREV': 'PREV_APPS', 'AMT_ANNUITY': 'PREV_AMT_ANNUITY'})

In [None]:
prev_df.head(20)

In [None]:
curr_prev_df = train_df.fillna(value=train_df.mean()).join(prev_df, on='SK_ID_CURR', how='left')
curr_prev_df[['PREV_APPS', 'PREV_AMT_ANNUITY']] = curr_prev_df[['PREV_APPS', 'PREV_AMT_ANNUITY']].fillna(value=0)
print(curr_prev_df.shape[0])
curr_prev_df

In [None]:
filename = 'data/bureau.csv'
bureau_df = pd.read_csv(filename)
active_bureau_df = bureau_df[bureau_df['CREDIT_ACTIVE']=='Active']
active_bureau_df

In [None]:
active_bureau_agg = active_bureau_df.groupby('SK_ID_CURR')
active_bureau_agg_df = active_bureau_agg.agg({'AMT_CREDIT_SUM_DEBT': 'sum', 'CNT_CREDIT_PROLONG': 'sum'})
active_bureau_agg_df

In [None]:
df = curr_prev_df.join(active_bureau_agg_df, on='SK_ID_CURR', how='left')
df[['AMT_CREDIT_SUM_DEBT', 'CNT_CREDIT_PROLONG']] = df[['AMT_CREDIT_SUM_DEBT', 'CNT_CREDIT_PROLONG']].fillna(value=0)
print(df.shape[0])
df.shape

In [None]:
df