# Problem Definition

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

- Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

### Data Description

See https://www.kaggle.com/c/home-credit-default-risk/data

There are 7 different sources of data:

* <b>application_train.csv</b>: contains the main training data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0 (the loan was repaid) or 1 (the loan was not repaid). 
* <b>application_test.csv</b>: contains the main testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. 
* <b>bureau.csv</b>: contains data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
* <b>bureau_balance.csv</b>: contains monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length. 
* <b>previous_application.csv</b>: contains previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature `SK_ID_PREV`. 
* <b>POS_CASH_BALANCE.csv</b>: contains monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
* <b>credit_card_balance.csv</b>: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
* <b>installments_payment.csv</b>: contains payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. 

#### Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, Imputer, PolynomialFeatures, MinMaxScaler

# Data

#### Main train data

In [2]:
def get_train_data(path: str, target_name: str) -> tuple:
    df = pd.read_csv(path)
    target = df[target_name]    
    features = df.drop(target_name, axis=1)
    return features, target

In [3]:
path = 'data/application_train.csv'
target_name = 'TARGET'
train_features, train_target = get_train_data(path, target_name)

In [4]:
train_features.shape, train_target.shape

((307511, 121), (307511,))

In [5]:
train_features.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,...,0,0,0,0,,,,,,
4,100007,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
train_target.head()

0    1
1    0
2    0
3    0
4    0
Name: TARGET, dtype: int64

#### Auxiliary training data

Previous out of network loans

In [7]:
bureau_df = pd.read_csv('data/bureau.csv')
bureau_df.shape

(1716428, 17)

In [8]:
bureau_df.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


Extend main train data with previous out of network loans

In [56]:
previous_loan_counts = bureau_df.groupby('SK_ID_CURR')['SK_ID_CURR'].count()
previous_loan_counts.shape

(305811,)

In [57]:
previous_loan_counts.head()

SK_ID_CURR
100001    7
100002    8
100003    4
100004    2
100005    3
Name: SK_ID_CURR, dtype: int64

In [58]:
previous_loan_df = previous_loan_counts.to_frame().rename(columns={'SK_ID_CURR': 'LOAN_COUNT_BUREAU'}).reset_index()
previous_loan_df.shape

(305811, 2)

In [59]:
previous_loan_df.head()

Unnamed: 0,SK_ID_CURR,LOAN_COUNT_BUREAU
0,100001,7
1,100002,8
2,100003,4
3,100004,2
4,100005,3


In [60]:
train_features_bureau = train_features.merge(previous_loan_df, on='SK_ID_CURR', how='left')
train_features_bureau.shape

(307511, 122)

In [62]:
train_features_bureau[['SK_ID_CURR', 'LOAN_COUNT_BUREAU']].head()

Unnamed: 0,SK_ID_CURR,LOAN_COUNT_BUREAU
0,100002,8.0
1,100003,4.0
2,100004,2.0
3,100006,
4,100007,1.0


#### Test data

In [9]:
test_features = pd.read_csv('data/application_test.csv')
test_features.shape

(48744, 121)

In [10]:
test_features.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


#### Dealing with categorical values

Convert categorical features to one-hot encoded features. This is due to the fact that implementations of most machine learning models cannot directly deal with categorical data. We will use pandas native one hot encoder. Howerver, to align train and test data, we will simply remove the additional columns, since they reflect only a small portion of data.

In [None]:
def make_one_hot_encoded(train_features: pd.DataFrame, test_features: pd.DataFrame) -> tuple:
    train_1h = pd.get_dummies(train_features)
    test_1h = pd.get_dummies(test_features)
    return train_1h.align(test_1h, join='inner', axis=1)

In [None]:
train_1h, test_1h = make_one_hot_encoded(train_features, test_features)
assert train_1h.shape[1]==test_1h.shape[1]

In [None]:
train_1h.shape, test_1h.shape

Lets look at the one hot encoded colum s

In [None]:
def get_extra_columns(before: np.ndarray, after: np.ndarray) -> list:
    return list(set(after).difference(set(before)))

In [None]:
train_1h[get_extra_columns(train_features.columns.values, train_1h.columns.values)].head()

In [None]:
test_1h[get_extra_columns(test_features.columns.values, test_1h.columns.values)].head()

#### Imputing

We learned earlier that a lot of features are missing values. We need to deal with the missing values before we derive the polynomial feature. 

In [None]:
type(Imputer(strategy='median'))

In [None]:
def impute(train_data: pd.DataFrame, test_data: pd.DataFrame, strategy: str) -> tuple:
    imputer = Imputer(strategy=strategy)
    
    train_imputed = imputer.fit_transform(train_data)
    train_features = pd.DataFrame(train_imputed, columns=train_data.columns)
    
    test_imputed = imputer.transform(test_data) 
    test_features = pd.DataFrame(test_imputed, columns=test_data.columns)
    
    return train_features, test_features

In [None]:
train_imputed, test_imputed = impute(train_1h, test_1h, strategy='median')

In [None]:
def count_missing_stats(data: pd.DataFrame) -> int:
    return len([(row, stat) for row, stat in (data.isnull().sum()/data.shape[0]).items() if stat>0])    

In [None]:
print ("Columns having null valies: {} (before Imputing), {} (after Imputing)"
       .format(count_missing_stats(train_1h),
               count_missing_stats(train_imputed)))

In [None]:
print ("Columns having null valies: {} (before Imputing), {} (after Imputing)"
       .format(count_missing_stats(test_1h), 
               count_missing_stats(test_imputed)))

### Feature Engineering

#### Polynomial feature from correlated features

In our analysis, we learned that `TARGET` is positively correlated with `DAYS_BIRTH` and negatively correlated with `EXT_SOURCE_1`, `EXT_SOURCE_2`, and `EXT_SOURCE_3`. It is worth trying out polynomial features. `scikit-learn` provides a utitlity to generate one.

In [None]:
def EngieerPolynomialFeatures(train_data: pd.DataFrame, 
                              test_data: pd.DataFrame, 
                              feature_columns: list, 
                              degree: int, 
                              merge_id: str, 
                              merge_how:str) -> tuple:
    poly_transformer = PolynomialFeatures(degree=degree)
    
    train_features = poly_transformer.fit_transform(train_data[feature_columns])
    test_features = poly_transformer.transform(test_data[feature_columns])
    
    engineered_column_names = poly_transformer.get_feature_names(feature_columns)
    
    poly_df_train = pd.DataFrame(train_features, columns=engineered_column_names)
    poly_df_train[merge_id] = train_data[merge_id]
    poly_train_features = train_data.merge(poly_df_train, how=merge_how, on=merge_id)
    
    poly_df_test = pd.DataFrame(test_features, columns=engineered_column_names)
    poly_df_test[merge_id] = test_data[merge_id]
    poly_test_features = test_data.merge(poly_df_test, how=merge_how, on=merge_id)
    
    return poly_train_features, poly_test_features

In [None]:
feature_columns = ['DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']
degree = 3    
merge_id = 'SK_ID_CURR'
merge_how = 'left'
poly_train_features, poly_test_features = EngieerPolynomialFeatures(train_data=train_imputed,
                                                                    test_data=test_imputed,
                                                                    feature_columns=feature_columns, 
                                                                    degree=degree, 
                                                                    merge_id=merge_id, 
                                                                    merge_how=merge_how)

In [None]:
poly_train_features.shape, poly_test_features.shape

Lets look at the polynomial features

In [None]:
poly_train_features[get_extra_columns(train_imputed.columns.values, poly_train_features.columns.values)].head()

In [None]:
poly_test_features[get_extra_columns(test_imputed.columns.values, poly_test_features.columns.values)].head()

See if the polynomial features have some positive or negative correlations

#### Prediction: Regular data

Scaling

In [None]:
scaler_regular = MinMaxScaler(feature_range=(0, 1))

In [None]:
train_scaled = scaler_regular.fit_transform(train_imputed)
test_scaled = scaler_regular.transform(test_imputed)

Training

In [None]:
model = LogisticRegression(C=0.0001)

In [None]:
model.fit(train_scaled, train_target)

Prediction

In [None]:
prediction = model.predict_proba(test_scaled)

In [None]:
prediction[:,1]

In [None]:
submit = test_features[['SK_ID_CURR']]
submit = submit.assign(TARGET=prediction[:,1])

In [None]:
submit.shape

In [None]:
submit.head()

In [None]:
submit.to_csv('model_baseline.csv', index = False)

Scored 0.679

#### Predict: Polynomial features

Scaling

In [None]:
scaler_poly = MinMaxScaler(feature_range=(0, 1))

In [None]:
train_scaled_poly = scaler_poly.fit_transform(poly_train_features)
test_scaled_poly = scaler_poly.transform(poly_test_features)

In [None]:
model_poly = LogisticRegression(C=0.0001)

Model training

In [None]:
model_poly.fit(train_scaled_poly, train_target)

In [None]:
prediction_poly = model_poly.predict_proba(test_scaled_poly)

In [None]:
submit_poly = test_features[['SK_ID_CURR']]
submit_poly = submit.assign(TARGET=prediction_poly[:,1])

In [None]:
submit_poly.to_csv('model_baseline_poly.csv', index = False)

Scored 0.722