# Problem Definition

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

- Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

### Data Description

See https://www.kaggle.com/c/home-credit-default-risk/data

There are 7 different sources of data:

* <b>application_train.csv</b>: contains the main training data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0 (the loan was repaid) or 1 (the loan was not repaid). 
* <b>application_test.csv</b>: contains the main testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. 

#### Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, Imputer, PolynomialFeatures

# Data

#### Main train data

In [2]:
def get_train_data(path: str, target_name: str) -> tuple:
    df = pd.read_csv(path)
    target = df[target_name]    
    features = df.drop(target_name, axis=1)
    return features, target

In [3]:
path = 'data/application_train.csv'
target_name = 'TARGET'
train_features, train_target = get_train_data(path, target_name)

In [54]:
train_features.shape, train_target.shape

((307511, 121), (307511,))

In [28]:
train_features.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,...,0,0,0,0,,,,,,
4,100007,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [30]:
train_target.head()

0    1
1    0
2    0
3    0
4    0
Name: TARGET, dtype: int64

#### Test data

In [6]:
test_features = pd.read_csv("data/application_test.csv")
test_features.shape

(48744, 121)

#### Dealing with categorical values

Convert categorical features to one-hot encoded features. This is due to the fact that implementations of most machine learning models cannot directly deal with categorical data. We will use pandas native one hot encoder. Howerver, to align train and test data, we will simply remove the additional columns, since they reflect only a small portion of data.

In [7]:
def make_one_hot_encoded(train_features: pd.DataFrame, test_features: pd.DataFrame) -> tuple:
    train_1h = pd.get_dummies(train_features)
    test_1h = pd.get_dummies(test_features)
    return train_1h.align(test_1h, join='inner', axis=1)

In [65]:
train_1h, test_1h = make_one_hot_encoded(train_features, test_features)
assert train_1h.shape[1]==test_1h.shape[1]

In [66]:
train_1h.shape, test_1h.shape

((307511, 242), (48744, 242))

Lets look at the one hot encoded colum s

In [43]:
def get_extra_columns(before: np.ndarray, after: np.ndarray) -> list:
    return list(set(after).difference(set(before)))

In [52]:
train_1h[get_extra_columns(train_features.columns.values, train_1h.columns.values)].head()

Unnamed: 0,ORGANIZATION_TYPE_XNA,FONDKAPREMONT_MODE_reg oper spec account,ORGANIZATION_TYPE_Transport: type 4,OCCUPATION_TYPE_HR staff,ORGANIZATION_TYPE_Trade: type 6,WALLSMATERIAL_MODE_Monolithic,FLAG_OWN_REALTY_Y,ORGANIZATION_TYPE_Military,WALLSMATERIAL_MODE_Mixed,ORGANIZATION_TYPE_Self-employed,...,FONDKAPREMONT_MODE_not specified,NAME_HOUSING_TYPE_Rented apartment,OCCUPATION_TYPE_Private service staff,ORGANIZATION_TYPE_Trade: type 1,OCCUPATION_TYPE_Core staff,WEEKDAY_APPR_PROCESS_START_THURSDAY,NAME_INCOME_TYPE_Businessman,FLAG_OWN_CAR_Y,ORGANIZATION_TYPE_Religion,ORGANIZATION_TYPE_Hotel
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,1,0,0,1,0


In [51]:
test_1h[get_extra_columns(test_features.columns.values, test_1h.columns.values)].head()

Unnamed: 0,ORGANIZATION_TYPE_XNA,FONDKAPREMONT_MODE_reg oper spec account,ORGANIZATION_TYPE_Transport: type 4,OCCUPATION_TYPE_HR staff,ORGANIZATION_TYPE_Trade: type 6,WALLSMATERIAL_MODE_Monolithic,FLAG_OWN_REALTY_Y,ORGANIZATION_TYPE_Military,WALLSMATERIAL_MODE_Mixed,ORGANIZATION_TYPE_Self-employed,...,FONDKAPREMONT_MODE_not specified,NAME_HOUSING_TYPE_Rented apartment,OCCUPATION_TYPE_Private service staff,ORGANIZATION_TYPE_Trade: type 1,OCCUPATION_TYPE_Core staff,WEEKDAY_APPR_PROCESS_START_THURSDAY,NAME_INCOME_TYPE_Businessman,FLAG_OWN_CAR_Y,ORGANIZATION_TYPE_Religion,ORGANIZATION_TYPE_Hotel
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


#### Imputing

We learned earlier that a lot of features are missing values. We need to deal with the missing values before we derive the polynomial feature. 

In [55]:
type(Imputer(strategy='median'))

sklearn.preprocessing.imputation.Imputer

In [16]:
def impute(train_data: pd.DataFrame, test_data: pd.DataFrame, strategy: str) -> tuple:
    imputer = Imputer(strategy=strategy)
    
    train_imputed = imputer.fit_transform(train_data)
    train_features = pd.DataFrame(train_imputed, columns=train_data.columns)
    
    test_imputed = imputer.transform(test_data) 
    test_features = pd.DataFrame(test_imputed, columns=test_data.columns)
    
    return train_features, test_features

In [58]:
train_imputed, test_imputed = impute(train_1h, test_1h, strategy='median')

In [61]:
def count_missing_stats(data: pd.DataFrame) -> int:
    return len([(row, stat) for row, stat in (data.isnull().sum()/data.shape[0]).items() if stat>0])    

In [67]:
print ("Columns having null valies: {} (before Imputing), {} (after Imputing)"
       .format(count_missing_stats(train_1h),
               count_missing_stats(train_imputed)))

Columns having null valies: 61 (before Imputing), 0 (after Imputing)


In [68]:
print ("Columns having null valies: {} (before Imputing), {} (after Imputing)"
       .format(count_missing_stats(test_1h), 
               count_missing_stats(test_imputed)))

Columns having null valies: 58 (before Imputing), 0 (after Imputing)


### Feature Engineering

#### Polynomial feature from correlated features

In our analysis, we learned that `TARGET` is positively correlated with `DAYS_BIRTH` and negatively correlated with `EXT_SOURCE_1`, `EXT_SOURCE_2`, and `EXT_SOURCE_3`. It is worth trying out polynomial features. `scikit-learn` provides a utitlity to generate one.

In [104]:
def EngieerPolynomialFeatures(train_data: pd.DataFrame, 
                              test_data: pd.DataFrame, 
                              feature_columns: list, 
                              degree: int, 
                              merge_id: str, 
                              merge_how:str) -> tuple:
    poly_transformer = PolynomialFeatures(degree=degree)
    
    train_features = poly_transformer.fit_transform(train_data[feature_columns])
    test_features = poly_transformer.transform(test_data[feature_columns])
    
    engineered_column_names = poly_transformer.get_feature_names(feature_columns)
    
    poly_df_train = pd.DataFrame(train_features, columns=engineered_column_names)
    poly_df_train[merge_id] = train_data[merge_id]
    poly_train_features = train_data.merge(poly_df_train, how=merge_how, on=merge_id)
    
    poly_df_test = pd.DataFrame(test_features, columns=engineered_column_names)
    poly_df_test[merge_id] = test_data[merge_id]
    poly_test_features = test_data.merge(poly_df_test, how=merge_how, on=merge_id)
    
    return poly_train_features, poly_test_features

In [105]:
feature_columns = ['DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']
degree = 3    
merge_id = 'SK_ID_CURR'
merge_how = 'left'
poly_train_features, poly_test_features = EngieerPolynomialFeatures(train_data=train_imputed,
                                                                    test_data=test_imputed,
                                                                    feature_columns=feature_columns, 
                                                                    degree=degree, 
                                                                    merge_id=merge_id, 
                                                                    merge_how=merge_how)

In [106]:
poly_train_features.shape, poly_test_features.shape

((307511, 277), (48744, 277))

Lets look at the polynomial features

In [107]:
poly_train_features[get_extra_columns(train_imputed.columns.values, poly_train_features.columns.values)].head()

Unnamed: 0,EXT_SOURCE_1 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_3^2,DAYS_BIRTH^3,DAYS_BIRTH^2 EXT_SOURCE_2,DAYS_BIRTH EXT_SOURCE_1,DAYS_BIRTH^2,DAYS_BIRTH EXT_SOURCE_2,EXT_SOURCE_2 EXT_SOURCE_3^2,DAYS_BIRTH EXT_SOURCE_3^2,EXT_SOURCE_2^2 EXT_SOURCE_3,...,DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_2,DAYS_BIRTH EXT_SOURCE_1^2,DAYS_BIRTH^2 EXT_SOURCE_3,1,DAYS_BIRTH EXT_SOURCE_2^2,EXT_SOURCE_2^2,DAYS_BIRTH EXT_SOURCE_3,EXT_SOURCE_3_y,DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_3,DAYS_BIRTH_x
0,0.011573,0.001613,-846859000000.0,23536670.0,-785.612748,89510521.0,-2487.756636,0.005108,-183.785678,0.009637,...,-206.575767,-65.2349,12475600.0,1.0,-654.152107,0.069142,-1318.634256,0.139376,-109.49539,-9461.0
1,0.166614,0.089185,-4712058000000.0,174891600.0,-5218.396475,281065225.0,-10431.950422,0.178286,-4803.518937,0.207254,...,-3247.12516,-1624.316241,150447500.0,1.0,-6491.237078,0.38719,-8973.906339,0.535276,-2793.283699,-16765.0
2,0.369159,0.269326,-6908939000000.0,201657200.0,-9637.236584,362750116.0,-10587.90154,0.295894,-10137.567875,0.225464,...,-5357.456268,-4876.421768,264650400.0,1.0,-5885.942404,0.309038,-13895.327191,0.729567,-7031.006802,-19046.0
3,0.270849,0.144979,-6864416000000.0,234933100.0,-9616.490669,361190025.0,-12361.644326,0.186365,-5445.325225,0.226462,...,-6254.966447,-4865.924377,193336400.0,1.0,-8040.528832,0.423074,-10172.92514,0.535276,-5147.479068,-19005.0
4,0.270849,0.144979,-7918677000000.0,128219000.0,-10085.550751,397284624.0,-6432.819536,0.092471,-5710.929881,0.055754,...,-3254.993372,-5103.267808,212657000.0,1.0,-2076.117157,0.10416,-10669.126224,0.535276,-5398.55579,-19932.0


In [108]:
poly_test_features[get_extra_columns(test_imputed.columns.values, poly_test_features.columns.values)].head()

Unnamed: 0,EXT_SOURCE_1 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_3^2,DAYS_BIRTH^3,DAYS_BIRTH^2 EXT_SOURCE_2,DAYS_BIRTH EXT_SOURCE_1,DAYS_BIRTH^2,DAYS_BIRTH EXT_SOURCE_2,EXT_SOURCE_2 EXT_SOURCE_3^2,DAYS_BIRTH EXT_SOURCE_3^2,EXT_SOURCE_2^2 EXT_SOURCE_3,...,DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_2,DAYS_BIRTH EXT_SOURCE_1^2,DAYS_BIRTH^2 EXT_SOURCE_3,1,DAYS_BIRTH EXT_SOURCE_2^2,EXT_SOURCE_2^2,DAYS_BIRTH EXT_SOURCE_3,EXT_SOURCE_3_y,DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_3,DAYS_BIRTH_x
0,0.120057,0.019151,-7123328000000.0,292342700.0,-14481.055414,370216081.0,-15193.73937,0.020094,-489.615795,0.099469,...,-11435.028416,-10898.652144,59056700.0,1.0,-11997.802403,0.623554,-3069.315478,0.15952,-2310.011305,-19241.0
1,0.244619,0.105911,-5894429000000.0,95169560.0,-10205.983005,326308096.0,-5268.46553,0.054673,-3386.201665,0.036829,...,-2976.631403,-5766.280398,141278900.0,1.0,-1536.577117,0.085063,-7821.019554,0.432962,-4418.799416,-18064.0
2,0.30916,0.188894,-8045687000000.0,280979400.0,-10139.186531,401521444.0,-14022.328504,0.261238,-7480.393855,0.299203,...,-7095.269204,-5130.407402,245326100.0,1.0,-9812.640816,0.489702,-12243.044232,0.610991,-6194.955045,-20038.0
3,0.322119,0.197364,-2729912000000.0,99554500.0,-7347.658072,195328576.0,-7123.246872,0.191336,-5246.681115,0.159163,...,-3744.932912,-3862.913505,119678600.0,1.0,-3630.555667,0.259771,-8563.154516,0.612704,-4501.941285,-13976.0
4,0.108203,0.057919,-2217342000000.0,72384550.0,-2635.970697,170041600.0,-5550.962315,0.121968,-3736.229463,0.096997,...,-1122.099233,-532.848276,91019230.0,1.0,-2362.974127,0.18121,-6980.002306,0.535276,-1410.972511,-13040.0


See if the polynomial features have some positive or negative correlations