# Feature Engineering, Machine Learning, and a few improvments

In this notebook, I'll run some Machine Learning algorithms. I'm gonna use tree based algos, and some neural net configs. Hope to get better results than the previous attempts on First Contact.

In [23]:
import numpy as np
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt

%matplotlib inline

In [24]:
# Importing utils 
os.chdir('/home/hugo/Documents/DataScience/Kaggle/kaggle_credit_risk/code')

from utils import *

# Data directory
os.chdir('/home/hugo/Documents/DataScience/Kaggle/kaggle_credit_risk/data/treated_data')

In [25]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [26]:
train.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,...,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,SK_DPD,SK_DPD_DEF
0,0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,...,,,,,,,,,,
1,1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,...,,,,,,,,,,
2,2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,...,,,,,,,,,,
3,3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,...,0.0,0.0,0.0,,0.0,,,0.0,0.0,0.0
4,4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,...,,,,,,,,,,


In [27]:
test.head()

Unnamed: 0.1,Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,AMT_RECEIVABLE_PRINCIPAL,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,SK_DPD,SK_DPD_DEF
0,0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,...,,,,,,,,,,
1,1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,...,,,,,,,,,,
2,2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,...,17255.559844,18101.079844,18101.079844,0.255556,0.239583,0.0,0.0,18.719101,0.010417,0.010417
3,3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,...,7680.352041,7968.609184,7968.609184,0.045455,2.387755,0.0,2.613636,19.547619,0.0,0.0
4,4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,...,,,,,,,,,,


In [28]:
train.drop('Unnamed: 0', axis=1, inplace=True)
test.drop('Unnamed: 0', axis=1, inplace=True)

I'm gonna drop columns that have more than 70% missing values.

In [29]:
# Dropping missing columns

train, DROP_COLUMNS = drop_missings_columns(train, treshold = 0.7)
test.drop(DROP_COLUMNS, axis=1, inplace=True)

Now I'm gonna separate the dataframes between numerical and categoricals. Then, I'll fill those missing data adopting some strategy (let's start with mean and see where it takes us), and for categoricals, mode.

In [30]:
train_num = train.select_dtypes(exclude='object')
train_cat = train.select_dtypes(include='object')

In [31]:
test_num = test.select_dtypes(exclude='object')
test_cat = test.select_dtypes(include='object')

In [32]:
train_num = fill_numerical_missings(train_num, 'mean')
test_num = fill_numerical_missings(test_num, 'mean')

train_cat = fill_categorical_missings(train_cat)
test_cat = fill_categorical_missings(test_cat)

For categorical data, let's create some catcodes.

In [33]:
train_cat, mappers = get_catcodes(train_cat)

In [34]:
mappers

{'CODE_GENDER': {'F': 1, 'M': 0, 'XNA': 2},
 'EMERGENCYSTATE_MODE': {'No': 0, 'Yes': 1},
 'FLAG_OWN_CAR': {'N': 0, 'Y': 1},
 'FLAG_OWN_REALTY': {'N': 1, 'Y': 0},
 'FONDKAPREMONT_MODE': {'not specified': 3,
  'org spec account': 1,
  'reg oper account': 0,
  'reg oper spec account': 2},
 'HOUSETYPE_MODE': {'block of flats': 0,
  'specific housing': 2,
  'terraced house': 1},
 'NAME_CONTRACT_TYPE': {'Cash loans': 0, 'Revolving loans': 1},
 'NAME_EDUCATION_TYPE': {'Academic degree': 4,
  'Higher education': 1,
  'Incomplete higher': 2,
  'Lower secondary': 3,
  'Secondary / secondary special': 0},
 'NAME_FAMILY_STATUS': {'Civil marriage': 2,
  'Married': 1,
  'Separated': 4,
  'Single / not married': 0,
  'Unknown': 5,
  'Widow': 3},
 'NAME_HOUSING_TYPE': {'Co-op apartment': 5,
  'House / apartment': 0,
  'Municipal apartment': 3,
  'Office apartment': 4,
  'Rented apartment': 1,
  'With parents': 2},
 'NAME_INCOME_TYPE': {'Businessman': 6,
  'Commercial associate': 2,
  'Maternity leave'

In [36]:
for col, values in mappers.items():
    test_cat[col] = test_cat[col].map(values)

Great! So now I'm gonna do some feature engineering.

# Feature Engineering

I'll create some features using polinomial features from sklearn. Them I'm gonna try to use some domain knowledge in order to develop more powerful features.

I'll expand the most correlated features using this strategy. I hope that this brings some gains into the solution.

In [41]:
from sklearn.preprocessing import PolynomialFeatures

# Selecting some features to perform polinomial expansion
poly_feats_train = train_num[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
poly_feats_test = test_num[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]

# Creating transformer
poly_transformer = PolynomialFeatures(degree = 3)

# Transforming features
poly_transformer.fit(poly_feats_train)

poly_feats_train = poly_transformer.transform(poly_feats_train)
poly_feats_test = poly_transformer.transform(poly_feats_test)

print('Shape of transformed features: '+ str(poly_feats_train.shape))

Shape of transformed features: (307511, 35)


Good! Now let's turn it back to pandas dataframe and concatenate them with our original training/testing data.

In [43]:
poly_feats_train = pd.DataFrame(poly_feats_train, 
                                columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                           'EXT_SOURCE_3', 'DAYS_BIRTH']))

poly_feats_test = pd.DataFrame(poly_feats_test, 
                                columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                           'EXT_SOURCE_3', 'DAYS_BIRTH']))

In [44]:
poly_feats_train.head()

Unnamed: 0,1,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_BIRTH,EXT_SOURCE_1^2,EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_1 EXT_SOURCE_3,EXT_SOURCE_1 DAYS_BIRTH,EXT_SOURCE_2^2,...,EXT_SOURCE_2^3,EXT_SOURCE_2^2 EXT_SOURCE_3,EXT_SOURCE_2^2 DAYS_BIRTH,EXT_SOURCE_2 EXT_SOURCE_3^2,EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH,EXT_SOURCE_2 DAYS_BIRTH^2,EXT_SOURCE_3^3,EXT_SOURCE_3^2 DAYS_BIRTH,EXT_SOURCE_3 DAYS_BIRTH^2,DAYS_BIRTH^3
0,1.0,0.083037,0.262949,0.139376,-9461.0,0.006895,0.021834,0.011573,-785.612748,0.069142,...,0.018181,0.009637,-654.152107,0.005108,-346.733022,23536670.0,0.002707,-183.785678,12475600.0,-846859000000.0
1,1.0,0.311267,0.622246,0.510853,-16765.0,0.096887,0.193685,0.159012,-5218.396475,0.38719,...,0.240927,0.197797,-6491.237078,0.162388,-5329.19219,174891600.0,0.133318,-4375.173647,143583000.0,-4712058000000.0
2,1.0,0.50213,0.555912,0.729567,-19046.0,0.252134,0.27914,0.366337,-9563.564279,0.309038,...,0.171798,0.225464,-5885.942404,0.295894,-7724.580288,201657200.0,0.388325,-10137.567875,264650400.0,-6908939000000.0
3,1.0,0.50213,0.650442,0.510853,-19005.0,0.252134,0.326606,0.256514,-9542.976957,0.423074,...,0.275185,0.216129,-8040.528832,0.169746,-6314.981929,234933100.0,0.133318,-4959.747997,184515000.0,-6864416000000.0
4,1.0,0.50213,0.322738,0.510853,-19932.0,0.252134,0.162057,0.256514,-10008.451286,0.10416,...,0.033616,0.05321,-2076.117157,0.084225,-3286.224555,128219000.0,0.133318,-5201.667828,202954000.0,-7918677000000.0


Just as a sanity check, I'll calculate the correlations between those new engineered features and the target.

In [45]:
poly_feats_train['TARGET'] = train_num['TARGET']

poly_feats_train.corr()['TARGET'].sort_values()

EXT_SOURCE_2 EXT_SOURCE_3                -0.194235
EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3   -0.189593
EXT_SOURCE_2^2 EXT_SOURCE_3              -0.176589
EXT_SOURCE_2 EXT_SOURCE_3^2              -0.171729
EXT_SOURCE_1 EXT_SOURCE_2                -0.166538
EXT_SOURCE_1 EXT_SOURCE_3                -0.164933
EXT_SOURCE_2                             -0.160303
EXT_SOURCE_3                             -0.157397
EXT_SOURCE_1 EXT_SOURCE_2^2              -0.156791
EXT_SOURCE_1 EXT_SOURCE_3^2              -0.151139
EXT_SOURCE_2^2                           -0.149502
EXT_SOURCE_3^2                           -0.142517
EXT_SOURCE_2^3                           -0.140217
EXT_SOURCE_1^2 EXT_SOURCE_2              -0.139696
EXT_SOURCE_1^2 EXT_SOURCE_3              -0.139051
EXT_SOURCE_2 DAYS_BIRTH^2                -0.132844
EXT_SOURCE_3^3                           -0.128569
EXT_SOURCE_3 DAYS_BIRTH^2                -0.127602
EXT_SOURCE_1                             -0.099152
EXT_SOURCE_1 DAYS_BIRTH^2      

Cool! Some of those new features are highly correlated with the target. This can be very useful when we start to apply machine learning. Let's concatenate all the data.

In [50]:
poly_feats_train.drop(['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'], 
                      axis=1, inplace=True)

poly_feats_test.drop(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH'], 
                     axis=1, inplace=True)

In [51]:
train_full = pd.concat([train_num, train_cat, poly_feats_train], axis=1)
test_full = pd.concat([test_num, test_cat, poly_feats_test], axis=1)

# Domain Knowledge Features

Now let's create some domain knowledge features. I'm not an expert in credit, but I'll try.