# Introduction
## Learn and estimate the Home Credit Default Risk competition on Kaggle.


In [2]:
#predefines and import
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Problem 1
## Confirmation of competition contents

1. What to learn and what to predict?

    - We're using the data set to learn and predict whether our customer given their attributes have the ability to payoff our loan in time.

2. What kind of file to create and submit to Kaggle?

    - We summit a file that show our predicted probability of customer being type 1 (slow on dept payment) for all test indexes.
    - Ex of an index: Id = 1230459, Probability = 0.7

3. What kind of index value will be used to evaluate the submissions?
    - Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

### ROC and AUC [Reference](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=AUC%3A%20Area%20Under%20the%20ROC,to%20(1%2C1).)

-  **ROC**: curve (receiver operating characteristic curve) is a curve graph in 2d plane of TPR and FPR. The curve is plot of our classifier's performance under different decision thresholds

- **AUC**: Area under curve (integral of our ROC curve). This is used to draw the graph as well as for evaluating our classification model. A model with AUC = 0 means it does not have any kind of correct prediction and vice versa for AUC = 1. 

# Problem 2
## Learning and verification

## Load Training Data

In [3]:
init_houseprice_data = pd.read_csv('../Data/Big/creditinfo_train.csv')
init_houseprice_data.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
print(f'Data shape: {init_houseprice_data.shape}')

Data shape: (307511, 122)


## Null check

In [5]:
row_count = init_houseprice_data.shape[0]
null_count = init_houseprice_data.isnull().sum()
null_cols = init_houseprice_data.columns[null_count > 0]
print('All columns with null values: ', len(null_cols),' columns')
null_count.sort_values(ascending=False).head(15)

All columns with null values:  67  columns


COMMONAREA_MEDI             214865
COMMONAREA_AVG              214865
COMMONAREA_MODE             214865
NONLIVINGAPARTMENTS_MODE    213514
NONLIVINGAPARTMENTS_AVG     213514
NONLIVINGAPARTMENTS_MEDI    213514
FONDKAPREMONT_MODE          210295
LIVINGAPARTMENTS_MODE       210199
LIVINGAPARTMENTS_AVG        210199
LIVINGAPARTMENTS_MEDI       210199
FLOORSMIN_AVG               208642
FLOORSMIN_MODE              208642
FLOORSMIN_MEDI              208642
YEARS_BUILD_MEDI            204488
YEARS_BUILD_MODE            204488
dtype: int64

## Finding unusable columns
Naively, I think that if a column is > 50% null, it's certainly unusable

In [6]:
null_count_normalized = init_houseprice_data[null_cols].isnull().sum() / row_count * 100
unusable_cols = null_count_normalized.index[null_count_normalized > 50]
print('Unusable Cols: ', len(unusable_cols))
unusable_cols

Unusable Cols:  41


Index(['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG',
       'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
       'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE',
       'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI',
       'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI',
       'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI',
       'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
       'WALLSMATERIAL_MODE'],
      dtype='object')

### After Dropping Unsusable Columns

In [8]:
purged_data = init_houseprice_data.drop(columns= unusable_cols)
print('After first Purge, columns left:', purged_data.shape[1])
purged_data.columns

After first Purge, columns left: 81


Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE',
       'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
       'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE',
       'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'YEARS_BEGINEXPLUATATION_AVG', 'FLOOR

## Checking Attribute types and fill up null
I'm just going to go for the naive aproach, filling the remaining missing cells with the **MODE** of each corresponding column. 

MODE is important here cause MEAN value may not be in the attribute's domain.

In [9]:
mode_fill_values = dict()
for col in purged_data:
    mode_fill_values[col] = purged_data[col].mode()[0]
prob2_data = purged_data.fillna(mode_fill_values)
display(prob2_data.head())
print('Data Remains Null Value? ', prob2_data.isna().sum().any())

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


Data Remains Null Value?  False


## Building and Testing Model

### Label Encoding

In [15]:
categorical_columns = prob2_data.columns[prob2_data.dtypes == 'object']
prob2_data[categorical_columns].nunique()

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
EMERGENCYSTATE_MODE            2
dtype: int64

In [16]:
from sklearn.preprocessing import LabelEncoder
label_enc = LabelEncoder()

for col in categorical_columns:
    prob2_data[col] = label_enc.fit_transform(prob2_data[col])
prob2_data[categorical_columns]

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,EMERGENCYSTATE_MODE
0,0,1,0,1,6,7,4,3,1,8,6,5,0
1,0,0,0,0,1,4,1,1,1,3,1,39,0
2,1,1,1,1,6,7,4,3,1,8,1,11,0
3,0,0,0,1,6,7,4,0,1,8,6,5,0
4,0,1,0,1,6,7,4,3,1,3,4,37,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
307506,0,1,0,0,6,7,4,2,5,14,4,43,0
307507,0,0,0,1,6,3,4,5,1,8,1,57,0
307508,0,0,0,1,6,7,1,2,1,10,4,39,0
307509,0,0,0,1,6,1,4,1,1,8,6,3,0


### Let's pick attribute
I'ma gonna pick the top 20 attributes correlated to our target
Since i tried training the whole dataset naively and that did not go well :)

In [17]:
def pick_top_related(data, topn):
    target_corr = data.corr()['TARGET'].abs().sort_values(ascending = False)
    display(target_corr.head(10))
    return target_corr[1:topn+1].index
picked_attributes = pick_top_related(prob2_data,20)
picked_attributes

TARGET                         1.000000
EXT_SOURCE_2                   0.160039
EXT_SOURCE_3                   0.127891
DAYS_BIRTH                     0.078239
REGION_RATING_CLIENT_W_CITY    0.060893
REGION_RATING_CLIENT           0.058899
DAYS_LAST_PHONE_CHANGE         0.055217
NAME_EDUCATION_TYPE            0.054699
CODE_GENDER                    0.054692
DAYS_ID_PUBLISH                0.051457
Name: TARGET, dtype: float64

Index(['EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH',
       'REGION_RATING_CLIENT_W_CITY', 'REGION_RATING_CLIENT',
       'DAYS_LAST_PHONE_CHANGE', 'NAME_EDUCATION_TYPE', 'CODE_GENDER',
       'DAYS_ID_PUBLISH', 'REG_CITY_NOT_WORK_CITY', 'NAME_INCOME_TYPE',
       'FLAG_EMP_PHONE', 'DAYS_EMPLOYED', 'REG_CITY_NOT_LIVE_CITY',
       'FLAG_DOCUMENT_3', 'DAYS_REGISTRATION', 'TOTALAREA_MODE',
       'AMT_GOODS_PRICE', 'FLOORSMAX_AVG', 'FLOORSMAX_MEDI'],
      dtype='object')

### Speparating, Standardize

In [18]:
X = prob2_data[picked_attributes]
Y = prob2_data['TARGET']

from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2)

print(f'Sizes: xtrain,ytrain: {x_train.shape, y_train.shape}, xtest,ytest: {x_test.shape,y_test.shape}')

Sizes: xtrain,ytrain: ((246008, 20), (246008,)), xtest,ytest: ((61503, 20), (61503,))


### Using SGDClassifier

**Used SVC but dint work, so i went to sklearn model choosing diagram and use their recomendation instead**

Ended up with SGDClassifier. Does not know what it is but we'll see.

In [19]:
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
cls = SGDClassifier()
cls.fit(x_train,y_train)
pred = cls.predict(x_test)

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print('Accuracy: ', acc)

Accuracy:  0.9195323805342829


### Predicted Probability In Separated Test Set
SGD does not have predict probabilities by default so we will use CalibratedClassifierCV

In [20]:
from sklearn.calibration import CalibratedClassifierCV
calibrator = CalibratedClassifierCV(cls, cv = 'prefit')
problem2_model = calibrator.fit(x_train,y_train)
print('Predicted probability: ')
problem2_model.predict_proba(x_test)

Predicted probability: 


array([[0.96647387, 0.03352613],
       [0.93783676, 0.06216324],
       [0.90887625, 0.09112375],
       ...,
       [0.88671793, 0.11328207],
       [0.98170715, 0.01829285],
       [0.95207134, 0.04792866]])

# Problem 3
## Estimation on test data

In [21]:
test_data = pd.read_csv('../Data/Normal/application_test.csv')
test_ids = test_data['SK_ID_CURR']
print(f'Test shapes: {test_data.shape}')
display(test_data.head())

Test shapes: (48744, 121)


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


### Encoding and Picking test data

In [22]:
prob2_xtest = test_data.copy()
for col in categorical_columns:
    prob2_xtest[col] = label_enc.fit_transform(test_data[col])
prob2_xtest = prob2_xtest[picked_attributes]
prob2_xtest.head()

Unnamed: 0,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_BIRTH,REGION_RATING_CLIENT_W_CITY,REGION_RATING_CLIENT,DAYS_LAST_PHONE_CHANGE,NAME_EDUCATION_TYPE,CODE_GENDER,DAYS_ID_PUBLISH,REG_CITY_NOT_WORK_CITY,NAME_INCOME_TYPE,FLAG_EMP_PHONE,DAYS_EMPLOYED,REG_CITY_NOT_LIVE_CITY,FLAG_DOCUMENT_3,DAYS_REGISTRATION,TOTALAREA_MODE,AMT_GOODS_PRICE,FLOORSMAX_AVG,FLOORSMAX_MEDI
0,0.789654,0.15952,-19241,2,2,-1740.0,1,0,-812,0,6,1,-2329,0,1,-5170.0,0.0392,450000.0,0.125,0.125
1,0.291656,0.432962,-18064,2,2,0.0,4,1,-1623,0,6,1,-4469,0,1,-9118.0,,180000.0,,
2,0.699787,0.610991,-20038,2,2,-856.0,1,1,-3503,0,6,1,-4458,0,0,-2175.0,,630000.0,,
3,0.509677,0.612704,-13976,2,2,-1805.0,4,0,-4208,0,6,1,-1866,0,1,-2000.0,0.37,1575000.0,0.375,0.375
4,0.425687,,-13040,2,2,-821.0,4,1,-4262,1,6,1,-2191,0,1,-4000.0,,625500.0,,


### Filling the Nulls?
I'm not sure if this should be done.

In [23]:
mode_fill_values = dict()
for col in prob2_xtest:
    mode_fill_values[col] = prob2_xtest[col].mode()[0]
prob2_xtest = prob2_xtest.fillna(mode_fill_values)
display(prob2_xtest.head())
print('Data Remains Null Value? ', prob2_xtest.isna().sum().any())

Unnamed: 0,EXT_SOURCE_2,EXT_SOURCE_3,DAYS_BIRTH,REGION_RATING_CLIENT_W_CITY,REGION_RATING_CLIENT,DAYS_LAST_PHONE_CHANGE,NAME_EDUCATION_TYPE,CODE_GENDER,DAYS_ID_PUBLISH,REG_CITY_NOT_WORK_CITY,NAME_INCOME_TYPE,FLAG_EMP_PHONE,DAYS_EMPLOYED,REG_CITY_NOT_LIVE_CITY,FLAG_DOCUMENT_3,DAYS_REGISTRATION,TOTALAREA_MODE,AMT_GOODS_PRICE,FLOORSMAX_AVG,FLOORSMAX_MEDI
0,0.789654,0.15952,-19241,2,2,-1740.0,1,0,-812,0,6,1,-2329,0,1,-5170.0,0.0392,450000.0,0.125,0.125
1,0.291656,0.432962,-18064,2,2,0.0,4,1,-1623,0,6,1,-4469,0,1,-9118.0,0.0,180000.0,0.1667,0.1667
2,0.699787,0.610991,-20038,2,2,-856.0,1,1,-3503,0,6,1,-4458,0,0,-2175.0,0.0,630000.0,0.1667,0.1667
3,0.509677,0.612704,-13976,2,2,-1805.0,4,0,-4208,0,6,1,-1866,0,1,-2000.0,0.37,1575000.0,0.375,0.375
4,0.425687,0.706205,-13040,2,2,-821.0,4,1,-4262,1,6,1,-2191,0,1,-4000.0,0.0,625500.0,0.1667,0.1667


Data Remains Null Value?  False


### Standardize 

In [24]:
prob2_xtest = StandardScaler().fit_transform(prob2_xtest)
prob2_xtest

array([[ 1.4985808 , -1.99561188, -0.73347688, ..., -0.03747724,
        -0.68705158, -0.68384094],
       [-1.24845678, -0.54907933, -0.46139201, ..., -0.83936194,
        -0.31326363, -0.31039554],
       [ 1.00285782,  0.39271301, -0.91771788, ...,  0.49711256,
        -0.31326363, -0.31039554],
       ...,
       [ 0.63318027, -1.33862363,  0.0337701 , ..., -0.43841959,
         1.18009541,  1.18159497],
       [-0.39871853,  0.31053085,  0.48547261, ..., -0.03747724,
         3.79481828,  3.79392168],
       [-0.33892518, -1.3998733 ,  0.48685962, ..., -0.57206704,
        -0.31326363, -0.31039554]])

### Predict Result

In [27]:
problem2_model.fit(x_train,y_train)
summision_result = pd.DataFrame(
    index = test_ids,
    data = problem2_model.predict_proba(prob2_xtest)[:,0],
    columns = ['TARGET']
)
summision_result.to_csv('../Data/Big/predicted_submission_p2.csv')
summision_result.head()

Unnamed: 0_level_0,TARGET
SK_ID_CURR,Unnamed: 1_level_1
100001,0.929721
100005,0.848655
100013,0.948004
100028,0.961289
100038,0.884928


## PROBLEM 3 RESULT
** So i've uploaded to Kaggle and got 0.46684 score** Yey me!
![image](../Data/Normal/kaggle_result_p2.PNG)

### Conclusion: Problem 3:
Turns out, my model predicted everything as TARGET = 1, so maybe it's not much better from a baseline random model.

# Problem 4
## Feature engineering

## Conclusion
### Problem 4 is about feature engineering
But i think i've done that, though only very simply, in the previous Problem (2 and 3), including:
- Dropping mostly null columns
- Mode filling columns to preserve as much data as possible without dropping null tuples.
- Label encoding for categorical data
- Attribute picking relying on correlation
- Standardization