### Title: 
# Feature Engineering

### Description:

What is a feature and why we need the engineering of it? Basically, all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly. Here, the need for feature engineering arises. I think feature engineering efforts mainly have two goals:

- Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
- Improving the performance of machine learning models.

Here we will separate our dataset in X (data without target) and Y (target), perform One Hot Enconding, create the train, validation and test datasets and study the influential outilers.

### Authors:
#### Hugo Cesar Octavio del Sueldo
#### Jose Lopez Galdon

### Date:
11/12/2020

### Version:
2.0

***

### Libraries

In [1]:
    # Numpy & Pandas to work with the DF
import numpy as np
import pandas as pd

    # Visualize DF & images
from IPython.display import display, HTML

    # Import Sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest

## Load data

In [2]:
    # To automate the work as much as possible, we will parameterize the codes, so in this case, we will create an objetct with
    # the path root
name = 'data_preprocessed'

data = pd.read_csv(f'../data/02_intermediate/{name}.csv',  # Path root: here we include an f-string with the variable name
                   low_memory=False)                       # To avoid warnings we use set low_memory = False

### View data

Firstly, we are goint to take a look to our dataframe:

In [3]:
    # First 5 rows using html display in order to view all the columns
data = data
display(HTML(data.head().to_html()))

Unnamed: 0,loan_status,funded_amnt,term,int_rate,grade,emp_length,home_ownership,annual_inc,verification_status,addr_state,dti,delinq_2yrs,fico_range_low,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,acc_now_delinq,open_acc_6m,open_act_il,open_il_24m,open_rv_24m,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc_dlq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_tl,num_il_tl,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,total_il_high_credit_limit
0,0,-0.38082,0,-1.301789,0,6,2,1.146812,0,4,-0.192147,-0.375668,0.867194,-0.706459,1,1,1.466509,1.222434,-0.362962,-1.28006,1.623752,-0.119227,-0.071894,2.166771,0.741897,-0.196922,7.170571,1.898908,1.000292,0.638439,1.019823,-1.736113,-0.091184,-0.017977,0.073427,1.01848,-0.615069,-0.55802,0.10879,0,-0.410373,1.091516,1.049814,1.567798,0.388002,-0.183158,-0.124943,0.157414,2.226422,-0.137225,1.804984
1,0,-0.893563,0,2.547691,4,2,3,-0.394579,1,47,-1.435548,-0.375668,-0.282881,-0.706459,1,0,-1.464425,-0.361617,-0.593404,1.915804,-1.768944,-0.119227,-0.071894,-0.175511,0.741897,1.304953,0.723863,-0.181734,-0.897775,-0.734571,-0.647356,1.338383,-0.091184,-0.017977,-2.038881,-0.100495,0.091934,0.813458,-0.853443,0,0.325333,-0.765696,-1.160979,-1.117975,-1.078166,-0.183158,-1.233508,-1.74942,-0.352351,-0.137225,-0.91816
2,0,-0.858404,0,-1.301789,0,7,1,0.503319,0,43,-0.632803,-0.375668,0.538601,1.406835,0,0,1.283326,-0.361617,0.320419,-1.091328,0.051527,-0.119227,-0.071894,2.166771,3.518892,7.312453,5.328654,6.060191,2.582014,0.078002,3.502667,-1.220693,-0.091184,-0.017977,-1.515649,-1.415019,-0.615069,-0.55802,1.071022,0,-0.410373,3.41303,1.681469,0.948004,-0.14515,-0.183158,0.42934,0.679668,-0.352351,-0.137225,-0.258492
3,0,-0.565408,1,0.605354,2,7,3,-0.364964,1,4,-0.975535,-0.375668,-0.118585,-0.706459,0,1,0.550592,2.806485,-0.493053,-1.393299,-0.444965,-0.119227,-0.071894,2.166771,0.741897,2.806828,4.407696,-0.181734,0.683947,-0.807131,0.356433,-1.42614,-0.091184,-0.017977,0.073427,-1.382428,-0.42225,-0.18398,-0.853443,0,-0.410373,-0.301393,0.418159,-0.084986,-0.678302,-0.183158,0.42934,0.679668,4.805196,-0.137225,-0.88312
4,0,-0.096614,0,0.605354,2,7,1,-0.020455,2,19,-0.890137,0.735458,-0.282881,-0.706459,1,0,-1.464425,-0.361617,-0.633119,1.542534,-1.3552,-0.119227,-0.071894,2.166771,1.667562,2.806828,2.565779,3.979549,0.051259,1.189782,-0.629466,1.017598,-0.091184,-0.017977,-0.333531,-0.915283,-0.486523,-0.30866,-0.372327,0,-0.410373,-1.229999,-1.476806,-0.704779,-0.81159,-0.183158,0.42934,-0.534876,-0.352351,-0.137225,-0.303297


In [4]:
    # Last 5 rows using html display in order to view all the columns
data = data
display(HTML(data.tail().to_html()))

Unnamed: 0,loan_status,funded_amnt,term,int_rate,grade,emp_length,home_ownership,annual_inc,verification_status,addr_state,dti,delinq_2yrs,fico_range_low,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,acc_now_delinq,open_acc_6m,open_act_il,open_il_24m,open_rv_24m,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc_dlq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_tl,num_il_tl,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,total_il_high_credit_limit
425010,0,-0.38082,0,0.510766,3,1,3,-0.762718,2,43,0.956747,-0.375668,-0.447178,-0.706459,0,0,-0.548508,-0.361617,0.085729,0.494022,-0.693211,-0.119227,-0.071894,-0.175511,-0.183767,-0.196922,-0.197095,-0.181734,-1.214119,-0.712989,-0.014724,0.166975,-0.091184,-0.017977,-0.256015,0.029871,2.277215,0.065379,-0.853443,0,-0.410373,1.555819,0.418159,0.32821,-0.545014,-0.183158,-0.679225,0.679668,-0.352351,-0.137225,-0.970347
425011,1,-0.331011,0,-0.281127,1,10,1,-0.200035,2,35,0.570746,0.735458,-0.282881,-0.706459,1,0,0.184225,-0.361617,0.217067,0.691142,2.864982,-0.119227,-0.071894,-0.175511,-0.183767,-0.196922,-0.197095,-0.181734,1.949325,0.313679,-0.321476,0.260688,-0.091184,-0.017977,-0.100984,0.562199,-0.743615,-0.80738,-0.372327,0,0.325333,0.16291,0.733986,-0.704779,4.120066,1.92812,0.42934,0.473195,-0.352351,-0.137225,1.280618
425012,1,-0.213813,1,0.598754,3,5,3,-0.619054,2,42,1.39171,-0.375668,-0.447178,-0.706459,0,1,-0.548508,1.222434,-0.248388,0.338842,-0.362216,-0.119227,-0.071894,-0.175511,-0.183767,-0.196922,-0.197095,-0.181734,0.683947,-0.610649,-0.47437,0.509387,-0.091184,-0.017977,0.150943,-0.79578,-0.42225,-0.18398,-0.853443,0,-0.410373,0.16291,0.102331,-0.911377,0.388002,-0.183158,0.42934,0.679668,2.226422,-0.137225,-0.235132
425013,1,-0.331011,1,1.478636,4,0,3,-0.179084,1,19,0.969272,0.735458,0.045712,1.406835,1,0,0.916959,-0.361617,-0.376632,-0.94873,-0.527713,6.657969,-0.071894,-0.175511,-0.183767,-0.196922,-0.197095,-0.181734,0.367603,-0.604436,-0.391936,-0.016846,-0.091184,-0.017977,-1.360617,-1.067376,-0.486523,-0.30866,-0.853443,0,0.325333,0.627213,0.418159,-0.291584,-0.81159,1.92812,-0.124943,0.072396,-0.352351,-0.137225,0.62172
425014,0,0.606576,0,-0.281127,1,1,3,0.353669,2,9,-0.893552,-0.375668,-0.611474,0.350188,0,1,-0.731692,1.222434,0.778438,1.114741,-1.3552,-0.119227,-0.071894,-0.175511,-0.183767,-0.196922,-0.197095,-0.181734,-1.214119,-0.587273,-0.069771,0.649956,-0.091184,-0.017977,0.383491,-0.28518,-0.486523,-0.30866,-0.853443,0,-0.410373,0.627213,-0.213496,-0.291584,-1.078166,-0.183158,-0.679225,0.679668,-0.352351,2.209005,-1.011601


In [5]:
    # Data dimension
data.shape

(425015, 51)

In [6]:
    # Loan status variable proportions
data['loan_status'].value_counts()

0    318620
1    106395
Name: loan_status, dtype: int64

***

## Data X & Data Y

First of all, we have to create a new dataset without the variable we want to predict, in this case `loan_status`, this will be our *X Data*. The *Y Data* will be a vector with the `loan_status variable`.

In [7]:
    # Set X data
X = data.drop('loan_status', axis=1)

    # Set y data
y = data['loan_status']

    # Check dimensions
X.shape, y.shape

((425015, 50), (425015,))

## One Hot Encoding

**One Hot Encoding** is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category.

In [8]:
    # Select those categorical columns
columns_categ = ['term', 'grade', 'home_ownership', 'emp_length', 'addr_state', 'verification_status', 'mths_since_last_delinq',
                 'mths_since_last_record', 'mths_since_recent_bc_dlq']
    
    # Below, we transform the variables into categorical with the astype function.
X[columns_categ] = X[columns_categ].astype('category')

    # Check the results
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 425015 entries, 0 to 425014
Data columns (total 50 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   funded_amnt                 425015 non-null  float64 
 1   term                        425015 non-null  category
 2   int_rate                    425015 non-null  float64 
 3   grade                       425015 non-null  category
 4   emp_length                  425015 non-null  category
 5   home_ownership              425015 non-null  category
 6   annual_inc                  425015 non-null  float64 
 7   verification_status         425015 non-null  category
 8   addr_state                  425015 non-null  category
 9   dti                         425015 non-null  float64 
 10  delinq_2yrs                 425015 non-null  float64 
 11  fico_range_low              425015 non-null  float64 
 12  inq_last_6mths              425015 non-null  float64 
 13 

In the EDA we applied an **Integer Enconding** to our variables, and now we just need to decide whether or not apply one hot enconding. In order to do that we need to pay attention on those categorical variables that has ordinnal relationship between them. For example, `grade` has ordinal relationship, but `verification_status` has not.

In [9]:
    # Select the columns without ordinal relationship
columns = ['home_ownership', 'addr_state', 'verification_status']
    # One Hot Enconding, droping the first column in order to save K-1 
X = pd.get_dummies(X, columns=columns, drop_first=True)

    # Check results
display(HTML(X.head().to_html()))

Unnamed: 0,funded_amnt,term,int_rate,grade,emp_length,annual_inc,dti,delinq_2yrs,fico_range_low,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,collections_12_mths_ex_med,acc_now_delinq,open_acc_6m,open_act_il,open_il_24m,open_rv_24m,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc_dlq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_tl,num_il_tl,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,total_il_high_credit_limit,home_ownership_1,home_ownership_2,home_ownership_3,addr_state_1,addr_state_2,addr_state_3,addr_state_4,addr_state_5,addr_state_6,addr_state_7,addr_state_8,addr_state_9,addr_state_10,addr_state_11,addr_state_12,addr_state_13,addr_state_14,addr_state_15,addr_state_16,addr_state_17,addr_state_18,addr_state_19,addr_state_20,addr_state_21,addr_state_22,addr_state_23,addr_state_24,addr_state_25,addr_state_26,addr_state_27,addr_state_28,addr_state_29,addr_state_30,addr_state_31,addr_state_32,addr_state_33,addr_state_34,addr_state_35,addr_state_36,addr_state_37,addr_state_38,addr_state_39,addr_state_40,addr_state_41,addr_state_42,addr_state_43,addr_state_44,addr_state_45,addr_state_46,addr_state_47,addr_state_48,addr_state_49,addr_state_50,verification_status_1,verification_status_2
0,-0.38082,0,-1.301789,0,6,1.146812,-0.192147,-0.375668,0.867194,-0.706459,1,1,1.466509,1.222434,-0.362962,-1.28006,1.623752,-0.119227,-0.071894,2.166771,0.741897,-0.196922,7.170571,1.898908,1.000292,0.638439,1.019823,-1.736113,-0.091184,-0.017977,0.073427,1.01848,-0.615069,-0.55802,0.10879,0,-0.410373,1.091516,1.049814,1.567798,0.388002,-0.183158,-0.124943,0.157414,2.226422,-0.137225,1.804984,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,-0.893563,0,2.547691,4,2,-0.394579,-1.435548,-0.375668,-0.282881,-0.706459,1,0,-1.464425,-0.361617,-0.593404,1.915804,-1.768944,-0.119227,-0.071894,-0.175511,0.741897,1.304953,0.723863,-0.181734,-0.897775,-0.734571,-0.647356,1.338383,-0.091184,-0.017977,-2.038881,-0.100495,0.091934,0.813458,-0.853443,0,0.325333,-0.765696,-1.160979,-1.117975,-1.078166,-0.183158,-1.233508,-1.74942,-0.352351,-0.137225,-0.91816,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,-0.858404,0,-1.301789,0,7,0.503319,-0.632803,-0.375668,0.538601,1.406835,0,0,1.283326,-0.361617,0.320419,-1.091328,0.051527,-0.119227,-0.071894,2.166771,3.518892,7.312453,5.328654,6.060191,2.582014,0.078002,3.502667,-1.220693,-0.091184,-0.017977,-1.515649,-1.415019,-0.615069,-0.55802,1.071022,0,-0.410373,3.41303,1.681469,0.948004,-0.14515,-0.183158,0.42934,0.679668,-0.352351,-0.137225,-0.258492,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
3,-0.565408,1,0.605354,2,7,-0.364964,-0.975535,-0.375668,-0.118585,-0.706459,0,1,0.550592,2.806485,-0.493053,-1.393299,-0.444965,-0.119227,-0.071894,2.166771,0.741897,2.806828,4.407696,-0.181734,0.683947,-0.807131,0.356433,-1.42614,-0.091184,-0.017977,0.073427,-1.382428,-0.42225,-0.18398,-0.853443,0,-0.410373,-0.301393,0.418159,-0.084986,-0.678302,-0.183158,0.42934,0.679668,4.805196,-0.137225,-0.88312,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,-0.096614,0,0.605354,2,7,-0.020455,-0.890137,0.735458,-0.282881,-0.706459,1,0,-1.464425,-0.361617,-0.633119,1.542534,-1.3552,-0.119227,-0.071894,2.166771,1.667562,2.806828,2.565779,3.979549,0.051259,1.189782,-0.629466,1.017598,-0.091184,-0.017977,-0.333531,-0.915283,-0.486523,-0.30866,-0.372327,0,-0.410373,-1.229999,-1.476806,-0.704779,-0.81159,-0.183158,0.42934,-0.534876,-0.352351,-0.137225,-0.303297,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


## Train, Validation & Test

- **Training Dataset**: The sample of data used to fit the model.

- **Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

- **Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

![Image](https://miro.medium.com/max/3000/1*Nv2NNALuokZEcV6hYEHdGA.png)

We will use the next data proportion:

- **Training**: 70 %
- **Validation**: 15 %
- **Test**: 15 %

In [10]:
    # First we will define our training and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1322)

    # Check dimensions
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((297510, 102), (127505, 102), (297510,), (127505,))

In [11]:
    # From the test data, we divide it into validation and test
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1322)
    
    # Check dimensions
X_test.shape, X_val.shape, y_test.shape, y_val.shape

((63752, 102), (63753, 102), (63752,), (63753,))

## Influential Outliers

Return the anomaly score of each sample using the IsolationForest algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

In [12]:
    # Define the isolationForest params
iso = IsolationForest(max_samples=100, contamination=0.1, random_state=1322)

    # Predict those outliers
yhat = iso.fit_predict(X_train)

    # We create a mask with the outliers and then we drop them
mask = yhat !=-1
X_train, y_train = X_train[mask], y_train[mask]

    # Finally, check the shapes
X_train.shape, y_train.shape

((267759, 102), (267759,))

***

### Save data

In [13]:
    # Training data
data = X_train
data.to_csv('X_train.csv', header=True, index=False)

data = y_train
data.to_csv('y_train.csv', header=True, index=False)

    # Validation data
data = X_val
data.to_csv('X_val.csv', header=True, index=False)

data = y_val
data.to_csv('y_val.csv', header=True, index=False)

    # Test data
data = X_test
data.to_csv('X_test.csv', header=True, index=False)

data = y_test
data.to_csv('y_test.csv', header=True, index=False)

***

***