### Title: 
# Feature Engineering

### Description:
What is a feature and why we need the engineering of it? Basically, all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly. Here, the need for feature engineering arises. I think feature engineering efforts mainly have two goals:

- Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
- Improving the performance of machine learning models.

### Authors:
#### Hugo Cesar Octavio del Sueldo
#### Jose Lopez Galdon

### Date:
04/11/2020

### Version:
1.0

***

### Libraries

In [1]:
    # Numpy & Pandas to work with the DF
import numpy as np
import pandas as pd

    # Visualize DF & images
from IPython.display import display, HTML

    # Import Sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest

## Load data

In [2]:
    # To automate the work as much as possible, we will parameterize the codes, so in this case, we will create an objetct with
    # the path root
name = 'data_preprocessed'

data = pd.read_csv(f'../data/02_intermediate/{name}.csv',  # Path root: here we include an f-string with the variable name
                   low_memory = False)                     # To avoid warnings we use set low_memory = False

### View data

Firstly, we are goint to take a look to our dataframe:

In [3]:
    # First 5 rows using html display in order to view all the columns
data = data
display(HTML(data.head().to_html()))

Unnamed: 0,loan_status,funded_amnt,term,int_rate,emp_length,home_ownership,annual_inc,addr_state,inq_last_6mths,open_acc,revol_bal,revol_util,total_acc,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,num_actv_bc_tl,num_actv_rev_tl,num_bc_tl,num_il_tl,num_tl_op_past_12m,pct_tl_nvr_dlq,total_il_high_credit_limit,debt_settlement_flag
0,0,-0.37408,0,-1.295542,6,2,1.16572,4,-0.706388,1.472352,-0.361733,-1.274618,1.620425,1.008902,0.632728,1.005441,-1.730262,0.073228,1.015729,-0.615098,-0.558955,0.100821,1.098916,1.060693,1.558178,0.391566,-0.122448,0.151335,1.811281,0
1,0,-0.88836,0,2.596361,2,3,-0.401111,47,-0.706388,-1.465798,-0.591312,1.92344,-1.772693,-0.897198,-0.735431,-0.648872,1.344423,-2.043405,-0.104311,0.091684,0.812923,-0.857378,-0.765412,-1.161177,-1.122894,-1.079178,-1.233873,-1.770156,-0.915202,0
2,0,-0.853095,0,-1.295542,7,1,0.511606,43,1.403049,1.288718,0.319091,-1.085757,0.048005,2.597318,0.074271,3.469124,-1.214811,-1.519101,-1.420086,-0.615098,-0.558955,1.059021,3.429325,1.695513,0.939469,-0.14325,0.433265,0.677603,-0.254725,0
3,0,-0.559221,1,0.632618,7,3,-0.371006,4,-0.706388,0.55418,-0.491337,-1.387935,-0.448549,0.691219,-0.807735,0.34717,-1.42027,0.073228,-1.387463,-0.422339,-0.184807,-0.857378,-0.29933,0.425873,-0.091712,-0.678066,0.433265,0.677603,-0.880119,0
4,0,-0.089022,0,0.632618,7,1,-0.020812,19,-0.706388,-1.465798,-0.630879,1.549914,-1.358898,0.055852,1.182123,-0.63112,1.023618,-0.334563,-0.919874,-0.486592,-0.309523,-0.378278,-1.231494,-1.478587,-0.710421,-0.81177,0.433265,-0.546277,-0.299585,0


In [4]:
    # Last 5 rows using html display in order to view all the columns
data = data
display(HTML(data.tail().to_html()))

Unnamed: 0,loan_status,funded_amnt,term,int_rate,emp_length,home_ownership,annual_inc,addr_state,inq_last_6mths,open_acc,revol_bal,revol_util,total_acc,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,num_actv_bc_tl,num_actv_rev_tl,num_bc_tl,num_il_tl,num_tl_op_past_12m,pct_tl_nvr_dlq,total_il_high_credit_limit,debt_settlement_flag
407314,0,-0.37408,0,0.536988,1,3,-0.775325,43,-0.706388,-0.547626,0.085279,0.500682,-0.696826,-1.214881,-0.713925,-0.021122,0.172943,-0.256889,0.026179,2.276281,0.064626,-0.857378,1.564998,0.425873,0.32076,-0.544362,-0.678161,0.677603,-0.967453,0
407315,1,-0.324121,0,-0.263631,10,1,-0.203355,35,-0.706388,0.186912,0.216125,0.697938,2.86181,1.961951,0.309115,-0.325507,0.266661,-0.10154,0.559014,-0.743603,-0.808388,-0.378278,0.166752,0.743283,-0.710421,4.135279,0.433265,0.469543,1.286272,0
407316,1,-0.206572,1,0.625946,5,3,-0.62929,42,-0.706388,-0.547626,-0.247588,0.345396,-0.36579,0.691219,-0.611946,-0.477221,0.515375,0.150903,-0.800258,-0.422339,-0.184807,-0.857378,0.166752,0.108463,-0.916657,0.391566,0.433265,0.677603,-0.231336,0
407317,1,-0.324121,1,1.515524,0,3,-0.182059,19,1.403049,0.921449,-0.375352,-0.943061,-0.531308,0.373535,-0.605756,-0.395424,-0.010889,-1.363752,-1.072112,-0.486592,-0.309523,-0.857378,0.632834,0.425873,-0.297949,-0.81177,-0.122448,0.065663,0.626566,0
407318,0,0.616276,0,-0.263631,1,3,0.359487,9,0.348331,-0.73126,0.775397,1.121827,-1.358898,-1.214881,-0.588653,-0.075745,0.655953,0.383927,-0.289172,-0.486592,-0.309523,-0.857378,0.632834,-0.208947,-0.297949,-1.079178,-0.678161,0.677603,-1.008758,0


In [5]:
    # Data dimension
data.shape

(407319, 30)

In [6]:
    # Loan status variable proportions
data['loan_status'].value_counts()

0    318635
1     88684
Name: loan_status, dtype: int64

***

## Data X & Data Y

First of all, we have to create a new dataset without the variable we want to predict, in this case `loan_status`, this will be our *X Data*. The *Y Data* will be a vector with the `loan_status variable`.

In [7]:
    # Set X data
X = data.drop("loan_status", axis = 1)

    # Set y data
y = data["loan_status"]

    # Check dimensions
X.shape, y.shape

((407319, 29), (407319,))

## One Hot Encoding

**One Hot Encoding** is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category.

In [8]:
    # Select those categorical columns
columns_categ = ['term', 'home_ownership', 'emp_length', 'addr_state', 'debt_settlement_flag']
    
    # Below, we transform the variables into categorical with the astype function.
X[columns_categ] = X[columns_categ].astype('category')

    # Check the results
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407319 entries, 0 to 407318
Data columns (total 29 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   funded_amnt                 407319 non-null  float64 
 1   term                        407319 non-null  category
 2   int_rate                    407319 non-null  float64 
 3   emp_length                  407319 non-null  category
 4   home_ownership              407319 non-null  category
 5   annual_inc                  407319 non-null  float64 
 6   addr_state                  407319 non-null  category
 7   inq_last_6mths              407319 non-null  float64 
 8   open_acc                    407319 non-null  float64 
 9   revol_bal                   407319 non-null  float64 
 10  revol_util                  407319 non-null  float64 
 11  total_acc                   407319 non-null  float64 
 12  acc_open_past_24mths        407319 non-null  float64 
 13 

In [9]:
    # One Hot Enconding, droping the first column in order to save K-1 
X = pd.get_dummies(X, drop_first=True)

    # Check results
display(HTML(data.tail().to_html()))

Unnamed: 0,loan_status,funded_amnt,term,int_rate,emp_length,home_ownership,annual_inc,addr_state,inq_last_6mths,open_acc,revol_bal,revol_util,total_acc,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,num_actv_bc_tl,num_actv_rev_tl,num_bc_tl,num_il_tl,num_tl_op_past_12m,pct_tl_nvr_dlq,total_il_high_credit_limit,debt_settlement_flag
407314,0,-0.37408,0,0.536988,1,3,-0.775325,43,-0.706388,-0.547626,0.085279,0.500682,-0.696826,-1.214881,-0.713925,-0.021122,0.172943,-0.256889,0.026179,2.276281,0.064626,-0.857378,1.564998,0.425873,0.32076,-0.544362,-0.678161,0.677603,-0.967453,0
407315,1,-0.324121,0,-0.263631,10,1,-0.203355,35,-0.706388,0.186912,0.216125,0.697938,2.86181,1.961951,0.309115,-0.325507,0.266661,-0.10154,0.559014,-0.743603,-0.808388,-0.378278,0.166752,0.743283,-0.710421,4.135279,0.433265,0.469543,1.286272,0
407316,1,-0.206572,1,0.625946,5,3,-0.62929,42,-0.706388,-0.547626,-0.247588,0.345396,-0.36579,0.691219,-0.611946,-0.477221,0.515375,0.150903,-0.800258,-0.422339,-0.184807,-0.857378,0.166752,0.108463,-0.916657,0.391566,0.433265,0.677603,-0.231336,0
407317,1,-0.324121,1,1.515524,0,3,-0.182059,19,1.403049,0.921449,-0.375352,-0.943061,-0.531308,0.373535,-0.605756,-0.395424,-0.010889,-1.363752,-1.072112,-0.486592,-0.309523,-0.857378,0.632834,0.425873,-0.297949,-0.81177,-0.122448,0.065663,0.626566,0
407318,0,0.616276,0,-0.263631,1,3,0.359487,9,0.348331,-0.73126,0.775397,1.121827,-1.358898,-1.214881,-0.588653,-0.075745,0.655953,0.383927,-0.289172,-0.486592,-0.309523,-0.857378,0.632834,-0.208947,-0.297949,-1.079178,-0.678161,0.677603,-1.008758,0


## Train, Validation & Test

- **Training Dataset**: The sample of data used to fit the model.

- **Validation Dataset**: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.

- **Test Dataset**: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

![Image](https://miro.medium.com/max/3000/1*Nv2NNALuokZEcV6hYEHdGA.png)

We will use the next data proportion:

- **Training**: 70 %
- **Validation**: 15 %
- **Test**: 15 %

In [10]:
    # First we will define our training and test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1322)

    # Check dimensions
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((285123, 90), (122196, 90), (285123,), (122196,))

In [11]:
    # From the test data, we divide it into validation and test
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5, random_state = 1322)
    
    # Check dimensions
X_test.shape, X_val.shape, y_test.shape, y_val.shape

((61098, 90), (61098, 90), (61098,), (61098,))

## Influential Outliers

Return the anomaly score of each sample using the IsolationForest algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

In [12]:
    # Define the isolationForest params
iso = IsolationForest(max_samples = 100, contamination = 0.1, random_state = 1322)

    # Predict those outliers
yhat = iso.fit_predict(X_train)

    # We create a mask with the outliers and then we drop them
mask = yhat != -1
X_train, y_train = X_train[mask], y_train[mask]

    # Finally, check the shapes
X_train.shape, y_train.shape

((256610, 90), (256610,))

***

### Save data

In [13]:
    # Training data
data = X_train
data.to_csv('X_train.csv', header = True, index = False)

data = y_train
data.to_csv('y_train.csv', header = True, index = False)

    # Validation data
data = X_val
data.to_csv('X_val.csv', header = True, index = False)

data = y_val
data.to_csv('y_val.csv', header = True, index = False)

    # Test data
data = X_test
data.to_csv('X_test.csv', header = True, index = False)

data = y_test
data.to_csv('y_test.csv', header = True, index = False)