## Quote prediction: Feature engineering

There will be a notebook for each one of the Machine Learning Pipeline steps:

1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Model Building

**This is the notebook for step 2: Feature engineering**

===================================================================================================

In [67]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to visualise al the columns in the dataframe
import scipy.stats as stats
pd.pandas.set_option('display.max_columns', None)
pd.pandas.set_option('display.max_rows', None)
%config Completer.use_jedi = False

from sklearn import preprocessing
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder()

import joblib

from datetime import datetime
from datetime import date

In [108]:
# load dataset
data = pd.read_csv('../data/output/step1data.csv')

data.head()

Unnamed: 0,Ins_Age,Ins_Gender,Ht,Wt,bmi,quote
0,31,Male,510,185,26.5,500
1,35,Male,510,205,29.4,500
2,45,Female,510,125,17.9,900
3,38,Male,503,175,31.0,500
4,39,Female,600,252,34.1,450


In [109]:
data['Ins_Gender']= label_encoder.fit_transform(data['Ins_Gender']) 
print(data.head())

   Ins_Age  Ins_Gender   Ht   Wt   bmi  quote
0       31           1  510  185  26.5    500
1       35           1  510  205  29.4    500
2       45           0  510  125  17.9    900
3       38           1  503  175  31.0    500
4       39           0  600  252  34.1    450


In [110]:
#data["height"]=((data["Ht"]//100)*30.48+(data["Ht"]%100)*2.54)*0.01

# Separate dataset into train and test

It is important to separate our data intro training and testing set. 

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [116]:
# Let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Ht',"Wt","quote"], axis=1), # predictive variable
    data['quote'], # target
    test_size=0.3, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

X_train.shape, X_test.shape

((70, 3), (30, 3))

In [117]:
X_train.head()

Unnamed: 0,Ins_Age,Ins_Gender,bmi
60,37,1,29.0
80,36,1,15.9
90,50,1,17.3
68,19,0,25.8
51,55,1,27.3


In [118]:
bins = [10,20,30,40,50,60]
X_train['age_bin'] = pd.cut(X_train.Ins_Age,bins).astype('object')
X_test['age_bin'] = pd.cut(X_test.Ins_Age,bins).astype('object')

X_train.drop(["Ins_Age"], axis=1, inplace= True)
X_test.drop(["Ins_Age"], axis=1, inplace= True)

In [119]:
X_train = pd.get_dummies(X_train, columns = ["age_bin"])
X_test = pd.get_dummies(X_test, columns = ["age_bin"])
print(X_test.head())

    Ins_Gender   bmi  age_bin_(10, 20]  age_bin_(20, 30]  age_bin_(30, 40]  \
26           1  18.4                 0                 0                 1   
86           1  29.5                 0                 0                 0   
2            0  17.9                 0                 0                 0   
55           1  20.5                 0                 0                 0   
75           1  28.2                 0                 0                 0   

    age_bin_(40, 50]  age_bin_(50, 60]  
26                 0                 0  
86                 1                 0  
2                  1                 0  
55                 1                 0  
75                 1                 0  


In [121]:
y_train.head()

60     500
80     750
90    1000
68     450
51     500
Name: quote, dtype: int64

In [122]:
X_train.dtypes

Ins_Gender            int64
bmi                 float64
age_bin_(10, 20]      uint8
age_bin_(20, 30]      uint8
age_bin_(30, 40]      uint8
age_bin_(40, 50]      uint8
age_bin_(50, 60]      uint8
dtype: object

### Categorical variables
<a id="remove_rares"></a>




There are a few categories that appear in fewer than 0.05% of the rows, as stated in the preceding notebook. Those categories will be replaced with "rare".

In [123]:
# let's now save the train and test sets for the next notebook!

X_train.to_csv('../data/output/xtrain.csv', index=False)
X_test.to_csv('../data/output/xtest.csv', index=False)

y_train.to_csv('../data/output/ytrain.csv', index=False)
y_test.to_csv('../data/output/ytest.csv', index=False)