# Feature Engineering for the House Price Model pipeline

This notebook will handle Feature Engineering. Here. we'll tackle the following:

1. Missing values
2. Non-Gaussian distributed features
3. Categorical Features: Removing the rare labels
4. Ctegorical Features: convert strings to numbers
5. Handling Multicollinearity
6. Scaling the dataframe

Now, we proceed to import the packages

In [1]:
# Import Packages
# To handle datasets
import numpy as np
import pandas as pd

# for plotting visuals
import matplotlib.pyplot as plt

# for the yeo-johnson transformation
import scipy.stats as stats

# For splitting dataset
from sklearn.model_selection import train_test_split

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# To save the trained scaler class
import joblib

# To visualize all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load dataset
data = pd.read_csv('train.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(14448, 16)


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,age_group,income_bracket,rooms_per_household,bedrooms_per_household,population_per_household,median_income_per_household
0,-118.02,33.93,35.0,2400.0,398.0,1218.0,408.0,4.1312,193800.0,<1H OCEAN,middle-aged,low-income,5.882353,0.97549,2.985294,0.010125
1,-117.09,32.79,20.0,2183.0,534.0,999.0,496.0,2.8631,169700.0,NEAR OCEAN,young,low-income,4.40121,1.076613,2.014113,0.005772
2,-120.14,34.59,24.0,1601.0,282.0,731.0,285.0,4.2026,259800.0,NEAR OCEAN,young,low-income,5.617544,0.989474,2.564912,0.014746
3,-121.0,39.26,14.0,810.0,151.0,302.0,138.0,3.1094,136100.0,INLAND,minor,low-income,5.869565,1.094203,2.188406,0.022532
4,-122.45,37.77,52.0,3188.0,708.0,1526.0,664.0,3.3068,500001.0,NEAR BAY,elderly,low-income,4.801205,1.066265,2.298193,0.00498


## Separate dataset into train and test sets

When engineering features, some techniques learn parameters from data. it is important to learn these parameters from the train set, in order to avoid overfitting.

Typically, our feature engineering techniques will learn things like mean, mode, exponents of the yeo-johnson transformation, frequencies of cayegory, and caegory to number mappings from the train set

### Reproducibility: Setting the seed

With the aim to ensure reproducibility between runs of the same notebook, but also between the research and production environment, for each step that includes some element of randomness, it is extremely important that we **set the seed**.

In [4]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['total_bedrooms', 'households', 'age_group'], axis=1), # predictive variables
    data['median_house_value'], # target
    test_size=0.1, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

X_train.shape, X_test.shape

((13003, 13), (1445, 13))

## Feature Engineering

We're going to embark on the core Feature Engineering Tasks in order to tacke:

    Missing Values
    Temporal Values
    Non-Gaussian distributed values
    Categorical features: remove rare labels, if any
    Categorical Features: convert strings to numbers
    Put the variables in a similar scale
    
### Target/ Label

We apply the Yeo-Johnson transformation to the target variable

In [6]:
y_train = stats.yeojohnson(y_train)[0]
y_test = stats.yeojohnson(y_test)[0]