# Feature engineering for House Price modelling

In this notebook, I will bring together various techniques for feature engineering to tackle a regression problem. I hope to give you a flavour of how to approach the end-to-end pipeline to build machine learning algorithms for regression.

For more feature engineering techniques, check my new course [Feature Engineering for Machine Learning](https://www.udemy.com/feature-engineering-for-machine-learning/?couponCode=PROMO_KGG), which was recently launched on Udemy.

In [None]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import StandardScaler

# for variable transformation
import scipy.stats as stats

# to build the models
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import xgboost as xgb

# to evaluate the models
from sklearn.metrics import mean_squared_error

pd.pandas.set_option('display.max_columns', None)

In [None]:
# load dataset
data = pd.read_csv("../input/train.csv")
print(data.shape)
data.head()

In [None]:
# Load the dataset for submission (the one on which our model will be evaluated by Kaggle)
# it contains exactly the same variables, but not the target

submission = pd.read_csv("../input/test.csv")
submission.head()

The House Price dataset  contains 80 different variables. We could potentially investigate each one of them individually, and I think that in a business scenario this would be the right way to proceed, as they are actually not that many. However, for the purpose of this notebook, I will try to automate the feature engineering pipeline, making some a priori decisions on when I will apply one feature engineering technique or the other.

There are other good visualisation notebooks in Kaggle that you can check to get more familiar of how the variables look like.

### Types of variables

Let's go ahead and find out what types of variables there are in this dataset

In [None]:
# find categorical variables
categorical = [var for var in data.columns if data[var].dtype=='O']
print('There are {} categorical variables'.format(len(categorical)))

In [None]:
# find numerical variables
numerical = [var for var in data.columns if data[var].dtype!='O']
print('There are {} numerical variables'.format(len(numerical)))

Numerical variables can be binary, continuous or discrete. A priori, it is good practice to know what each variable means, to then be able to differentiate continuous from discrete variables. In this notebook, I will assume that variables with a definite and low number of unique values are discrete.

#### Find discrete variables

To identify discrete variables, I will select from all the numerical ones, those that contain a finite and small number of distinct values. See below.

In [None]:
# let's visualise the values of the discrete variables
discrete = []
for var in numerical:
    if len(data[var].unique())<20:
        print(var, ' values: ', data[var].unique())
        discrete.append(var)
        
print('There are {} discrete variables'.format(len(discrete)))

As you can see there are a number of discrete variables in the dataset,  for example BedroomAbvGr, with the values indicating the number of bedrooms in the House.

### Types of problems the variables may present

#### Missing values

In [None]:
# let's visualise the percentage of missing values
for var in data.columns:
    if data[var].isnull().sum()>0:
        print(var, data[var].isnull().mean())

There are a few variables that contain missing information (NaN). Some of them contain a lot of missing values, and some of them only a few. Let's first identify those that contain a lot of NaN and then see how we can process the different variables.

In [None]:
# let's inspect the type of those variables with a lot of missing information
for var in data.columns:
    if data[var].isnull().mean()>0.80:
        print(var, data[var].unique())

The ones with high percentage of missing data are categorical variables. We will need to fill those out.

#### Outliers

Let's find out now if the variables contain outliers.

In [None]:
# first we make a list of continuous variables (from the numerical ones)
continuous = [var for var in numerical if var not in discrete and var not in ['Id', 'SalePrice']]
continuous

In [None]:
# let's make boxplots to visualise outliers in the continuous variables 
# and histograms to get an idea of the distribution

for var in continuous:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = sns.boxplot(y=data[var])
    fig.set_title('')
    fig.set_ylabel(var)
    
    plt.subplot(1, 2, 2)
    fig = sns.distplot(data[var].dropna())
    fig.set_ylabel('Number of houses')
    fig.set_xlabel(var)

    plt.show()

Outliers can be visualised as the dots outside the  whiskers in the boxplots. The majority of the continuous variables seem to contain outliers. In addition, the majority of the variables are not normally distributed. If we are planning to build linear regression, we should tackle these to improve the model performance. I will transform the variables with a box cox to try and make them more "Gaussian" looking later on in the notebook.  I will not cover outlier removal in the notebook though.

In [None]:
# let's look at the distribution of the target variable

plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
fig = sns.boxplot(y=data['SalePrice'])
fig.set_title('')
fig.set_ylabel(var)

plt.subplot(1, 2, 2)
fig = sns.distplot(data['SalePrice'].dropna())#.hist(bins=20)
fig.set_ylabel('Number of passengers')
fig.set_xlabel(var)

plt.show()

The target variable is also skewed. So I will transform it as well to boost the performance of the algorithm.

#### Outlies in discrete variables

Let's calculate the percentage of houses for each  of the values that can take the discrete variables. I will call outliers, those values that are present in less than 1% of the houses.

In [None]:
# outlies in discrete variables
for var in discrete:
    print(data[var].value_counts() / np.float(len(data)))
    print()

Most of the discrete variables show values that are shared by a tiny proportion of houses in the dataset. For linear regression, this may not be a problem, but it most likely will be for tree methods. We should take this into account to improve the performance of our trees.


#### Number of labels: cardinality

Let's now check if our categorical variables have a huge number of categories. This may be a problem for some machine learning models.

In [None]:
for var in categorical:
    print(var, ' contains ', len(data[var].unique()), ' labels')

Most of the variables, contain only a few labels. Then, we do not have to deal with high cardinality. That is good news!

Variables with high cardinality may affect the performance of some machine learning models, for example trees.

### Separate train and test set

In [None]:
# Let's separate into train and test set

X_train, X_test, y_train, y_test = train_test_split(data, data.SalePrice, test_size=0.2,
                                                    random_state=0)
X_train.shape, X_test.shape

### Engineering missing values in numerical variables
#### Continuous variables

In [None]:
# print variables with missing data
for col in continuous:
    if X_train[col].isnull().mean()>0:
        print(col, X_train[col].isnull().mean())

- LotFrontage and GarageYrBlt contain a relatively high percentage of missing values, therefore I will create and additional variable to indicate NA, and then I will do median imputation on the original variable.
- CMasVnrArea contains a small percentage of missing values, thus I will just do median imputation

In [None]:
# add variable indicating missingness + median imputation
for df in [X_train, X_test, submission]:
    for var in ['LotFrontage', 'GarageYrBlt']:
        df[var+'_NA'] = np.where(df[var].isnull(), 1, 0)
        df[var].fillna(X_train[var].median(), inplace=True) 

for df in [X_train, X_test, submission]:
    df.MasVnrArea.fillna(X_train.MasVnrArea.median(), inplace=True)

#### Discrete variables

In [None]:
# print variables with missing data
for col in discrete:
    if X_train[col].isnull().mean()>0:
        print(col, X_train[col].isnull().mean())

There are no missing data in the discrete variables. Good, then we don't have to engineer them.

### Engineering Missing Data in categorical variables

In [None]:
# print variables with missing data
for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, X_train[col].isnull().mean())

I will add a 'Missing' Label to all of them. If the missing data are rare, I will handle those together with rare labels in a subsequent engineering step.

In [None]:
# add label indicating 'Missing' to categorical variables

for df in [X_train, X_test, submission]:
    for var in categorical:
        df[var].fillna('Missing', inplace=True)

In [None]:
# check absence of null values
for var in X_train.columns:
    if X_train[var].isnull().sum()>0:
        print(var, X_train[var].isnull().sum())

In [None]:
# check absence of null values
for var in X_train.columns:
    if X_test[var].isnull().sum()>0:
        print(var, X_test[var].isnull().sum())

In [None]:
# check absence of null values
submission_vars = []
for var in X_train.columns:
    if var!='SalePrice' and submission[var].isnull().sum()>0:
        print(var, submission[var].isnull().sum())
        submission_vars.append(var)

This is something important. There are variables in the submission dataset that contain null values (missing data), where in the training set they did not.  This needs to be taken into consideration at the time of making predictions, or deploying models in business scenarios.

In [None]:
#  I will replace NAN by the median 
for var in submission_vars:
    submission[var].fillna(X_train[var].median(), inplace=True)

### Transformation of Numerical variables 

As most variables were skewed, I will transform them with the box cox transformation.

In [None]:
def boxcox_transformation(var):
    X_train[var], param = stats.boxcox(X_train[var]+1) 
    X_test[var], param = stats.boxcox(X_test[var]+1) 
    submission[var], param = stats.boxcox(submission[var]+1) 

In [None]:
for var in continuous:
    boxcox_transformation(var)
    
X_train[continuous].head()

In [None]:
# let's  check if the transformation created infinite values
for var in continuous:
    if np.isinf(X_train[var]).sum()>1:
        print(var)

In [None]:
for var in continuous:
    if np.isinf(X_test[var]).sum()>1:
        print(var)

In [None]:
for var in continuous:
    if np.isinf(submission[var]).sum()>1:
        print(var)

In [None]:
# check absence of null values(there should be none)
for var in X_train.columns:
    if X_test[var].isnull().sum()>0:
        print(var, X_test[var].isnull().sum())

In [None]:
# let's make boxplots to visualise outliers in the continuous variables
# and histograms to get an idea of the distribution
# hopefully the transformation yielded variables more "Gaussian"looking


for var in continuous:
    plt.figure(figsize=(15,6))
    plt.subplot(1, 2, 1)
    fig = sns.boxplot(y=X_train[var])
    fig.set_title('')
    fig.set_ylabel(var)
    
    plt.subplot(1, 2, 2)
    fig = sns.distplot(X_train[var].dropna())#.hist(bins=20)
    fig.set_ylabel('Number of passengers')
    fig.set_xlabel(var)

    plt.show()

The boxcox transformation worked for some of the variables, and of course it did not for some others. Those for example where only one value was predominant could not be shaped into a Gaussian looking distribution.

Also, notice that there are still outliers in several of the variables. Ideally, we would like to remove them somehow.

### Normalisation of the target variable: SalePrice

In [None]:
var = 'SalePrice'
y_train = np.log(y_train) 
y_test = np.log(y_test) 

plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
fig = sns.boxplot(y=y_train)
fig.set_title('')
fig.set_ylabel(var)

plt.subplot(1, 2, 2)
fig = sns.distplot(y_train)#.hist(bins=20)
fig.set_ylabel('Number of passengers')
fig.set_xlabel(var)

plt.show()

The transformation of the Sale Variable worked quite well. It shows not a more Gaussian looking shape.

### Engineering rare labels in categorical and discrete variables

In [None]:
def rare_imputation(variable):
    # find frequent labels / discrete numbers
    temp = X_train.groupby([variable])[variable].count()/np.float(len(X_train))
    frequent_cat = [x for x in temp.loc[temp>0.03].index.values]
    
    X_train[variable] = np.where(X_train[variable].isin(frequent_cat), X_train[variable], 'Rare')
    X_test[variable] = np.where(X_test[variable].isin(frequent_cat), X_test[variable], 'Rare')
    submission[variable] = np.where(submission[variable].isin(frequent_cat), submission[variable], 'Rare')
    
# find unfrequent labels in categorical variables
for var in categorical:
    rare_imputation(var)
    
for var in ['BsmtFullBath', 'BsmtHalfBath', 'GarageCars']:
    submission[var] = submission[var].astype('int')


### Encode categorical variables

I will order the labels according to the target.

In [None]:
def encode_categorical_variables(var, target):
        # make label to house price dictionary
        ordered_labels = X_train.groupby([var])[target].mean().sort_values().index
        ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
        
        # encode variables
        X_train[var] = X_train[var].map(ordinal_label)
        X_test[var] = X_test[var].map(ordinal_label)
        submission[var] = submission[var].map(ordinal_label)

# encode labels in categorical vars
for var in categorical:
    encode_categorical_variables(var, 'SalePrice')


In [None]:
for var in X_train.columns:
    if var!='SalePrice' and submission[var].isnull().sum()>0:
        print(var, submission[var].isnull().sum())

In [None]:
# let's inspect the dataset
X_train.head()

### Feature scaling

In [None]:
training_vars = [var for var in X_train.columns if var not in ['Id', 'SalePrice']]

In [None]:
# fit scaler
scaler = StandardScaler() # create an instance
scaler.fit(X_train[training_vars]) #  fit  the scaler to the train set for later use

The scaler is now ready, we can use it in a machine learning algorithm when required. See below.

### Machine Learning algorithm building

#### xgboost

In [None]:
xgb_model = xgb.XGBRegressor()

eval_set = [(X_test[training_vars], y_test)]
xgb_model.fit(X_train[training_vars], y_train, eval_set=eval_set, verbose=False)

pred = xgb_model.predict(X_train[training_vars])
print('xgb train mse: {}'.format(mean_squared_error(y_train, pred)))
pred = xgb_model.predict(X_test[training_vars])
print('xgb test mse: {}'.format(mean_squared_error(y_test, pred)))

#### Support vector classifier


In [None]:
SVR_model = SVR()
SVR_model.fit(scaler.transform(X_train[training_vars]), y_train)

pred = SVR_model.predict(scaler.transform(X_train[training_vars]))
print('SVR train mse: {}'.format(mean_squared_error(y_train, pred)))
pred = SVR_model.predict(scaler.transform(X_test[training_vars]))
print('SVR test mse: {}'.format(mean_squared_error(y_test, pred)))

#### Regularised linear regression

In [None]:
lin_model = Lasso(random_state=2909)
lin_model.fit(scaler.transform(X_train[training_vars]), y_train)

pred = lin_model.predict(scaler.transform(X_train[training_vars]))
print('linear train mse: {}'.format(mean_squared_error(y_train, pred)))
pred = lin_model.predict(scaler.transform(X_test[training_vars]))
print('linear test mse: {}'.format(mean_squared_error(y_test, pred)))

### Submission to Kaggle

In [None]:
pred_ls = []
pred_ls.append(pd.Series(xgb_model.predict(submission[training_vars])))

pred = SVR_model.predict(scaler.transform(submission[training_vars]))
pred_ls.append(pd.Series(pred))

pred = lin_model.predict(scaler.transform(submission[training_vars]))
pred_ls.append(pd.Series(pred))

final_pred = np.exp(pd.concat(pred_ls, axis=1).mean(axis=1))

In [None]:
temp = pd.concat([submission.Id, final_pred], axis=1)
temp.columns = ['Id', 'SalePrice']
temp.head()

### Conclusion

This solution is not one of the best ranking possible solutions. There is a lot more that can be done to try and improve the quality of the variables before using them in machine learning models. For example, instead of making box plot transformation for all of them, doing it just on those variables that benefit from it, and applying other techniques on the remaining ones, like for example discretisation. Discretisation also takes care of outliers, which do affect the performance of linear models. 

Variable selection is also an essencial step in the machine learning pipeline, which I have not covered in this notebook.

I hope you get a flavour of how to approach (or at least how I would approach) a machine learning problem and that you enjoyed the notebook.

Thanks for reading!
