# Homework nr. 3 - features transformation & selection (deadline 13/12/2018)

In short, the main task is to play with transformations and feature selection methods in order to obtain the best results for linear regression model predicting house sale prices.
  
> The instructions are not given in details: It is up to you to come up with ideas on how to fulfill the particular tasks as best you can. ;)

## What are you supposed to do

Your aim is to optimize the _RMSLE_ (see the note below) of the linear regression estimator (=our prediction model) of the observed sale prices.

### Instructions:

  1. Download the dataset from the course pages (hw3_data.csv, hw3_data_description.txt). It corresponds to [this Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).
  2. Split the dataset into train & test part exactly as we did in the tutorial.
  3. Transform the features properly (don't forget the target variable).
  4. Try to find the best subset of features.
  5. Compare your results with the [Kaggle leaderboard](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard). You should be able to reach approximately the top 20% there.
  
Give comments on each step of your solution, with short explanations of your choices.

  
**Note**: _RMSLE_ is a Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sale prices.


## Comments

  * Please follow the instructions from https://courses.fit.cvut.cz/MI-PDD/homeworks/index.html.
  * If the reviewing teacher is not satisfied, he can give you another chance to rework your homework and to obtain more points.

In [1]:
import numpy as np
import pandas as pd

from scipy import stats, optimize

from sklearn import model_selection, linear_model, metrics, preprocessing, feature_selection

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('hw3_data.csv')

### Feature transformations

First, we focus on transformations that can help to increase the performance of prediction models.

In [3]:
# Converting categorical features into indicators
df = pd.get_dummies(df)

In [4]:
# Converting all non-indicator values to float64
df[df.select_dtypes(['float16', 'float64', 'int64']).columns] = df[df.select_dtypes(['float16', 'float64', 'int64']).columns].astype('float64')

In [5]:
# Replacing NaNs with mode if feature is string, otherwise with median
for col in df:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna(df[col].mode())
    else:
        df[col] = df[col].fillna(df[col].median())

In [6]:
# Finding the features that have the most correlation with Sale Price
corr = df.corr().SalePrice
corr_field = corr.sort_values(ascending = False).head(11)
display(corr_field.head(11))

cols = corr_field.head(6).index.values
cols = np.delete(cols, 0) # Removing Sale Price
print("The features that have the most correlation with Sale Price are the following : ", cols)

SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
BsmtQual_Ex     0.553105
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
Name: SalePrice, dtype: float64

The features that have the most correlation with Sale Price are the following :  ['OverallQual' 'GrLivArea' 'GarageCars' 'GarageArea' 'TotalBsmtSF']


In [7]:
# Checking that there is no missing data before train and test split
df.columns[df.isnull().any()]

Index([], dtype='object')

### Split the dataset into train & test

In [8]:
dt, dv = model_selection.train_test_split(df, test_size=0.25, random_state=17)
dt = dt.copy()
dv = dv.copy()
print('Train: ', len(dt), '; Validation: ', len(dv))

Train:  1095 ; Validation:  365


In [9]:
def linreg(train, validate, plot = False, train_error = True):
    # Data prepare
    X = train.drop(['SalePrice'], axis = 1, errors = 'ignore')
    y = train.SalePrice
    Xv = validate.drop(['SalePrice'], axis = 1, errors = 'ignore')
    yv = validate.SalePrice
    
    # Linear Regression train
    clf = linear_model.LinearRegression()
    clf.fit(X, y) 
    
    # Print RMSE
    print('Linear regression root mean squared validation error:', 
          np.sqrt(metrics.mean_squared_error(np.log(clf.predict(Xv)),np.log(yv))))
    if train_error:
        print('Linear regression root mean squared train error:', 
              np.sqrt(metrics.mean_squared_error(np.log(clf.predict(X)), np.log(y))))
    
    # Joint Plot
    if plot:
        sns.jointplot(yv, clf.predict(Xv))

In [10]:
linreg(dt,dv)

Linear regression root mean squared validation error: 0.17438604800458019
Linear regression root mean squared train error: 0.1071406902571685


We found a RMSLE of 0.1071 which is in the [Kaggle Leaderboard](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard) 20%.