# Homework nr. 3 - features transformation & selection (deadline 22/11/2018)

In short, the main task is to play with transformations and feature selection methods in order to obtain the best results for linear regression model predicting house sale prices.
  
> The instructions are not given in details: It is up to you to come up with ideas on how to fulfill the particular tasks as best you can. ;)

## What are you supposed to do

Your aim is to optimize the _RMSLE_ (see the note below) of the linear regression estimator (=our prediction model) of the observed sale prices.

### Instructions:

  1. Download the dataset from the course pages (hw3_data.csv, hw3_data_description.txt). It corresponds to [this Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques).
  2. Split the dataset into train & test part exactly as we did in the tutorial.
  3. Transform the features properly (don't forget the target variable).
  4. Try to find the best subset of features.
  5. Compare your results with the [Kaggle leaderboard](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/leaderboard). You should be able to reach approximately the top 20% there.
  
Give comments on each step of your solution, with short explanations of your choices.

  
**Note**: _RMSLE_ is a Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sale prices.


## Comments

  * Please follow the instructions from https://courses.fit.cvut.cz/MI-PDD/homeworks/index.html.
  * If the reviewing teacher is not satisfied, he can give you another chance to rework your homework and to obtain more points.

In [2]:
import numpy as np
import pandas as pd
from scipy import stats, optimize
from sklearn import model_selection, linear_model, metrics, preprocessing, feature_selection
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
import math

%matplotlib inline

df = pd.read_csv('hw3_data.csv')

In [3]:
# CLEANING
# Convert all object values to categorial format
df[df.select_dtypes(include=['object']).columns] = df.select_dtypes(include=['object']).apply(pd.Series.astype, dtype='category')
# Fill all NaN with 0
df.loc[:,df.select_dtypes(include=['float64']).columns] = df.loc[:,df.select_dtypes(include=['float64']).columns].fillna(0)

df = pd.get_dummies(df)

df[df.select_dtypes(['float16', 'float64', 'int64']).columns] = df[df.select_dtypes(['float16', 'float64', 'int64']).columns].astype('float64')

df = df[df.columns[df.min() != df.max()]]


In [4]:
df['TotalArea'] = df['TotalBsmtSF'] + df['1stFlrSF'] + df['2ndFlrSF']

for column in df.filter(regex='Area|SF', axis=1).columns: 
    df['Has' + column] = (df[column] > 0).replace({True: 1, False: 0}).astype('uint8')
    df['Sqrt' + column] = np.sqrt(df[column])

In [5]:
#Standardize 
columns = df.select_dtypes(include=['float64']).columns 
columns = columns.drop('SalePrice')
scaler = preprocessing.StandardScaler()
scaler.fit(df[columns])
df[columns] = scaler.transform(df[columns])

In [6]:
def get_rmsle(train,train_price, validate,validate_price):
    clf = linear_model.LinearRegression()
    model = clf.fit(train, train_price) 
    res =clf.predict(validate)    
    return math.sqrt(metrics.mean_squared_error(validate_price, clf.predict(validate)))

In [7]:
#Removing features
columns_to_remove = ['Id']
ttest_pvals = df\
    .drop(columns_to_remove, axis = 1, errors = 'ignore')\
    .select_dtypes(include = ['uint8']).columns\
    .to_series()\
    .apply(lambda x: stats.ttest_ind(df.SalePrice[df[x] == 0], df.SalePrice[df[x] == 1], equal_var = False).pvalue)

columns_to_remove = list(set(columns_to_remove + list(ttest_pvals[ttest_pvals > 0.6].index)))
df.drop(columns_to_remove, axis = 1, errors = 'ignore', inplace=True)

  **kwargs)


In [8]:
#Log
df["SalePrice"] = np.log1p(df["SalePrice"])

In [9]:
#Spliting
dt, dv = model_selection.train_test_split(df, test_size=0.25, random_state=17)
dt = dt.copy()
dv = dv.copy()
print('Train: ', len(dt), '; Validation: ', len(dv))

Train:  1095 ; Validation:  365


In [10]:
dt_without = dt.drop('SalePrice', axis=1)
dv_without = dv.drop('SalePrice', axis=1)

#I have tried many models, but Lasso has the best results.
clf = linear_model.Lasso(alpha = 0.001)
sfm = feature_selection.SelectFromModel(clf)
sfm.fit(dt_without, dt.SalePrice)



SelectFromModel(estimator=Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
        max_features=None, norm_order=1, prefit=False, threshold=None)

In [11]:
feature_idx = sfm.get_support()
feature_name = dt_without.columns[feature_idx]

#print(f'Selected features: \n{list(feature_name)}')
print(f'Number of selected features: {len(feature_name)}')

Number of selected features: 84


In [13]:
X = sfm.transform(dt_without)
y = dt.SalePrice
Xv = sfm.transform(dv_without)
yv = dv.SalePrice

clf = linear_model.LinearRegression()
clf.fit(X, y) 
    
rmsle = np.sqrt(metrics.mean_squared_error(clf.predict(Xv),yv))
print('RMSLE:', rmsle)

RMSLE: 0.1164806015259141


I have done many solutions of this homework and usage **Lasso model** with **sklearn.feature_selection.SelectFromModel** had the best result. I have selected 84 features and final RMSLE is 0.1165.
