# Homework 3 - data transformation & dimensionality reduction (deadline 19. 12. 2021, 23:59)

In short, the main task is to play with transformations and dimensionality reduction to obtain the best results for the linear regression model predicting house sale prices.
  
> The instructions are not given in detail: It is up to you to come up with ideas on how to fulfill the particular tasks as best you can!

However, we **strongly recommend and require** the following:
* Follow the assignment step by step. Number each step.
* Properly comment all your steps. Comments are evaluated for 2 points of the total together with the final presentation of the solution. However, it is not desirable to write novels! 
* Do not leave the task to the last minute.
* Hand in a notebook that has already been run (i.e. do not delete outputs before handing in).

## What are you supposed to do:

Your aim is to optimize the _RMSLE_ (see the note below) of the linear regression estimator (= our prediction model) of the observed sale prices.

**Just copied code from tutorial 3 and 5 will not be accepted.**

### Instructions:

  1. Download the dataset from the [course pages](https://courses.fit.cvut.cz/MI-PDD/homeworks/index.html) (data.csv, data_description.txt). It corresponds to [this Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). 
  1. Transform features appropriately and prepare new ones - focus on the increase in the performance of the model (possibly in combination with further steps). Split the dataset into a train and test part exactly as we did in the tutorials. Use the test part for evaluation of the influence of further steps. _(3 points)_
  1. Try to find some suitable subset of features - first without the use of PCA. _(4 points)_
  1. Use PCA (principal component analysis) to reduce the dimensionality. Discuss the influence of the number of principal components. _(4 points)_
  1. Compare the results of previous steps on the test part of the dataset. _(3 points)_
  
Give comments (!) on each step of your solution, with short explanations of your choices.

All your steps and following code **have to be commented!** Comments are evaluated for _2 points_ together with the final presentation of the solution.

**If you do all this properly, you will obtain 16 points.**

**Note**: _RMSLE_ is a Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sale prices.


## Comments

  * Please follow the technical instructions from https://courses.fit.cvut.cz/NI-PDD/homeworks/index.html.
  * If the reviewing teacher is not satisfied, she can (!) give you another chance to rework your homework and to obtain more points. However, this is not a given, so do your best! :)
  * English is not compulsory.

In [None]:
import numpy as np
import pandas as pd
import copy
from scipy import stats, optimize

from sklearn import model_selection, linear_model, metrics, preprocessing, feature_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
df = df.drop(['Id'], axis=1)

In [None]:
print(list(df.isnull().sum()))

[0, 0, 259, 0, 0, 1369, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 0, 0, 37, 37, 38, 37, 0, 38, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 690, 81, 81, 81, 0, 0, 81, 81, 0, 0, 0, 0, 0, 0, 0, 1453, 1179, 1406, 0, 0, 0, 0, 0, 0]


In [None]:
# on the basis of number of null values, columns Alley, FireplaceQu, MiscFeature,Fence and PoolQC will be removed
df = df.drop(['Alley', 'FireplaceQu', 'MiscFeature','Fence', 'PoolQC'], axis=1)

In [None]:
# GOAL: optimize the RMSLE of the linear regression estimator (= our prediction model) of the observed sale prices.
# Convert object data to categorical
df[df.select_dtypes(include=['object']).columns] = df.select_dtypes(include=['object']).apply(pd.Series.astype, dtype='category')

In [None]:
# transform features   
# 1. FEATURE SELECTION - variation approach
df.var() < 0.1
# KitchenAbvGr, BsmtHalfBath, are True -> remove
df = df.drop(['KitchenAbvGr', 'BsmtHalfBath'], axis=1)
# 2. one-hot encoding - convert categorical data to indicators
df = pd.get_dummies(df)
df.dtypes.value_counts()
df[df.select_dtypes(['uint8', 'int64']).columns] = df[df.select_dtypes(['uint8', 'int64']).columns].astype('float64')

In [None]:
df = df.dropna()

In [None]:
np.any(np.isnan(df))

False

In [None]:
# split + scale data
xdata = df.drop(['SalePrice'], axis=1)
ydata = df['SalePrice']
x_train, x_test, y_train, y_test = train_test_split(xdata, ydata, test_size = 0.25, random_state=42)
scaler =  StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)

x_train2, x_test2, y_train2, y_test2 = train_test_split(xdata, ydata, test_size = 0.25, random_state=42)
scaler =  StandardScaler()
scaler.fit(x_train2)
x_train2 = scaler.transform(x_train2)

In [None]:
# subset selection without PCA - L^1 regularisation
subset_lasso = linear_model.Lasso()
select = feature_selection.SelectFromModel(subset_lasso)
select.fit(x_train, y_train)

model = LinearRegression()
model.fit(x_train, y_train)
prediction = model.predict(x_test)
RMSE = np.sqrt(mean_squared_error(prediction, y_test))

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive
  f"X has feature names, but {self.__class__.__name__} was fitted without"


In [None]:
# use PCA to reduce dimensionality
pca = PCA(0.95) # choose number of principal components so that 95% of variance retains
pca.fit(x_train2)
x_train_pca = pca.transform(x_train2)
x_test_pca = pca.transform(x_test2)

model_pca = LinearRegression()
model_pca.fit(x_train_pca, y_train2)

prediction_pca = model_pca.predict(x_test_pca)
RMSE_pca = np.sqrt(mean_squared_error(prediction_pca, y_test2))


  f"X has feature names, but {self.__class__.__name__} was fitted without"


In [None]:
# PCA with n_components = 99%
pca = PCA(0.99)
pca.fit(x_train2)
x_train_pca = pca.transform(x_train2)
x_test_pca = pca.transform(x_test2)

model_pca = LinearRegression()
model_pca.fit(x_train_pca, y_train2)

prediction_pca = model_pca.predict(x_test_pca)
RMSE_pca_2 = np.sqrt(mean_squared_error(prediction_pca, y_test2))

  f"X has feature names, but {self.__class__.__name__} was fitted without"


In [None]:
# compare results on test data
print("RMSE: " + str(RMSE))
print("RMSE with PCA, n_components=0.95: " + str(RMSE_pca))
print("RMSE with PCA, n_components=0.99: " + str(RMSE_pca_2))

RMSE: 7.04524918206456e+17
RMSE with PCA, n_components=0.95: 63481046.85547718
RMSE with PCA, n_components=0.99: 117148994.89390866


RMSE při použití PCA vyšlo znatelně lépe, jeho použití se tedy jeví jako vhodné, přičemž při varianci 95% je chyba nižší, než při 99%.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=74cb6787-7726-4c91-8889-edb5587bf483' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>