## Ridge Regression

As we saw in the previous notebook simple linear regression has many problems from the point of view of practical application. The data must be ideal for it to work well.

In this notebook we will use L2 reqularizations for the Linear Regression. Thank goodness these models can handle collinear data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib as mpl
import seaborn as sns
from IPython.core.display import HTML
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.linear_model import Ridge, Lasso

In [2]:
data = pd.read_csv("data/train_without_nans.csv")
data_sub = pd.read_csv("data/test_without_nans.csv")

encoded = pd.read_csv("data/encoded_train.csv")
encoded_sub = pd.read_csv("data/encoded_test.csv")

This time we don't need to remove collinear features and we don't need to reencode our data. All we need is to clear our data from
outliers and apply the models with futher tuning

In [3]:
'''
Function for making submission for the encoded_test.csv data

the variable y is 
'''

def make_submission(model, X, y, sub_X, feature2scale, path):
    
    scaler = StandardScaler()
    X[features2scale] = scaler.fit_transform(X[features2scale])
    sub_X[features2scale] = scaler.transform(sub_X[features2scale])

    y_real = y.copy()
    y = np.log(y)
    
    trained_model = model.fit(X, y)
    
    train_predict = trained_model.predict(X)
    sub_predict = pd.Series(np.exp(model.predict(sub_X)))
    
    print("WHOLE TRAIN RMSE: {}".format(root_mean_squared_error(y_real, np.exp(train_predict))))
    
    sub_df = pd.DataFrame()
    sub_df["Id"] = data_sub["Id"]
    sub_df["SalePrice"] = sub_predict

    sub_df.to_csv(path, index=False)

## Content

## Outlier filtering 

First of all we also will use the "3-sigma" rule. In future perhaps we will try other filtering methods

### 3-sigma rule

First of all we need to understand the next thing. Despite the fact that we have named all numerical features from the original dataset (except MSSubClass) as continuous, there are some features in the data which are categorical but were encoded with label encoding. To such features belong:
* **OverallQual**
* **OverallCond**
* **BsmtFullBath**, **BsmtHalfBath**, **FullBath**, **HalfBath**, **BedroomAbvGr** &ndash; features which are not categorical, but applying to them some filter methods is not good because every value in these features seems to be normal (except cases, for example, if there are object with 500 Full Bathrooms :) which is abnormal but such cases were eliminated by our EDA)
* **YearBuilt** &ndash; the same situation as above
* **YearRemodAdd** &ndash; the same
* **KitchenAbvGr**, **TotRmsAbvGrd**, **Fireplaces**, **GarageYrBlt**, **GarageCars** &ndash; also
* **MoSold**, **YrSold**

So...

In [4]:
std_cnt = 5

skip_features = ["Id", "MSSubClass", "OverallQual", "OverallCond", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr",
                 "YearBuilt", "YearRemodAdd", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageYrBlt", "GarageCars",
                 "MoSold", "YrSold"]

continuous_features = data.select_dtypes(exclude="object").drop(skip_features, axis=1)

filtered = encoded.copy()

filtered["is_filtered"] = pd.Series(np.zeros(filtered.shape[0]), dtype="int64")

for column in continuous_features.columns.values:
    prev_filtered = filtered["is_filtered"].sum()
    
    col_values = filtered[column]
    mean = col_values.mean()
    std = col_values.std()
    filtered.loc[(mean - std_cnt*std > col_values) | (col_values > mean + std_cnt*std), "is_filtered"] = 1

    print("feature {} ---> filtered {}".format(column, filtered["is_filtered"].sum() - prev_filtered))

print("WHOLE FILTERED CNT", filtered["is_filtered"].sum())

filtered = filtered[filtered["is_filtered"] == 0]
filtered = filtered.drop("is_filtered", axis=1)

feature LotFrontage ---> filtered 2
feature LotArea ---> filtered 5
feature MasVnrArea ---> filtered 7
feature BsmtFinSF1 ---> filtered 0
feature BsmtFinSF2 ---> filtered 14
feature BsmtUnfSF ---> filtered 0
feature TotalBsmtSF ---> filtered 0
feature 1stFlrSF ---> filtered 2
feature 2ndFlrSF ---> filtered 0
feature LowQualFinSF ---> filtered 16
feature GrLivArea ---> filtered 1
feature GarageArea ---> filtered 0
feature WoodDeckSF ---> filtered 3
feature OpenPorchSF ---> filtered 3
feature EnclosedPorch ---> filtered 2
feature 3SsnPorch ---> filtered 18
feature ScreenPorch ---> filtered 5
feature PoolArea ---> filtered 3
feature MiscVal ---> filtered 3
feature SalePrice ---> filtered 2
WHOLE FILTERED CNT 86


## Splitting and Scaling 

Now we need to scale our filtered data

In [5]:
filtered.columns

Index(['LotFrontage', 'LotArea', 'Street', 'Utilities', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       ...
       'MSSubClass_75', 'MSSubClass_80', 'MSSubClass_85', 'MSSubClass_90',
       'MSSubClass_120', 'MSSubClass_150', 'MSSubClass_160', 'MSSubClass_180',
       'MSSubClass_190', 'SalePrice'],
      dtype='object', length=314)

In [6]:
X = filtered.drop(["SalePrice"], axis=1).copy()
y = filtered["SalePrice"].copy()

sub_X = encoded_sub.copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

features2scale = data.select_dtypes(exclude="object").drop(["Id", "MSSubClass", "SalePrice"], axis=1).columns.values

scaler = StandardScaler()

X_train[features2scale] = scaler.fit_transform(X_train[features2scale])
X_test[features2scale] = scaler.transform(X_test[features2scale])

real_y_train = y_train.copy()
real_y_test = y_test.copy()

y_train = np.log(y_train)
y_test = np.log(y_test)

X_train.shape, X_test.shape

((1098, 313), (275, 313))

## Models applying

Let's take a look how ridge and lasso performs on our new data

In [7]:
ridge_reg = Ridge().fit(X_train, y_train)

train_pred = ridge_reg.predict(X_train)
test_pred = ridge_reg.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

make_submission(Ridge(), X, y, sub_X, features2scale, "submissions/ridge/ridge_3_sigma_filter.csv")

TRAIN LOG RMSE: 0.08543172708848658
TEST LOG RMSE: 0.11352538730277134
...................................
TRAIN RMSE 15181.930112009515
TEST RMSE 17940.373622376243
WHOLE TRAIN RMSE: 15174.021798496298


With 3-sigma rule filtering we receive a score of **0.15896** which is worse than our baseline model with RidgeRegression.
That means that perhaps we have thrown to many objects. But this is only the one of million possible causes

### Checking coefficients

In [128]:
sorted_coef_idx = np.argsort(-np.abs(ridge_reg.coef_))
coef = ridge_reg.coef_[sorted_coef_idx]
for i in range(0, 5):
    print("{:.<025} {:< }".format(X_train.columns[sorted_coef_idx[i]], coef[i]))

MSZoning_C (all)......... -0.27157903728178956
Functional_Maj2.......... -0.17205679113987105
Heating_Grav............. -0.12587106760483902
Neighborhood_MeadowV..... -0.11951078198519736
Neighborhood_StoneBr.....  0.11779848428498227


It's strange that the most affecting features are just some categories. The model don't look at such an important features like
Area of the Lot, OveralQuality of the house. Perhaps our alpha-parameter is to high and thus important features cannot show themselves.
Let's take some other values of alpha

In [119]:
ridge_reg = Ridge(alpha=0.00001).fit(X_train, y_train)

train_pred = ridge_reg.predict(X_train)
test_pred = ridge_reg.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

sorted_coef_idx = np.argsort(-np.abs(ridge_reg.coef_))
coef = ridge_reg.coef_[sorted_coef_idx]
for i in range(0, 20):
    print("{:.<025} {:< }".format(X_train.columns[sorted_coef_idx[i]], coef[i]))

TRAIN LOG RMSE: 0.08047429535023858
TEST LOG RMSE: 0.13769599725873466
...................................
TRAIN RMSE 13555.564874746198
TEST RMSE 19269.541243830092
MSSubClass_45............ -0.5152806040355594
HouseStyle_1.5Unf........  0.4640953631162831
MSZoning_C (all)......... -0.445476939288213
Functional_Maj2.......... -0.37870988970050484
MSSubClass_75............ -0.3586691365095825
SaleCondition_Partial.... -0.3202308488436226
HouseStyle_2.5Unf........  0.2894027810667336
SaleType_New.............  0.27447329815794813
Heating_Grav............. -0.2364863222390627
MSSubClass_40............  0.2240606801355614
RoofStyle_Shed...........  0.22406068013390296
SaleCondition_AdjLand....  0.2006771062459512
Foundation_Stone.........  0.19918149802794438
HouseStyle_SFoyer........ -0.19208600593712918
Exterior2nd_Brk Cmn......  0.17977878723579507
GarageType_2Types........ -0.1783804745630281
SaleType_ConLI........... -0.17697750244656038
MSZoning_FV..............  0.17651657397677734