## Linear models 

In this notebook we will test OLS, ridge and lasso regressions but with additional data preprocessing and tuning

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib as mpl
import seaborn as sns
from IPython.core.display import HTML
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, root_mean_squared_error
from feature_engine.selection import DropCorrelatedFeatures

In [13]:
train = pd.read_csv("data/train_without_nans.csv")
sub = pd.read_csv("data/test_without_nans.csv")

encoded_train = pd.read_csv("data/encoded_train.csv")
encoded_sub = pd.read_csv("data/encoded_test.csv")

encoded_train.shape

(1459, 313)

## Anomaly deleting

To delete anomalies we will use simple 3-sigma rule for the continuous features. To delete less objects we firstly use even 4 sigma interval.

In [14]:
X = encoded_train.drop(["SalePrice"], axis=1).copy()
y = encoded_train["SalePrice"].copy()

continuous_features = train.select_dtypes(exclude="object").drop(["Id", "MSSubClass", "SalePrice"], axis=1)
#print(continuous_features.columns.values)

filtered_X = X.copy()

filtered_X["is_filtered"] = pd.Series(np.zeros(filtered_X.shape[0]), dtype="int64")

# mark outliers
for column in continuous_features.columns.values:
    
    col_values = filtered_X[column]
    mean = col_values.mean()
    std = col_values.std()
    filtered_X.loc[(mean - 4*std > col_values) | (col_values > mean + 4*std), "is_filtered"] = 1

print("CNT FILTERED OBJECTS", filtered_X["is_filtered"].sum())


# deleting outliers
X = filtered_X[filtered_X["is_filtered"] == 0].copy().drop("is_filtered", axis=1)
y = y[filtered_X["is_filtered"] == 0].copy()

CNT FILTERED OBJECTS 269


## Deleting features with low variance

In [15]:
'''
    After deleting it can be that there are some features now which have only
    one value or almost constant features. It means that their std is equal to zero
    or is near zero
'''
print(X.shape)

stds = X.std()
low_variance_columns = stds[stds < 0.1].index.values
encoded_sub = encoded_sub.drop(low_variance_columns, axis=1)
X = X.drop(low_variance_columns, axis=1)
X.shape

print(X.shape)

(1190, 312)
(1190, 202)


## Correlated feature deleting

As we know all linear models are affected badly by correlated features. On this purpose we will remove features with high correlations

In [16]:
tr = DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.4)
Xt = tr.fit_transform(X)
encoded_sub = encoded_sub.drop(tr.features_to_drop_, axis=1)
Xt.shape

(1190, 115)

In [17]:
encoded_sub.shape

(1459, 115)

## Scalilng, splitting and models applying

Now we have data without anomalies (in terms of 4 sigmas) and all features with correlations more than 0.5 were deleted

In [18]:
X = Xt
y = y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

'''
    After train/test split it can be that in train set there are
    also features, which have only one 
'''
#Select all remaining continuous features
continuous_columns = continuous_features.columns.drop(tr.features_to_drop_, errors="ignore")
continuous_columns = continuous_columns.drop(low_variance_columns, errors="ignore")

scaler = StandardScaler()

X_train[continuous_columns] = scaler.fit_transform(X_train[continuous_columns])
X_test[continuous_columns] = scaler.transform(X_test[continuous_columns])

real_y_train = y_train.copy()
real_y_test = y_test.copy()

y_train = np.log(y_train)
y_test = np.log(y_test)

X_train.shape

(1071, 115)

## OLS

In [19]:
from sklearn.linear_model import LinearRegression

lr_reg = LinearRegression().fit(X_train, y_train)

train_pred = lr_reg.predict(X_train)
test_pred = lr_reg.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

TRAIN LOG RMSE: 0.10840238051335337
TEST LOG RMSE: 0.15284137685055116
...................................
TRAIN RMSE 21393.48208394398
TEST RMSE 19556.890617807636


This time we see that our OLS model doesn't give us large values on the test set. That's because we have deleted correlated features.
And cleared our data from features with constant values

In [20]:
print("MIN TRAIN PREDICTION", np.min(train_pred))
print("MIN TEST PREDICTION", np.min(test_pred))
print("MAX TRAIN PREDICTION", np.max(train_pred))
print("MAX TEST PREDICTION", np.max(test_pred))

MIN TRAIN PREDICTION 10.966887874466359
MIN TEST PREDICTION 11.096970078263457
MAX TRAIN PREDICTION 13.13075311110468
MAX TEST PREDICTION 12.768192981025756


Let's check the lowest coefficients in 

In [21]:
sorted_coef_idx = np.argsort(-np.abs(lr_reg.coef_))
coef = lr_reg.coef_[sorted_coef_idx]
for i in range(0, 10):
    print("{:.<025} {}".format(X_train.columns[sorted_coef_idx[i]], coef[i]))

Exterior1st_BrkFace...... 0.1914548268982117
BsmtQual_Ex.............. 0.17436230494845154
1stFlrSF................. 0.16875755094372852
Neighborhood_Somerst..... 0.1634797910980309
Exterior1st_VinylSd...... 0.1569266427707396
Neighborhood_StoneBr..... 0.1460041176551825
Exterior1st_MetalSd...... 0.14346492916766204
Exterior1st_Stucco....... 0.14031542848000667
Neighborhood_SWISU....... -0.13199564899418942
2ndFlrSF................. 0.12803352884434469


Coeddicients seem to be normal. But we have received even worse test results than in the previous notebook with ridge regression without preprocessing