## Linear regression

In this notebook we will test OLS regressions but with additional data preprocessing and tuning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib as mpl
import seaborn as sns
from IPython.core.display import HTML
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, root_mean_squared_error
from feature_engine.selection import DropCorrelatedFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

In [2]:
train = pd.read_csv("data/train_without_nans.csv")
sub = pd.read_csv("data/test_without_nans.csv")

## Content

## Categorical encoding and dummy variable trap

In our encoded data we have dummy varible trap for the unregularized linear regression. On this purpose we will reencode our data for it independently.

In [3]:
encoded_train = train.copy()
encoded_sub = sub.copy()

## Taking categorical and continuous columns
object_columns = train.select_dtypes(include='object')
object_columns["MSSubClass"] = train["MSSubClass"]
continuous_columns = train.select_dtypes(exclude="object").drop(["SalePrice", "Id", "MSSubClass"], axis=1)

object_columns = object_columns.columns.values
continuous_columns = continuous_columns.columns.values

encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore", drop="first")        
one_hot_encoded_train = encoder.fit_transform(encoded_train[object_columns])
one_hot_encoded_sub = encoder.transform(encoded_sub[object_columns])
        
one_hot_df_train = pd.DataFrame(one_hot_encoded_train,
                                columns=encoder.get_feature_names_out(object_columns))
one_hot_df_sub= pd.DataFrame(one_hot_encoded_sub,
                                columns=encoder.get_feature_names_out(object_columns))
        
encoded_train = pd.concat([encoded_train, one_hot_df_train], axis=1)
encoded_train = encoded_train.drop(object_columns, axis=1)

encoded_sub = pd.concat([encoded_sub, one_hot_df_sub], axis=1)
encoded_sub = encoded_sub.drop(object_columns, axis=1)

encoded_train = encoded_train.astype("float64")
encoded_sub = encoded_sub.astype("float64")



In [4]:
encoded_sub.columns[(encoded_sub == 0).all()]

Index(['Utilities_NoSeWa', 'Condition2_RRAe', 'Condition2_RRAn',
       'Condition2_RRNn', 'HouseStyle_2.5Fin', 'RoofMatl_Membran',
       'RoofMatl_Metal', 'RoofMatl_Roll', 'Exterior1st_ImStucc',
       'Exterior1st_Stone', 'Exterior2nd_Other', 'Heating_OthW',
       'Electrical_Mix', 'PoolQC_Fa', 'MiscFeature_TenC'],
      dtype='object')

In the test set there are no objects which have the above features set to one

## Anomaly objects, low-informational

To delete anomalies we will use simple 3-sigma rule for the continuous features. To delete less objects we firstly use even 4 sigma interval.

Then we will filter our data from non-informative feature with low variance

In [5]:
X_sub = encoded_sub.copy().drop(["Id"], axis=1)
X = encoded_train.drop(["SalePrice"], axis=1).copy().drop(["Id"], axis=1)
y = encoded_train["SalePrice"].copy()

filtered_X = X.copy()
filtered_X["is_filtered"] = pd.Series(np.zeros(filtered_X.shape[0]), dtype="int64")

# mark outliers according to "4 sigma" rule
for column in continuous_columns:
    
    col_values = filtered_X[column]
    mean = col_values.mean()
    std = col_values.std()
    filtered_X.loc[(mean - 4*std > col_values) | (col_values > mean + 4*std), "is_filtered"] = 1

print("CNT FILTERED OBJECTS", filtered_X["is_filtered"].sum())

# deleting outliers
X = filtered_X[filtered_X["is_filtered"] == 0].copy().drop("is_filtered", axis=1)
y = y[filtered_X["is_filtered"] == 0].copy()

print("SHAPE AFTER ANOMALY FILTERING", X.shape)

'''
    After deleting some objects it can be that there are some features now which have only
    one value or some value near constant. It means that their std is equal to zero or almost equal 
    to zero
'''

stds = X.std()
low_variance_columns = stds[stds < 0.1].index.values
X_sub = X_sub.drop(low_variance_columns, axis=1)
X = X.drop(low_variance_columns, axis=1)

print("SHAPE AFTER LOW-VARIANCE FEATURES DELETING", X.shape)
X_sub.shape

CNT FILTERED OBJECTS 269
SHAPE AFTER ANOMALY FILTERING (1190, 271)
SHAPE AFTER LOW-VARIANCE FEATURES DELETING (1190, 176)


(1459, 176)

## Deleting correlated feature

Let's delete all features with correlationg above 0.7

In [6]:
## Delete correlated with threshold 0.6
tr = DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.7)
X1 = tr.fit_transform(X)
X1_sub = tr.transform(X_sub)
print("SHAPE AFTER CORRELATED WITH 0.7 THRESHOLD DELETING", X1.shape)
X1_sub.shape

SHAPE AFTER CORRELATED WITH 0.7 THRESHOLD DELETING (1190, 136)


(1459, 136)

## Model applying

Now we have data without anomalies (in terms of 4 sigmas) and all features with correlations more than 0.7 were deleted

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X1, y, test_size=0.2, random_state=1)

#Select all remaining continuous features
continuous_features = train.select_dtypes(exclude="object").drop(["SalePrice", "Id", "MSSubClass"], axis=1)
continuous_features = continuous_features.drop(tr.features_to_drop_, axis=1, errors="ignore")
continuous_features = continuous_features.drop(low_variance_columns, axis=1, errors="ignore")
continuous_features = continuous_features.columns.values

scaler = StandardScaler()

X_train[continuous_features] = scaler.fit_transform(X_train[continuous_features])
X_test[continuous_features] = scaler.transform(X_test[continuous_features])

real_y_train = y_train.copy()
real_y_test = y_test.copy()

y_train = np.log(y_train)
y_test = np.log(y_test)

print(X_train.shape)

reg1 = LinearRegression().fit(X_train, y_train)

train_pred = reg1.predict(X_train)
test_pred = reg1.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

(952, 136)
TRAIN LOG RMSE: 0.09168057743275664
TEST LOG RMSE: 0.10265950723222719
...................................
TRAIN RMSE 17027.966490021885
TEST RMSE 17761.881040230197


This time we see that our OLS model doesn't give us large values on the test set. That's because we have deleted correlated features.
And cleared our data from features with constant values

Let's check the highest coefficients in the model

In [8]:
sorted_coef_idx = np.argsort(-np.abs(reg1.coef_))
coef = reg1.coef_[sorted_coef_idx]
for i in range(0, 5):
    print("{:.<025} {}".format(X_train.columns[sorted_coef_idx[i]], coef[i]))

GrLivArea................ 135431081302.48347
2ndFlrSF................. -124579564626.18889
1stFlrSF................. -103724018763.79666
LowQualFinSF............. -2171672710.745196
Functional_Typ........... 0.16978049278259277


Once again we received large coefficients. It seems like we have deleted not enough correlated features. Moreover there is a problem that if we rerun our notebook with correlational threshold as 0.6 (if we make the threshold lower) we will receive normal coefficients but the model will have much worse results on a train and test set. Let's show it.

In [9]:
## Delete correlated with threshold 0.6
tr2 = DropCorrelatedFeatures(variables=None, method='pearson', threshold=0.6)
X2 = tr2.fit_transform(X)
X2_sub = tr2.transform(X_sub)

print("SHAPE AFTER CORRELATED WITH 0.6 THRESHOLD DELETING", X2.shape)
X2_sub.shape

SHAPE AFTER CORRELATED WITH 0.6 THRESHOLD DELETING (1190, 122)


(1459, 122)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size=0.2, random_state=1)

#Select all remaining continuous features
continuous_features = train.select_dtypes(exclude="object").drop(["SalePrice", "Id", "MSSubClass"], axis=1)
continuous_features = continuous_features.drop(tr2.features_to_drop_, axis=1, errors="ignore")
continuous_features = continuous_features.drop(low_variance_columns, axis=1, errors="ignore")
continuous_features = continuous_features.columns.values

X_train[continuous_features] = scaler.fit_transform(X_train[continuous_features])
X_test[continuous_features] = scaler.transform(X_test[continuous_features])

real_y_train = y_train.copy()
real_y_test = y_test.copy()

y_train = np.log(y_train)
y_test = np.log(y_test)

# Train model on new set of features
reg2 = LinearRegression().fit(X_train, y_train)

train_pred = reg2.predict(X_train)
test_pred = reg2.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

TRAIN LOG RMSE: 0.10360826409008214
TEST LOG RMSE: 0.1039663528019374
...................................
TRAIN RMSE 19370.099182081205
TEST RMSE 19226.562372625296


In [11]:
sorted_coef_idx = np.argsort(-np.abs(reg2.coef_))
coef = reg2.coef_[sorted_coef_idx]
for i in range(0, 5):
    print("{:.<025} {}".format(X_train.columns[sorted_coef_idx[i]], coef[i]))

MSZoning_FV.............. 0.172044164477254
Neighborhood_StoneBr..... 0.16847152546880867
Exterior1st_Stucco....... 0.1599256414534786
ExterCond_Fa............. -0.15860989511372817
Exterior1st_BrkFace...... 0.15379551246096257


It means that if we delete not enough correlated coefficients we receive overfitting and if we delete them enough we receive underfitting.

What can be the solution here? We can try to find balance between deleting of correlated features and the model performance.
Even in terms of the simple linear regression it's not a trivial task.

If we had all features not correlated we would just look at which features have big coefficients and which have not and decide which features are important and which are not.

But if we have correlated features we can't say from the model's coefficients if they are important or not because we don't know if the feature has large coefficient because it is important or because it's correlated with some other feature.

For the beginning I prefer not to do this time-consuming work and solve it in the future notebooks with regularization

Anyway let's make a submission with our second model which has stable coefficients

In [12]:
final_sub_X = X2_sub

continuous_features = train.select_dtypes(exclude="object").drop(["SalePrice", "Id", "MSSubClass"], axis=1)
continuous_features = continuous_features.drop(tr2.features_to_drop_, axis=1, errors="ignore")
continuous_features = continuous_features.drop(low_variance_columns, axis=1, errors="ignore")
continuous_features = continuous_features.columns.values

final_sub_X[continuous_features] = scaler.transform(final_sub_X[continuous_features])

submission = pd.Series(np.exp(reg2.predict(final_sub_X)))
submission_df = pd.DataFrame()
submission_df["Id"] = sub["Id"]
submission_df["SalePrice"] = submission

submission_df.to_csv("submissions/ols/ols_correlations_0.6.csv", index=False)

We have used not the whole dataset for training because when we use the whole dataset it turns out that not-splitted dataset gives us obverfitting with high coefficients.

This solution gives us worse results in submission to kaggle: **0.17688**

This will be a lesson for us that simple Linear Regression has many problems which are solved not easly with only data preprocessing.