# Improving a Multiple Linear Regression Model by Applying Feature Engineering

## Environment Management

Please read the project's README for instructions on how to set up the project's environment on your computer.

In [1]:
import sys				
sys.executable				

'C:\\Users\\Adespotos\\anaconda3\\envs\\Linear_Regression\\python.exe'

<div style="text-align: justify">
The environment is kept the same with the linear_regression_prj because this notebook is related to improvements on that exact project. The main purpose of this project is to compare the results of the application of different versions of the same model (Multiple Linear Regression). This will help the creator of this project as well as the viewers to better understand how the predictions can be checked and how different things might lead to different results.
</div>

# Project Libraries

In [2]:
import numpy as np  # For numerical operations and arrays.	
import pandas as pd  # For data manipulation and analysis.	
import matplotlib.pyplot as plt  # For basic plotting.	
import seaborn as sns  # For enhanced plotting.	
import plotly.express as px  # For interactive plotting.	
from statsmodels.stats.outliers_influence import variance_inflation_factor  # For calculating vif and check for features multicolinearity.
from sklearn.preprocessing import StandardScaler  # For creating scaler instances for standardization purposes.
from sklearn.model_selection import train_test_split  # For splitting the data into sets avoiding overfitting.
from sklearn.linear_model import LinearRegression  # For creating LinearRegression instances.
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error  # For calculating the mean absolute error estimate.
sns.set()  # For overriding default matplotlib styles with those of seaborn.

# Data Frame Manipulation

In [3]:
# Format the values to be displayed without scientific notation:
pd.options.display.float_format = '{:.4f}'.format

# Make Pandas display all columns:
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

In [4]:
df_imported = pd.read_csv('cleaned.csv')

<div style="text-align: justify">
This is a cleaned df, exported from the previous project. In addition, before testing the different versions of the same model, it may be good to somewhat standardize the 'Year' column.
</div>

In [5]:
# Store the year the dataset has been created to a variable:
year = 2019

# Add the column of the car age in the data frame:
df_imported['Car Age'] = year - df_imported['Year']

# Drop the 'Year' column:
df_imported = df_imported.drop(columns='Year', axis=1)

In [6]:
# Create a list of column names:
list = df_imported.columns.to_list()

# Move 'Car Age' column value to the position 1 of the list of column names:
list.insert(1, list.pop(list.index('Car Age')))

# Assign the updated list of column names to the original data frame:
df = df_imported[list]

<div style="text-align: justify">
This is our basic data frame including 3867 rows and 8 columns. This data frame will be manipulated each time to train different versions of linear regression models.
</div>

# Model Function

<div style="text-align: justify">
We 'll define a function which properly creates the linear regression model, calculates the model metrics and returns a data frame including these metrics.
</div>

In [7]:
def model_creation(dataframe, target='Price', testsize=0.2, randomstate=365):
    """
    Creates a linear regression model using the provided dataframe.
    
    Parameters:
    dataframe (pd.DataFrame): The dataset containing features and target. 
    target (str): The column name of the target variable. Defaults to 'Price'.
    testsize (float): Proportion of the dataset to include in the train_test_split method. 
    Defaults to 0.2.
    random_state (int): Seed used by the random number generator. Defaults to 365.
    
    Returns:
    dataframe (pd.DataFrame): A dataframe containing information about model, scaler, 
    and various metrics for train and test sets.
    """
    # Define the target:
    y = dataframe[target]
    # Define the features:
    x = dataframe.drop(columns=target, axis=1)
    
    # Define the scaler instance, fit it and transform the fitted data to scaled features:
    scaler = StandardScaler()
    x_scaled = scaler.fit_transform(x)
    
    # Split the data into two parts to avoid overfitting:
    x_train, x_test, y_train, y_test = train_test_split(
        x_scaled, 
        y, 
        test_size=testsize, 
        random_state=randomstate
    )
    
    # Create the linear regression instance:
    model = LinearRegression()
    # Fit the instance with the training part of the data:
    model.fit(x_train, y_train)
    # Create predictions over the train and test splits: 
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)
    
    # Assign number of observations and predictors for each split: 
    n_train = x_train.shape[0]
    n_test = x_test.shape[0]
    p = x_train.shape[1]  # This is the same both for training and testing.

    # Create a dictionary with model metrics:
    metrics = {
        'Model Information': ['Model', 'Scaler', 
                              'R-squared (train)', 
                              'Adj R-squared (train)', 
                              'Mean Absolute Error (train)', 
                              'Mean Squared Error (train)', 
                              'Root Mean Squared Error (train)', 
                              'R-squared (test)', 
                              'Adj R-squared (test)', 
                              'Mean Absolute Error (test)', 
                              'Mean Squared Error (test)',
                              'Root Mean Squared Error (test)'],
        'Value': [model, scaler,
                  model.score(x_train, y_train), 
                  1 - (1 - model.score(x_train, y_train)) * (n_train - 1) / (n_train - p - 1),
                  mean_absolute_error(y_train, y_train_pred),
                  mean_squared_error(y_train, y_train_pred),
                  root_mean_squared_error(y_train, y_train_pred),
                  model.score(x_test, y_test),
                  1 - (1 - model.score(x_test, y_test)) * (n_test - 1) / (n_test - p - 1),
                  mean_absolute_error(y_test, y_test_pred),
                  mean_squared_error(y_test, y_test_pred),
                  root_mean_squared_error(y_test, y_test_pred)]
    }
    
    # Create a df table to display the features and their values:
    coefs = pd.DataFrame(data=x.columns.values, columns=['Features'])  # Create a df with features.
    coefs['Weights'] = model.coef_  # Calculate features.

    intercept = model.intercept_  # Calculate the intercept.
    print('Intercept:', intercept)  # Print the intercept.

    params_table = pd.DataFrame(metrics).set_index('Model Information')  # Turn dict to a df.
    
    return params_table, y_test, y_test_pred, coefs  # Return important information.

Common assumptions for all models:  

1) The feature data will be always scaled using StandardScaler() instance.  
2) The categorical features will always be transformed to dummy variables dropping the first dummy.

# "Baseline" MLR Model (1)

<div style="text-align: justify">
This is a "baseline" model in which only the common two assumptions hold.
</div>

In [8]:
df_dummies = pd.get_dummies(df, drop_first=True)  # Create the dummy df.

In [9]:
model_info = model_creation(dataframe=df_dummies)  # Run the model function.
params = model_info[0]  # Assign params_table to a variable.
params

Intercept: 5903897603303517.0


Unnamed: 0_level_0,Value
Model Information,Unnamed: 1_level_1
Model,LinearRegression()
Scaler,StandardScaler()
R-squared (train),0.8222
Adj R-squared (train),0.8026
Mean Absolute Error (train),5021.5287
Mean Squared Error (train),63244975.7538
Root Mean Squared Error (train),7952.6710
R-squared (test),-1712389644576256238565720064.0000
Adj R-squared (test),-2840509002698381726758993920.0000
Mean Absolute Error (test),108503583499051920.0000


<div style="text-align: justify">
The test results are catastrophic, they are even worse than the mean of the target feature (horizontal line). This model can't be used even for a baseline model. Inspecting the model column of df_imported it can be seen that it includes 291 different car models from which 211 contain lesser or equal to 10 observations, whereas only 21 models contain greater than 50 observations. Consequantly, the dummy data frame includes 307 features from which at least 211 showcase high data sparsity. This is creating extremely high dimensionality without any statistical significance. Of course, keeping the most popular models won't fix the issue as there are very few models with more than 100 observations which is again a small number for applying machine learning models considering the simultaneous increase in dimensionality. We should drop the entire model column for now.
</div>

# MLR: No Model Column (2)

<div style="text-align: justify">
Model Assumptions:  

1) Common assumptions
2) The 'Model' column should be dropped.
</div>

In [10]:
df_no_model = df_imported.drop(columns='Model', axis=1)  # Drop the 'Model' column.
df_no_model = pd.get_dummies(df_no_model, drop_first=True)  # Create the dummy df.

In [11]:
model_info_2 = model_creation(dataframe=df_no_model)  # Run the model function.
params_2 = model_info_2[0]  # Assign params_table to a variable.
params_2

Intercept: 18097.822285317394


Unnamed: 0_level_0,Value
Model Information,Unnamed: 1_level_1
Model,LinearRegression()
Scaler,StandardScaler()
R-squared (train),0.6235
Adj R-squared (train),0.6214
Mean Absolute Error (train),7537.5635
Mean Squared Error (train),133907570.8081
Root Mean Squared Error (train),11571.8439
R-squared (test),0.6151
Adj R-squared (test),0.6064
Mean Absolute Error (test),7812.3613


<div style="text-align: justify">
The test performance is much better, with an $R^2$ of 0.6151 which suggests that the model generalizes reasonably well. The error metrics (MAE, MSE, RMSE) are also in a more acceptable range. This probably shows that it was a correct approach to drop the 'Model' column avoiding the previous case of severe overfitting.
</div>

In [12]:
weights_2 = model_info_2[3]
weights_2

Unnamed: 0,Features,Weights
0,Mileage,-5944.9621
1,EngineV,4559.8542
2,Car Age,-6413.1371
3,Brand_BMW,649.8063
4,Brand_Mercedes-Benz,1069.8264
5,Brand_Mitsubishi,-2803.721
6,Brand_Renault,-3304.4908
7,Brand_Toyota,-1959.5549
8,Brand_Volkswagen,-2078.2482
9,Body_hatch,-2168.3448


# MLR: No Model Column and Logarithmic Price (3)

<div style="text-align: justify">
Model Assumptions:  

1) Common assumptions
2) The 'Model' column should be dropped.
3) The target variable ('Price') should be transformed to logarithmic one.
</div>

In [13]:
df_log_price = df_no_model.copy()  # Copy the previous succesful df.
df_log_price['Log Price'] = np.log(df_log_price['Price'])  # Create the logarithmic price column.
df_log_price = df_log_price.drop(columns='Price', axis=1)  # Drop the original price column.

In [14]:
model_info_3 = model_creation(dataframe=df_log_price, target='Log Price')  # Run the model function.
params_3 = model_info_3[0]  # Assign params_table to a variable.
params_3

Intercept: 9.416549683715537


Unnamed: 0_level_0,Value
Model Information,Unnamed: 1_level_1
Model,LinearRegression()
Scaler,StandardScaler()
R-squared (train),0.8162
Adj R-squared (train),0.8152
Mean Absolute Error (train),0.2794
Mean Squared Error (train),0.1391
Root Mean Squared Error (train),0.3730
R-squared (test),0.8131
Adj R-squared (test),0.8089
Mean Absolute Error (test),0.2974


<div style="text-align: justify">
These values are significantly higher and closer between the training and testing sets compared to the previous non-log-transformed results. High $R^2$ values indicate a good fit of the model to the data, and the close values between train and test suggest the model generalizes well and is not overfitting. However, the above parameters table is not enough in terms of interpretability because mae, mse and rmse have been derived from log-transformed results.
</div>

In [15]:
weights_3 = model_info_3[3]  # Assign weights to a variable.
weights_3

Unnamed: 0,Features,Weights
0,Mileage,-0.1167
1,EngineV,0.2283
2,Car Age,-0.571
3,Brand_BMW,0.0332
4,Brand_Mercedes-Benz,0.0347
5,Brand_Mitsubishi,-0.1126
6,Brand_Renault,-0.169
7,Brand_Toyota,-0.0505
8,Brand_Volkswagen,-0.0679
9,Body_hatch,-0.105


In [16]:
data_df_3 = pd.DataFrame(columns=['Price', 'Price Prediction'])  # Create an empty df.
data_df_3['Price'] = np.exp(model_info_3[1])  # Assign y_test to 'Price' col.
data_df_3['Price Prediction'] = np.exp(model_info_3[2])  # Assign y_test_pred to 'Price Prediction' col.

In [17]:
# Calculate the errors over original price values:
mae = np.mean(np.abs(data_df_3['Price'] - data_df_3['Price Prediction']))  # MAE
mse = np.mean((data_df_3['Price'] - data_df_3['Price Prediction'])**2)  # MSE
rmse = np.sqrt(mse)  # RMSE

In [18]:
# Create an updated parameters table:
params_3_updated = pd.DataFrame({
    'R-squared (test)': [params_3.loc['R-squared (test)'].values.item()],
    'Adj R-squared (test)': [params_3.loc['Adj R-squared (test)'].values.item()],
    'Mean Absolute Error (test)': mae,
    'Mean Squared Error (test)': mse,
    'Root Mean Squared Error (test)': rmse,
})
params_3_updated.transpose()

Unnamed: 0,0
R-squared (test),0.8131
Adj R-squared (test),0.8089
Mean Absolute Error (test),4913.3928
Mean Squared Error (test),101499227.3206
Root Mean Squared Error (test),10074.6825


<div style="text-align: justify">
This indicates that 81.31% of the variance in the test set prices is explained by the model, which is good for many real-world scenarios. Additionally, The MAE and RMSE provide a good indication of the average error magnitude. While 4,913\$ and 10,074$ might seem large, they could be reasonable depending on the price range and variability in your dataset.
</div>

# MLR: No Model Column, Logarithmic Price and No Car Age Column (4)

<div style="text-align: justify">
Model Assumptions:  

1) Common assumptions
2) The 'Model' column should be dropped.
3) The target variable ('Price') should be transformed to logarithmic one.
4) The 'Car Age' column should be dropped.
</div>

<div style="text-align: justify">
The motivation for this model version has been given from the fact that probably car age and mileage are correlated. Consequently, dropping this column will reduce dimensionality and may not affect the model performance so much.
</div>

In [19]:
df_no_carage = df_log_price.copy()  # Copy the previous succesful df.
df_no_carage = df_no_carage.drop(columns='Car Age', axis=1)  # Drop the 'Car Age' column.

In [20]:
model_info_4 = model_creation(dataframe=df_no_carage, target='Log Price')  # Run the model function.
params_4 = model_info_4[0]  # Assign params_table to a variable.
params_4

Intercept: 9.416736021792213


Unnamed: 0_level_0,Value
Model Information,Unnamed: 1_level_1
Model,LinearRegression()
Scaler,StandardScaler()
R-squared (train),0.6328
Adj R-squared (train),0.6309
Mean Absolute Error (train),0.3859
Mean Squared Error (train),0.2779
Root Mean Squared Error (train),0.5272
R-squared (test),0.6669
Adj R-squared (test),0.6598
Mean Absolute Error (test),0.3863


<div style="text-align: justify">
Dropping 'Car Age' reduces the dimensionality, which can simplify the model. However, it appears that 'Car Age' was a significant predictor of the target variable. Its removal has led to a notable decrease in the model's performance, as reflected by the drop in R-squared and increases in error metrics.  
</div>

<div style="text-align: justify">
While 'Car Age' and 'Mileage' may be correlated, they each contain unique information. The results suggest that Car Age provided valuable information that Mileage alone could not fully capture. The decrease in performance indicates that the model trully benefits from having both features.
</div>

# MLR: No Model Column, Logarithmic Price and No Engine Type Column (5)

<div style="text-align: justify">
Model Assumptions:  

1) Common assumptions
2) The 'Model' column should be dropped.
3) The target variable ('Price') should be transformed to logarithmic one.
4) The 'Engine Type' column should be dropped.
</div>

<div style="text-align: justify">
The motivation for this model version has been given from the fact that probably engine volume and engine type are correlated. Consequently, dropping this column will reduce dimensionality and may not affect the model performance so much.
</div>

In [21]:
df_no_engtype = df_imported.drop(columns=['Model', 'Engine Type'], axis=1)  # Drop the columns.
df_no_engtype = pd.get_dummies(df_no_engtype, drop_first=True)  # Create the dummy df.
df_no_engtype['Log Price'] = np.log(df_no_engtype['Price'])  # Create the logarithmic price column.
df_no_engtype = df_no_engtype.drop(columns='Price', axis=1)  # Drop the original price column.

In [22]:
model_info_5 = model_creation(dataframe=df_no_engtype, target='Log Price')  # Run the model function.
params_5 = model_info_5[0]  # Assign params_table to a variable.
params_5

Intercept: 9.416211145724507


Unnamed: 0_level_0,Value
Model Information,Unnamed: 1_level_1
Model,LinearRegression()
Scaler,StandardScaler()
R-squared (train),0.8157
Adj R-squared (train),0.8149
Mean Absolute Error (train),0.2800
Mean Squared Error (train),0.1395
Root Mean Squared Error (train),0.3734
R-squared (test),0.8135
Adj R-squared (test),0.8100
Mean Absolute Error (test),0.2968


<div style="text-align: justify">
Removing the 'Engine Type' column has resulted in very minor changes to the performance metrics, with a slight improvement in the test set metrics. This suggests that the model is robust and does not heavily rely on the 'Engine Type' feature to make accurate predictions. Therefore, this process simplified the model without compromising its accuracy, making 'Engine Type' a good candidate for removal. 
</div>

In [23]:
weights_5 = model_info_5[3]  # Assign weights to a variable.
weights_5

Unnamed: 0,Features,Weights
0,Mileage,-0.1137
1,EngineV,0.2281
2,Car Age,-0.5718
3,Brand_BMW,0.0337
4,Brand_Mercedes-Benz,0.0358
5,Brand_Mitsubishi,-0.112
6,Brand_Renault,-0.168
7,Brand_Toyota,-0.0503
8,Brand_Volkswagen,-0.0668
9,Body_hatch,-0.1085


# MLR: No Model Column, Logarithmic Price, No Engine Type Column and No Body Column (6)

<div style="text-align: justify">
Model Assumptions:  

1) Common assumptions
2) The 'Model' column should be dropped.
3) The target variable ('Price') should be transformed to logarithmic one.
4) The 'Engine Type' column should be dropped.
5) The 'Body' column should be dropped.
</div>

We 'll drop this column just for curiosity reasons.

In [24]:
df_no_body = df_imported.drop(columns=['Model', 'Engine Type', 'Body'], axis=1)  # Drop the columns.
df_no_body = pd.get_dummies(df_no_body, drop_first=True)  # Create the dummy df.
df_no_body['Log Price'] = np.log(df_no_body['Price'])  # Create the logarithmic price column.
df_no_body = df_no_body.drop(columns='Price', axis=1)  # Drop the original price column.

In [25]:
model_info_6 = model_creation(dataframe=df_no_body, target='Log Price')  # Run the model function.
params_6 = model_info_6[0]  # Assign params_table to a variable.
params_6

Intercept: 9.415364312683943


Unnamed: 0_level_0,Value
Model Information,Unnamed: 1_level_1
Model,LinearRegression()
Scaler,StandardScaler()
R-squared (train),0.7959
Adj R-squared (train),0.7953
Mean Absolute Error (train),0.3038
Mean Squared Error (train),0.1545
Root Mean Squared Error (train),0.3930
R-squared (test),0.7987
Adj R-squared (test),0.7963
Mean Absolute Error (test),0.3126


<div style="text-align: justify">
Dropping the 'Body' column makes sense as it significantly reduces the number of features from 15 to 10. The performance decrease is not substantial, and the benefits of a simpler model may outweigh the slight loss in accuracy. Of course, the decision to retain this column or not depend on the context and the desired level of accuracy.
</div>

In [26]:
weights_6 = model_info_6[3]  # Assign weights to a variable.
weights_6

Unnamed: 0,Features,Weights
0,Mileage,-0.1281
1,EngineV,0.2887
2,Car Age,-0.5995
3,Brand_BMW,0.0462
4,Brand_Mercedes-Benz,0.0314
5,Brand_Mitsubishi,-0.0831
6,Brand_Renault,-0.1775
7,Brand_Toyota,-0.0326
8,Brand_Volkswagen,-0.0689


# Conclusions

<div style="text-align: justify">
The best model versions were versions 5 and 6 where the engine type and the car body were dropped respectively. Model 5 is ideal in cases where high accuracy is prioritized despite potentially increased complexity. On the opposite, model 6 can be chosen when the goal is to achieve robust, simple, quick and easily interpretable results. The performance differences between these models are minor.
</div>