># **Linear Regression Models: OLS, Ridge, and Lasso**
>
>The dataframe used in this notebook originates from the preprocessing steps 
>performed in the `"1_4b-preprocessing-feature-engineering-and-preprocessing-for-predictive-models.ipynb"` notebook.
>The final refinement of selected variables is conducted here to meet the 
>specific requirements of the models being developed, based on insights from 
>the aforementioned notebook.

>## Multilinear Regression (OLS)

In [None]:
# Importing Required Libraries

import pandas as pd
import numpy as np
import joblib
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [None]:
# Loading the Dataset

df = pd.read_csv("preprocessing_for_prediction_models_final.csv", index_col = 0, sep = ",")

In [None]:
# Droping rows without values for the target variable
df = df.dropna(subset=["electric_energy_consumption"])
df = df[df["electric_energy_consumption"] != 0]

In [None]:
# Testing the existence of multicolinearity using VIF

x = df[["mass_vehicle", "engine_power", "engine_capacity",
"electric_range", "fuel_consumption", "specific_co2_emissions"]].dropna()

# Standardization for better VIF scaling
x_scaled = StandardScaler().fit_transform(x)

# Calculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = x.columns
vif_data["VIF"] = [variance_inflation_factor(x_scaled, i) for i in range(x.shape[1])]
print(vif_data)


                  feature       VIF
0            mass_vehicle  2.528779
1            engine_power  2.382914
2         engine_capacity  1.979144
3          electric_range  2.288244
4        fuel_consumption  2.743936
5  specific_co2_emissions  4.035319


In [None]:
# Transforming the variables with non-visibles linearity trends

df["log_engine_power"] = np.log1p(df["engine_power"])
df["log_engine_capacity"] = np.log1p(df["engine_capacity"])
df["inv_electric_range"] = 1 / df["electric_range"].replace(0, np.nan)
df["electric_range"] = df["electric_range"].fillna(0)
df["inv_electric_range"] = 1 / df["electric_range"].replace(np.nan, 0)


In [31]:
df.isna().sum()

mass_vehicle                   0
engine_capacity                0
engine_power                   0
erwltp                         0
year                           0
electric_range                 0
fuel_consumption               0
specific_co2_emissions         0
electric_energy_consumption    0
fuel_type_diesel/electric      0
fuel_type_e85                  0
fuel_type_lpg                  0
fuel_type_ng                   0
fuel_type_petrol               0
fuel_type_petrol/electric      0
has_innovation                 0
col_0                          0
col_1                          0
col_2                          0
col_3                          0
col_4                          0
col_5                          0
col_6                          0
col_7                          0
log_engine_power               0
log_engine_capacity            0
inv_electric_range             0
dtype: int64

In [32]:
df.head()

Unnamed: 0_level_0,mass_vehicle,engine_capacity,engine_power,erwltp,year,electric_range,fuel_consumption,specific_co2_emissions,electric_energy_consumption,fuel_type_diesel/electric,...,col_1,col_2,col_3,col_4,col_5,col_6,col_7,log_engine_power,log_engine_capacity,inv_electric_range
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56003435,2005.0,2487.0,136.0,0.0,2021,75.0,1.0,22.0,166.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.919981,7.819234,0.013333
56003436,1985.0,2487.0,136.0,0.0,2021,75.0,1.0,22.0,166.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.919981,7.819234,0.013333
56003437,1985.0,2487.0,136.0,0.0,2021,75.0,1.0,22.0,166.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.919981,7.819234,0.013333
56003438,1985.0,2487.0,136.0,0.0,2021,75.0,1.0,22.0,166.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.919981,7.819234,0.013333
56003439,1985.0,2487.0,136.0,0.0,2021,75.0,1.0,22.0,166.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.919981,7.819234,0.013333


In [None]:
# Splitting Data and Converting Dataset to Float32

X = df.drop(columns = ["electric_energy_consumption"])
y = df["electric_energy_consumption"]

x = x.astype("float32")
y = y.astype("float32")

In [37]:
# Save the datasets (train-test split)
joblib.dump((x, y), "split_data_linear_model.pkl")

print("Splitted data saved successfully!")

Splitted data saved successfully!


In [None]:
# OLS Model

# Add constant (intercept)
x_const = sm.add_constant(x)

# OLS Model
model = sm.OLS(y, x_const).fit()
print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:     electric_energy_consumption   R-squared:                       0.667
Model:                                     OLS   Adj. R-squared:                  0.667
Method:                          Least Squares   F-statistic:                 7.050e+05
Date:                         Mon, 24 Mar 2025   Prob (F-statistic):               0.00
Time:                                 21:50:36   Log-Likelihood:            -9.6856e+06
No. Observations:                      2115211   AIC:                         1.937e+07
Df Residuals:                          2115204   BIC:                         1.937e+07
Df Model:                                    6                                         
Covariance Type:                     nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------

># RidgeCV

In [None]:
# Load the datasets
x, y = joblib.load("split_data_linear_model.pkl")

print("Splitted data saved successfully!")

Splitted data saved successfully!


In [7]:
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [8]:
# Feature scaling
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [10]:
# RidgeCV with cross-validation over a range of alpha values
alphas = np.logspace(-4, 4, 50)
ridge = RidgeCV(alphas = alphas, scoring = "neg_mean_squared_error", cv = 5)
ridge.fit(x_train_scaled, y_train)

# Prediction and evaluation
y_pred = ridge.predict(x_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display results
print(f"Best alpha: {ridge.alpha_:.5f}")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.2f}")

# Show feature coefficients
coef_df = pd.DataFrame({
    "Feature": x.columns,
    "Ridge Coefficient": ridge.coef_})
print(coef_df)


Best alpha: 11.51395
MAE: 16.51
RMSE: 23.60
R²: 0.67
                  Feature  Ridge Coefficient
0            mass_vehicle          34.898453
1            engine_power           1.455802
2         engine_capacity          -5.704918
3          electric_range           0.820035
4        fuel_consumption          -1.413962
5  specific_co2_emissions           4.189258


># LassoCV

In [11]:
# Scale features
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [None]:
# LassoCV with automatic alpha selection via cross-validation
alphas = np.logspace(-4, 1, 50)
lasso = LassoCV(alphas=alphas, cv = 5, max_iter = 10000, random_state = 42)
lasso.fit(x_train_scaled, y_train)

# Predictions and evaluation
y_pred = lasso.predict(x_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print results
print(f"Best alpha: {lasso.alpha_:.5f}")
print(f"MAE: {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²: {r2:.2f}")

# Coefficients table
coef_df = pd.DataFrame({
    "Feature": x.columns,
    "Lasso Coefficient": lasso.coef_
})
print(coef_df)

Best alpha: 0.00032
MAE: 16.51
RMSE: 23.60
R²: 0.67
                  Feature  Lasso Coefficient
0            mass_vehicle          34.898235
1            engine_power           1.455708
2         engine_capacity          -5.704424
3          electric_range           0.819414
4        fuel_consumption          -1.412677
5  specific_co2_emissions           4.187680


  model = cd_fast.enet_coordinate_descent(


## Conclusion

The performance of the three linear models — **OLS**, **Ridge**, and **Lasso** — was evaluated as a first benchmark. Despite the theoretical appeal of these models, their performance on this dataset was overall modest, with relatively low R² values and non-negligible prediction errors (MAE and RMSE).

- **OLS** served as a baseline, but struggled to capture the complexity of the underlying relationships in the data.
- **Ridge Regression** introduced L2 regularization, which helped stabilize coefficients but offered only marginal improvement in predictive power.
- **Lasso Regression**, while helpful in reducing the number of active features, also failed to significantly outperform the baseline.

These results suggest that **linear models are insufficient to capture the non-linear or interaction effects** likely present in the data. This is not unexpected, given the complexity of the relationships involved in electric energy consumption.

Although it would have been possible to improve performance by conducting a more specific preprocessing phase tailored to these linear models — including transformations, interaction terms, or advanced feature selection — this path was deliberately not pursued. 

Instead, the project will now advance to the use of **more sophisticated models**, beginning with a range of **machine learning algorithms** (such as tree-based models and ensemble techniques), and later extending to **deep learning architectures**. These are expected to yield substantially better predictive accuracy and capture the complex patterns inherent in the data.
