# EV Car Prices

This assignment focuses on car prices. The data ('car_prices.xlsx') is a pre-processed version of original data scraped from bilbasen.dk by previous MAL1 students. The dataset contains 16 columns:

- **Price (DKK)**: The current listed price of the vehicle in Danish Kroner.
- **Model Year**: The manufacturing year of the vehicle.
- **Mileage (km)**: The total kilometres driven by the vehicle (odometer reading).
- **Electric Range (km)**: The estimated maximum driving range on a full charge.
- **Battery Capacity (kWh)**: The total capacity of the vehicle's battery in kilowatt-hours.
- **Energy Consumption (Wh/km)**: The vehicle's energy consumption in watt-hours per kilometre.
- **Annual Road Tax (DKK)**: The annual road tax cost in Danish Kroner.
- **Horsepower (bhp)**: The vehicle's horsepower (brake horsepower).
- **0-100 km/h (s)**: The time (in seconds) for the car to accelerate from 0 to 100 km/h.
- **Top Speed (km/h)**: The maximum speed the vehicle can achieve.
- **Towing Capacity (kg)**: The maximum weight the vehicle can tow.
- **Original Price (DKK)**: The price of the vehicle when first sold as new.
- **Number of Doors**: The total number of doors on the vehicle.
- **Rear-Wheel Drive**: A binary indicator (1 = Yes, 0 = No) for rear-wheel drive.
- **All-Wheel Drive (AWD)**: A binary indicator (1 = Yes, 0 = No) for all-wheel drive.
- **Front-Wheel Drive**: A binary indicator (1 = Yes, 0 = No) for front-wheel drive.

The first one, **Price**, is the response variable.

The **objective** of this assignment is:
1. Understand how linear algebra is used in Machine Learning, specifically for correlations and regression
2. Learn how to perform multiple linear regression, ridge regression, lasso regression and elastic net
3. Learn how to assess regression models

Please solve the tasks using this notebook as you template, i.e. insert code blocks and markdown block to this notebook and hand it in. Please use 42 as your random seed.


## Import data
 - Import the dataset 
 - Split the data in a training set and test set - make sure you extract the response variable
 - Remember to use the data appropriately; in the tasks below, we do not explicitly state when to use train and test - but in order to compare the models, you must use the same dataset for training and testing in all models.
 - Output: When you are done with this, you should have the following sets: `X` (the original dataset), `X_train`, `X_train`, `X_test`, `y_train`, `y_test`

In [2]:
# Code block for important and creating data sets. Add more code blocks if needed.
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_excel('car_prices.xlsx')
print("Data shape:", df.shape)

X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes to verify the split.
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)



Data shape: (6226, 16)
Training set shape: (4980, 15)
Test set shape: (1246, 15)


## Part 1: Linear Algebra
In this assignment, you have to solve all problems using linear algebra concepts. You are free to use SymPy or NumPy - though NumPy is **significantly** more efficient computationally than SymPy since NumPy is optimized for numerical computations with floating-point arithmetic. Since linear regression is purely numerical, NumPy is the better choice.

### Task 1: Covariance Matrix
In class, we simplified and stated that $X^T X$ was the **covariance matrix**. However, the correct **sample covariance matrix** is computed as:

$$
\Sigma = \frac{X_{\text{centered}}^T X_{\text{centered}}}{n-1}
$$

  where $X_{\text{centered}}$ is the **mean-centered dataset**, and $n$ is the sample size.

The steps to compute the covariance matrix are:
   1. **Center the dataset**: Subtract the **mean of each feature (column-wise)** from every value in that column. This results in a **centred dataset** $X_{\text{centered}}$.
   2. **Compute $X_{\text{centered}}^T X_{\text{centered}}$**: Multiply the transposed centred dataset by itself.
   3. **Normalize by $n-1$**: Divide the resulting matrix by $n-1$ to obtain the final covariance matrix.

Please **display the covariance matrix** in an easy-to-read format (e.g., using Pandas). Here you use all data, i.e. `X`.





In [3]:
# Use this for Task 1. Add more code blocks if needed.
import numpy as np

X_centered = X - np.mean(X, axis=0)

n_samples = X.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)

print("Covariance Matrix of X using Linear Algebra:")
cov_matrix_df = pd.DataFrame(cov_matrix, columns=df.columns[1:], index=df.columns[1:])

cov_matrix_df.round(2)

Covariance Matrix of X using Linear Algebra:


Unnamed: 0,Model Year,Mileage (km),Electric Range (km),Battery Capacity (kWh),Energy Consumption (Wh/km),Annual Road Tax (DKK),Horsepower (bhp),0-100 km/h (s),Top Speed (km/h),Towing Capacity (kg),Original Price (DKK),Number of Doors,Rear-Wheel Drive,All-Wheel Drive (AWD),Front-Wheel Drive
Model Year,1.97,-22321.68,60.92,9.71,4.72,-7.19,24.88,-0.48,4.93,81.84,32748.08,0.06,0.09,0.03,-0.12
Mileage (km),-22321.68,619783400.0,-213060.24,-31194.97,-8263.15,129238.55,257683.88,-2795.57,91564.86,-299666.57,159944300.0,-441.98,-439.06,908.02,-468.96
Electric Range (km),60.92,-213060.2,9673.93,1460.7,314.73,-104.55,5924.76,-87.1,1353.54,6245.94,6989896.0,12.06,13.07,11.12,-24.19
Battery Capacity (kWh),9.71,-31194.97,1460.7,413.36,240.57,48.29,1347.64,-18.67,264.04,2174.63,2022926.0,2.88,1.07,3.8,-4.86
Energy Consumption (Wh/km),4.72,-8263.15,314.73,240.57,595.2,173.44,1328.02,-16.75,198.43,3670.31,2687909.0,3.69,-1.84,5.44,-3.6
Annual Road Tax (DKK),-7.19,129238.6,-104.55,48.29,173.44,983.71,448.83,-5.87,85.06,1068.0,1100685.0,0.41,-1.31,2.21,-0.9
Horsepower (bhp),24.88,257683.9,5924.76,1347.64,1328.02,448.83,12068.24,-174.6,2475.43,16919.7,13413380.0,3.5,-8.01,38.06,-30.05
0-100 km/h (s),-0.48,-2795.57,-87.1,-18.67,-16.75,-5.87,-174.6,3.17,-37.0,-241.74,-181337.8,0.0,0.12,-0.54,0.42
Top Speed (km/h),4.93,91564.86,1353.54,264.04,198.43,85.06,2475.43,-37.0,684.74,2711.77,2796699.0,-1.07,0.3,6.71,-7.0
Towing Capacity (kg),81.84,-299666.6,6245.94,2174.63,3670.31,1068.0,16919.7,-241.74,2711.77,124307.92,20184330.0,23.19,-30.86,71.62,-40.76


### Task 2: Correlation matrix
The **correlation matrix** is closely related to the **covariance matrix**, as both describe relationships between variables. While the **covariance matrix** measures how two variables vary **together** in absolute terms, the **correlation matrix** standardizes these values by dividing by the standard deviations of each variable. This makes the correlation matrix **dimensionless**, allowing direct comparison between different variables regardless of their units or scale. Essentially, the correlation matrix is a **normalized version** of the covariance matrix, ensuring all values are between **-1 and 1**, making it easier to interpret.

The **correlation matrix** $ \mathbf{R} $ can be computed from the **covariance matrix** $ \mathbf{\Sigma} $ (found in Task 1) using the formula:

$$
\mathbf{R} = \mathbf{D}^{-1} \mathbf{\Sigma} \mathbf{D}^{-1}
$$

where:
- $ \mathbf{\Sigma} $ is the **covariance matrix**.
- $ \mathbf{D} $ is a **diagonal matrix** where each diagonal entry is the **standard deviation** of the corresponding variable, i.e., $ D_{ii} = \sqrt{\Sigma_{ii}} $.

Alternatively, in element-wise form:

$$
R_{ij} = \frac{\Sigma_{ij}}{\sigma_i \sigma_j}
$$

where:
- $ \Sigma_{ij} $ is the **covariance** between variables $ i $ and $ j $.
- $ \sigma_i $ and $ \sigma_j $ are the **standard deviations** of variables $ i $ and $ j $, respectively.

This transformation **normalizes** the covariance values, ensuring the correlation coefficients range between **-1 and 1**, making them comparable across different variables.

In [4]:
# Use this for Task 2. Add more code blocks if needed.

std_dev = np.sqrt(np.diag(cov_matrix))

correlation_matrix = cov_matrix / np.outer(std_dev, std_dev)

correlation_matrix_df = pd.DataFrame(correlation_matrix, columns=df.columns[1:], index=df.columns[1:])

correlation_matrix_df.round(2)


Unnamed: 0,Model Year,Mileage (km),Electric Range (km),Battery Capacity (kWh),Energy Consumption (Wh/km),Annual Road Tax (DKK),Horsepower (bhp),0-100 km/h (s),Top Speed (km/h),Towing Capacity (kg),Original Price (DKK),Number of Doors,Rear-Wheel Drive,All-Wheel Drive (AWD),Front-Wheel Drive
Model Year,1.0,-0.64,0.44,0.34,0.14,-0.16,0.16,-0.19,0.13,0.17,0.14,0.08,0.14,0.04,-0.18
Mileage (km),-0.64,1.0,-0.09,-0.06,-0.01,0.17,0.09,-0.06,0.14,-0.03,0.04,-0.03,-0.04,0.09,-0.04
Electric Range (km),0.44,-0.09,1.0,0.73,0.13,-0.03,0.55,-0.5,0.53,0.18,0.42,0.22,0.27,0.26,-0.51
Battery Capacity (kWh),0.34,-0.06,0.73,1.0,0.48,0.08,0.6,-0.52,0.5,0.3,0.59,0.25,0.11,0.44,-0.49
Energy Consumption (Wh/km),0.14,-0.01,0.13,0.48,1.0,0.23,0.5,-0.39,0.31,0.43,0.65,0.27,-0.16,0.52,-0.3
Annual Road Tax (DKK),-0.16,0.17,-0.03,0.08,0.23,1.0,0.13,-0.11,0.1,0.1,0.21,0.02,-0.09,0.16,-0.06
Horsepower (bhp),0.16,0.09,0.55,0.6,0.5,0.13,1.0,-0.89,0.86,0.44,0.72,0.06,-0.15,0.81,-0.56
0-100 km/h (s),-0.19,-0.06,-0.5,-0.52,-0.39,-0.11,-0.89,1.0,-0.79,-0.39,-0.6,0.0,0.14,-0.71,0.49
Top Speed (km/h),0.13,0.14,0.53,0.5,0.31,0.1,0.86,-0.79,1.0,0.29,0.63,-0.07,0.02,0.6,-0.55
Towing Capacity (kg),0.17,-0.03,0.18,0.3,0.43,0.1,0.44,-0.39,0.29,1.0,0.34,0.12,-0.18,0.47,-0.24


### Task 3: Regression



Linear regression finds the best-fitting line (or hyperplane) by solving for the **coefficient vector** $\mathbf{B}$ that minimizes the squared error:

$$
\mathbf{B} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

where:
- $\mathbf{X}$ is the **design matrix**, including a column of ones for the intercept.
- $\mathbf{y}$ is the **response variable** (target values).
- $\mathbf{B}$ contains the **regression coefficients**.

**Explanation of Each Step**
1. **Construct the matrix $X$**:
   - Each **row** represents a data point.
   - Each **column** represents a feature.
   - The **first column is all ones** to account for the **intercept**.

2. **Solve for $\mathbf{B}$ using the normal equation**:
   - Compute $X^T X$ (feature correlation).
   - Compute $X^T y$ (cross-product with the target variable).
   - Compute the **inverse of $X^T X$** and multiply by $X^T y$ to get $\mathbf{B}$.

3. **Interpret the results**:
   - The **first value** in $\mathbf{B}$ is the **intercept**.
   - The remaining values are the **coefficients for each feature**.

Once we have the regression coefficients $\mathbf{B}$, we can evaluate how well the model fits the data using two key metrics:

1. **Mean Squared Error (MSE)** – Measures the average squared difference between the predicted and actual values:
   $$
   MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
   $$
   - Lower MSE means better fit.

2. **$R^2$ (Coefficient of Determination)** – Measures how much of the variance in $y$ is explained by $X$:
   $$
   R^2 = 1 - \frac{\sum (y - \hat{y})^2}{\sum (y - \bar{y})^2}
   $$
   - $R^2$ ranges from **0 to 1**, where **1** indicates a perfect fit and **0** means the model explains no variance.


**Explanation of Each Step**
1. **Compute Predictions**:  
   $$ \hat{y} = X B $$
   This gives the model’s predicted values.

2. **Compute MSE**:  
   - We square the residuals $ (y - \hat{y})^2 $ and take the mean.

3. **Compute $R^2$**:
   - **Total sum of squares** $ SS_{total} $ measures the total variance in $ y $.
   - **Residual sum of squares** $ SS_{residual} $ measures the variance left unexplained by the model.
   - $ R^2 $ tells us what fraction of variance is explained.

**Interpreting the Results**
- **MSE**: Lower values indicate a better fit.
- **$R^2$ Score**:
  - **$R^2 = 1$** → Perfect fit (all points on the regression line).
  - **$R^2 = 0$** → Model is no better than predicting the mean of $ y $.
  - **$R^2 < 0$** → Model performs worse than a simple average.

Implement the above steps using linear algebra so that you both create a regression model and calculate the MSE and $R^2$. Note, here you need to use `X_train`, `X_test`, `y_train` and `y_test` appropriately!


In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_excel('car_prices.xlsx')
#df = df.drop(['All-Wheel Drive (AWD)', 'Front-Wheel Drive', 'Rear-Wheel Drive'], axis='columns')
print("Data shape:", df.shape)

# Extract features and target
X = df.iloc[:, 1:].values  # Feature matrix (excluding target column)
y = df.iloc[:, 0].values   # Target variable (first column)

# Add an intercept column (column of ones) to X
X = np.column_stack((np.ones(X.shape[0]), X))

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print shapes to verify the split
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


# Compute regression coefficients using normal equation
B = np.linalg.pinv(X_train.T @ X_train) @ X_train.T @ y_train


# Predictions
y_pred = X_test @ B

# Compute MSE and R² score
mse = np.mean((y_test - y_pred) ** 2)
ss_total = np.sum((y_test - np.mean(y_test)) ** 2)
ss_residual = np.sum((y_test - y_pred) ** 2)
r2 = 1 - (ss_residual / ss_total)

# Print results
print(f"Regression Coefficients (B): {B}")
print(f"Mean Squared Error (MSE) on test set: {mse:.2f}")
print(f"R² Score on test set: {r2:.2f}")


Data shape: (6226, 16)
Training set shape: (4980, 16)
Test set shape: (1246, 16)
Regression Coefficients (B): [-7.60989394e-01  7.19112298e+01 -1.21357928e+00  2.09557799e+02
  2.94950867e+01  3.11165840e+02 -4.24531271e+02 -2.04132747e+01
  4.06304924e+03  1.92571354e+02  2.57020754e+01  7.26956423e-01
  1.15962706e+02  9.65604322e+03 -7.79276708e+03 -1.86403713e+03]
Mean Squared Error (MSE) on test set: 3032238857.47
R² Score on test set: 0.85


In [6]:
# Use this for Task 3. Add more code blocks if needed.
import numpy as np

B = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train

y_pred = X_test @ B

mse = np.mean((y_test - y_pred) ** 2)

ss_total = np.sum((y_test - np.mean(y_test)) ** 2)
ss_residual = np.sum((y_test - y_pred) ** 2)
r2 = 1 - (ss_residual / ss_total)

print(f"Regression Coefficients (B): {B}")
print(f"Mean Squared Error (MSE) on test set: {mse:.2f}")
print(f"R² Score on test set: {r2:.2f}")


Regression Coefficients (B): [-3.68790568e+07  1.82848934e+04 -6.17338869e-01  1.30061548e+02
  3.80657532e+01  1.13990601e+02 -3.51845414e+02 -2.11808036e+01
  6.71412428e+03  2.64351560e+02  1.85628527e+01  7.39893302e-01
  4.02920409e+03]
Mean Squared Error (MSE) on test set: 2801332611.11
R² Score on test set: 0.86


# Part 2: Using Library Functions

### Task 4: Correlation and OLS
For this task you must do the following
 - Using library functions, build the following models:
   - Correlation matrix where the correlations are printed in the matrix and a heat map is overlay
   - Ordinary least squares
   - Performance metrics: MSE, RMSE, $R^2$
   - Comment on the real world meaning of RMSE and $R^2$


In [7]:
# Use this for Task 4. Add more code blocks if needed.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Bruger en funktion til at evaluere metrics
def evaluate_model(model, X, y, name):
    y_pred = model.predict(X)
    mse = mean_squared_error(y, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y, y_pred)

    print(f"\n{name}:")
    print(f"  MSE: {mse:.4f}")
    print(f"  RMSE: {rmse:.4f}")
    print(f"  R²: {r2:.4f}")

correlation_matrix = np.corrcoef(X, rowvar=False)
correlation_matrix_df = pd.DataFrame(correlation_matrix, columns=df.columns[:-1], index=df.columns[:-1])

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_df, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

# OLS
ols_model = LinearRegression()
ols_model.fit(X_train, y_train)
evaluate_model(ols_model, X_test, y_test, "Ordinary Least Squares")

ModuleNotFoundError: No module named 'seaborn'

### Task 5: Ridge, Lasso and Elastic Net
In order for Ridge and Lasso (and Elastic net) to have an effect, you must use scaled data to build the models, since regularization depends on coefficient magnitude, and if using non-scaled data the penalty will affect them unequally. Feel free to use this code to scale the data:

```python
# Standardize X
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Standardize y
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
y_test_scaled = scaler_y.transform(y_test.reshape(-1, 1)).flatten()
```
For the final task you must do the following
   - Ridge regression (using multiple alphas)
   - Lasso regression (using multiple alphas)
   - Elastic Net (using multiple alphas)
 - Discussion and conclusion:
   - Discuss the MSE and $R^2$ of all 3 models and conclude which model has the best performance - note the MSE will be scaled!
   - Rebuild the OLS model from Task 4, but this time use the scaled data from this task - interpret the meaning of the model's coefficients
   - Use the coefficients of the best ridge and lasso model to print the 5 most important features and compare to the 5 most important features in the OLS with scaled data model. Do the models agree about which features are the most important?

Note: You may get a convergence warning; try increasing the `max_iter` parameter of the model (the default is 1000 - maybe set it to 100000)

In [None]:
# Use this for Task 5. Add more code blocks if needed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Standardisere X
scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

# Standardisere y
scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.reshape(-1, 1)).flatten()
y_test_scaled = scaler_y.transform(y_test.reshape(-1, 1)).flatten()

# Ja, burde måske bruge nogle andre!
alphas = [0.01, 0.1, 1, 10]

# -------------------
# **OLS på de skalerede**
# -------------------
ols_scaled_model = LinearRegression()
ols_scaled_model.fit(X_train_scaled, y_train_scaled)  

# -------------------
# **Ridge**
# -------------------
best_ridge, best_ridge_alpha = None, None
best_ridge_mse = float("inf")

for alpha in alphas:
    ridge_model = Ridge(alpha=alpha, max_iter=100000)
    ridge_model.fit(X_train_scaled, y_train_scaled)  
    
    mse = mean_squared_error(y_test_scaled, ridge_model.predict(X_test_scaled))
    if mse < best_ridge_mse:
        best_ridge_mse = mse
        best_ridge = ridge_model
        best_ridge_alpha = alpha

# -------------------
# **Lasso**
# -------------------
best_lasso, best_lasso_alpha = None, None
best_lasso_mse = float("inf")

for alpha in alphas:
    lasso_model = Lasso(alpha=alpha, max_iter=100000)
    lasso_model.fit(X_train_scaled, y_train_scaled)
    
    mse = mean_squared_error(y_test_scaled, lasso_model.predict(X_test_scaled))
    if mse < best_lasso_mse:
        best_lasso_mse = mse
        best_lasso = lasso_model
        best_lasso_alpha = alpha

# -------------------
# **Elastic Net**
# -------------------
best_elastic, best_elastic_alpha = None, None
best_elastic_mse = float("inf")

for alpha in alphas:
    elastic_net_model = ElasticNet(alpha=alpha, l1_ratio=0.5, max_iter=100000)
    elastic_net_model.fit(X_train_scaled, y_train_scaled)
    
    mse = mean_squared_error(y_test_scaled, elastic_net_model.predict(X_test_scaled))
    if mse < best_elastic_mse:
        best_elastic_mse = mse
        best_elastic = elastic_net_model
        best_elastic_alpha = alpha

# -------------------
# **Metrics for de bedste modeller**
# -------------------
print("\nBest Model Results:")

evaluate_model(ols_scaled_model, X_test_scaled, y_test_scaled, "OLS Regression (Scaled Data)")
evaluate_model(best_ridge, X_test_scaled, y_test_scaled, f"Best Ridge Regression (Alpha={best_ridge_alpha})")
evaluate_model(best_lasso, X_test_scaled, y_test_scaled, f"Best Lasso Regression (Alpha={best_lasso_alpha})")
evaluate_model(best_elastic, X_test_scaled, y_test_scaled, f"Best Elastic Net Regression (Alpha={best_elastic_alpha})")



Best Model Results:

OLS Regression (Scaled Data):
  MSE: 0.1252
  RMSE: 0.3538
  R²: 0.8644

Best Ridge Regression (Alpha=0.01):
  MSE: 0.1252
  RMSE: 0.3538
  R²: 0.8644

Best Lasso Regression (Alpha=0.01):
  MSE: 0.1229
  RMSE: 0.3506
  R²: 0.8669

Best Elastic Net Regression (Alpha=0.01):
  MSE: 0.1238
  RMSE: 0.3518
  R²: 0.8659


In [None]:
feature_names = df.columns[1:]  

ols_coefficients = ols_scaled_model.coef_

sorted_indices = np.argsort(np.abs(ols_coefficients))[::-1]  # Sort descending

# Dem alle
print("\nFeature Importance - OLS Regression (Scaled Data):")
for feature, coef in zip(feature_names, ols_coefficients):
    print(f"{feature}: {coef:.4f}")

# De 5 vigigste
print("\nTop 5 Most Important Features - OLS Regression (Scaled Data):")
for i in range(5):
    feature = feature_names[sorted_indices[i]]
    coef_value = ols_coefficients[sorted_indices[i]]
    print(f"{i+1}. {feature}: {coef_value:.4f}")



Feature Importance - OLS Regression (Scaled Data):
Model Year: 0.1680
Mileage (km): -0.1032
Electric Range (km): 0.0706
Battery Capacity (kWh): 0.0054
Energy Consumption (Wh/km): 0.0182
Annual Road Tax (DKK): -0.0627
Horsepower (bhp): 0.0180
0-100 km/h (s): 0.0770
Top Speed (km/h): 0.0256
Towing Capacity (kg): 0.0475
Original Price (DKK): 0.8500
Number of Doors: 0.0111
Rear-Wheel Drive: 0.0244
All-Wheel Drive (AWD): -0.0145
Front-Wheel Drive: -0.0117

Top 5 Most Important Features - OLS Regression (Scaled Data):
1. Original Price (DKK): 0.8500
2. Model Year: 0.1680
3. Mileage (km): -0.1032
4. 0-100 km/h (s): 0.0770
5. Electric Range (km): 0.0706


In [None]:
def print_top_features(model, model_name):
    """Prints the 5 most important features of a trained regression model."""
    coefficients = model.coef_
    
    sorted_indices = np.argsort(np.abs(coefficients))[::-1]  # Sort descending
    
    print(f"\nTop 5 Most Important Features - {model_name}:")
    for i in range(5):
        feature = feature_names[sorted_indices[i]]
        coef_value = coefficients[sorted_indices[i]]
        print(f"{i+1}. {feature}: {coef_value:.4f}")

# Top 5 for ridge og lasso
print_top_features(best_ridge, f"Ridge Regression (Alpha={best_ridge_alpha})")
print_top_features(best_lasso, f"Lasso Regression (Alpha={best_lasso_alpha})")



Top 5 Most Important Features - Ridge Regression (Alpha=0.01):
1. Original Price (DKK): 0.8500
2. Model Year: 0.1680
3. Mileage (km): -0.1032
4. 0-100 km/h (s): 0.0770
5. Electric Range (km): 0.0706

Top 5 Most Important Features - Lasso Regression (Alpha=0.01):
1. Original Price (DKK): 0.8419
2. Model Year: 0.1691
3. Mileage (km): -0.0948
4. Electric Range (km): 0.0634
5. Annual Road Tax (DKK): -0.0512
