<a href="https://colab.research.google.com/github/jowu-brunonian/AI_Hackathon_11-02-2025/blob/main/Assessment_2_Solving_ordinary_least_regression_through_matrices_due_end_of_Module_9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Source of data: Electric Vehicle Population Data https://catalog.data.gov/dataset/electric-vehicle-population-data

Based on the available information in the notebook, the data is sourced from data.gov and contains registration data from based on registered VIN numbers of electric vehicles within the state of Washington

Downloaded and imported CSV into Google Drive for Public Viewing: https://docs.google.com/spreadsheets/d/1yh6j5HBTlhZzFQy-7JixcnW0CoILKn4x3ioi_LbqmGc/edit?gid=443715183#gid=443715183

In [None]:
#importing the gdown library to import the excel file
import gdown

#importing pandas to create a dataframe, statsmodel to perform linear algebra, and numpy for numpy packages
import pandas as pd
import statsmodels.api as sm
import numpy as np

In [None]:
#Calling gdown and the file id
!gdown 1yh6j5HBTlhZzFQy-7JixcnW0CoILKn4x3ioi_LbqmGc

Downloading...
From (original): https://drive.google.com/uc?id=1yh6j5HBTlhZzFQy-7JixcnW0CoILKn4x3ioi_LbqmGc
From (redirected): https://docs.google.com/spreadsheets/d/1yh6j5HBTlhZzFQy-7JixcnW0CoILKn4x3ioi_LbqmGc/export?format=xlsx
To: /content/Electric_Vehicle_Population_Data.xlsx
21.7MB [00:00, 74.5MB/s]


In [None]:
# Load the 10-row sample data
# This file contains the 'Base MSRP', 'Electric Range', and 'Model Year' data.
#define a dataframe named 'EVehicles'
#using pd.read_excel to read in the file imported from previous step
#input parameter is the location of the file after calling gdown

Evehicles=pd.read_excel('/content/Electric_Vehicle_Population_Data.xlsx')

# SELECT CONTINUOUS COLUMNS
# Create a new DataFrame containing only the required continuous variables.
# Using .copy() to create a new Dataframe "Evehicles_10cols" so that any modifications will not affect the original dataframe Evehicles
# Y: 'Base MSRP', X1: 'Electric Range', X2: 'Model Year'
cols_of_interest = ['Base MSRP', 'Electric Range', 'Model Year']
Evehicles_10cols = Evehicles[cols_of_interest].copy()

# CLEAN DATA (HANDLE MISSING VALUES)
# Drop any rows where these specific variables have missing (NaN) values, as matrix calculations require complete data.
Evehicles_10cols.dropna(inplace=True)

# RANDOMLY SELECT 10 ROWS
# Set a seed for reproducibility so you get the same 10 rows every time you run it.
# This resulting DataFrame, Evehicles_10_rows, is the one you will use for all matrix calcs.
np.random.seed(42)
Evehicles_10_rows = Evehicles_10cols.sample(n=10, random_state=42)

# CHECK CORRELATION BETWEEN PREDICTORS (X1 and X2)
#Running this code shows the correlation coefficient between 'Electric Range' and 'Model Year' is -0.7025.
#Correlation coefficients range from -1 to +1.
#A value close to 0 indicates a weak linear relationship, while values close to -1 or +1 indicate a strong linear relationship (negative or positive, respectively).
correlation = Evehicles_10_rows['Electric Range'].corr(Evehicles_10_rows['Model Year'])

print("Data Selection Summary")
print(f"10-Row Sample Data Head:\n{Evehicles_10_rows.head()}")
print(f"\nCorrelation between Predictors ('Electric Range' and 'Model Year'): {correlation:.4f}")




Data Selection Summary
10-Row Sample Data Head:
        Base MSRP  Electric Range  Model Year
143641        0.0            25.0        2021
157153        0.0             0.0        2022
134641        0.0             0.0        2025
95955         0.0            25.0        2018
100191        0.0             0.0        2025

Correlation between Predictors ('Electric Range' and 'Model Year'): -0.7025


In [None]:
#Define Y (Response) and X (Predictors)
#two predictor features that are continuous and there is evidence the two predictor features are not highly correlated.
Y = Evehicles_10_rows['Base MSRP']
X = Evehicles_10_rows[['Electric Range', 'Model Year']]

# Define Y (Response), X1, X2 (Predictors)
# using .reshape(-1, 1) to turn values into a column vector
Y = Evehicles_10_rows['Base MSRP'].values.reshape(-1, 1)        # Response vector (Y)
X1 = Evehicles_10_rows['Electric Range'].values.reshape(-1, 1)  # Predictor 1 (X1)
X2 = Evehicles_10_rows['Model Year'].values.reshape(-1, 1)      # Predictor 2 (X2)

# Create Design Matrix X: [Column of 1s | X1 | X2]
# The column of 1s is for the intercept (b0).
X = np.hstack([np.ones(Y.shape), X1, X2])

In [None]:
# 1. Calculate X'X transpose matrix (X transpose times X)
# The matrix is symmetric, so the transpose (X'X)' is the same as X'X.
XTX = np.dot(X.T, X)
print("\n (X'X) Matrix:")
print(XTX)


 (X'X) Matrix:
[[1.0000000e+01 1.2400000e+02 2.0219000e+04]
 [1.2400000e+02 4.0380000e+03 2.5046500e+05]
 [2.0219000e+04 2.5046500e+05 4.0880847e+07]]


In [None]:
# Calculate (X'X) Inverse Matrix (needed for B and Covariance)
XTX_inverse = np.linalg.inv(XTX)
print("\n(X'X) Inverse Matrix:")
print(XTX_inverse)


(X'X) Inverse Matrix:
[[ 1.58746927e+05 -7.86907554e+00 -7.84654289e+01]
 [-7.86907554e+00  7.89514503e-04  3.88707926e-03]
 [-7.84654289e+01  3.88707926e-03  3.87839305e-02]]


In [None]:
# 2. Calculate X'Y matrix (X transpose times Y)
#XTY = X.T @ Y
# Like the X'X matrix, I can use np.dot(X.T, Y) instead of X.T @ Y
XTY = np.dot(X.T, Y)
print("\n2. (X'Y) Matrix:")
print(XTY)


2. (X'Y) Matrix:
[[3.9995000e+04]
 [1.2798400e+06]
 [8.0749905e+07]]


In [None]:
# 3. Calculate B matrix (Beta coefficients): B = (X'X)^-1 * X'Y
#B = XTX_inverse @ XTY
B = np.dot(XTX_inverse, XTY)
print("\n3. B (Beta) Matrix:")
print(B)


3. B (Beta) Matrix:
[[ 2.93626745e+06]
 [ 1.68057166e+02]
 [-1.45128436e+03]]


In [None]:
# 4. Final Regression Equation and Predictions (Y_hat, Error)

# Extract coefficients for the final equation
b0 = B[0, 0]
b1 = B[1, 0]
b2 = B[2, 0]

# 4. Final Regression Equation
print("\n4. Final Regression Equation:")
print(f"Y_hat = {b0:.2f} + {b1:.2f}*Electric_Range + {b2:.2f}*Model_Year")




4. Final Regression Equation:
Y_hat = 2936267.45 + 168.06*Electric_Range + -1451.28*Model_Year


In [None]:
# 5. Prediction for 2 Rows (using the first two rows of the design matrix X)
Y_hat_all = np.dot(X, B)
Y_hat_2rows = Y_hat_all[:2, :]  # Select Y_hat for the first 2 rows
Y_2rows = Y[:2, :]              # Select actual Y for the first 2 rows

# Calculate Error (e) for the 2 rows: e = Y - Y_hat
Error_2rows = Y_2rows - Y_hat_2rows

# Create a table for Y, Y_hat, and Error for the 2 rows
results_table = pd.DataFrame({
    'Y (Actual)': Y_2rows.flatten(),
    'Y_hat (Predicted)': Y_hat_2rows.flatten(),
    'Error (e)': Error_2rows.flatten()
})
print("\n5.a. Y, Y_hat, and Error for 2 Selected Rows:")
print(results_table)


5.a. Y, Y_hat, and Error for 2 Selected Rows:
   Y (Actual)  Y_hat (Predicted)    Error (e)
0         0.0        7423.176221 -7423.176221
1         0.0        1770.462703 -1770.462703


In [None]:
# 6. Covariance Matrix Calculations

# 6.a. Calculation for Sigma Squared (Estimate of Error Variance)
# 6.a.i. Calculate error vector (e = Y - Y_hat)
e = Y - Y_hat_all

# 6.a.ii. Calculate Sum of Squared Errors (e'e)
e_T_e = np.dot(e.T, e)

#Calculate sigma squared: e'e / (n - p),
#where n is the number of observations and p is the number of parameters.
n = X.shape[0]  # Number of observations
p = X.shape[1]  # Number of parameters (columns in X)
sigma_squared = np.dot(e.T, e) / (n - p)
print("\nSigma Squared:")
print(sigma_squared)

# 6.b. Calculation of the Covariance Matrix
cov_matrix = sigma_squared * XTX_inverse
print("\nCovariance matrix of coefficients:")
print(cov_matrix)

# 6.c. Reporting the required values
# Variances are the diagonal elements: (0,0), (1,1), (2,2)
# Covariance of B1:B2 is the (1,2) or (2,1) element
Var_B0 = cov_matrix[0, 0]
Var_B1 = cov_matrix[1, 1]
Var_B2 = cov_matrix[2, 2]
Cov_B1_B2 = cov_matrix[1, 2]

print("\n6.c. Reported Covariance Matrix Results:")
print(f"6.c.1. Variance of B0: {Var_B0:.4f}")
print(f"6.c.2. Variance of B1: {Var_B1:.4f}")
print(f"6.c.3. Variance of B2: {Var_B2:.4f}")
print(f"6.c.4. Covariance of B1:B2: {Cov_B1_B2:.4f}")


Sigma Squared:
[[1.62795962e+08]]

Covariance matrix of coefficients:
[[ 2.58433587e+13 -1.28105372e+09 -1.27738549e+10]
 [-1.28105372e+09  1.28529773e+05  6.32800806e+05]
 [-1.27738549e+10  6.32800806e+05  6.31386726e+06]]

6.c. Reported Covariance Matrix Results:
6.c.1. Variance of B0: 25843358657228.6680
6.c.2. Variance of B1: 128529.7727
6.c.3. Variance of B2: 6313867.2606
6.c.4. Covariance of B1:B2: 632800.8061
