# Part 1 (50 points): Fuel Consumption → Horsepower Prediction

In [39]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import matplotlib.pyplot as plt

1.1 Load and inspect the dataset

In [40]:
# ============================================================
# Load dataset
# ============================================================

DATA_PATH = "FuelEconomy.csv"
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
print("\nColumns:")
print(df.columns.tolist())

display(df.head())

print("\nSummary statistics:")
display(df.describe(include="all"))

print("\nMissing values per column:")
display(df.isna().sum())

Shape: (100, 2)

Columns:
['Horse Power', 'Fuel Economy (MPG)']


Unnamed: 0,Horse Power,Fuel Economy (MPG)
0,118.770799,29.344195
1,176.326567,24.695934
2,219.262465,23.95201
3,187.310009,23.384546
4,218.59434,23.426739



Summary statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



Missing values per column:


Unnamed: 0,0
Horse Power,0
Fuel Economy (MPG),0


1.2 Train/Test split (70% / 30% random)

In [41]:
X = df[['Horse Power']]
y = df['Fuel Economy (MPG)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (70, 1)
X_test shape: (30, 1)
y_train shape: (70,)
y_test shape: (30,)


1.3 Model training: Linear + Polynomial regression (15 points)

• Train the following models to predict HP:

(a) Linear Regression

(b) Polynomial Regression (degree 2)

(c) Polynomial Regression (degree 3)

(d) Polynomial Regression (degree 4)

• Do NOT use any regularization (no Ridge/Lasso/ElasticNet).

• Use PolynomialFeatures + LinearRegression for polynomial models.

(a) Linear Regression

In [42]:
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)

(b) Polynomial Regression (degree 2)

In [43]:
pipeline_poly2 = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('linear_reg', LinearRegression())
])
pipeline_poly2.fit(X_train, y_train)

(b) Polynomial Regression (degree 3)

In [44]:
pipeline_poly3 = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3, include_bias=False)),
    ('linear_reg', LinearRegression())
])
pipeline_poly3.fit(X_train, y_train)

(b) Polynomial Regression (degree 4)

In [45]:
pipeline_poly4 = Pipeline([
    ('poly_features', PolynomialFeatures(degree=4, include_bias=False)),
    ('linear_reg', LinearRegression())
])
pipeline_poly4.fit(X_train, y_train)

1.4 Model evaluation (train and test) (10 points)

• For each model, report metrics on both train and test sets: MSE, MAE, R2

• Present your results in a clean table (recommended).

In [46]:
def evaluate_model_formatted(model, X_train, y_train, X_test, y_test, model_name):
    row_data = {'Model': model_name}

    y_train_pred = model.predict(X_train)
    row_data[('Train', 'MSE')] = mean_squared_error(y_train, y_train_pred)
    row_data[('Train', 'MAE')] = mean_absolute_error(y_train, y_train_pred)
    row_data[('Train', 'R2')] = r2_score(y_train, y_train_pred)

    y_test_pred = model.predict(X_test)
    row_data[('Test', 'MSE')] = mean_squared_error(y_test, y_test_pred)
    row_data[('Test', 'MAE')] = mean_absolute_error(y_test, y_test_pred)
    row_data[('Test', 'R2')] = r2_score(y_test, y_test_pred)

    return row_data

formatted_all_results = []

formatted_all_results.append(evaluate_model_formatted(model_linear, X_train, y_train, X_test, y_test, 'Linear Regression'))

formatted_all_results.append(evaluate_model_formatted(pipeline_poly2, X_train, y_train, X_test, y_test, 'Polynomial Degree 2'))

formatted_all_results.append(evaluate_model_formatted(pipeline_poly3, X_train, y_train, X_test, y_test, 'Polynomial Degree 3'))

formatted_all_results.append(evaluate_model_formatted(pipeline_poly4, X_train, y_train, X_test, y_test, 'Polynomial Degree 4'))

results_df_formatted = pd.DataFrame(formatted_all_results).set_index('Model')

metric_order = ['MSE', 'MAE', 'R2']
set_order = ['Train', 'Test']
ordered_columns = pd.MultiIndex.from_product([set_order, metric_order], names=['Set', 'Metric'])

results_df_formatted = results_df_formatted.reindex(columns=ordered_columns)

print("\nModel Evaluation Results:")
display(results_df_formatted)


Model Evaluation Results:


Set,Train,Train,Train,Test,Test,Test
Metric,MSE,MAE,R2,MSE,MAE,R2
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Linear Regression,2.115741,1.209978,0.90632,1.67495,1.031271,0.913315
Polynomial Degree 2,2.11507,1.210303,0.90635,1.657031,1.025411,0.914243
Polynomial Degree 3,2.06055,1.211527,0.908764,1.903743,1.087196,0.901475
Polynomial Degree 4,1.917714,1.168259,0.915088,2.54846,1.203406,0.868108


1.5 Discussion and interpretation (10 points)

Use your results to answer the following questions with a data-driven explanation:

• Which model performs best on the test set and why?

• Does increasing polynomial degree always improve performance? If not, explain what you observe.

• If a model performs unexpectedly poorly (e.g., low R2 or large test error), propose at least two
plausible reasons, such as:underfitting vs overfitting,weak relationship between features and target, outliers or noise in the data, insufficient feature information for predicting HP.

• Support your claims using your reported metrics (not intuition only)

Looking at the test data, polynomial degree tool performs the best. It has the highest R2 value of 0.914, abd the lowest MSE and MAE values of 1.657 and 1.025 respectively. That means the model has a tighter fit to the data and has less variance from the actual data set. Increasing the polynomial degree does not increase performance. While the R2 value is greater for the degree 2 model compared to linea, as it goes from 0.913 to 0.914, when we increase the degre to 3 and 4, the R2 value drops to 0.90 and then 0.87, which is a significant decrease in model to data fit. This means that the relationship between horse power and fuel economy is most likely explained best by a quadratic function. Increasing the degree beyond this makes the model too sensitive to the underlying noise as the model tries to fit that. This is also the most plaudible reason for a model to perform poorly - overfitting. Overfitting is senstive to outliers and noise in the data and can skew the model away from most of the data. Alternatively, poor performance could also be because the fatures we have are insufficient to explain the behavior of our dependent variabel - the relationship might not be strong enough to make a model out of, indicating there are other unnaccoutned for variables that are actually influencing the output data.

# Part 2 (50 points): Weather → Daily Electricity Consumption Prediction

2.1 Load and inspect the dataset (10 points)

In [47]:
DATA_PATH_ELECTRICITY = "electricity_consumption_based_weather_dataset.csv"
df_electricity = pd.read_csv(DATA_PATH_ELECTRICITY)

print("Shape:", df_electricity.shape)
print("\nColumns:")
print(df_electricity.columns.tolist())

display(df_electricity.head())

print("\nSummary statistics:")
display(df_electricity.describe(include='all'))

print("\nMissing values per column:")
display(df_electricity.isna().sum())

Shape: (1433, 6)

Columns:
['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
0,2006-12-16,2.5,0.0,10.6,5.0,1209.176
1,2006-12-17,2.6,0.0,13.3,5.6,3390.46
2,2006-12-18,2.4,0.0,15.0,6.7,2203.826
3,2006-12-19,2.4,0.0,7.2,2.2,1666.194
4,2006-12-20,2.4,0.0,7.2,1.1,2225.748



Summary statistics:


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1433,1418.0,1433.0,1433.0,1433.0,1433.0
unique,1433,,,,,
top,2010-11-26,,,,,
freq,1,,,,,
mean,,2.642313,3.800488,17.187509,9.141242,1561.078061
std,,1.140021,10.973436,10.136415,9.028417,606.819667
min,,0.0,0.0,-8.9,-14.4,14.218
25%,,1.8,0.0,8.9,2.2,1165.7
50%,,2.4,0.0,17.8,9.4,1542.65
75%,,3.3,1.3,26.1,17.2,1893.608



Missing values per column:


Unnamed: 0,0
date,0
AWND,15
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


In [48]:
df_electricity['AWND'].fillna(df_electricity['AWND'].mean(), inplace=True)

print("Missing values after AWND filled:")
display(df_electricity.isna().sum())


Missing values after AWND filled:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_electricity['AWND'].fillna(df_electricity['AWND'].mean(), inplace=True)


Unnamed: 0,0
date,0
AWND,0
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


2.2 Train/Test split (70% / 30% random) (5 points)

In [49]:
X_electricity = df_electricity.drop(columns=['date', 'daily_consumption'])
y_electricity = df_electricity['daily_consumption']

X_train_electricity, X_test_electricity, y_train_electricity, y_test_electricity = train_test_split(X_electricity, y_electricity, test_size=0.3, random_state=42)

print(f"X_train_electricity shape: {X_train_electricity.shape}")
print(f"X_test_electricity shape: {X_test_electricity.shape}")
print(f"y_train_electricity shape: {y_train_electricity.shape}")
print(f"y_test_electricity shape: {y_test_electricity.shape}")

X_train_electricity shape: (1003, 4)
X_test_electricity shape: (430, 4)
y_train_electricity shape: (1003,)
y_test_electricity shape: (430,)


2.3 Model training: Linear + Polynomial regression (15 points)

• Train the following models to predict daily consumption

(a) Linear Regression

(b) Polynomial Regression (degree 2)

(c) Polynomial Regression (degree 3)

(d) Polynomial Regression (degree 4)

• Do NOT use regularization.

• Ensure your features are correctly separated from the target

(a) Linear Regression (Electricity Consumption)

In [50]:
model_linear_electricity = LinearRegression()
model_linear_electricity.fit(X_train_electricity, y_train_electricity)

(b) Polynomial Regression (degree 2) (Electricity Consumption)

In [51]:
pipeline_poly2_electricity = Pipeline([
    ('poly_features', PolynomialFeatures(degree=2, include_bias=False)),
    ('linear_reg', LinearRegression())
])
pipeline_poly2_electricity.fit(X_train_electricity, y_train_electricity)


(c) Polynomial Regression (degree 3) (Electricity Consumption)

In [52]:
pipeline_poly3_electricity = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3, include_bias=False)),
    ('linear_reg', LinearRegression())
])
pipeline_poly3_electricity.fit(X_train_electricity, y_train_electricity)


(d) Polynomial Regression (degree 4) (Electricity Consumption)

In [53]:
pipeline_poly4_electricity = Pipeline([
    ('poly_features', PolynomialFeatures(degree=4, include_bias=False)),
    ('linear_reg', LinearRegression())
])
pipeline_poly4_electricity.fit(X_train_electricity, y_train_electricity)

2.4 Model evaluation (train and test) (10 points)

In [54]:
formatted_all_results_electricity = []

formatted_all_results_electricity.append(evaluate_model_formatted(model_linear_electricity, X_train_electricity, y_train_electricity, X_test_electricity, y_test_electricity, 'Linear Regression Electricity'))

formatted_all_results_electricity.append(evaluate_model_formatted(pipeline_poly2_electricity, X_train_electricity, y_train_electricity, X_test_electricity, y_test_electricity, 'Polynomial Degree 2 Electricity'))

formatted_all_results_electricity.append(evaluate_model_formatted(pipeline_poly3_electricity, X_train_electricity, y_train_electricity, X_test_electricity, y_test_electricity, 'Polynomial Degree 3 Electricity'))

formatted_all_results_electricity.append(evaluate_model_formatted(pipeline_poly4_electricity, X_train_electricity, y_train_electricity, X_test_electricity, y_test_electricity, 'Polynomial Degree 4 Electricity'))

results_df_formatted_electricity = pd.DataFrame(formatted_all_results_electricity).set_index('Model')

# Define the desired column order for MultiIndex (reusing from Part 1)
metric_order = ['MSE', 'MAE', 'R2']
set_order = ['Train', 'Test']
ordered_columns = pd.MultiIndex.from_product([set_order, metric_order], names=['Set', 'Metric'])

results_df_formatted_electricity = results_df_formatted_electricity.reindex(columns=ordered_columns)

print("\nModel Evaluation Results for Electricity Consumption:")
display(results_df_formatted_electricity)


Model Evaluation Results for Electricity Consumption:


Set,Train,Train,Train,Test,Test,Test
Metric,MSE,MAE,R2,MSE,MAE,R2
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Linear Regression Electricity,274826.312092,387.047361,0.272945,237216.888723,365.583197,0.311496
Polynomial Degree 2 Electricity,268041.444572,382.087143,0.290894,234831.853379,362.904541,0.318418
Polynomial Degree 3 Electricity,261191.080538,377.738782,0.309017,238445.641493,369.093689,0.307929
Polynomial Degree 4 Electricity,253602.686954,374.729584,0.329092,408466.181456,415.287189,-0.185542


2.5 Discussion and interpretation (10 points)

Write a short, technical discussion that uses your results to answer:

• Which model generalizes best (best test performance), and what does that tell you about the
relationship between weather and electricity usage?

• Dopolynomial models improve the fit compared to linear regression? If yes, why might electricity
consumption have nonlinear dependence on weather?

• If higher-degree models perform worse on the test set, explain this behavior using evidence from
metrics (e.g., train error decreases but test error increases).

• If none of the models achieve good test performance, provide at least two reasons supported by
your outputs (e.g., limited feature set, high noise, unmodeled drivers such as occupancy/behavior,
seasonal effects).

The Polynomial Degree 2 model is the best performnce on the tests set with an R2 value of 0.318. This indicates a weak quadratic relationship between weather and electricity usage. The degree 2 polynomial model is slightly better of a precitor than the linear model, but the R2 value drops significantly as the degree goes up to 3 then 4, which has an R2 of -0.18. This indicates severe overfitting to noise. The overall low R2 scores across the models, all under 0.5, idicates that the current feature set with basic weather variables is insufficient ro explain the majority of the variance in electricity cosumption, indicated other unnacounted variables such as occupancy, seasonal effects, etc.