# EE344 Assignment 1 - Part 1: Fuel Consumption → Horsepower Prediction

**Dataset**: Fuel Consumption Based on HP (Kaggle)  
**Task**: Build regression models to predict horsepower (HP) based on fuel consumption features

---

## Models to be trained:
- Linear Regression
- Polynomial Regression (degree 2)
- Polynomial Regression (degree 3)
- Polynomial Regression (degree 4)

**Note**: No regularization (Ridge/Lasso/ElasticNet) will be used as per assignment requirements.

## Import Required Libraries

In [10]:
# ============================================================
# Imports
# ============================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Set random seed for reproducibility
np.random.seed(42)

## 1.1 Load and Inspect the Dataset (10 points)

In this section, we:
- Load the CSV file into a pandas DataFrame
- Display basic information about the dataset:
  - Column names
  - Shape (number of rows and columns)
  - Summary statistics
- Check for missing values
- Identify the feature and target variables

In [11]:
# ============================================================
# Load the dataset
# ============================================================

DATA_PATH = "FuelEconomy.csv"
df = pd.read_csv(DATA_PATH)

print("="*60)
print("DATASET OVERVIEW")
print("="*60)

print("\nShape:", df.shape)

print("\nColumn names:")
print(df.columns.tolist())

print("\n" + "="*60)
print("SUMMARY STATISTICS")
print("="*60)
display(df.describe())

print("\n" + "="*60)
print("MISSING VALUES CHECK")
print("="*60)
missing_values = df.isna().sum()
print(missing_values)


DATASET OVERVIEW

Shape: (100, 2)

Column names:
['Horse Power', 'Fuel Economy (MPG)']

SUMMARY STATISTICS


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



MISSING VALUES CHECK
Horse Power           0
Fuel Economy (MPG)    0
dtype: int64


### Data Understanding

From the dataset inspection above:

- **Feature (X)**: `Fuel Economy (MPG)` - measures how many miles a vehicle can travel per gallon of fuel
- **Target (y)**: `Horse Power` - the engine's horsepower that we want to predict

## 1.2 Train/Test Split (70% / 30% Random) (5 points)

We split the dataset into:
- **Training set (70%)**: Used to train the models
- **Test set (30%)**: Used to evaluate model performance on unseen data

A fixed `random_state=42` ensures reproducibility of results.

In [12]:
# ============================================================
# Prepare features (X) and target (y)
# ============================================================

# Feature: Fuel Economy (MPG)
X = df[['Fuel Economy (MPG)']].values

# Target: Horse Power
y = df['Horse Power'].values

print("Feature matrix X shape:", X.shape)
print("Target vector y shape:", y.shape)

# ============================================================
# Train-test split (70% train, 30% test)
# ============================================================

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.30, 
    random_state=42
)

print("\n" + "="*60)
print("TRAIN/TEST SPLIT SUMMARY")
print("="*60)
print(f"Training samples: {X_train.shape[0]} ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test samples:     {X_test.shape[0]} ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Total samples:    {len(X)}")

Feature matrix X shape: (100, 1)
Target vector y shape: (100,)

TRAIN/TEST SPLIT SUMMARY
Training samples: 70 (70.0%)
Test samples:     30 (30.0%)
Total samples:    100


## 1.3 Model Training: Linear + Polynomial Regression (15 points)

We will train **four regression models** without any regularization:

1. **Linear Regression**: Fits a straight line (degree 1 polynomial)
2. **Polynomial Regression (degree 2)**: Fits a quadratic curve
3. **Polynomial Regression (degree 3)**: Fits a cubic curve
4. **Polynomial Regression (degree 4)**: Fits a quartic curve

For polynomial models, we use `PolynomialFeatures` to generate polynomial and interaction features, then apply `LinearRegression`.

In [13]:
# ============================================================
# Train all models
# ============================================================

# Dictionary to store all trained models
models = {}

# 1. Linear Regression
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
models['Linear Regression'] = {
    'model': model_linear,
    'X_train': X_train,
    'X_test': X_test,
    'degree': 1
}

# 2-4. Polynomial Regression (degrees 2, 3, 4)
for degree in [2, 3, 4]:
    
    # Create polynomial features
    poly_features = PolynomialFeatures(degree=degree, include_bias=True)
    
    # Transform training data
    X_train_poly = poly_features.fit_transform(X_train)
    
    # Transform test data (use transform, not fit_transform)
    X_test_poly = poly_features.transform(X_test)
    
    # Train linear regression on polynomial features
    model_poly = LinearRegression()
    model_poly.fit(X_train_poly, y_train)
    
    # Store model and transformed data
    models[f'Polynomial (degree={degree})'] = {
        'model': model_poly,
        'X_train': X_train_poly,
        'X_test': X_test_poly,
        'degree': degree,
        'poly_features': poly_features
    }

## 1.4 Model Evaluation (Train and Test) (10 points)

For each model, we compute three key metrics on both training and test sets:

- **MSE (Mean Squared Error)**: Average of squared differences between predicted and actual values. Lower is better.
- **MAE (Mean Absolute Error)**: Average of absolute differences. Lower is better.
- **R² (Coefficient of Determination)**: Proportion of variance explained by the model. Range: (-∞, 1], where 1 is perfect fit.

In [14]:
# ============================================================
# Evaluate all models
# ============================================================

results = []

for model_name, model_info in models.items():
    model = model_info['model']
    X_train_transformed = model_info['X_train']
    X_test_transformed = model_info['X_test']
    
    # Training set predictions
    y_train_pred = model.predict(X_train_transformed)
    
    # Test set predictions
    y_test_pred = model.predict(X_test_transformed)
    
    # Compute metrics for training set
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    
    # Compute metrics for test set
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # Store results
    results.append({
        'Model': model_name,
        'Train MSE': train_mse,
        'Train MAE': train_mae,
        'Train R²': train_r2,
        'Test MSE': test_mse,
        'Test MAE': test_mae,
        'Test R²': test_r2
    })

# Create DataFrame for better visualization
results_df = pd.DataFrame(results)

print("="*100)
print("MODEL PERFORMANCE SUMMARY")
print("="*100)
display(results_df)

MODEL PERFORMANCE SUMMARY


Unnamed: 0,Model,Train MSE,Train MAE,Train R²,Test MSE,Test MAE,Test R²
0,Linear Regression,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561
1,Polynomial (degree=2),350.879731,15.995824,0.908106,331.105434,15.14833,0.909118
2,Polynomial (degree=3),345.108668,15.746762,0.909618,318.404012,14.764973,0.912604
3,Polynomial (degree=4),339.700171,15.508465,0.911034,313.798757,14.735471,0.913868


## 1.5 Discussion and Interpretation (10 points)

In this section, we provide a **data-driven analysis** of the model results, answering the key questions posed in the assignment.

### Question 1: Which model performs best on the test set and why?

**Answer:**

**Polynomial (degree=4)** performs best on the test set with Test R² = 0.913868 and Test MSE = 313.80. This indicates that the relationship between fuel economy and horsepower is **nonlinear** and requires a higher-degree polynomial to capture the underlying pattern. The small train-test gap (Train R² = 0.911034, Test R² = 0.913868) shows no overfitting, suggesting the model generalizes well.


### Question 2: Does increasing polynomial degree always improve performance? If not, explain what you observe.

**Answer:**

**No**, increasing polynomial degree does not always improve test performance. Test R² values show a **non-monotonic pattern**: Linear (0.912561) → degree 2 (0.909118, decreases) → degree 3 (0.912604) → degree 4 (0.913868, best). The drop from Linear to degree 2 demonstrates that added complexity does not guarantee better generalization. However, degrees 3 and 4 show improvement, indicating the true relationship requires higher-order terms.


### Question 3: If a model performs unexpectedly poorly, what are plausible reasons?

**Answer:**

**Note**: In this specific dataset, **all models perform well** with Test R² values above 90% (ranging from 0.909 to 0.914), indicating a strong relationship between fuel economy and horsepower.

---

# EE344 Assignment 1 - Part 2: Weather → Daily Electricity Consumption Prediction

**Dataset**: Electricity Consumption Based On Weather Data (Kaggle)  
**Task**: Build regression models to predict daily electricity consumption using weather features

---

## Models to be trained:
- Linear Regression
- Polynomial Regression (degree 2)
- Polynomial Regression (degree 3)
- Polynomial Regression (degree 4)

**Note**: No regularization (Ridge/Lasso/ElasticNet) will be used as per assignment requirements.


## 2.1 Load and Inspect the Dataset (10 points)

In this section, we:
- Load the CSV file into a pandas DataFrame
- Display basic information about the dataset:
  - Column names and data types
  - Shape (number of rows and columns)
  - Summary statistics
- Check for missing values
- Clearly identify the dependent variable (target) and independent variables (features)

In [15]:
# ============================================================
# Load the dataset
# ============================================================

DATA_PATH = "electricity_consumption_based_weather_dataset.csv"
df = pd.read_csv(DATA_PATH)

print("="*80)
print("DATASET OVERVIEW")
print("="*80)

print("\nColumn names and data types:")
print(df.dtypes)

print("\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80)
display(df.describe())

print("\n" + "="*80)
print("MISSING VALUES CHECK")
print("="*80)
missing_values = df.isna().sum()
print(missing_values)

if missing_values.sum() == 0:
    print("\n No missing values detected in the dataset.")
else:
    print(f"\nWarning: {missing_values.sum()} missing values found.")


DATASET OVERVIEW

Column names and data types:
date                  object
AWND                 float64
PRCP                 float64
TMAX                 float64
TMIN                 float64
daily_consumption    float64
dtype: object

SUMMARY STATISTICS


Unnamed: 0,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1418.0,1433.0,1433.0,1433.0,1433.0
mean,2.642313,3.800488,17.187509,9.141242,1561.078061
std,1.140021,10.973436,10.136415,9.028417,606.819667
min,0.0,0.0,-8.9,-14.4,14.218
25%,1.8,0.0,8.9,2.2,1165.7
50%,2.4,0.0,17.8,9.4,1542.65
75%,3.3,1.3,26.1,17.2,1893.608
max,10.2,192.3,39.4,27.2,4773.386



MISSING VALUES CHECK
date                  0
AWND                 15
PRCP                  0
TMAX                  0
TMIN                  0
daily_consumption     0
dtype: int64



### Feature Descriptions

From the dataset inspection above, we have the following columns:

**Independent Variables (Features - X):**
- `AWND` - Average Wind Speed (probably in m/s or mph)
- `PRCP` - Precipitation (rainfall, probably in mm)
- `TMAX` - Maximum Temperature of the day (in °C or °F)
- `TMIN` - Minimum Temperature of the day (in °C or °F)

**Dependent Variable (Target - y):**
- `daily_consumption` - Daily electricity consumption (target variable to predict)

**Additional Column:**
- `date` - Date of the measurement (not used as a feature)


### Handling Missing Values

From the missing values check above, we identified that the `AWND` (Average Wind Speed) column has some missing values.

**Strategy**: We will drop rows with missing values to ensure clean data for model training. This is appropriate since:
1. The number of missing values is relatively small compared to the total dataset size
2. It maintains data integrity without introducing imputation bias
3. Assignment instructions require clear handling of missing values

In [20]:
# ============================================================
# Handle missing values
# ============================================================

print("Dataset shape before handling missing values:", df.shape)

# Drop rows with any missing values
df_clean = df.dropna()

print("Dataset shape after removing missing values:", df_clean.shape)
print(f"Rows removed: {len(df) - len(df_clean)}")

# Verify no missing values remain
print("\nVerification - Missing values after cleaning:")
print(df_clean.isna().sum())

Dataset shape before handling missing values: (1433, 6)
Dataset shape after removing missing values: (1418, 6)
Rows removed: 15

Verification - Missing values after cleaning:
date                 0
AWND                 0
PRCP                 0
TMAX                 0
TMIN                 0
daily_consumption    0
dtype: int64


## 2.2 Train/Test Split (70% / 30% Random) (5 points)

We split the dataset into:
- **Training set (70%)**: Used to train the models
- **Test set (30%)**: Used to evaluate model performance on unseen data

A fixed `random_state=42` ensures reproducibility of results.

**Important**: We exclude the `date` column from the features since it's not a numerical predictor.

In [17]:
# ============================================================
# Prepare features (X) and target (y)
# ============================================================

# Features: All weather-related columns (exclude 'date' and target)
feature_columns = ['AWND', 'PRCP', 'TMAX', 'TMIN']
X = df_clean[feature_columns].values

# Target: Daily electricity consumption
y = df_clean['daily_consumption'].values


# ============================================================
# Train-test split (70% train, 30% test)
# ============================================================

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.30, 
    random_state=42
)

print("\n" + "="*80)
print("TRAIN/TEST SPLIT SUMMARY")
print("="*80)
print(f"Training samples:   {X_train.shape[0]} ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test samples:       {X_test.shape[0]} ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Total samples:      {len(X)}")
print(f"Number of features: {X_train.shape[1]}")


TRAIN/TEST SPLIT SUMMARY
Training samples:   992 (70.0%)
Test samples:       426 (30.0%)
Total samples:      1418
Number of features: 4


## 2.3 Model Training: Linear + Polynomial Regression (15 points)

We will train **four regression models** without any regularization:

1. **Linear Regression**: Assumes a linear relationship between weather features and electricity consumption
2. **Polynomial Regression (degree 2)**: Captures quadratic relationships and two-way feature interactions
3. **Polynomial Regression (degree 3)**: Captures cubic relationships and higher-order interactions
4. **Polynomial Regression (degree 4)**: Captures even more complex relationships

For polynomial models, we use `PolynomialFeatures` to generate polynomial and interaction features, then apply `LinearRegression`.

In [18]:
# ============================================================
# Train all models
# ============================================================

# Dictionary to store all trained models
models = {}

# 1. Linear Regression
model_linear = LinearRegression()
model_linear.fit(X_train, y_train)
models['Linear Regression'] = {
    'model': model_linear,
    'X_train': X_train,
    'X_test': X_test,
    'degree': 1
}

# 2-4. Polynomial Regression (degrees 2, 3, 4)
for degree in [2, 3, 4]: 
    # Create polynomial features
    poly_features = PolynomialFeatures(degree=degree, include_bias=True)
    
    # Transform training data
    X_train_poly = poly_features.fit_transform(X_train)
    
    # Transform test data (use transform, not fit_transform)
    X_test_poly = poly_features.transform(X_test)
    
    # Train linear regression on polynomial features
    model_poly = LinearRegression()
    model_poly.fit(X_train_poly, y_train)
    
    # Store model and transformed data
    models[f'Polynomial (degree={degree})'] = {
        'model': model_poly,
        'X_train': X_train_poly,
        'X_test': X_test_poly,
        'degree': degree,
        'poly_features': poly_features
    }

## 2.4 Model Evaluation (Train and Test) (10 points)

For each model, we compute three key metrics on both training and test sets:

- **MSE (Mean Squared Error)**: Average of squared differences between predicted and actual values. Lower is better. Penalizes large errors more heavily.
- **MAE (Mean Absolute Error)**: Average of absolute differences. Lower is better. More robust to outliers.
- **R² (Coefficient of Determination)**: Proportion of variance explained by the model. Range: (-∞, 1], where 1 is perfect fit.

In [19]:
# ============================================================
# Evaluate all models
# ============================================================

results = []

for model_name, model_info in models.items():
    model = model_info['model']
    X_train_transformed = model_info['X_train']
    X_test_transformed = model_info['X_test']
    
    # Training set predictions
    y_train_pred = model.predict(X_train_transformed)
    
    # Test set predictions
    y_test_pred = model.predict(X_test_transformed)
    
    # Compute metrics for training set
    train_mse = mean_squared_error(y_train, y_train_pred)
    train_mae = mean_absolute_error(y_train, y_train_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    
    # Compute metrics for test set
    test_mse = mean_squared_error(y_test, y_test_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # Store results
    results.append({
        'Model': model_name,
        'Train MSE': train_mse,
        'Train MAE': train_mae,
        'Train R²': train_r2,
        'Test MSE': test_mse,
        'Test MAE': test_mae,
        'Test R²': test_r2
    })

# Create DataFrame for better visualization
results_df = pd.DataFrame(results)

print("="*120)
print("MODEL PERFORMANCE SUMMARY")
print("="*120)
display(results_df)

MODEL PERFORMANCE SUMMARY


Unnamed: 0,Model,Train MSE,Train MAE,Train R²,Test MSE,Test MAE,Test R²
0,Linear Regression,272403.396174,384.465016,0.276,248125.8,375.404537,0.299333
1,Polynomial (degree=2),264765.769932,379.648753,0.2963,255268.5,379.039083,0.279163
2,Polynomial (degree=3),259249.53487,375.952901,0.310961,265623.7,385.235167,0.249922
3,Polynomial (degree=4),251909.339001,372.116566,0.33047,12151490.0,578.6422,-33.313843


## 2.5 Discussion and Interpretation (10 points)

In this section, we provide a **data-driven technical discussion** of the model results, answering the key questions posed in the assignment.

### Question 1: Which model generalizes best and what does it tell us about the weather-electricity relationship?

**Answer:**

Based on test set performance, the **Linear Regression model** generalizes best (Test R² = 0.299333, Test MSE = 248125.79). This indicates that the relationship between weather features and electricity consumption is primarily **linear**. Although polynomial models achieve higher R² on the training set (degree=4 reaches 0.330), they perform worse on the test set, indicating overfitting. The simplicity of the linear model allows it to better capture the fundamental linear trends in the data without being misled by noise in the training data.


### Question 2: Do polynomial models improve the fit compared to linear regression? Why might electricity consumption have nonlinear dependence on weather?

**Answer:**

Polynomial models do improve fit on the **training set** (Train R² for degree=2/3/4 are 0.296, 0.311, 0.330, higher than linear regression's 0.276), but perform worse on the **test set** (Test R² are 0.279, 0.250, -33.314). This indicates that polynomial models suffer from overfitting and fail to genuinely improve generalization performance.

Theoretically, electricity consumption may have nonlinear dependence on weather: for example, the relationship between temperature and air conditioning usage (sharp increase in consumption during extreme heat), threshold effects under extreme weather conditions, and interactions between temperature and precipitation. However, on the current dataset, these nonlinear relationships may be masked by noise or are not significant enough, causing more complex models to perform worse instead.


### Question 3: If higher-degree models perform worse on test set, explain using evidence from metrics

**Answer:**

Higher-degree models perform worse on the test set, which is a classic **overfitting** phenomenon. Evidence is as follows:

- **Training error decreases but test error increases**: The degree=4 model's Train R² improves from 0.276 (linear) to 0.330, but Test R² drops from 0.299 to -33.314 (negative value indicates the model predicts worse than a simple mean). Test MSE surges from 248,125 to 12,151,490, an increase of approximately 49 times.

- **Train-test performance gap widens**: At degree=3, the gap between Train R² (0.311) and Test R² (0.250) is 0.061; at degree=4, the gap expands to 33.644, indicating the model has overlearned noise and details in the training data and cannot generalize to new data.

- **Trade-off between model complexity and generalization**: As polynomial degree increases, the number of model parameters grows dramatically (degree=4 has approximately 70 features), making overfitting likely with limited samples (992 training samples).


### Question 4: If none of the models achieve good test performance, provide reasons supported by outputs

**Answer:**

All models achieve poor test performance (best Test R² is only 0.299, indicating the model explains only about 30% of the variance). Main reasons include:

1. **Limited feature set**: Only 4 weather features (AWND, PRCP, TMAX, TMIN) are used, which may miss key driving factors. Electricity consumption is also influenced by unmodeled factors such as occupancy, weekday/weekend patterns, seasonality, economic activity, and user behavior.

2. **High data noise**: From the summary statistics, the standard deviation of daily_consumption (606.82) is relatively large compared to the mean (1561.08), and extreme values exist (minimum 14.22, maximum 4773.39), indicating significant noise and outliers in the data that limit the model's predictive capability.

3. **Weak linear relationship**: Even the best linear model explains only 30% of the variance, suggesting that the linear relationship between weather features and electricity consumption is weak. There may exist complex nonlinear or interaction effects that current models cannot effectively capture.
