<a href="https://colab.research.google.com/github/kuhunain/Data-Driven-ML/blob/main/PredictHPFromFuelConsumption.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
'''
-- Imports --
'''

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import matplotlib.pyplot as plt

In [5]:
'''
Load CSV File from Kaggle - Fuel Consumption based on HP
'''
DATA_PATH = "FuelEconomy.csv"
df = pd.read_csv(DATA_PATH)

print("Shape - ", df.shape)
print("\nColumns - ")
print(df.columns.tolist())

display(df.head())

print("\nsummary stats:")
display(df.describe(include="all"))

print("\nmissing values in each col:")
# missing values is done w insa below
display(df.isna().sum())

Shape -  (100, 2)

Columns - 
['Horse Power', 'Fuel Economy (MPG)']


Unnamed: 0,Horse Power,Fuel Economy (MPG)
0,118.770799,29.344195
1,176.326567,24.695934
2,219.262465,23.95201
3,187.310009,23.384546
4,218.59434,23.426739



summary stats:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



missing values in each col:


Unnamed: 0,0
Horse Power,0
Fuel Economy (MPG),0


In [8]:
'''
-- Train/Test Split (70/30) --
'''
X = df[["Fuel Economy (MPG)"]]
y = df["Horse Power"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [9]:
'''
-- Model Training Linear + Poly Regression --
'''
models = {
    "Linear": LinearRegression(),
    "Polynomial deg-2": Pipeline([
        ("poly", PolynomialFeatures(degree=2, include_bias=False)),
        ("lr", LinearRegression())
    ]),
    "Polynomial deg-3": Pipeline([
        ("poly", PolynomialFeatures(degree=3, include_bias=False)),
        ("lr", LinearRegression())
    ]),
    "Polynomial deg-4": Pipeline([
        ("poly", PolynomialFeatures(degree=4, include_bias=False)),
        ("lr", LinearRegression())
    ])
}

In [11]:
'''
 -- Model Eval: Train/Test on MSE, MAE, R2 --
'''
def compute_metrics(y_true, y_pred):
    """ find the mse, mae, and r^2 """
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }


print("Table for Model Eval")
rows = []

for name, model in models.items():
    model.fit(X_train, y_train)

    yhat_train = model.predict(X_train)
    yhat_test  = model.predict(X_test)

    # Metrics
    train_m = compute_metrics(y_train, yhat_train)
    test_m  = compute_metrics(y_test, yhat_test)

    rows.append({
        "Model": name,
        "Train MSE": train_m["MSE"],
        "Train MAE": train_m["MAE"],
        "Train R^2": train_m["R^2"],
        "Test MSE": test_m["MSE"],
        "Test MAE": test_m["MAE"],
        "Test R^2": test_m["R^2"]
    })

results_df = pd.DataFrame(rows)
display(results_df)

Table for Model Eval


Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
0,Linear,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561
1,Polynomial deg-2,350.879731,15.995824,0.908106,331.105434,15.14833,0.909118
2,Polynomial deg-3,345.108668,15.746762,0.909618,318.404012,14.764973,0.912604
3,Polynomial deg-4,339.700171,15.508465,0.911034,313.798757,14.735471,0.913868


Discussion and interpretation

Use your results to answer the following questions with a data-driven explanation:

• Which model performs best on the test set and why?

**The polynomial degree 4 seems to perform the best on the test set. It has an R^2 value of 0.913868I, which is the highest test R^2 value out of all the performing models. The reason why this is the best one would be by comparisn. If I am comparing the linear regression model to poly degree 4, the linear one underperforms with a lower R^2 of 0.912561. In this case, that means that the Poly Degree 4 is finding a better non-linear relationship between horse power and fuel economy.**

• Does increasing polynomial degree always improve performance? If not, explain what you observe.

**Not necessarily. As we can see, the linear regression model is the 3rd overall performance for R^2. Even though Polynomial Degree of 2 is one degree higher, it underperforms with an R^2 score of 0.909118. I would say that though the models are becoming more complex as the degree becomes higher, and that there could be correlation in some data for that leading to better performance; That does not mean that it will always improve.**

• If a model performs unexpectedly poorly (e.g., low R2 or large test error), propose at least two
plausible reasons, such as:– underfitting vs overfitting,– weak relationship between features and target,– outliers or noise in the data,– insufficient feature information for predicting HP.

**In this case, there did not seem to be a poorly performing model but I think there can be many reasons for this to happen. One would be if the test r squared value and the train r squared value have a big gap in higher degree polynomial models, I would guess that this would be due to overfitting. I also think that in this specific case, there is only 1 correlation with 2 data types to be made (fuel economy and horse power), which could lead to not enough information that results in weird or low test results.**