## Fuel Consumption → Horsepower Prediction ####

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split

In [2]:
#1.1

#Load CSV to pandas dataframe.
DATA_PATH = "FuelEconomy.csv"           
df = pd.read_csv(DATA_PATH)

# Summary Statistics Printed
print("\nSummary statistics:")
display(df.describe(include="all"))

# Printed Columnns
print("\nMissing values per column:")
display(df.isna().sum())

# Drop rows with missing values
df = df.dropna()


Summary statistics:


Unnamed: 0,Horse Power,Fuel Economy (MPG)
count,100.0,100.0
mean,213.67619,23.178501
std,62.061726,4.701666
min,50.0,10.0
25%,174.996514,20.439516
50%,218.928402,23.143192
75%,251.706476,26.089933
max,350.0,35.0



Missing values per column:


Horse Power           0
Fuel Economy (MPG)    0
dtype: int64

### Missing Values: ###

I handled the missing values by removing the columns that contained null input. Therefore, the model would have even training and testing information and not skew the results to depend on one variable more.

In [3]:
#1.2

# Divide into independent and dependent variables. 
X = df.drop("Horse Power", axis=1)
y = df["Horse Power"]

# Create test and train splits. 
x_train, x_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42)


In [4]:
#1.3 - Methods

def train_polynomial(degree, X_train, y_train):
    """Trains a polynomial model with linear regression."""
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=degree, include_bias=False)),
        ("lr", LinearRegression())
    ])
    model.fit(X_train, y_train)
    return model


In [5]:
#1.3 - Using Methods

# Linear Model Training
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)

# 2nd Degree Polynomial Model Training
poly_2 = train_polynomial(2, x_train, y_train)

# 3rd Degree Polynomial Model Training
poly_3 = train_polynomial(3, x_train, y_train)

# 4th Degree Polynomial Model Training
poly_4 = train_polynomial(4, x_train, y_train)


In [6]:
#1.4 - Methods

def compute_metrics(y_true, y_pred):
    """Return MSE, MAE, R^2."""
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }

def metrics_table(model, X_train, X_test, y_train, y_test, model_name):
    """Return a table with train and test MSE, MAE, R^2 for the model."""
    yhat_train = model.predict(X_train)
    yhat_test  = model.predict(X_test)
    
    train = compute_metrics(y_train, yhat_train)
    test = compute_metrics(y_test, yhat_test)
    
    rows = []
    rows.append({
        "Model": model_name,
        "Train MSE": train["MSE"],
        "Train MAE": train["MAE"],
        "Train R^2": train["R^2"],
        "Test MSE": test["MSE"],
        "Test MAE": test["MAE"],
        "Test R^2": test["R^2"],
    })
    return pd.DataFrame(rows)


In [7]:
#1.4

# Creating Linear Model Metrics Row
rowsLin = metrics_table(lin_reg, x_train, x_test, y_train, y_test, "Linear_Regression")

# Creating Polynomial 2 Model Metrics Row
rows2 = metrics_table(poly_2, x_train, x_test, y_train, y_test, "Polynomial 2nd Degree")

# Creating Polynomial 3 Model Metrics Row
rows3 = metrics_table(poly_3, x_train, x_test, y_train, y_test, "Polynomial 3rd Degree")

# Creating Polynomial 4 Model Metrics Row
rows4 = metrics_table(poly_4, x_train, x_test, y_train, y_test, "Polynomial 4th Degree")

# Concatenate the rows into a table. 
results = pd.concat([rowsLin, rows2, rows3, rows4], ignore_index = True)
display(results)


Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
0,Linear_Regression,357.69918,16.061689,0.90632,318.561087,14.940628,0.912561
1,Polynomial 2nd Degree,350.879731,15.995824,0.908106,331.105434,15.14833,0.909118
2,Polynomial 3rd Degree,345.108668,15.746762,0.909618,318.404012,14.764973,0.912604
3,Polynomial 4th Degree,339.700171,15.508465,0.911034,313.798757,14.735471,0.913868


### Dicussion for Part 1: ###
The 4th degree polynomial has the highest R^2 value and lowest MSE and MAE values. This means that it performed the best on this test set. The polynomial's degree does not always improve performance. The linear regression performed better than the second degree polynomial, since linear is a polynomial of 1. The R^2 value is higher than the polynomial of 2, while the test MSE and MAE were lower, proving that increasing the degree does not necesarily improve performance. The polynomial of degree two performs the worst because a quadratic does the worst job at approximating the fit of the data. This may be that it introduces the most noise into the data or that a quadratic is the weakest relationshiop between the features and target. 


## Part 2: Weather → Daily Electricity Consumption Prediction ##

In [21]:
#2.1

#Load CSV file. 
DATA_PATH = "electricity_consumption_based_weather_dataset.csv"
df2 = pd.read_csv(DATA_PATH)

# Print Columns. 
print("\nColumns:")
print(df2.columns.tolist())

# Print Shape. 
print("\nShape:", df2.shape)

# Print summary statistics. 
print("\nSummary statistics:")
display(df2.describe(include="all"))

# DEPENDENT VARIABLE
TARGET_COL = "daily consumption" 

# Print Missing Columns
print("\nMissing values per column:")
display(df2.isna().sum())

# Drop missing columns. 
df = df2.dropna()


Columns:
['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']

Shape: (1433, 6)

Summary statistics:


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1433,1418.0,1433.0,1433.0,1433.0,1433.0
unique,1433,,,,,
top,2006-12-16,,,,,
freq,1,,,,,
mean,,2.642313,3.800488,17.187509,9.141242,1561.078061
std,,1.140021,10.973436,10.136415,9.028417,606.819667
min,,0.0,0.0,-8.9,-14.4,14.218
25%,,1.8,0.0,8.9,2.2,1165.7
50%,,2.4,0.0,17.8,9.4,1542.65
75%,,3.3,1.3,26.1,17.2,1893.608



Missing values per column:


date                  0
AWND                 15
PRCP                  0
TMAX                  0
TMIN                  0
daily_consumption     0
dtype: int64

In [18]:
#2.2

# Drop the independent and dependent variables respectively. 
# Date is dropped because it is a string, not number. 
X = df2.drop(columns=["date", "daily_consumption"])
y = df2["daily_consumption"]

# Create test and train splits. 
x_train, x_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

#Drop rows with NaN. 
x_train = x_train.dropna()
y_train = y_train.loc[x_train.index]

x_test = x_test.dropna()
y_test = y_test.loc[x_test.index]


In [19]:
#2.3

# Linear Model Training
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)

# 2nd Degree Polynomial Model Training
poly_2 = train_polynomial(2, x_train, y_train)

# 3rd Degree Polynomial Model Training
poly_3 = train_polynomial(3, x_train, y_train)

# 4th Degree Polynomial Model Training
poly_4 = train_polynomial(4, x_train, y_train)

In [22]:
#2.4

# Creating Linear Model Metrics Row
rowsLin = metrics_table(lin_reg, x_train, x_test, y_train, y_test, "Linear_Regression")

# Creating Polynomial 2 Model Metrics Row
rows2 = metrics_table(poly_2, x_train, x_test, y_train, y_test, "Polynomial 2nd Degree")

# Creating Polynomial 3 Model Metrics Row
rows3 = metrics_table(poly_3, x_train, x_test, y_train, y_test, "Polynomial 3rd Degree")

# Creating Polynomial 4 Model Metrics Row
rows4 = metrics_table(poly_4, x_train, x_test, y_train, y_test, "Polynomial 4th Degree")

# Concatenate all metrics together for table. 
results = pd.concat([rowsLin, rows2, rows3, rows4], ignore_index = True)
display(results)


Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
0,Linear_Regression,276355.629175,387.812202,0.272348,238928.533171,367.573044,0.309317
1,Polynomial 2nd Degree,269661.813847,382.840308,0.289973,236501.999706,364.4351,0.316331
2,Polynomial 3rd Degree,263018.514825,378.848749,0.307465,239999.837927,370.812331,0.30622
3,Polynomial 4th Degree,255303.163264,375.58222,0.32778,367876.042896,410.880431,-0.063439


### Discussion for Part 2: ###

For this dataset, the model which has the best test performance is the 2nd degree polynomial. The 4th degree polynomial performed the worst by far, while other models were much closer in performance. This tells me that the relationship between weather and electricity usage is quadratic, or at least best approximated by a quadratic. Only the 2nd degree polynomial is better than the linear regression, but it does approximate better than the linear. 

It may have a non linear dependence on weather because electricity is needed to provide heat and cold. When it really warms up, people turn on AC which increases the electricity consumption. When it's extreme cold, the heat gets turned on which also increases electricity consumption. The middle point between extreme cold and heat is where electricity consumption is at its lowest. This is the form of a quadratic. 

The metrics demonstrate that the higher degree models perform worse on the test set due to noise. The training error decreases as the degree of the polynomial increases. However, the test performance does not improve with the degree. This is probably due to the polynomial degrees were fitting noise or overfitting instead of the true relationship that the polynomial of degree 2 has found. 

None of the models did achieve accuracy that was very high. A high R^2 value would be close to 1 and all of these were less than 0.32. The reasons why might be the unmodelled features and the fact that some features had to be dropped. There are a lot of features that control electricity consumption such as the amount of light in the sky (season), number of occupants per house and more. Some features had to be dropped due to the fact that there was a lot of NaN values. The average wind was dropped, so the prediction was based on less information, decreasing the R^2 and increasing the error. 