<a href="https://colab.research.google.com/github/kuhunain/Data-Driven-ML/blob/main/PredictWeatherFromElectricity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
'''
-- Imports --
'''

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [11]:
'''
Load CSV File from Kaggle - electricity based weather report
'''
DATA_PATH = "electricity_consumption_based_weather_dataset.csv"
df = pd.read_csv(DATA_PATH)

# set dependent variable
TARGET_COL = "daily_consumption"

print("Shape - ", df.shape)
print("\nColumns - ")
print(df.columns.tolist())

display(df.head())

print("\nsummary stats:")
display(df.describe(include="all"))

print("\nmissing values in each col:")
# missing values is done w insa below
display(df.isna().sum())

Shape -  (1433, 6)

Columns - 
['date', 'AWND', 'PRCP', 'TMAX', 'TMIN', 'daily_consumption']


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
0,2006-12-16,2.5,0.0,10.6,5.0,1209.176
1,2006-12-17,2.6,0.0,13.3,5.6,3390.46
2,2006-12-18,2.4,0.0,15.0,6.7,2203.826
3,2006-12-19,2.4,0.0,7.2,2.2,1666.194
4,2006-12-20,2.4,0.0,7.2,1.1,2225.748



summary stats:


Unnamed: 0,date,AWND,PRCP,TMAX,TMIN,daily_consumption
count,1433,1418.0,1433.0,1433.0,1433.0,1433.0
unique,1433,,,,,
top,2010-11-26,,,,,
freq,1,,,,,
mean,,2.642313,3.800488,17.187509,9.141242,1561.078061
std,,1.140021,10.973436,10.136415,9.028417,606.819667
min,,0.0,0.0,-8.9,-14.4,14.218
25%,,1.8,0.0,8.9,2.2,1165.7
50%,,2.4,0.0,17.8,9.4,1542.65
75%,,3.3,1.3,26.1,17.2,1893.608



missing values in each col:


Unnamed: 0,0
date,0
AWND,15
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


In [17]:
'''
 -- Drop missing (15 rows from AWND) --
'''
df = df.dropna()

print("\nmissing values in each col:")
# missing values is done w insa below
display(df.isna().sum())

# find out what data types are
print(df.dtypes)



missing values in each col:


Unnamed: 0,0
date,0
AWND,0
PRCP,0
TMAX,0
TMIN,0
daily_consumption,0


date                  object
AWND                 float64
PRCP                 float64
TMAX                 float64
TMIN                 float64
daily_consumption    float64
dtype: object


In [20]:
'''
-- Train/Test Split (70/30) --
'''
# drops date because was not able to convert to float
X = df.drop(columns=[TARGET_COL, "date"])
y = df[TARGET_COL]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [21]:
'''
-- Model Training Linear + Poly Regression --
'''
models = {
    "Linear": LinearRegression(),
    "Polynomial deg-2": Pipeline([
        ("poly", PolynomialFeatures(degree=2, include_bias=False)),
        ("lr", LinearRegression())
    ]),
    "Polynomial deg-3": Pipeline([
        ("poly", PolynomialFeatures(degree=3, include_bias=False)),
        ("lr", LinearRegression())
    ]),
    "Polynomial deg-4": Pipeline([
        ("poly", PolynomialFeatures(degree=4, include_bias=False)),
        ("lr", LinearRegression())
    ])
}

In [22]:
'''
 -- Model Eval: Train/Test on MSE, MAE, R2 --
'''
def compute_metrics(y_true, y_pred):
    """ find the mse, mae, and r^2 """
    return {
        "MSE": mean_squared_error(y_true, y_pred),
        "MAE": mean_absolute_error(y_true, y_pred),
        "R^2": r2_score(y_true, y_pred),
    }

print("Table for Model Eval")
rows = []

for name, model in models.items():
    model.fit(X_train, y_train)

    yhat_train = model.predict(X_train)
    yhat_test  = model.predict(X_test)

    # Metrics
    train_m = compute_metrics(y_train, yhat_train)
    test_m  = compute_metrics(y_test, yhat_test)

    rows.append({
        "Model": name,
        "Train MSE": train_m["MSE"],
        "Train MAE": train_m["MAE"],
        "Train R^2": train_m["R^2"],
        "Test MSE": test_m["MSE"],
        "Test MAE": test_m["MAE"],
        "Test R^2": test_m["R^2"]
    })

results_df = pd.DataFrame(rows)
display(results_df)

Table for Model Eval


Unnamed: 0,Model,Train MSE,Train MAE,Train R^2,Test MSE,Test MAE,Test R^2
0,Linear,272403.396174,384.465016,0.276,248125.8,375.404537,0.299333
1,Polynomial deg-2,264765.769932,379.648753,0.2963,255268.5,379.039083,0.279163
2,Polynomial deg-3,259249.53487,375.952901,0.310961,265623.7,385.235167,0.249922
3,Polynomial deg-4,251909.339001,372.116566,0.33047,12151490.0,578.642201,-33.313844


Discussion

• Which model generalizes best (best test performance), and what does that tell you about the
relationship between weather and electricity usage?

**The best model was the linear regression model with the highest test R^2 value of 0.299333. This tells me that the relationship between weather and electricity usage is directly correlated linearly. The next best one was the polynomial degree 2 with 0.279163 which suggests that as the we get to higher polynomial models that are less linear, the models do not perform as well, which confirms that this data is linear.**

• Do polynomial models improve the fit compared to linear regression? If yes, why might electricity
consumption have nonlinear dependence on weather?

**Yes, polynomial models to improve the fit compared to linear regression. It is intuitive because as the degree gets higher (2,3,4), the model is being more "fit" to the data. This is apparent in the training R^2 data, where the value becomes bigger as the degree gets higher. However, even though there is a direct correlation between training datas R^2 score and higher degree models, this does not translate to the test data. So, weather might have nonlinear dependence on electricity because of harsh extremes. When the temperature is average, no electricity is being used (heating or cooling), but when the temperature is high or low, electricity is being used, which would make a kind of slope (non-linear).**

• If higher-degree models perform worse on the test set, explain this behavior using evidence from
metrics (e.g., train error decreases but test error increases).

**This behavior is apparent through the data set and it allows us to see exactly what is happening at each degree model. As we can see in degree 3 model, it has lower training error of MSE = 259249.534870 but worse test performance with MSE at 2.656237e+05 and R^2 at 0.249922. Though this is worse than degree 2 linear, it still does not indicate that there is extreme levels of overfitting happening. I think at degree 3, we see the beginning of the overfitting. However, once we raech degree 4, it is apparent that the model has extremely overfitted due to the very bad R^2 value of ~-33, which is obviously not a good model to base off of.**

• If none of the models achieve good test performance, provide at least two reasons supported by
your outputs (e.g., limited feature set, high noise, unmodeled drivers such as occupancy/behavior,
seasonal effects)

**None of the models in this dataset were able to reach good test performance with all of them being below 0.3 test r^2 value. Some reasons for this could be that there is no specification of electricity usage on particular days (workday compared to weekends). This could cause a big shift considering if the home owner is not in the house, the electricity will not be used no matter what the weather is. This goes into my second point that I do think there is high noise due to the fact that a lot of the electricity data cannot solely be explained by weather and there are a lot of outside factors to take into consideration.**

