## Import Statements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

pd.options.display.max_columns=25

In [None]:
data_main = pd.read_csv('Electricity_Usage_Data.csv')

In [4]:
data_main[['bill_date']] = data_main[['bill_date']].apply(pd.to_datetime)

In [None]:
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(lambda x: pd.to_datetime(f'{x.year}-{x.month}-01'))

In [107]:
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()

data_main['address_enc'] = address_enc.fit_transform(data_main['service_address'])
data_main['bill_type_enc'] = bill_type_enc.fit_transform(data_main['bill_type'])
data_main['year'] = data_main['bill_date'].apply(lambda x:x.year)
data_main['month'] = data_main['bill_date'].apply(lambda x:x.month)

### Regression - Task: Predicting Energy Usage - Sourabh

Models Proposed:
1. Linear Regression - Simple and easy to understand model. Using this to set a baseline. assumes linear relationship between thhe input features and the target variable.
2. Gradient Boosting Regressor - Since this model is an ensemble model which combines, multiple decision tree. Expecting good accuracy from this model. can be computationally expensive and may require more resources to other models
3. Decision Tree Regressor - This is a simple, interpretable model. can handle non-linear relationships between the input features. Can be prone to overfitting on the train data

In [108]:
def regression_metrics(model, X_train, X_test, y_train, y_test):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    print(f'Train R2 Score: {r2_score(y_train, y_train_pred)}')
    print(f'Test R2 Score: {r2_score(y_test, y_test_pred)}')

    print(f'Train MSE Score: {mean_squared_error(y_train, y_train_pred)}')
    print(f'Test MSE Score: {mean_squared_error(y_test, y_test_pred)}')

In [109]:
X = data_main[[
    'business_area', 'address_enc', 'bill_type_enc', 'year', 'month'
]]
y = data_main[['kwh_usage']]

In [110]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Linear Regression

In [111]:
reg = LinearRegression().fit(X_train, y_train)

In [112]:
regression_metrics(reg, X_train, X_test, y_train, y_test)

Train R2 Score: 0.009173833177215318
Test R2 Score: 0.007167343822345074
Train MSE Score: 39877725926.044525
Test MSE Score: 49616557595.280334


#### Gradient Boosting Regression

In [113]:
gbr = GradientBoostingRegressor().fit(X_train, np.ravel(y_train))

In [114]:
regression_metrics(gbr, X_train, X_test, np.ravel(y_train), np.ravel(y_test))

Train R2 Score: 0.7043295722890507
Test R2 Score: 0.731555408814922
Train MSE Score: 11899831348.320082
Test MSE Score: 13415449659.919914


#### Decision Tree Regressor

In [115]:
dtr = DecisionTreeRegressor().fit(X_train, np.ravel(y_train))

In [116]:
regression_metrics(dtr, X_train, X_test, np.ravel(y_train), np.ravel(y_test))

Train R2 Score: 0.9328533753005721
Test R2 Score: 0.9247996538963315
Train MSE Score: 2702446489.891373
Test MSE Score: 3758118027.666927


#### Preliminary Conclusions:

In the above models based on the R2 score we can see that the Decision Tree Regressor works quite well even on the unseen test data. But what is concerning is the large MSE score. We are working on figuring out what the possible reason could be for such a large value and how to mitigate this.