## Import Statements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

pd.options.display.max_columns=25

In [2]:
data_main = pd.read_csv('Electricity_Usage_Data.csv')

In [3]:
data_main[['bill_date']] = data_main[['bill_date']].apply(pd.to_datetime)

In [4]:
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(
    lambda x: pd.to_datetime(f'{x.year}-{x.month}-01')
)

In [5]:
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()

data_main['address_enc'] = address_enc.fit_transform(
    data_main['service_address']
)
data_main['bill_type_enc'] = bill_type_enc.fit_transform(
    data_main['bill_type']
)
data_main['year'] = data_main['bill_date'].apply(lambda x:x.year)
data_main['month'] = data_main['bill_date'].apply(lambda x:x.month)

### Regression - Task: Predicting Energy Usage - Sourabh

Models Proposed:
1. Linear Regression - Simple and easy to understand model. Using this to set a baseline. assumes linear relationship between thhe input features and the target variable.
2. Gradient Boosting Regressor - Since this model is an ensemble model which combines, multiple decision tree. Expecting good accuracy from this model. can be computationally expensive and may require more resources to other models
3. Decision Tree Regressor - This is a simple, interpretable model. can handle non-linear relationships between the input features. Can be prone to overfitting on the train data

In [6]:
data_main.head()

Q1 = data_main['kwh_usage'].quantile(0.25)
Q3 = data_main['kwh_usage'].quantile(0.75)
IQR = Q3 - Q1

Q1, Q3, IQR
data_main_filt = data_main[~(
    (data_main['kwh_usage'] < (Q1 - 1.5 * IQR)) | 
    (data_main['kwh_usage'] > (Q3 + 1.5 * IQR))
)]

data_main.shape, data_main_filt.shape

((191253, 11), (157498, 11))

Therefore, 33,755 rows have values that are considered outliers based on the IQR Method.

In [7]:
def regression_metrics(model, X_train, X_test, y_train, y_test):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    print(f'Train R2 Score: {r2_score(y_train, y_train_pred)}')
    print(f'Test R2 Score: {r2_score(y_test, y_test_pred)}')

    print(f'Train MSE Score: {mean_squared_error(y_train, y_train_pred)}')
    print(f'Test MSE Score: {mean_squared_error(y_test, y_test_pred)}')

In [8]:
X = data_main[[
    'business_area', 'address_enc', 'bill_type_enc', 'year', 'month'
]]
y = data_main[['kwh_usage']]

X_filt = data_main_filt[[
    'business_area', 'address_enc', 'bill_type_enc', 'year', 'month'
]]
y_filt = data_main_filt[['kwh_usage']]

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_filt, X_test_filt, y_train_filt, y_test_filt = train_test_split(
    X_filt, y_filt, test_size=0.2, random_state=42
)

#### Linear Regression

In [10]:
reg = LinearRegression().fit(X_train, y_train)
reg_filt = LinearRegression().fit(X_train_filt, y_train_filt)

In [11]:
print('With Outliers')
regression_metrics(reg, X_train, X_test, y_train, y_test)

print('='*50)

print('Without Outliers')
regression_metrics(
    reg_filt, X_train_filt, X_test_filt, y_train_filt, y_test_filt
)

With Outliers
Train R2 Score: 0.009173833177215318
Test R2 Score: 0.007167343822345074
Train MSE Score: 39877725926.044525
Test MSE Score: 49616557595.280334
Without Outliers
Train R2 Score: 0.08955528262685242
Test R2 Score: 0.09509437141206156
Train MSE Score: 672051.7504459427
Test MSE Score: 694695.2188051218


#### Gradient Boosting Regression

In [12]:
gbr = GradientBoostingRegressor().fit(X_train, np.ravel(y_train))
gbr_filt = GradientBoostingRegressor().fit(
    X_train_filt, np.ravel(y_train_filt)
)

In [13]:
print('With Outliers')
regression_metrics(
    gbr, X_train, X_test, np.ravel(y_train), np.ravel(y_test)
)

print('='*50)

print('Without Outliers')
regression_metrics(
    gbr_filt, X_train_filt, X_test_filt, np.ravel(y_train_filt), np.ravel(y_test_filt)
)

With Outliers
Train R2 Score: 0.7043295722890507
Test R2 Score: 0.7315554088149221
Train MSE Score: 11899831348.320082
Test MSE Score: 13415449659.919912
Without Outliers
Train R2 Score: 0.1523656445581465
Test R2 Score: 0.15408797851650213
Train MSE Score: 625687.800085661
Test MSE Score: 649405.8808887758


#### Decision Tree Regressor

In [14]:
dtr = DecisionTreeRegressor().fit(X_train, np.ravel(y_train))
dtr_filt = DecisionTreeRegressor().fit(X_train_filt, np.ravel(y_train_filt))

In [15]:
print('With Outliers')
regression_metrics(
    dtr, X_train, X_test, np.ravel(y_train), np.ravel(y_test)
)

print('='*50)

print('Without Outliers')
regression_metrics(
    dtr_filt, X_train_filt, X_test_filt, np.ravel(y_train_filt), np.ravel(y_test_filt)
)

With Outliers
Train R2 Score: 0.9328533753005721
Test R2 Score: 0.9252101010668637
Train MSE Score: 2702446489.891373
Test MSE Score: 3737606035.4899774
Without Outliers
Train R2 Score: 0.9934125253356817
Test R2 Score: 0.8822752077931225
Train MSE Score: 4862.594943652055
Test MSE Score: 90377.21470310935


#### Conclusions: Electricity usage can be predicted by using correlated features (Regression)
* Used three regression techniques: Linear Regression, Gradient Boosting Regression, and Decision Tree Regression.
* The linear regression was too simple and was not able to capture the correlations within the dataset hence the poor performance on the train as well as test set. This can also be seen in the high RMSE value too.
* Gradient Boosting Regression was able to capture the relationships between the variables but unfortunately it is overfitting as reflected by the R2 score and RMSE value. On the test the R2 score is too low and RMSE is high which indicates overfitting.
* The Decision Tree Regressor gave the best performance of the 3 models. It averages 95% on the train and 90% on the test set (taking into account both the model trained on outlier datset and without outliers).
* Therefore, using regression analysis, we are able to show that electricity usage can be predicted provided we have the features such as area of the connection, the type of the bill (connection taken), the year and month in which the usage is being predicted.