## Import Statements

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

pd.options.display.max_columns=25

In [None]:
data_main = pd.read_csv('Electricity_Usage_Data.csv')

In [4]:
data_main[['bill_date']] = data_main[['bill_date']].apply(pd.to_datetime)

In [None]:
data_main.loc[:,'bill_date'] = data_main['bill_date'].apply(lambda x: pd.to_datetime(f'{x.year}-{x.month}-01'))

In [107]:
address_enc = LabelEncoder()
bill_type_enc = LabelEncoder()

data_main['address_enc'] = address_enc.fit_transform(data_main['service_address'])
data_main['bill_type_enc'] = bill_type_enc.fit_transform(data_main['bill_type'])
data_main['year'] = data_main['bill_date'].apply(lambda x:x.year)
data_main['month'] = data_main['bill_date'].apply(lambda x:x.month)

### Classification - Task: Predicting Type of Bill - Sharmisha

Models proposed:
1. Logistic Regression - widely used interpretable model which can be used for setting a baseline accuracy. This model assumed linear relationship between the variables, so mighht give bad results

2. Decision Tree Classifier - It can handle the non-linear relationships well between input and target variable. Can be prone to overfitting on the train data.

3. Random Forest Classifier - ensemble model, takes advantage of multiple decision trees to create a powerful model. But this model is not easy to interpret and requires more computational resource to run.

In [117]:
def classification_metrics(model, X_train, X_test, y_train, y_test):
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    print(f'Train F1 Score: {f1_score(y_train, y_train_pred, average="macro")}')
    print(f'Test F1 Score: {f1_score(y_test, y_test_pred, average="macro")}')

    print(f'Train Accuracy Score: {accuracy_score(y_train, y_train_pred)}')
    print(f'Test Accuract Score: {accuracy_score(y_test, y_test_pred)}')

In [118]:
X = data_main[[
    'business_area', 'address_enc', 'kwh_usage', 'year', 'month'
]]
y = data_main[['bill_type_enc']]

In [119]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Logistic Regression

In [120]:
lreg = LogisticRegression().fit(X_train, np.ravel(y_train))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [121]:
classification_metrics(lreg, X_train, X_test, y_train, y_test)

Train F1 Score: 0.6435088261325143
Test F1 Score: 0.6427111304198002
Train Accuracy Score: 0.9986993634070143
Test Accuract Score: 0.9986928446315129


#### Decision Tree Classifier

In [122]:
dtc = DecisionTreeClassifier().fit(X_train, y_train)

In [123]:
classification_metrics(dtc, X_train, X_test, y_train, y_test)

Train F1 Score: 0.9998707454630322
Test F1 Score: 0.6666403148167668
Train Accuracy Score: 0.9999934641377237
Test Accuract Score: 0.9998431413557816


#### Random Forest Classifier

In [124]:
rfc = RandomForestClassifier().fit(X_train, np.ravel(y_train))

In [125]:
classification_metrics(dtc, X_train, X_test, np.ravel(y_train), np.ravel(y_test))

Train F1 Score: 0.9998707454630322
Test F1 Score: 0.6666403148167668
Train Accuracy Score: 0.9999934641377237
Test Accuract Score: 0.9998431413557816


#### Preliminary Conclusions

In the above models based although the accuracy seems quite high for all the models, the important metric here is the F1 score. Since it gives a better understanding of the results. Logistic Classification model is struggling to capture the features well since the F1 is low for the train set too. 

The decision tree classifier however is able to perform well on the train dataset while taking a huge hit on the test set. This implies that is likely overfitting. The case is similar in the random forest classifier too. One possible reason for this could be the class imbalance that exists in the dataset.