# Machine Learning Models and Predicting Public Sector Electricity Costs

# Purpose

The aim of this project is to predict the Primary average energy price for each State in the residential sector for the year 2021 based on residential sector features. And to do so using a variety of Machine Learning Models. 



## Background

### Data information
EIA's State Energy Data System (SEDS) is a comprehensive data set that consists of annual time series estimates of State-level energy use by major economic sectors, energy production and State-level energy price and expenditure data.[6]

The data is organized by State, Year, and MSN.

The MSNs are five-character codes, most of which are structured as follows:
First and second characters - describes an energy source (for example, NG for natural gas, MG for motor gasoline)
Third and fourth characters - describes an energy sector or an energy activity (for example, RC for residential consumption, PR for production)
Fifth character - describes a type of data (for example, P for data in physical unit, B for data in billion Btu and D is for dollars per million BTU)[6]






### Model Background:


#### Linear Models

A *regression* attempts to fit a function to observed data to make predictions on new data. A *linear regression* fits a straight line to observed data, attempting to demonstrate a linear relationship between variables and make predictions on new data yet to be observed[1].

So given a set of predictors, all combined, with an error term yields a linear relationship to some response. 

The formula is $ y_i = \beta_0+ \beta_{i1}.... + \epsilon_i$


### Support Vector Regression

The basic goal of Support Vector Regression is to find a function $f(x)$ that has at most $\epsilon$ from actually obtained targets $y_i$ for all of the training data and at the same time is as flat as possible. [2] 

To achieve this we must estimate $\beta$.

$ H(\beta,\beta_0) = \sum\limits_{i=1}^{N} V(y_i - f(x_i)) + \frac{\lambda}{2}\|\beta\|^{2}$

Errors smaller than size $\epsilon$ are ignored due to having small residual values.

$V_H (r) =\begin{cases} 0 \text{    if    } \|r\| < \epsilon  \\
		\|r\| - \epsilon ~ otherwise
\end{cases}$

### K-Nearest Neighbors

It is a non parametric supervised learning method, input consisting of k closest training examples in a data set.The output is the average value of k nearest neighbors. [3]\
$f(x) = \frac{1}{K} \sum_{i\epsilon N_k(`x)}$

### Neural Network

The Neural Network model used was a Multi-Layer Perceptron regressor. It optimizes the squared error using LBFGS or stochastic gradient descent [4]

It has the function: $f(\cdot): R^{m} \to R^{o} $

m = input dimensions \
o = output dimensions 

A basic model consists of an input layer,hidden layers which transform the values, and an output layer. [6] 


### Stochastic Gradient Descent
Uses a linear model, that minimizes loss by updating the model with a lower learning rate as it proceeds.


## Procedure

### Data Loading cleaning and initial analysis
All of the data was first loaded, inspected and cleaned. 
When loaded the file contains several columns: \
`Data_Status` - A code \
`MSN` - A 5 letter combination to signify energy source, sector and unit of data.\
Units for each category are given below.\
Please see :https://www.eia.gov/state/seds/CDF/Codes_and_Descriptions.xlsx for a full description of each category. 
*	Million Btu per short ton
*	Billion Btu
*	Dollars per million Btu
*	Thousand short tons
*	Million dollars
*	Thousand barrels

`StateCode` Two letter state abbreviation \
`Year` YYYY-MM-DD \
`Data` Numerical - associated with MSN 



Please refer to :https://github.com/oohtmeel1/Machine-Learning-_-Prediction_Electricity_Data/blob/main/working_notebooks/Pipeline_work_B.ipynb

The portion of the data that was used was specifically associated with the residential sector. 
When visualized, most of the data did not fit a distribution that would be considered normal. 


### QQ plots

<img src="working_notebooks/qqplots1.png">


### Correlations 

The Pearson product-moment correlation coefficients were computed. And variables most strongly associated with the dependent variable were retained.


The formula is below:

$ p_{x,y} = \rho (X,Y)= \frac{cov((X,Y))}{\sigma_{(X)} \sigma_{(Y)}} $


$cov((X),(Y))$ Is the covariance of the variables\
$\sigma(X)$ Is the standard deviation of X \
$\sigma(Y)$ Is the standard deviation of Y


<img src="working_notebooks/corr_matrix.png">


### Preprocessing

The categorical data, (Years and States) was masked using label encoding.The numerical data was transformed before analysis. Scikit-learn preprocessing was utilized in the form of the StandardScaler class. Which removed the mean and scaled the data to unit variance [7].
Please refer to :https://github.com/oohtmeel1/Machine-Learning-_-Prediction_Electricity_Data/blob/main/working_notebooks/Pipeline_work_B.ipynb


In [1]:
# Data preprocessing is here
import pandas as pd
import os
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.utils import shuffle

X1=pd.read_csv(os.getcwd()+ '/working_notebooks/X1.csv')
X2=pd.read_csv(os.getcwd()+ '/working_notebooks/X2.csv')
X1 = shuffle(X1)
X2=shuffle(X2)
y_train=X1['Data_x'].reset_index(drop=True) # Separating all of the data
y_test=X2['Data_x'].reset_index(drop=True)
X_train = X1.iloc[:,3: ]
X_test = X2.iloc[:,3: ]
scaler = StandardScaler()
scaler.fit(X_train)
X_train1=scaler.transform(X_train)
X_test1=scaler.transform(X_test) 

### Model training and testing

Each model performed regression on a subset of the data, casting a prediction for the electrical expenditure amount for the year 2021 for each State.
The results were then compared by graphing them against the true values as well as computing the Mean Absolute Error (MAE), Mean Squared Error (MSE), Explained Variance (EV) and the $R^2$.

In [2]:
## The linear model
from sklearn.linear_model import LinearRegression
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")


linear_model=LinearRegression().fit(X_train1,y_train)
linear_pred= linear_model.predict(X_test1)
y_true =y_test

#px.scatter(y_pred)


In [3]:
## The SVR model

from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline

regr = make_pipeline(SVR(C=23.0, epsilon=0.2,kernel='linear'))
SVR_pred=regr.fit(X_train1, y_train).predict(X_test1)


## The SGR
from sklearn.linear_model import SGDRegressor
reg = make_pipeline(
                    SGDRegressor(penalty='l1'))
gradient_pred=reg.fit(X_train1, y_train).predict(X_test1)


#K neighbors
from sklearn.neighbors import KNeighborsRegressor
reg = make_pipeline(
                    KNeighborsRegressor(weights='distance',n_neighbors=1,p=1/2))
kneighbors_pred=reg.fit(X_train1, y_train).predict(X_test1)


# Neural Netwok

from sklearn.neural_network import MLPRegressor

regr=MLPRegressor(activation='identity',max_iter=2300,solver='lbfgs').fit(X_train1, y_train)
neural_net_pred=regr.predict(X_test1)



In [17]:
import plotly.graph_objects as go
rm=range(len(y_test))

df=pd.DataFrame([y_test, linear_pred,SVR_pred,gradient_pred,kneighbors_pred,neural_net_pred])
df1=df.T
df1=df1.rename(columns={'Data_x':'y_test', 'Unnamed 0':'linear_pred','Unnamed 1':'SVR_pred',
                        'Unnamed 2':'gradient_pred','Unnamed 3':'kneighbors_pred','Unnamed 4':'neural_net_pred'})

In [18]:
import plotly.express as px

fig = px.line(df1,title="Comparisons of predictions for each model as well as the true values",labels={'value':"Predicted amount",'index':'prediction number'})
fig.update_traces({'marker':{'size': 20}})
fig.show()

In [45]:
a=X_test1[16]
b=X_test1[15]
c=X_test1[14]
df3=pd.DataFrame([a,b,c]).T
df3=df3.rename(columns={0:'Number 14',1:'Number 15',2:'Number 16'})
px.scatter(df3,title="Comparisons of a few points to variable 15 data")



### Results and discussion

Most of the models performed very well after optimization.\
The plot above illustrates the closeness of all of the predictions between the models, and shows the specific points where there were differences. 
Prediction number 15 shows some differences, with the kneighbors regressor performing slightly worse.This might be due to that specific data point being somewhat of an outlier. But due to the nature of the data, only one very extreme outlier was actually removed. (The sum of all of the data, which was the US column) Kneighbors are known to be sensitive to highly dimensional data and to outliers. All of the other models had prediction accuracies of 100%.


For all of the models, the true values were plotted on a bar graph, and the predictions were plotted as a scatterplot to help illustrate.

The Linear Regression model had a prediction accuracy of 100%.\ 
https://github.com/oohtmeel1/Machine-Learning-_-Prediction_Electricity_Data/blob/main/Linear_models/Machine_learning_Linear_modeling.ipynb


<img src="working_notebooks/linear_model_plot.png">

The `MAE` was 0.03267136765242417\
The `MSE` was 0.003015192149111796\
The `Explained variance` was 0.999999999844917\
The $ R^2 $ was 0.9999999998350794


The SVR also had a prediction accuracy of 100%.\
https://github.com/oohtmeel1/Machine-Learning-_-Prediction_Electricity_Data/tree/main/SVR

<img src="working_notebooks/svr_plot.png">

The `MAE` was 0.5021485094357804\
The `MSE` was 0.6832135549229309\
The `Explained variance` was 0.9999999633196209\
The $ R^2 $ was 0.999999962630569



The k_neighbors performed decently. With a prediction accuracy of 80%.
https://github.com/oohtmeel1/Machine_Learning_Prediction_Kneighborsreg/blob/main/Kneighbors_reg.ipynb

<img src="working_notebooks/kneighbors_plot.png">

The `MAE` was 186.8901960784314\
The `MSE` was 111144.4768627452\
The `Explained variance` was 0.9956994403253784\
The $ R^2 $ was 0.9939207794879534


The SGDR performed very well. With a prediction accuracy of 100%.\
https://github.com/oohtmeel1/Machine_Learning_Prediction_Stochastic_Gradient/blob/main/Gradient_Descent_regression.ipynb
<img src="working_notebooks/sgdr_plot.png">

The `MAE` was 0.9834673843854801\
The `MSE` was 1.5266552927792496\
The `Explained variance` was 0.9999999450342941\
The $ R^2 $ was 0.9999999164972077


The Neural Network Model performed very well. With a prediction accuracy of 100%.
https://github.com/oohtmeel1/Machine_Learning_Prediction_Neural_Network_Electricity_Data

<img src="working_notebooks/neural_net_plot.png">

The `MAE` was 0.032168487841461764\
The `MSE` was 0.0030038938705200698\
The `Explained variance` was 0.9999999998453984\
The $ R^2 $ was 0.9999999998356973



# Citations and references

*Essential Math for Data Science* by Thomas Nield (O'Reilly). Copyright 2022 Thomas Nield, 978-1-098-10293-7 [1]

*A tutorial on support vector regression* by ALEX J. SMOLA and BERNHARD SCHOLKOPF,Received July 2002 and accepted November 2003 https://web.archive.org/web/20120131193522/http://eprints.pascal-network.org/archive/00000856/01/fulltext.pdf [2]

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm[3]

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor[4]

https://scikit-learn.org/stable/modules/neural_networks_supervised.html[5]

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html[7]

Useful Links for data:

The main website: https://catalog.data.gov/dataset/state-energy-data-system-seds[6]

Additional information: https://www.eia.gov/state/seds/

Codes and descriptions: https://www.eia.gov/state/seds/CDF/Codes_and_Descriptions.xlsx
