<a href="https://colab.research.google.com/github/punjabinuclei/RealTimeBatteryMonitoringSystem/blob/main/10.%20Gradient%20Boosting%20regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **State of Charge Estimation using Machine Learning** 

Why linear regression? Linear regression is an algorithm used to predict values that are continuous in nature.

To predict the state of charget we are going to use the following linear regression algorithms: Ordinal Least Square (OLS) algorithm, Ridge regression algorithm, Lasso regression algorithm, Bayesian regression algorithm, and lastly Elastic Net regression algorithm.

# Importing Important Libraries




Our primary packages for this project are going to be pandas for data processing, NumPy to work with arrays, matplotlib & seaborn for data visualizations, and finally scikit-learn for building an evaluating our ML model. Let’s import all the required packages into our python environment.

In [15]:

import pandas as pd # data processing
import numpy as np # working with arrays
import matplotlib.pyplot as plt # visualization
import seaborn as sb # visualization
from termcolor import colored as cl # text customization

from sklearn.model_selection import train_test_split # data split
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import explained_variance_score as evs # evaluation metric
from sklearn.metrics import r2_score as r2 # evaluation metric

sb.set_style('whitegrid') # plot style
plt.rcParams['figure.figsize'] = (20, 10)
from sklearn import preprocessing # preprocessData

# Loading our Dataset

Using the ‘read_csv’ function provided by the Pandas package, we can import the data into our python environment. After importing the data, we can use the ‘head’ function to get a glimpse of our dataset.

In [16]:
df=pd.read_csv(r"SocFinal.csv")
df=df.dropna()
df

Unnamed: 0,Voltage,Current,Temperature,Capacity,StandardDeviation,DCResistance,MeanofVariables,StateofCharge
0,4.17604,-0.15069,23.97615,2.99746,11.492914,-0.039153,7.001010,99.915333
1,4.17014,-0.15069,23.97615,2.99239,11.492369,-0.016789,7.000802,99.746333
2,4.16761,-0.15069,23.76583,2.98986,11.388522,-0.016444,6.948222,99.662000
3,4.16509,-0.15325,23.66067,2.98732,11.336982,-0.015400,6.921297,99.577333
4,4.16273,-0.15325,23.76583,2.98478,11.388420,-0.014600,6.947633,99.492667
...,...,...,...,...,...,...,...,...
14952,2.80077,-11.13838,8.83332,0.52128,8.443293,-0.000061,0.743607,17.376000
14953,2.80010,-10.93406,8.83332,0.52099,8.347490,-0.000016,0.794593,17.366333
14954,2.79993,-10.88298,8.83332,0.52099,8.323558,0.000117,0.807320,17.366333
14955,2.80178,-15.74596,8.93848,0.53256,10.664245,-0.000108,-0.384565,17.752000


## **Feature Selection & Data Split**
 In this process we are going to define the ‘X’ variable (independent variable) and the ‘Y’ variable (dependent variable). After defining the variables, we will use them to split the data into a train set and test set. Splitting the data can be done using the ‘train_test_split’ function provided by scikit-learn in python.

In [9]:
features = ['Voltage','Current','Temperature', 'Capacity', 'StandardDeviation', 'DCResistance', 'MeanofVariables']
X = df.loc[:, features]
y = df.loc[:, ['StateofCharge']]

In [17]:
# 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

## **Modeling**
In this process, we are going to build and train five different types of linear regression models which are the OLS model, Ridge regression model, Lasso regression model, Bayesian regression model, Elastic Net regression model. For all the models, we are going to use the pre-built algorithms provided by the scikit-learn package in python. And the process for all the models are the same, first, we define a variable to store the model algorithm, next, we fit the train set variables into the model, and finally make some predictions in the test set.

Training the Model

In [20]:
# Instantiate Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators = 200, max_depth = 1, random_state = 1)
  
# Fit to training set
gbr.fit(X_train, y_train)
  
# Predict on test set
gbr_yhat_test= gbr.predict(X_test)

# Predict on val set
gbr_yhat_val= gbr.predict(X_val)

  y = column_or_1d(y, warn=True)


Using the algorithms provided by scikit-learn, we have successfully built five different linear regression models. Now, to know which model is more appropriate for our data, we can evaluate each of the models using the evaluation metrics and come to a conclusion.

## **Model Evaluation**
To evaluate our model we are going to use the ‘explained_variance_score’ metric and the ‘r2_score’ metric functions which are provided by the scikit-learn package in python.

When it comes to the ‘explained_variance_score’ metric, the score should not below 0.60 or 60%. If it is the case, then our built model is not sufficient for our data to solve the given case. So, the ideal score of the ‘explained_variance_score’ should be between 0.60 and 1.0.

Our next evaluation metric is the ‘r2_score’ (R-squared) metric. What is R-squared? R-squared is a measurement of how well the dependent variable explains the variance of the independent variable. It is the most popular evaluation metric for regression models. The ideal ‘r2_score’ of a build should be more than 0.70 (at least > 0.60).

We are now going to compare the metric scores of each model and choose which model is more suitable for the given dataset.

In [21]:
# 1. Explained Variance Score

print('-------------------------------------------------------------------------------')
print(cl('Explained Variance Score of rgr is {}'.format(evs(y_test, gbr_yhat_test)), attrs = ['bold']))
print('-------------------------------------------------------------------------------')

-------------------------------------------------------------------------------
Explained Variance Score of rgr is 0.9997067156945743
-------------------------------------------------------------------------------


In [23]:
# 2. R-squared

print('-------------------------------------------------------------------------------')
print(cl('R-Squared of gbr model is {}'.format(r2(y_test, gbr_yhat_test)), attrs = ['bold']))
print('-------------------------------------------------------------------------------')


-------------------------------------------------------------------------------
R-Squared of gbr model is 0.9997065294054692
-------------------------------------------------------------------------------


With Validation Data

In [24]:
# 1. Explained Variance Score

print(cl('Explained Variance Score of gbr model is {}'.format(evs(y_val, gbr_yhat_val)), attrs = ['bold']))
print('-------------------------------------------------------------------------------')

Explained Variance Score of gbr model is 0.9997084271095571
-------------------------------------------------------------------------------


In [25]:
# 2. R-squared

print('-------------------------------------------------------------------------------')
print(cl('R-Squared of gbr model is {}'.format(r2(y_val, gbr_yhat_val)), attrs = ['bold']))
print('-------------------------------------------------------------------------------')


-------------------------------------------------------------------------------
R-Squared of gbr model is 0.9997082442356037
-------------------------------------------------------------------------------


In [29]:
X_val.iloc[13].values.reshape(1,-1)

array([[ 3.69093000e+00, -1.80012200e+01,  3.85931800e+01,
         2.62954000e+00,  2.36311288e+01, -3.72000000e-05,
         6.16333750e+00]])

In [30]:
gbr.predict(X_test.iloc[13].values.reshape(1,-1) )



array([14.83604529])

In [31]:
y_test.iloc[13]

StateofCharge    14.818667
Name: 6057, dtype: float64