<a href="https://colab.research.google.com/github/punjabinuclei/RealTimeBatteryMonitoringSystem/blob/main/7.%20DecisonTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Decision Tree Regression

Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.


*   Decision trees have an advantage that it is easy to understand, lesser data cleaning is required, non-linearity does not affect the model’s performance and the number of hyper-parameters to be tuned is almost null.




In [1]:
import pandas as pd # data processing
import numpy as np # working with arrays
from termcolor import colored as cl # text customization

# Decision trees can also be applied to regression problems, using the DecisionTreeRegressor class.
from sklearn import tree
from sklearn.model_selection import train_test_split # data split

from sklearn.metrics import explained_variance_score as evs # evaluation metric
from sklearn.metrics import r2_score as r2 # evaluation metric

from sklearn import preprocessing # preprocessData


# Loading our Dataset
Using the ‘read_csv’ function provided by the Pandas package, we can import the data into our python environment. After importing the data, we can use the ‘head’ function to get a glimpse of our dataset.

In [3]:
# list of csv files to merge
files = ['File1.csv', 'File2.csv', 'File3.csv', 'File4.csv', 'File5.csv', 'File6.csv', 'File7.csv', 'File8.csv', 'File9.csv', 'File10.csv', 'File11.csv', 'File12.csv','File13.csv','File14.csv','File15.csv','File16.csv','File17.csv','File18.csv','File19.csv','File20.csv','File21.csv']

# read each file into a dataframe
dfs = [pd.read_csv(file) for file in files]

# merge all dataframes into one
merged_df = pd.concat(dfs)

df=merged_df
df=df.drop(columns=['QDischarge_mA_h'])
df=df.dropna()
df

Unnamed: 0,Voltage(V),Current (A),Temperature (C),SOC%
0,4.181530,-0.592617,23.549685,99.999989
1,4.174833,-0.597150,23.699697,99.994465
2,4.173691,-0.595770,23.470728,99.988943
3,4.172785,-0.596164,23.707592,99.983422
4,4.171997,-0.596558,23.628637,99.977902
...,...,...,...,...
952192,2.951639,0.000000,19.411720,26.814976
952193,2.954870,0.000000,19.253963,26.814976
952194,2.957982,0.000000,19.301291,26.814976
952195,2.960897,0.000000,19.372282,26.814976


In [4]:
features = ['Voltage(V)','Current (A)', 'Temperature (C)']
X = df.loc[:, features]
y = df.loc[:, ['SOC%']]

The data should ideally be divided into 3 sets – namely, train, test, and validation/dev sets.

- **Train Set:** The train set would contain the data which will be fed into the model. In simple terms, our model would learn from this data. For instance, a Regression model would use the examples in this data to find gradients in order to reduce the cost function. Then these gradients will be used to reduce the cost and predict data effectively.
- **Dev Set:** The development set is used to validate the trained model. This is the most important setting as it will form the basis of our model evaluation. If the difference between error on the training set and error on the dev set is huge, it means the model as high variance and hence, a case of over-fitting.
- **Test Set:** The test set contains the data on which we test the trained and validated model. It tells us how efficient our overall model is and how likely is it going to predict something which does not make sense. There are a plethora of evaluation metrics (like precision, recall, accuracy, etc.) which can be used to measure the performance of our model.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)


# Model Training
**Decision trees** can also be applied to regression problems, using the DecisionTreeRegressor class.


In [6]:

clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
clf_yhat_test = clf.predict(X_test)
clf_yhat_val=clf.predict(X_val)


**Model Evaluation**

To evaluate our model we are going to use the ‘explained_variance_score’ metric and the ‘r2_score’ metric functions which are provided by the scikit-learn package in python.

When it comes to the ‘explained_variance_score’ metric, the score should not below 0.60 or 60%. If it is the case, then our built model is not sufficient for our data to solve the given case. So, the ideal score of the ‘explained_variance_score’ should be between 0.60 and 1.0.

Our next evaluation metric is the ‘r2_score’ (R-squared) metric. What is R-squared? R-squared is a measurement of how well the dependent variable explains the variance of the independent variable. It is the most popular evaluation metric for regression models. The ideal ‘r2_score’ of a build should be more than 0.70 (at least > 0.60).

We are now going to compare the metric scores of each model and choose which model is more suitable for the given dataset.

# Evaluating Model with Test Data

In [7]:
print(cl('EXPLAINED VARIANCE SCORE:', attrs = ['bold']))
print('-------------------------------------------------------------------------------')
print(cl('Explained Variance Score of SVM model is {}'.format(evs(y_test, clf_yhat_test)), attrs = ['bold']))

[1mEXPLAINED VARIANCE SCORE:[0m
-------------------------------------------------------------------------------
[1mExplained Variance Score of SVM model is 0.9768925501065459[0m


In [8]:
print(cl('R-SQUARED:', attrs = ['bold']))
print('-------------------------------------------------------------------------------')
print(cl('R-Squared of SVM model is {}'.format(r2(y_test, clf_yhat_test)), attrs = ['bold']))

[1mR-SQUARED:[0m
-------------------------------------------------------------------------------
[1mR-Squared of SVM model is 0.9768925102282883[0m


## Evaluating Model with Validation Data

In [9]:
print(cl('EXPLAINED VARIANCE SCORE:', attrs = ['bold']))
print('-------------------------------------------------------------------------------')
print(cl('Explained Variance Score of SVM model is {}'.format(evs(y_val, clf_yhat_val)), attrs = ['bold']))

[1mEXPLAINED VARIANCE SCORE:[0m
-------------------------------------------------------------------------------
[1mExplained Variance Score of SVM model is 0.9768957819473763[0m


In [56]:
print(cl('R-SQUARED:', attrs = ['bold']))
print('-------------------------------------------------------------------------------')
print(cl('R-Squared of SVM model is {}'.format(r2(y_val, clf_yhat_val)), attrs = ['bold']))

R-SQUARED:
-------------------------------------------------------------------------------
R-Squared of SVM model is 0.9999983803694567


In [61]:
X_test.iloc[0].values.reshape(1,-1)

array([[ 3.90803000e+00, -2.89889000e+00,  9.04363000e+00,
         2.41436000e+00,  5.08306134e+00, -1.13837000e-04,
         2.65960250e+00]])

In [63]:
clf.predict(X_test.iloc[0].values.reshape(1,-1) )



array([80.47066667])

In [65]:
y_test.iloc[0]

StateofCharge    80.478667
Name: 12088, dtype: float64