# Predicting Modeling Hydropower Consumption using *k*-Nearest Neighbors Regression

### Table of Contents

1. [Introduction](#1.-Introduction)

1. [Base Libraries](#2.-Base-Libraries)

1. [Data Preprocessing](#3.-Data-PreProcessing)

2. [Tuning Hyperparameters](#4.-Tuning-Hyperparameters)

3. [Validating the Models with Metrics](#5.-Validating-the-Models-with-Metrics)

4. [Predicting Energy Generation](#6.-Predicting-Energy-Generation)

5. [Conclusions](#7.-Conclusions)

6. [References](#8.-References)

# 1. Introduction

For forecasting, historical data is used as input and future trends are predicted on the basis of this data, but accurate selection and extraction of meaningful features from data are challenging. However, the *k*nn algorithm, one of the most simple and useful Machine Learning techniques, can be easily used to generate powerful regression models. In this notebook, I focused on the use of kNN itself and its hyperparametrizations, to guarantee a better understanding of this algorithm and keep it as a plain tutorial to anyone who needs it. Another resources like feature engineering, dimensionality reduction or even another machine learning techniques are not used.

The dataset used is a energy generation compilation of several european countries, measured in THh between 2000 and 2019. Its contents were extracted from World in Data. 

# 2. Base Libraries

First, the basic libraries are imported: **pandas**, **matplotlib** and **numpy** to Dataframes, Graphs and numeric operations; **MinMaxScaler** to normalize our data between 0 and 1, **train_test_split** to help split the dataset (usually 70% training/ 30% testing), **GridSearchCV** for hyperparameter tuning and **neighbors** to generate our models using *k*NN

In [13]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import neighbors
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error, r2_score, explained_variance_score

To measure the efficiency of our generated models, we use a bunch of metrics (maybe more than we need):
* **mean_absolute_error** (MAE): a measure of errors between paired observations expressing the same phenomenon;
* **mean_squared_error**  (root - RMSE): the standard deviation of the residuals (prediction errors);
* **mean_squared_log_error** (root - RMSLE): that measures the ratio between actual and predicted;
* **r2_score** (R2): coefficient of determination, the proportion of the variance in the dependent variable that is predictable from the independent variable(s); and
* **explained_variance_score** (EVS): measures the discrepancy between a model and actual data

# 3. Data Preprocessing

Initially, the data is imported from the dataset, and the categorical columns are excluded (in this case, only "Country"). To transform the dataset and still keep it as a Dataframe, the *scaler* library is used inside the *Dataframe* library, normalizing the data between 0 and 1, and keeping the dataframe properties. After this, the *describe* function is called, to show some important data about the dataframe, like mean, max, min, std and others.

In [14]:
# Setting the display format
pd.options.display.float_format = '{:.4f}'.format

# Data Loading and Pre-processing
df = pd.read_csv('data/Hydropower_Consumption.csv', sep=',')
df = df.drop(columns=["Country"])

scaler = MinMaxScaler(feature_range=(0, 1))
df = pd.DataFrame(scaler.fit_transform(df),
                        columns=[str(year) for year in range(2000, 2020)])

df.head(10)

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,0.0005,0.0008,0.0009,0.0001,0.0009,0.0001,0.001,0.0012,0.0008,0.0012,0.0011,0.0008,0.0001,0.0009,0.0008,0.0009,0.0009,0.0001,0.0001,0.0001
1,0.1136,0.1321,0.1347,0.1324,0.1396,0.1358,0.1379,0.1494,0.1455,0.1496,0.151,0.1517,0.1286,0.1294,0.1168,0.1039,0.1074,0.1119,0.1875,0.0
2,0.0069,0.0057,0.0055,0.0082,0.0086,0.0081,0.0074,0.0004,0.0056,0.0078,0.0108,0.0055,0.0055,0.0077,0.0045,0.0053,0.0062,0.0004,0.0006,0.0032
3,0.0001,0.0001,0.0001,0.0004,0.0004,0.0008,0.0003,0.0004,0.0004,0.0005,0.0002,0.0005,0.0005,0.0001,0.0002,0.0001,0.0001,0.0,0.0002,0.0001
4,0.0014,0.0016,0.0018,0.002,0.0028,0.0033,0.0039,0.0039,0.0046,0.0046,0.0052,0.0054,0.0043,0.0052,0.0047,0.0045,0.005,0.0065,0.0107,0.0066
5,0.0514,0.0685,0.0656,0.0062,0.0565,0.006,0.0652,0.0592,0.0557,0.0611,0.057,0.0544,0.0427,0.0004,0.0386,0.0375,0.0332,0.0357,0.0588,0.0292
6,0.0249,0.0263,0.0255,0.0259,0.0251,0.0236,0.0223,0.0205,0.0177,0.019,0.0193,0.0269,0.0197,0.021,0.0137,0.0126,0.0153,0.0116,0.0245,0.0113
7,0.0631,0.0661,0.0636,0.0531,0.0587,0.0565,0.0533,0.0581,0.0574,0.0613,0.0539,0.047,0.0508,0.0462,0.0039,0.0332,0.0345,0.0329,0.0532,0.0321
8,0.0023,0.0021,0.0003,0.0004,0.0044,0.0046,0.0038,0.0037,0.0033,0.0035,0.0048,0.0037,0.0021,0.0016,0.0,0.0015,0.0017,0.0015,0.0025,0.0012
9,0.0011,0.0016,0.0012,0.0012,0.0012,0.0011,0.0011,0.0012,0.0001,0.0006,0.001,0.0012,0.0009,0.0008,0.0005,0.0008,0.0008,0.0001,0.0012,0.0006


Now, my goal was to check the possibility of creating a model that would predict power generation for 2019, based on the previous 18 years (2000 - 2018) with at least 75% accuracy. For this, the data set was separated into X and y, X being my prediction data, and y what I intended to predict. For this, the data set was separated into X and y, X being my prediction data, and y what I intended to predict. For this, they are divided into training (70%) and testing (30%).

In [15]:
# Data Splitting
X = df.drop(columns=["2019"])
y = df["2019"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)


Since we already have the training and test data selected, it is necessary to find the k factor that will generate the best results for the algorithm.

# 4. Tuning Hyperparameters

A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data.

In [16]:
# Hyperparameter Tuning using GridSearchCV
params = {'n_neighbors': list(range(1, 21))}
knn = neighbors.KNeighborsRegressor()
model = GridSearchCV(knn, params, cv=5)
model.fit(X_train, y_train)

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. For our dataset, GridSearch indicates 4 as the best values to our *k*, quite similar to our tests realized before.

Once you've found the best value for k, it's time to train the model and predict the results. We will also make use of the *score* function, which will allows us to see the accuracy rate for our model (80.16%).

In [17]:
# Model Training and Prediction using the best k value
best_k = model.best_params_['n_neighbors']
knn = neighbors.KNeighborsRegressor(n_neighbors=best_k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

Once we have the model ready and the prediction made, we can apply the metrics and analyze the results.

# 5. Validating the Models with Metrics

Here the test results are compared with the prediction results within the metrics, so that we can see what the results are for each of them.

In [18]:
# Performance Metrics
r2_valid = r2_score(y_test, y_pred)
mae_valid = mean_absolute_error(y_test, y_pred)
evs_valid = explained_variance_score(y_test, y_pred)
rmse_valid = np.sqrt(mean_squared_error(y_test, y_pred))
rmsle_valid = np.sqrt(mean_squared_log_error(y_test, y_pred))

print(f'Best K Value: {best_k}')
print('R2 Score:', r2_valid)
print('Explained Variance Score:', evs_valid)
print('Mean Absolute Error:', mae_valid)
print('Root Mean Squared Error:', rmse_valid)
print('Root Mean Squared Log Error:', rmsle_valid)

Best K Value: 4
R2 Score: 0.8016285342188538
Explained Variance Score: 0.8131526313306373
Mean Absolute Error: 0.017960057840249434
Root Mean Squared Error: 0.04999229884993785
Root Mean Squared Log Error: 0.039043708745309796


The results show an R2 of 80.16% and an EVS of 81.31%, indicating that our model has a great fit to your sample (R2) and a strong association between the model and its current data (EVS). For the RMSLE, it only considers the relative error between and the predicted and the actual value and the scale of the error is not significant. On the other hand, RMSE value increases in magnitude if the scale of error increases. For our model, both presents very low values (4.99% and 3.90%)

# 6. Predicting Energy Generation


Once we have the prediction ready and the model tested, we organize the results side by side to be able to make a comparison.

In [19]:
# Displaying Predictions
data_prediction = pd.DataFrame(list(zip(y_test, y_pred)), columns=['Test', 'Prediction'])
data_prediction.head(10)

Unnamed: 0,Test,Prediction
0,0.0406,0.0268
1,0.0044,0.0039
2,0.0134,0.0096
3,0.028,0.0128
4,0.0164,0.0125
5,0.0321,0.0155
6,0.4982,0.3417
7,0.5331,0.3417
8,0.0007,0.0075
9,0.0032,0.0039


The results presented show a great proximity between the test and prediction values, showing that the model we created managed to fulfill its proposed objective of performing above 75% accuracy.

# 7. Conclusions

Power generation data sets have a very close pattern, which can be identified even with a simple regression tool, even if the pre-processing of the data is simpler. Several other techniques and algorithms can also be applied to improve the performance of the model at higher levels.

# 8. References

[1] Ali, M., Khan, Z. A., Mujeeb, S., Abbas, S., & Javaid, N. (2019). Short-Term Electricity Price and Load Forecasting using Enhanced Support Vector Machine and K-Nearest Neighbor. 2019 Sixth HCT Information Technology Trends (ITT). doi:10.1109/itt48889.2019.9075063 

[2] Ashfaq, T., & Javaid, N. (2019). Short-Term Electricity Load and Price Forecasting using Enhanced KNN. 2019 International Conference on Frontiers of Information Technology (FIT). doi:10.1109/fit47737.2019.00057 

[3] Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(1). doi:10.1186/s13040-017-0155-3 

[4] Claesen, Marc; Bart De Moor (2015). "Hyperparameter Search in Machine Learning". arXiv:1502.02127

[5] Zaki, M., & Meira, W. (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. New York City, New York: Cambridge University Press (10.1017/CBO9780511810114).