WISO100303 / Johannes Schmidt & Peter Regner

# **An introduction to scientific programming**

<br> <br> <br> <br><br> <br> <br> <br>

# Please provide feedback to us!

On BOKU Online - and on menti:

<a href="https://www.menti.com/hvyn9aw7r9" target="_blank">https://www.menti.com/hvyn9aw7r9</a>

<a href="https://www.menti.com/dmzbk2193i" target="_blank">https://www.menti.com/dmzbk2193i</a>

# Final test: hints at end of lecture




# Predicting electricity demand

Predicting electricity demand is extremely important in the operation of power systems. Only if system operators know, which demand to expect, they can schedule resources (i.e. generation and transmission facilities).

Today, we develop a very simple model which can be used to predict electricity demand. We will not use it for operational forecasting but will try to understand how the corona pandemic and high gas and electricity prices in 2022 and 2023 have impacted electricity demand in a very simple approach.



# Scikit Learn
For that purpose, we use scikit learn. It is a machine learning library in Python. Builds on numpy, matplotlib and SciPy. It can be used for 
- classification (i.e. is there a cat on a photo - although caution, there are better libraries to do that)
- regression (i.e. how high will the electricity demand be under given conditions?), 
- and clustering (i.e. which objects belong to the same group?)

It provides a huge toolbox for model selection (i.e. which of my models performs best?), data preprocessing (e.g. normalization), and dimensionality reduction.

We will have a very brief glimpse into scikit learn only - and we will only use one particular fitting algorith, random forests without further explanation. If you are interested in algorithmic details, please check this video for a brief introduction [for details](https://www.youtube.com/watch?v=v6VJ2RO66Ag&t=33s).

# Additional information about regression and the regression algorithm we use, i.e. random forests 

[Introduction to Machine Learning for Beginners](https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fdb08)

[An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples](https://www.toptal.com/machine-learning/machine-learning-theory-an-introductory-primer)

[Regression in Machine Learning](https://medium.datadriveninvestor.com/regression-in-machine-learning-296caae933ec)

# File download

We need again the pickle file to have access to the load data. Please run the cell below therefore.

In [None]:
# workaround: Datalore does not allow to publish attached files, so we have to download it.
def download_attached_files():
    import urllib
    import os.path
    fnames = {
              'entsoe-demand-shortened.pickle': 'https://files.boku.ac.at/filr/public-link/file-download/0d7483c9959b20360196809f11ff2d67/18707/-4160977441044749444/entsoe-demand-shortened.pickle'
    }
    for fname, url in fnames.items():
        if not os.path.exists(fname):
            print(f'Downloading: {url}')
            urllib.request.urlretrieve(url, filename=fname)
            print(f'Download finished!')
        else:
            print("File already exists, not downloading again.")

download_attached_files()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import ensemble
from sklearn.metrics import r2_score

In [None]:
def get_hourly_country_data(data, country):
    ret_data = data[data["AreaName"] == country].interpolate() # data may contain NAs, therefore inteprolate
    ret_data = ret_data.resample("1h").mean(numeric_only=True).interpolate()    # not all hours may be complete 
                                                               # (i.e. some last 15 minutes are lacking, therefore
                                                               # another inpolation here)

    return ret_data

power_demand = pd.read_pickle("entsoe-demand-shortened.pickle")

power_demand_at_hourly = get_hourly_country_data(power_demand, "AT CTY")

In [None]:
# Let's have a look into the data...
power_demand_at_hourly.plot()

# Model fitting

We want to understand if electricity load is lower than expected due to the Corona Lockdown in 2020 and the high gas prices in 2021/2022/2023. We therefore have to know which electricity load we should have expected without the lockdown.

We do so by fitting a function to the electricity load, i.e. $y=f(x_1, x_2, ..., x_n)$. $y$ is the output feature, in our case the load. $f$ is some function depending on some $x_i$. We call the $x_i$ input features in the following.

Let's see an interactive example of model fitting [here](https://observablehq.com/@grahampullan/interactive-curve-fitting).



## Exercise 1

When you think of the last lecture - how can we build a model that predicts electricity load?

- First, which data do we want to predict in our data frame? How would you store the respective data in the output feature variable called `Y`?
- Second, which data describes the electricity load data pretty well, which data are you fitting to (input features)?
- Can you store this data in a new np array called `X`?

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

<div style="color:#555;border-top:1px solid #999;text-align:right;padding:4px;">End of exercise</div>

# Fit a first model

Ok, let's try to fit first months to the model. We use a random forest as fitting algorithm. At this point, you could also decide to use a different algorithm such as linear regression for example.

In [None]:
Y = power_demand_at_hourly["TotalLoadValue"].values
X = power_demand_at_hourly.index.month.values[:, np.newaxis]

Y

## .values?

In [None]:
type(power_demand_at_hourly["TotalLoadValue"])

In [None]:
type(power_demand_at_hourly["TotalLoadValue"].values)

In [None]:
X

Why do we use np.newaxis? Is there any difference?

In [None]:
power_demand_at_hourly.index.month.values

In [None]:
X.shape

In [None]:
power_demand_at_hourly.index.month.values.shape

In [None]:
a = np.array([1, 2, 3, 4])
print(a.shape)
a.shape

In [None]:
b = np.array([[1], [2], [3], [4]])
b.shape

In [None]:
#a[0, 0]

In [None]:
b[0, 0]

In [None]:
a = np.array([1, 2, 3, 4])
#a[0, 0]

In [None]:
a[:,np.newaxis][0, 0] 

In [None]:
forest_simple = ensemble.RandomForestRegressor()

# Train the model using the training sets
forest_simple.fit(X, Y)

prediction = forest_simple.predict(X)

In [None]:
prediction

In [None]:
def plot_prediction(Y, prediction, alpha=1., linewidth_1=0.1, linewidth_2=1.5):
    plt.plot(Y, label="Observation", linewidth=linewidth_1)
    plt.plot(prediction, label="Prediction", alpha=alpha, linewidth=linewidth_2)
    plt.xlabel("Time")
    plt.ylabel("Load (MW)")
    plt.legend()
    
plot_prediction(Y, prediction)

How could we do even better perhaps?

In [None]:
X = np.array([
    power_demand_at_hourly.index.month.values,
    power_demand_at_hourly.index.weekday.values,
    power_demand_at_hourly.index.hour.values]
).T

X

### .T transposes a Matrix

In [None]:
a = np.array([[1, 2, 3],[1, 2, 3]])
a.shape

In [None]:
a.T.shape

In [None]:
forest_all_time_scales = ensemble.RandomForestRegressor()

# Train the model using the training sets
forest_all_time_scales.fit(X, Y)

predicted = forest_all_time_scales.predict(X)

plot_prediction(Y, predicted, alpha=0.5, linewidth_2=0.1)

Looks ok. But maybe a different plot is more helpful?

Let's zoom in a bit:

## Exercise 2

In this exercise, we aim at better exploring how well our model works. Please observe that the original observations are stored in `Y` and the predicted values in `predicted`.

- Plot a timeseries of the error between the prediction and the observation: i.e. derive difference of `Y` - `predicted` and plot the data over time. Do not forget to add axis labels. 
- Plot a scatterplot between observations `Y` and prediction `prediction`, i.e. the values on the x-axis correspond to `Y` and on the y-axis to `predict`. Plot points instead of lines. The command `plt.scatter` may be useful here.



In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

<div style="color:#555;border-top:1px solid #999;text-align:right;padding:4px;">End of exercise</div>

## Exercise 3

Run the cell below this cell and observe the output.

- Can you determine what the function does? 
- How do we evaluate the quality of our model here?
- Are we perhaps doing something wrong here - how could we improve our evaluation procedure?

In [None]:
def train_test_plot(power_demand_hourly, start_train, end_train, start_test, end_test):

    explanatory_string = f'Training period: {start_train}-{end_train}, Test period: {start_test}-{end_test}'
    print(explanatory_string)
    power_demand_at_hourly_train = power_demand_hourly.loc[start_train:end_train]
    power_demand_at_hourly_test = power_demand_hourly.loc[start_test:end_test]
    
    X = np.array([
        power_demand_at_hourly_train.index.month.values,
        power_demand_at_hourly_train.index.weekday.values,
        power_demand_at_hourly_train.index.hour.values
    ]).T

    Y = power_demand_at_hourly_train["TotalLoadValue"].values

    forest_model = ensemble.RandomForestRegressor()

    # Train the model using the training sets
    forest_model.fit(X, Y)

    X_test = np.array([power_demand_at_hourly_test.index.month.values,
        power_demand_at_hourly_test.index.weekday.values,
        power_demand_at_hourly_test.index.hour.values]).T

    Y_test = power_demand_at_hourly_test["TotalLoadValue"].values

    predicted = forest_model.predict(X_test)

    fig, axes = plt.subplots(3, figsize=(10, 10))
    plot_prediction(Y_test, predicted, alpha=0.5)
    axes[0].plot(np.arange(0, 8760), np.repeat(0, 8760), color="red")
    axes[0].plot(Y_test-predicted, 'k', linestyle="", marker="o", markersize=0.1)
    axes[0].set_xlabel('Time (hours)')
    axes[0].set_ylabel('Prediction error load (MW)')
    fig.suptitle(explanatory_string)
    axes[1].scatter(Y_test, predicted, s=0.5)
    axes[1].set_xlabel("Observed load (MW)")
    axes[1].set_ylabel("Predicted load (MW)")

    plt.figure()
    print(f'The R² score of the model is {round(r2_score(Y_test, predicted),2)}')

train_test_plot(power_demand_at_hourly, "2020-01-01", "2020-12-31", "2020-01-01", "2020-12-31")

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

<div style="color:#555;border-top:1px solid #999;text-align:right;padding:4px;">End of exercise</div>

# Forecasting yesterday's weather is easy
It is very easy to forecast yesterday's weather - so we should always evaluate our model for a different period than the period we used for training.

## Exercise 4
Train the model with the single years 2015, 2016 and 2022 and always predict for 2020. 
- Which year works best? Which month seems to work the worst? What could be the reason?
- There are two periods where the test prediction is consistently very different from the training prediction for all years. What may be the reason there? How could the model be improved?

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

<div style="color:#555;border-top:1px solid #999;text-align:right;padding:4px;">End of exercise</div>

# Including holidays
Hholidays may bew the reason for outlieres in the prediction. Let's include them! For that purpose, we have to install a new package, called holidays. Click on the library manager on your left, type holidays under "explore", and hit enter to install.

In [None]:
import holidays

holidays_austria = pd.to_datetime(list(holidays.CountryHoliday('AT', years=range(2014,2025)).keys())).sort_values()

idx_period = power_demand_at_hourly.index

idx_period.freq = None
    
holidays_austria_array = np.isin(idx_period.date, holidays_austria.date)

power_demand_at_hourly["holidays"] = holidays_austria_array

In [None]:
def train_test_plot_with_holidays(power_demand_hourly, start_train, end_train, start_test, end_test):
    explanatory_string = f'Training model with holidays. Training period: {start_train}-{end_train}, Test period: {start_test}-{end_test}'
    print(explanatory_string)
    power_demand_at_hourly_train = power_demand_hourly.loc[start_train:end_train]
    power_demand_at_hourly_test = power_demand_hourly.loc[start_test:end_test]
    
    X = np.array([power_demand_at_hourly_train.index.month.values,
        power_demand_at_hourly_train.index.weekday.values,
        power_demand_at_hourly_train.index.hour.values,
        power_demand_at_hourly_train["holidays"].values]).T

    Y = power_demand_at_hourly_train["TotalLoadValue"].values

    forest_model = ensemble.RandomForestRegressor()

    # Train the model using the training sets
    forest_model.fit(X, Y)

    X_test = np.array([power_demand_at_hourly_test.index.month.values,
        power_demand_at_hourly_test.index.weekday.values,
        power_demand_at_hourly_test.index.hour.values,
        power_demand_at_hourly_test["holidays"].values]).T

    Y_test = power_demand_at_hourly_test["TotalLoadValue"].values

    predicted = forest_model.predict(X_test)

    fig, axes = plt.subplots(3, figsize=(10, 10))
    plot_prediction(Y_test, predicted, alpha=0.5)
    axes[0].plot(np.arange(0, 8760), np.repeat(0, 8760), color="red")
    axes[0].plot(Y_test-predicted, 'k', linestyle="", marker="o", markersize=0.1)
    axes[0].set_ylabel('Prediction error (MW)')
    axes[1].scatter(Y_test, predicted, s=0.5) 
    axes[1].set_xlabel("Observed load (MW)")
    axes[1].set_ylabel("Predicted load (MW)")
    fig.suptitle(explanatory_string) 
    print(f'The R² score of the model with holidays is {round(r2_score(Y_test, predicted),2)}')

train_test_plot(power_demand_at_hourly, "2016-01-01", "2016-12-31", "2020-01-01", "2020-12-31")
train_test_plot_with_holidays(power_demand_at_hourly, "2016-01-01", "2016-12-31", "2020-01-01", "2020-12-31")
train_test_plot(power_demand_at_hourly, "2022-01-01", "2022-12-31", "2020-01-01", "2020-12-31")
train_test_plot_with_holidays(power_demand_at_hourly, "2022-01-01", "2022-12-31", "2020-01-01", "2020-12-31")
    

# Loops
We have to copy and paste a lot of code to train different years! Can this be improved somehow? We can use loops!

In [None]:
for year in [2015, 2016, 2022, 2023]:
        train_test_plot(power_demand_at_hourly, f'{year}-01-01', f'{year}-12-31', "2020-01-01", "2020-12-31")
        print("-------")
        train_test_plot_with_holidays(power_demand_at_hourly, f'{year}-01-01', f'{year}-12-31', "2020-01-01", "2020-12-31")
        print("--------------------------------------")

What? How does this work? 

A loop repeats a code block. In a `for` loop, the amount of times the loop is repeated is fixed. The syntax is:

`for i in list:`

The loop is then run for each element of that list. The current element is stored in a new variable `i`.

In [None]:
for year in np.arange(2015, 2022):
    print(year)

In [None]:
for i in np.arange(0, 4):
    print(f'i is {i}, and i² is {i**2}')

In [None]:
for pet in ['cat', 'dog', 'rabbit', 'turtle']:
    print(pet)

## Exercise 5

Adapt the function `train_test_plot_with_holidays` and calculate the yearly sum of load in the test data set and in the test prediction and print them. Add a parameter `plot_figs` to the function so that the plots are only created if the parameter is `True`. Call the function `train_test_plot_with_holidays_advanced`. Test the function by running it again for all years.

- Using the training years 2015, 2016, 2022, and 2023, show different predictions of demand for 2020. Do you see a huge impact of the corona lockdowns?
- Which additional data could we use to refine our estimate?

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

<div style="color:#555;border-top:1px solid #999;text-align:right;padding:4px;">End of exercise</div>

# And the energy prices in 2022 and 2023?

We predict for 2022 and 2023 with all other years and observe the change.

In [None]:
for year in [2015, 2016, 2021]:
    train_test_plot_with_holidays_advanced(power_demand_at_hourly, f'{year}-01-01', f'{year}-12-31', "2022-01-01", "2022-12-31", False)
    print("--------")
    train_test_plot_with_holidays_advanced(power_demand_at_hourly, f'{year}-01-01', f'{year}-12-31', "2023-01-01", "2023-12-31", False)
    print("----------------")

There seems to be some impact of prices on electricity demand, in particular in 2023. If you look into the hourly mean of the observations and the prediction, what may be one of the reasons for lower grid electricity demand in 2023?

In [None]:
def train_test_plot_with_holidays_hourly_deviation(power_demand_hourly, start_train, end_train, start_test, end_test, plot_figs):

    explanatory_string = f'Training with holidays. Training period: {start_train}-{end_train}, Test period: {start_test}-{end_test}'
    print(explanatory_string)
    power_demand_at_hourly_train = power_demand_hourly.loc[start_train:end_train]
    power_demand_at_hourly_test = power_demand_hourly.loc[start_test:end_test]
    
    X = np.array([
        power_demand_at_hourly_train.index.month.values,
        power_demand_at_hourly_train.index.weekday.values,
        power_demand_at_hourly_train.index.hour.values,
        power_demand_at_hourly_train["holidays"].values]
    ).T

    Y = power_demand_at_hourly_train["TotalLoadValue"].values

    forest_model = ensemble.RandomForestRegressor()

    # Train the model using the training sets
    forest_model.fit(X, Y)

    X_test = np.array([power_demand_at_hourly_test.index.month.values,
        power_demand_at_hourly_test.index.weekday.values,
        power_demand_at_hourly_test.index.hour.values,
        power_demand_at_hourly_test["holidays"].values]).T

    Y_test = power_demand_at_hourly_test["TotalLoadValue"].values

    predicted = forest_model.predict(X_test)

    predicted_data_frame_column = pd.Series(predicted)
    predicted_data_frame_column.index = power_demand_at_hourly_test.index

    power_demand_at_hourly_test = power_demand_at_hourly_test.assign(predicted = predicted_data_frame_column) 
    power_demand_at_hourly_mean = power_demand_at_hourly_test.groupby(power_demand_at_hourly_test.index.hour.values).mean(numeric_only=True)

    if plot_figs:
        fig, axes = plt.subplots(4, figsize=(10, 10))
        plot_prediction(Y_test, predicted, alpha=0.5)
        axes[0].plot(np.arange(0, 8760), np.repeat(0, 8760), color="red")
        axes[0].plot(Y_test-predicted, 'k', linestyle="", marker="o", markersize=0.1)
        axes[0].set_ylabel('Prediction error (MW)')
        axes[1].scatter(Y_test, predicted, s=0.5) 
        axes[1].set_xlabel("Observed load (MW)")
        axes[1].set_ylabel("Predicted load (MW)")
        axes[2].plot(power_demand_at_hourly_mean["TotalLoadValue"], label="Observation")
        axes[2].plot(power_demand_at_hourly_mean["predicted"], color="red", label="Prediction")
        axes[2].legend()
        fig.suptitle(explanatory_string) 

    print(f'The R² score of the model with holidays is {round(r2_score(Y_test, predicted),2)}')
    print(f'The demand in observations with holidays is {round(np.sum(Y_test * 10E-7), 2)} TWh and the demand in the predictions is {round(np.sum(predicted) * 10E-7, 2)} TWh')

for year in [2015, 2016, 2020, 2021]:
    train_test_plot_with_holidays_hourly_deviation(power_demand_at_hourly, f'{year}-01-01', f'{year}-12-31', "2023-01-01", "2023-12-31", True)
    print("-------------------------")

# How to continue after this course?

This course hopefully gave you a good start, there is a whole lot more to know and take care of. Here is some additional information.

## How to program without Datalore?

You can run Python as scripts or use Jupyter Notebooks. The latter is _very_ similar to Datalore.

We recommend one of the following:

- Use [Anaconda](https://www.anaconda.com/) to install [Jupyter](https://jupyter.org/) on your computer if you use Windows or Mac, use [miniconda](https://docs.conda.io/en/latest/miniconda.html) or the package manager of your distribution (e.g. apt in Ubuntu/Debian) if you run Linux. [This video might help](https://www.youtube.com/watch?v=LrMOrMb8-3s), but you probably can find other (better?) instructions yourself too.
- Use an Editor such as [PyCharm](https://www.jetbrains.com/pycharm/) (not opensource, but really powerful, free community edition and free education license), [Spyder](https://www.spyder-ide.org/) (free and opensource) or [Visual Studio Code](https://code.visualstudio.com/).
- Use [Datalore](images/datalore.jetbrains.com/) or [Google Colab](https://colab.research.google.com/notebooks/).

## How to debug?

Debugging is important to understand what your code does. Use print statements or special debugging tools! There are great tutorials out there how to learn good debugging techniques.

## How to version control?

Nowadays, software development is never done without version control systems, such as [Git](https://git-scm.com/). When working with notebooks things are messy anyway, so it might be less important. But you should definitely checkout Git, when you write scripts and modules.

## How to do testing?

You can save a lot of time if you invest time into proper [unit testing](https://en.wikipedia.org/wiki/Unit_testing). This means you write code, which checks if your code does the right thing - automatically. This is similar to the tests you have seen in most homework exercises - but of course, you have to write the solution *and* the test on your own. If you need correct results, you probably need automatic ways of testing your code.

# Other courses

## Other Python related courses at BOKU

Introduction to programming

Programming with Python

Machine learning and pattern recognition for bioinformatics

Bayesian data analysis in the life sciences

Meteorological data analysis and visualization

Snow and avalanches - field methods

## Other courses introducing programming languages at our institute

Computer Simulation in Energy & Resource Economics (NetLogo - Agend-based Simulation)

Operations Research and Systems Analysis (GAMS - Optimization)

Applied mathematical programming in natural resource management (GAMS - Optimization)

Agricultural production and policy impact modelling (GAMS - Optimization)

Integrated land use modelling (GAMS - Optimization)

Interdisziplinäres Seminar Umwelt-Informationsmanagement

## Other relevant courses

Check out the Stastics Institute, they have R-related classes.

Check out TU Wien, they have tons of programming classes! (and several computer science programs)

# Master thesis

- This is a bachelor class, but some of you are already in a master program
- If you are intersted in applying your newly acquired programming skills in a master thesis, get in touch
- We model renewable energy systems on different spatial levels from local to global
- More details [here](https://homepage.boku.ac.at/jschmidt/masterstudents.pdf)

# Hints for final test
- Content: everything up to lecture 7, but focus on numpy and pandas. Don't forget that these elements also were part of our class:
  - f-strings
  - Pythagoras
  - equation of a circle (via Pythagoras)
  - distances between points in a 2D Cartesian coordinate system
- 4 questions
    - Explaining code
    - Finding errors
    - Writing code
    - Explaining terminology
- If you have a reason to not participate in the test (other exam, doctor, ...) please tell us and we will find a new date

# Reminder: Feedback

https://www.menti.com/hvyn9aw7r9

https://www.menti.com/dmzbk2193i