WISO100303 / Johannes Schmidt & Peter Regner

# **An introduction to scientific programming**

<br> <br> <br> <br><br> <br> <br> <br>

# Download data
We use a rather extensive data set during this class. Below, there is automatic download code for that data. The download needs to be done only once, since Datalore stores the file in the Notebook files.

In [None]:
# workaround: Datalore does not allow to publish attached files, so we have to download it.
def download_attached_files():
    import urllib
    import os.path
    fnames = {
              'entsoe-demand-shortened.pickle': 'https://files.boku.ac.at/filr/public-link/file-download/0d7483c9959b20360196809f11ff2d67/18707/-4160977441044749444/entsoe-demand-shortened.pickle'
    }
    for fname, url in fnames.items():
        if not os.path.exists(fname):
            print(f'Downloading: {url}')
            urllib.request.urlretrieve(url, filename=fname)
            print(f'Download finished!')
        else:
            print("File already exists, not downloading again.")

download_attached_files()

# Pandas

Today, we work with the `Pandas` library. `Pandas` is a library  that allows to handle multi-type data frames, i.e. unlike in numpy, which allows only one data type per array, in a `Pandas` dataframe each column can be a different data type. Think of `Pandas` as providing the functionality of Excel to you in Python: you can sort data, aggregate data, do calculations on data, etc. The advantage:
- You can automatize your tasks
- You can reproduce your analysis
- You can separate data and code

`Pandas` is built on top of numpy, so many operations you know from numpy, may work as well in `Pandas`.

`Pandas` dataframes have column names, which can be used to acces them (see below). They also have an index for rows, such as e.g. date and time of a sample in the table. During class, we will learn to handle both column names as well as the index.

# Data example: which simple patterns can we identify in electricity demand?

Today we want to use the `Pandas` library to study which temporal patterns we can find in Austrian electricity demand. When do people use more electricity? When do they use less electricity?

For that purpose we use data provided by ENTSO-e, that is the "European association for the cooperation of transmission system operators". They provide hourly data for most European countries, informing about the "load" on the network. The load indicates how much electricity was produced - and consumed. We will look into the dataset which spans from late 2014 to today and aim at understanding consumption patterns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Load data

We load data from the file we just downloaded. It contains load measured on grids in the whole of Europe, provided by entso-e. You can download the raw CSV files from the entso-e website directly, but we have done so for you, as the download takes substantial time. The pickle file, which is hosted on a bokubox, contains all relevant data.

In [None]:
power_demand = pd.read_pickle("entsoe-demand-shortened.pickle") 

**Note:** Pickle is [not a good file format for data exchange](https://nedbatchelder.com/blog/202006/pickles_nine_flaws.html). There are security dangers and possible compatibility issues. Unfortunately all other formats (except pickle, npz and cvs) require to install additional packages on Datalore. To avoid the hassle, we used pickle despite its risks.

What's in there?

In [None]:
power_demand

That's a lot of information... let's get an overview! Which columns are there?

In [None]:
power_demand.columns

In pandas, rows can also have names. They are called the index of the pandas dataframe:

In [None]:
power_demand.index

# Acessing data

The `dataframe.columns` attribute tells us, which columns are available in our table. To access a column, you can use `dataframe_name["Column name"]`. 

In [None]:
power_demand["AreaName"]

Which areas are there? The command `unique()` will give you a listing of unique entries in that column.

In [None]:
power_demand["AreaName"].unique()

What do the abbreviations mean? Do you think Austria is there? But what if the list was very long and we do not want to check manually if we can find Austria in the list of unique countries?

We can use a comparison and `np.sum` to find out! Please observe that a boolean value is assumed to be 1 if it is `True`, and 0 otehrwise.

In [None]:
np.sum(power_demand["AreaName"] == "AT CTY")

What does that answer tell us?

We can also access rows of the dataframe by using the method `.loc`

In [None]:
power_demand.loc['2015-01-01']

In [None]:
power_demand.loc['2015-01-01 00:00:00']

In [None]:
power_demand.loc['2015-01-01':'2015-02-02']

If we want to, we can also access a row or a column by its numerical index (similar to numpy). For that purpose, we have to use `.iloc`.

In [None]:
power_demand.iloc[0, 0]

Let's get Austrian data! We can filter rows by the values of a column using `dataframe_name[dataframe_name["column_name"]== criterium]`. This is similar to how we filtered numpy arrays. 

In [None]:
def filter_country(load, country):
    country_load = load[load["AreaName"] == country]
    return country_load


power_demand_at = filter_country(power_demand, "AT CTY")

In [None]:
power_demand_at

## Exercise 1

Let's do some summary statistics. Calculate mean, standard deviation, min, max and the 25%, 50% and 75% quantile of the distribution of the Austrian data. Hint: There may be a single pandas function that does it for you...

In a second step, do the same for Germany. Does the * 10 rule hold? (everything in Germany is ten times as big as in Austria)

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #

In [None]:
# # # # # YOUR SOLUTION GOES HERE # # # # #