<div style = "display:flex; flex-direction:row; flex-wrap:wrap">
    <p style = "flex:1 0; width:50%; text-align:left">2023WS_12632</p>
    <p style = "flex:1 0; width:50%; text-align:right">16. Januar, 2024</p>
</div>

### <strong>Assignment:</strong> The Interaction Between two Atoms
##### <strong>Module:</strong> Scientific Programming with Python
##### <strong>Professor:</strong> Karl N. Kirschner
Department of Computer Science, University of Applied Sciences Bonn-Rhein-Sieg,
Sankt Augustin, Germany
<hr>
<br>

<strong>Goal: </strong> The goal of this assignment is to make use of scientific computing knowledge known so far. Specifically: <code>matplotlib</code> <strong>[5]</strong>, <code>numpy</code> and <code>pandas</code> libraries.

<strong>Problem and Input Data: </strong>In Toronto Canada, the Environment and Climate Change Canada's National Air Pollution
Surveillance (NAPS) program recorded the hourly concentrations (parts-per-billion; ppb) of ozone ($\ce{O3}$) and nitrogen dioxide
($\ce{NO2}$) at five different stations [5]. These two molecules are considered atmospheric pollutants when near the Earth's surface.
Their relationship to each other can be seen the following reactions:

$$
\begin{align}
    \ce{O3 ->[h_v]NO + O} \quad \text{(1)} \\
    \ce{O + O2 -> O3 + O} \quad \text{(2)} \\
    \ce{O3 + NO -> O2 + NO} \quad \text{(3)}
\end{align}
$$

where $h_v$ represents light. During the daytime, $Eq. 1$ dominates due to the sunlight that is present, resulting in an increase in
$\ce{O3}$ concentration. However at night, $Eq. 3$ dominates, resulting in an increase of $\ce{NO2}$ concentration. These reactions result in
a correlation to occur between the concentrations of $\ce{O3}$ and $\ce{NO2}$ [6].
<br>

The data was collected in five CSV-formatted files, entitled <strong>"Toronto2020_Station_n.csv"</strong>, where n = 1 -> 5. 

<strong>The content of these files include columns for:</strong><br>
- The "Time" (i.e., date and time)<br>

- $\ce{O3}$ concentration (in ppb)<br>

- and $\ce{NO2}$ concentration in (ppb)

First and foremost, required libraries for the assignment should be imported, so that all later code blocks can run well. For this assignment we need 1. <code>matplotlib</code>, 2. <code>numpy</code> and 3. <code>pandas</code>. The imports are alphabetically sorted:

In [14]:
import matplotlib as plt
import numpy as np
import pandas as pd

<hr>
<strong style = "font-size:20px">Task 1</strong>
<hr>

In this task we need to load the target data from 5 designated .csv files mentioned above.

1. To load the data from the files, we have to utilize Pandas library. The imported dataframes will be named <code>dfn</code>, with n 1-> 5:

In [15]:
df1 = pd.read_csv("Toronto2020_Station_1.csv", header=0, sep=",")
df2 = pd.read_csv("Toronto2020_Station_2.csv", header=0, sep=",")
df3 = pd.read_csv("Toronto2020_Station_3.csv", header=0, sep=",")
df4 = pd.read_csv("Toronto2020_Station_4.csv", header=0, sep=",")
df5 = pd.read_csv("Toronto2020_Station_5.csv", header=0, sep=",")

By test printing one of the data sets out, we can see that each dataframe were successfully loaded, as all 3 columns are present:

In [16]:
df1

Unnamed: 0,Time,station 1 O3,station 1 NO2
0,2020-01-01T00:00:00Z,27.0,6.0
1,2020-01-01T01:00:00Z,27.0,6.0
2,2020-01-01T02:00:00Z,27.0,6.0
3,2020-01-01T03:00:00Z,,
4,2020-01-01T04:00:00Z,26.0,6.0
...,...,...,...
1435,2020-02-29T19:00:00Z,15.0,25.0
1436,2020-02-29T20:00:00Z,8.0,33.0
1437,2020-02-29T21:00:00Z,8.0,35.0
1438,2020-02-29T22:00:00Z,10.0,33.0


Afterwards, we can merge them using <code>pandas</code>' <code>concat()</code> functions. The final dataframe will be saved into dataframe <code>df</code>:

In [26]:
'''
Concatenate DataFrames vertically, ignore index is to stop 
the index from looping back to 1 at the beginning of each child
dataframe
'''
df = pd.concat([df1, df2, df3, df4, df5], ignore_index=True)

print("\nMerged data:\n\n")
df


Merged data:




Unnamed: 0,Time,station 1 O3,station 1 NO2,station 2 O3,station 2 NO2,station 3 O3,station 3 NO2,station 4 O3,station 4 NO2,station 5 O3,station 5 NO2
0,2020-01-01T00:00:00Z,27.0,6.0,,,,,,,,
1,2020-01-01T01:00:00Z,27.0,6.0,,,,,,,,
2,2020-01-01T02:00:00Z,27.0,6.0,,,,,,,,
3,2020-01-01T03:00:00Z,,,,,,,,,,
4,2020-01-01T04:00:00Z,26.0,6.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
7195,2020-02-29T19:00:00Z,,,,,,,,,20.0,19.0
7196,2020-02-29T20:00:00Z,,,,,,,,,21.0,21.0
7197,2020-02-29T21:00:00Z,,,,,,,,,22.0,20.0
7198,2020-02-29T22:00:00Z,,,,,,,,,20.0,20.0


Now the 'Time' column will be changed to the date time format as per required by the assigment, using <code>to_datetime</code> function:

In [19]:
df['Time'] = pd.to_datetime(df['Time'])

df

Unnamed: 0,Time,station 1 O3,station 1 NO2,station 2 O3,station 2 NO2,station 3 O3,station 3 NO2,station 4 O3,station 4 NO2,station 5 O3,station 5 NO2
0,2020-01-01 00:00:00+00:00,27.0,6.0,,,,,,,,
1,2020-01-01 01:00:00+00:00,27.0,6.0,,,,,,,,
2,2020-01-01 02:00:00+00:00,27.0,6.0,,,,,,,,
3,2020-01-01 03:00:00+00:00,,,,,,,,,,
4,2020-01-01 04:00:00+00:00,26.0,6.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
7195,2020-02-29 19:00:00+00:00,,,,,,,,,20.0,19.0
7196,2020-02-29 20:00:00+00:00,,,,,,,,,21.0,21.0
7197,2020-02-29 21:00:00+00:00,,,,,,,,,22.0,20.0
7198,2020-02-29 22:00:00+00:00,,,,,,,,,20.0,20.0


<hr>
<strong style = "font-size:20px">Task 2</strong>
<hr>

In this task we need to create user-defined functions that encodes and computes the loss function ($Equ. 1$), which:

1. Using as much hourly data as available across all stations, compute an hourly mean concentrations for $\ce{O3}$ and $\ce{NO2}$.

2. Smooth the $\ce{O3}$ and $\ce{NO2}$ hourly mean concentrations by computing a rolling (i.e., moving) average, using a window
of 24 hours.

3. Using <code>matplotlib</code>, create a plot that shows the following:

    - $\ce{O3}$ and $\ce{NO2}$ rolling average concentrations as a function of the recording time.

1. To calculate the hourly means across stations, we will have to use <code>pandas</code>'<code>mean()</code> function. The columns' names for O3 and NO2 of stations will be saved into <code>O3_columns</code> and <code>NO2_columns</code> accordingly, and the resulting hourly means will then be saved into corresponding columns in datafram <code>df</code>:

In [29]:
# Calculate hourly mean concentrations for O3 and NO2
O3_columns = [f'station {i} O3' for i in range(1, 6)]
df['hourly mean O3'] = df[O3_columns].mean(axis=1)

NO2_columns = [f'station {i} NO2' for i in range(1, 6)]
df['hourly mean NO2'] = df[NO2_columns].mean(axis=1)

df

Unnamed: 0_level_0,station 1 O3,station 1 NO2,station 2 O3,station 2 NO2,station 3 O3,station 3 NO2,station 4 O3,station 4 NO2,station 5 O3,station 5 NO2,hourly mean O3,hourly mean NO2
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020-01-01T00:00:00Z,27.0,6.0,,,,,,,,,27.0,6.0
2020-01-01T04:00:00Z,26.0,6.0,,,,,,,,,26.0,6.0
2020-01-01T05:00:00Z,27.0,5.0,,,,,,,,,27.0,5.0
2020-01-01T07:00:00Z,25.0,7.0,,,,,,,,,25.0,7.0
2020-01-01T08:00:00Z,26.0,7.0,,,,,,,,,26.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2020-02-29T05:00:00Z,,,,,,,,,11.0,24.0,11.0,24.0
2020-02-29T20:00:00Z,,,,,,,,,21.0,21.0,21.0,21.0
2020-02-29T21:00:00Z,,,,,,,,,22.0,20.0,22.0,20.0
2020-02-29T22:00:00Z,,,,,,,,,20.0,20.0,20.0,20.0


<hr>
<strong style = "font-size:20px">Task 3</strong>
<hr>

- In this task we need to Evaluate the speed performance between Task 2 functions by computing the loss value for when $\alpha$ = $\beta$ = 0.5

To complete this task, <code>timeit</code> library will be needed. As such, it needs to be imported:

In [75]:
import timeit

The required data will then be saved as lists, and converted to Numpy arrays to be used as parameters for the 2 custom functions written above:

In [76]:
# First we extract the required data from the corresponding dataframes
r_pred = df_pred['Rainfall Pred.']
r_exp = df_exp['Rainfall']
e_pred = df_pred['Evaporation Pred.']
e_exp = df_exp['Evaporation']

# Then we convert the data from lists to Numpy arrays
r_pred_np = np.array(r_pred)
r_exp_np = np.array(r_exp)
e_pred_np = np.array(e_pred)
e_exp_np = np.array(e_exp)

Syntax <code>%%timeit</code> will then be used to measure execution time of the entire code block. This implementation of the <code>timeit</code> library is only accepted in Jupyter Notebooks <strong>[4]</strong>, as such it should be fully utilized in order to be able to minimize having to write longer code, or having to use lambda functions.

In [77]:
%%timeit

loss_function_regular(alpha=.5, beta=.5, 
                      r_pred=df_pred['Rainfall Pred.'], r_exp=df_exp['Rainfall'], 
                      e_pred=df_pred['Evaporation Pred.'], e_exp=df_exp['Evaporation'])

630 µs ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


As we can see, execution time per loop is around 614 µs ± 34.1 µs when normal lists are used for the calculation.

In [None]:
%%timeit

loss_function_numpy(alpha=.5, beta=.5,
                    r_pred=r_pred_np, r_exp=r_exp_np, 
                    e_pred=e_pred_np, e_exp=e_exp_np)

While it is 206 µs ± 24.3 µs per loop when Numpy library is utilized

As shown by the results of the <code>%%timeit</code> function, in this particular case, by using Numpy library for the same calculations with the same data set, the codes are executed at approximately triple the efficiency, in comparision to only using normal Python's built-in lists. 

In conclusion, the Numpy library is absolutely useful and should be utilized to the best of it's capacity, especially for developers, people who prioritize efficiency above all.

<hr>

## References:
[1] Wikipedia contributors. Loss Function. https: //en.wikipedia.org/wiki/Loss_function. Accessed 30/11/2023<br>
[2] Xiao-xiong You, Zhao-ming Liang, Ya-qiang Wang, Hui Zhang. A study on loss function against data imbalance in deep learning correction of precipitation forecasts. Atmospheric Research. Accessed 01/12/2023 <br>
[3] Joe Young and Adamyoung. Rain in Australia, Kaggle https://www.kaggle.con/datasets/jsphyg/weather-dataset-rattle-package?resource=downloadéselect=weatherAUS.csv. Online. Accessed 01/12/2023 <br>
[4] Python Developers. timeit — Measure execution time of small code snippets. https://docs.python.org/3/library/timeit.html. © Copyright 2001-2023, Python Software Foundation. Accessed 02/12/2023<br>
[5] NumPy Developers. Numpy Documentations: https://numpy.org/doc/stable/reference/index.html#reference.© Copyright 2008-2022. Accessed 02/12/2023