<div style = "display:flex; flex-direction:row; flex-wrap:wrap">
    <p style = "flex:1 0; width:50%; text-align:left">2023WS_12632</p>
    <p style = "flex:1 0; width:50%; text-align:right">30. November 2023</p>
</div>

### <strong>Assignment:</strong> Regular Python Lists when Computing a Loss Function
##### <strong>Module:</strong> Scientific Programming with Python
##### <strong>Professor:</strong> Karl N. Kirschner
Department of Computer Science, University of Applied Sciences Bonn-Rhein-Sieg,
Sankt Augustin, Germany
<hr>
<br>

<strong>Goal: </strong> The goal of this assignment is to understand how Numpy usage <strong>[5]</strong> affects the performance of numerical calculations.

<strong>Problem and Input Data: </strong>Weather researchers created a machine learning model that predicts the rainfall and evaporation on different days at different location in Australia <strong>[3]</strong>. The experimental and model observables collected are shown in Table 1. 
<br>

<strong>The observables are described by the following:</strong>

- Date - The observation date.

- Location - The weather station location.

- MinTemp - The minimum temperature (°C).

- MaxTemp - The maximum temperature (°C).

- Rainfall - The rainfall amount in 24 hours (mm).

- Evaporation - The evaporation amount in 24 hours (mm).

- Sunshine - The sunshine amount in 24 hours (h).

- WindGustSpeed - The maximum wind gust speed in 24 hours (h).

- RainToday - Did it rain on that day? yes: if precipitation >= 1 mm, no:if precipitation < 1 mm.

- RainTomorrow - Did it rain in the following day? yes: if precipitation >= 1 mm, no: if precipitation < 1 mm.

<strong>Equation:</strong> Loss function to calculate the predicted rainfall and evaporation <strong>[1, 2]</strong>:
#### \begin{equation}
    Loss = \alpha*|R^{Pred.} - R^{Exp.}| + \beta*|E^{Pred.} - E^{Exp.}|
\end{equation}
where $\alpha$ is the rainfall weighting factor, $\beta$ is the evaporation weighting factor, $R^{Pred.}$ and $R^{Exp.}$ are the predicted and experimental rainfall values, while $E^{Pred.}$: and $E^{Exp.}$ are the corresponding evaporation values.
<br>
<hr>

First and foremost, required libraries for the assignment should be imported, so that all later code blocks can run well. For this assignment we need 1. Pandas library and 2. Numpy library:

In [68]:
import pandas as pd
import numpy as np

<hr>
<strong style = "font-size:20px">Task 1</strong>
<hr>

In this task we need to load the target data from 2 designated .csv file called <strong><i>"weather_experiment.csv"</i></strong> and <strong><i>"weather_prediction.csv"</i></strong>

1. To load the data from the file, we have to utilize Pandas library. The first imported dataframe, which is the one containing actual values measured in experiments, will be stored in variable <code>df_exp</code>:

In [69]:
df_exp = pd.read_csv("weather_experiment.csv", header=0, sep=",")

print("\nLoaded data:\n\n")
df_exp


Loaded data:




Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,RainToday,RainTomorrow
0,2009-01-01,Cobar,17.9,35.2,0.0,12.0,12.3,48.0,No,No
1,2009-01-02,Cobar,18.4,28.9,0.0,14.8,13.0,37.0,No,No
2,2009-01-04,Cobar,19.4,37.6,0.0,10.8,10.6,46.0,No,No
3,2009-01-05,Cobar,21.9,38.4,0.0,11.4,12.2,31.0,No,No
4,2009-01-06,Cobar,24.2,41.0,0.0,11.2,8.4,35.0,No,No
...,...,...,...,...,...,...,...,...,...,...
55242,2017-06-20,Darwin,19.3,33.4,0.0,6.0,11.0,35.0,No,No
55243,2017-06-21,Darwin,21.2,32.6,0.0,7.6,8.6,37.0,No,No
55244,2017-06-22,Darwin,20.7,32.8,0.0,5.6,11.0,33.0,No,No
55245,2017-06-23,Darwin,19.5,31.8,0.0,6.2,10.6,26.0,No,No


By printing the data set out, we can see that it comprises of 10 columns or types of data, all of which are actual results of real life experiments and measurements in 55247 different cases

Then the data should be cleaned, in which rows either without valid and/or duplicated values will be removed, using Numpy's <code>dropna()</code> and <code>drop_duplicates()</code> functions:

In [70]:
# Drop rows with missing values
df_exp.dropna(axis=0, how='any', inplace=True)

# Drop duplicate rows
df_exp.drop_duplicates(inplace=True)

print("\nCleaned data:\n\n")
df_exp


Cleaned data:




Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,RainToday,RainTomorrow
0,2009-01-01,Cobar,17.9,35.2,0.0,12.0,12.3,48.0,No,No
1,2009-01-02,Cobar,18.4,28.9,0.0,14.8,13.0,37.0,No,No
2,2009-01-04,Cobar,19.4,37.6,0.0,10.8,10.6,46.0,No,No
3,2009-01-05,Cobar,21.9,38.4,0.0,11.4,12.2,31.0,No,No
4,2009-01-06,Cobar,24.2,41.0,0.0,11.2,8.4,35.0,No,No
...,...,...,...,...,...,...,...,...,...,...
55242,2017-06-20,Darwin,19.3,33.4,0.0,6.0,11.0,35.0,No,No
55243,2017-06-21,Darwin,21.2,32.6,0.0,7.6,8.6,37.0,No,No
55244,2017-06-22,Darwin,20.7,32.8,0.0,5.6,11.0,33.0,No,No
55245,2017-06-23,Darwin,19.5,31.8,0.0,6.2,10.6,26.0,No,No


2. And we do the same with the second .csv file, which contains predicted rainfall and evaporation values. The dataframe will be stored in <code>df_predict</code>:

In [71]:
df_pred = pd.read_csv("weather_prediction.csv",header=0, sep=",")

print("\nLoaded data:\n\n")
df_pred


Loaded data:




Unnamed: 0,Date,Location,Rainfall Pred.,Evaporation Pred.
0,2009-01-01,Cobar,7.098998,6.179719
1,2009-01-02,Cobar,1.433238,6.375806
2,2009-01-04,Cobar,0.914834,5.687946
3,2009-01-05,Cobar,5.285904,6.897139
4,2009-01-06,Cobar,0.993975,0.050364
...,...,...,...,...
55242,2017-06-20,Darwin,5.693780,3.400099
55243,2017-06-21,Darwin,1.548031,1.696780
55244,2017-06-22,Darwin,1.516136,2.945245
55245,2017-06-23,Darwin,1.158509,4.960711


And the second data set comprises of 4 columns or types of data, all of which are predicted values, also in 55247 different cases. I'll also do the data cleaning with this dataframe:

In [72]:
# Drop rows with missing values
df_pred.dropna(axis=0, how='any', inplace=True)

# Drop duplicate rows
df_pred.drop_duplicates(inplace=True)

print("\nCleaned data:\n\n")
df_pred


Cleaned data:




Unnamed: 0,Date,Location,Rainfall Pred.,Evaporation Pred.
0,2009-01-01,Cobar,7.098998,6.179719
1,2009-01-02,Cobar,1.433238,6.375806
2,2009-01-04,Cobar,0.914834,5.687946
3,2009-01-05,Cobar,5.285904,6.897139
4,2009-01-06,Cobar,0.993975,0.050364
...,...,...,...,...
55242,2017-06-20,Darwin,5.693780,3.400099
55243,2017-06-21,Darwin,1.548031,1.696780
55244,2017-06-22,Darwin,1.516136,2.945245
55245,2017-06-23,Darwin,1.158509,4.960711


It seems that both of the given data sets do not contain any rows with invalid values. However it is best practice to clean the data regardless, since manually knowing which data set contains invalid data and which do not is outside humans' capability.

<hr>
<strong style = "font-size:20px">Task 2</strong>
<hr>

- In this task we need to create user-defined functions that encodes and computes the loss function (Equation 1), which:

    1. performs the calculation using regular Python lists (i.e. do not use Numpy or ndarrays), and
    
    2. performs the calculation using Numpy (i.e. maximizing the use of Numpy’s library and performance <strong>[5]</strong>).

- To perform the calculation using regular Python lists, a function will be created, which receives the variables as parameters ($\alpha$, $\beta$, $R^{Pred}$, $R^{Exp}$, $E^{Pred}$ and $E^{Exp}$), in that exact order:

In [73]:
# 1. Using regular Python lists
def loss_function_regular(alpha:float, beta:float, r_pred:list, r_exp:list, e_pred:list, e_exp:list):
    return alpha*abs(r_pred - r_exp) + beta*abs(e_pred - e_exp)

- To perform the calculation using Numpy, the data lists will first have to be converted to Numpy arrays using <code>np.array()</code> function, and then the results will be calculated using Numpy's built-in functions:

In [74]:
# 1. Using numpy
# The custom function will receive numpy arrays as parameters instead of lists
def loss_function_numpy(alpha:float, beta:float, 
                        r_pred: np.ndarray, r_exp: np.ndarray, 
                        e_pred: np.ndarray, e_exp: np.ndarray):
    
    # Then we calculate the results using Numpy's functions
    return alpha * np.abs(r_pred_np - r_exp_np) + beta * np.abs(e_pred_np - e_exp_np)

<hr>
<strong style = "font-size:20px">Task 3</strong>
<hr>

- In this task we need to Evaluate the speed performance between Task 2 functions by computing the loss value for when $\alpha$ = $\beta$ = 0.5

To complete this task, <code>timeit</code> library will be needed. As such, it needs to be imported:

In [75]:
import timeit

The required data will then be saved as lists, and converted to Numpy arrays to be used as parameters for the 2 custom functions written above:

In [76]:
# First we extract the required data from the corresponding dataframes
r_pred = df_pred['Rainfall Pred.']
r_exp = df_exp['Rainfall']
e_pred = df_pred['Evaporation Pred.']
e_exp = df_exp['Evaporation']

# Then we convert the data from lists to Numpy arrays
r_pred_np = np.array(r_pred)
r_exp_np = np.array(r_exp)
e_pred_np = np.array(e_pred)
e_exp_np = np.array(e_exp)

Syntax <code>%%timeit</code> will then be used to measure execution time of the entire code block. This implementation of the <code>timeit</code> library is only accepted in Jupyter Notebooks <strong>[4]</strong>, as such it should be fully utilized in order to be able to minimize having to write longer code, or having to use lambda functions.

In [77]:
%%timeit

loss_function_regular(alpha=.5, beta=.5, 
                      r_pred=df_pred['Rainfall Pred.'], r_exp=df_exp['Rainfall'], 
                      e_pred=df_pred['Evaporation Pred.'], e_exp=df_exp['Evaporation'])

630 µs ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


As we can see, execution time per loop is around 614 µs ± 34.1 µs when normal lists are used for the calculation.

In [None]:
%%timeit

loss_function_numpy(alpha=.5, beta=.5,
                    r_pred=r_pred_np, r_exp=r_exp_np, 
                    e_pred=e_pred_np, e_exp=e_exp_np)

While it is 206 µs ± 24.3 µs per loop when Numpy library is utilized

As shown by the results of the <code>%%timeit</code> function, in this particular case, by using Numpy library for the same calculations with the same data set, the codes are executed at approximately triple the efficiency, in comparision to only using normal Python's built-in lists. 

In conclusion, the Numpy library is absolutely useful and should be utilized to the best of it's capacity, especially for developers, people who prioritize efficiency above all.

<hr>

## References:
[1] Wikipedia contributors. Loss Function. https: //en.wikipedia.org/wiki/Loss_function. Accessed 30/11/2023<br>
[2] Xiao-xiong You, Zhao-ming Liang, Ya-qiang Wang, Hui Zhang. A study on loss function against data imbalance in deep learning correction of precipitation forecasts. Atmospheric Research. Accessed 01/12/2023 <br>
[3] Joe Young and Adamyoung. Rain in Australia, Kaggle https://www.kaggle.con/datasets/jsphyg/weather-dataset-rattle-package?resource=downloadéselect=weatherAUS.csv. Online. Accessed 01/12/2023 <br>
[4] Python Developers. timeit — Measure execution time of small code snippets. https://docs.python.org/3/library/timeit.html. © Copyright 2001-2023, Python Software Foundation. Accessed 02/12/2023<br>
[5] NumPy Developers. Numpy Documentations: https://numpy.org/doc/stable/reference/index.html#reference.© Copyright 2008-2022. Accessed 02/12/2023