<a href="" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="./src/copernicus-logo.png"><span style="margin-left: 40px"></span><img src="./src/cds-logo.jpeg">

# Data Modelling

During this phase, we will discuss about possibles models to predict discharge given a temperature and past history of a particular river location according to the result of the Granger Causality.

## Choice of the Model

A wide array of methods are available for time series forecasting. One of the most commonly used is Autoregressive Moving Average (ARMA), a statistical model that predicts future values using past values. However, this method is flawed because it does not capture seasonal trends. It also assumes that the time series data is stationary, meaning that its statistical properties would not change over time. However, this type of behavior is an idealized assumption that does not hold in practice, which means ARMA may provide skewed results. Our ADF and KPSS tests give the applicability of this type; they defined that air temperature is stationary. However, the discharge does not follow the same property. An extension of ARMA is the Autoregressive Integrated Moving Average (ARIMA) model, which does not assume stationarity but still that the data exhibits little to no seasonality. Fortunately, the seasonal ARIMA (SARIMA) variant is a statistical model that can work with non-stationary data and capture some seasonality; so we can build a model following this pattern to implement an initial threshold, and we will carry out other considerations and other algorithms to improves experimental model and find out the best approach.


## Libraries dependencies

In [13]:
%pip install seaborn

#!pip install seaborn

Collecting seaborn
  Using cached seaborn-0.11.2-py3-none-any.whl (292 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.11.2
You should consider upgrading via the '/Users/kode/Desktop/Copernicus-river-discharges/venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [16]:
import datetime, json
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [5]:
j = 6
df = pd.DataFrame(columns=['time', 'lat', 'lon', 'discharge', 'temp', 'prec'])
for i in range(2011, 2022):
    tmp = pd.read_csv("samples/italy-dtp-{}-{}.csv".format(i,j), usecols=['time', 'lat', 'lon', 'discharge', 'temp', 'prec'])
    df = pd.concat([df, tmp])
    j += 1

In [6]:
df.head()

Unnamed: 0,time,lat,lon,discharge,temp,prec
0,2011-01-01 12:00:00,44.750378,7.56052,11.211914,5.205774,0.0
1,2011-01-02 12:00:00,44.750378,7.56052,10.950195,4.364069,6e-06
2,2011-01-03 12:00:00,44.750378,7.56052,10.685547,6.473016,2e-06
3,2011-01-04 12:00:00,44.750378,7.56052,10.419922,8.097437,0.0
4,2011-01-05 12:00:00,44.750378,7.56052,10.15918,6.304498,0.0


In [9]:
df.describe()

Unnamed: 0,time,lat,lon,discharge,temp,prec
count,35788326,35788330.0,35788330.0,35788330.0,35788330.0,35788326.0
unique,4018,8899.0,8905.0,941887.0,1560222.0,29903592.0
top,2011-01-01 12:00:00,46.66971,9.002067,0.0009765625,12.26299,0.0
freq,8907,8036.0,8036.0,67538.0,76.0,3911611.0


From the Granger Causality, we found that precipitations don't generate causal dependency on the discharge, so we can avoid to consider this column on our model development.

In [11]:
df.drop(columns=['prec'], inplace=True) 

In [None]:
df.index = pd.to_datetime(df['time'], format = '%Y-%m-%d %H:%M:%S')
df.drop(columns=['time'])

In [None]:
plt.ylabel('Discharge')
plt.xlabel('Time')
plt.xticks(rotation=45)