# Introduction

TimeLab was created to work with time series data. The main purpose is to facilitate the usage of deep learning models on multiindex temporal datasets. With its built-in functions, one can easily organize the data into pairs of input ("past" data) and output ("future" data), applying functions to the structured data seamlessly. It also allows for pair splitting into training, validation, and testing sets, datewise indexing and feature plotting.

The usual data flow works with hierarchical or multi-column data frames, which is useful for dealing with data like the stock market, where given a universe of companies (assets), each of them is composed of a set of features (channels). With timeLab, the main columns are named "units", and sub-columns are called "channels".

On the stock market example, take "AAPL" and "AMZN" as the units, each having five channels: "Open", "High", "Low", "Close" and "Volume".

<!-- In case of data with no Multi-level columns, the TimeLab library will automatically create a unit with the name of "main", and all the features will be converted to the channels. -->

# Setup

In [3]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf

from wavy import panel
from wavy import frequency
from wavy import utils

# Loading data

## The weather dataset

This tutorial uses a <a href="https://www.bgc-jena.mpg.de/wetter/" class="external">weather time series dataset</a> recorded by the <a href="https://www.bgc-jena.mpg.de" class="external">Max Planck Institute for Biogeochemistry</a>.

This dataset contains 14 different features such as air temperature, atmospheric pressure, and humidity. These were collected every 10 minutes, beginning in 2003. For efficiency, you will use only the data collected between 2009 and 2016. This section of the dataset was prepared by François Chollet for his book <a href="https://www.manning.com/books/deep-learning-with-python" class="external">Deep Learning with Python</a>.

In [2]:
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True)

csv_path, _ = os.path.splitext(zip_path)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip


We'll deal with hourly predictions, so start by sub-sampling the data from 10-minute intervals to one-hour intervals:

In [31]:
df = pd.read_csv(csv_path)

# # Slice [start:stop:step], starting from index 5 take every 6th record.
df = df[5::6]

date_time = pd.to_datetime(df.pop('Date Time'), format='%d.%m.%Y %H:%M:%S')
df.index = date_time


df.head()

Unnamed: 0_level_0,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2009-01-01 01:00:00,996.5,-8.05,265.38,-8.78,94.4,3.33,3.14,0.19,1.96,3.15,1307.86,0.21,0.63,192.7
2009-01-01 02:00:00,996.62,-8.88,264.54,-9.77,93.2,3.12,2.9,0.21,1.81,2.91,1312.25,0.25,0.63,190.3
2009-01-01 03:00:00,996.84,-8.81,264.59,-9.66,93.5,3.13,2.93,0.2,1.83,2.94,1312.18,0.18,0.63,167.2
2009-01-01 04:00:00,996.99,-9.05,264.34,-10.02,92.6,3.07,2.85,0.23,1.78,2.85,1313.61,0.1,0.38,240.0
2009-01-01 05:00:00,997.46,-9.63,263.72,-10.65,92.2,2.94,2.71,0.23,1.69,2.71,1317.19,0.4,0.88,157.0


# TimeLab

Before creating the TimeLab object, called a panel, we must ensure that the Date Time indexes are not overlapping. Otherwise, its analysis, preprocessing, and usage in a model would get poor results. The function called 'resample_datetimes' removes the duplicated Date Time and resamples the DataFrame according to the given rule. The default for the 'rule' parameter is '1H'.

In [32]:
df = frequency.resample_datetimes(df)

There are many (2 or 3) ways to initialize an object of the TimeLab, called panel.
# !!! Excluir 'from_arrays' ???

The fist way is by giving the xdata (inputs) and ydata (outputs). In the second way we give only the data without specifying the inputs and outputs. 

In both ways, we must specify the lookback value (number of timestep of each input) and the horizon (number of timesteps of each output).

There is a third parameter called gap. It represents how many time steps one input is shifted to the next one. Its default value is 0.



First way, using the function 'from_xy_data' with a lookback of 3 and horizon of 1.

In [33]:
panel = time_panel.from_xy_data(
        xdata=df, ydata=df, lookback=3, horizon=1
    )

100%|██████████| 70124/70124 [00:01<00:00, 43571.17it/s]


Second way, using the function 'from_data' with a lookback of 3 and horizon of 1.

In [34]:
panel = time_panel.from_data(
        df, lookback=3, horizon=1
    )

100%|██████████| 70124/70124 [00:01<00:00, 64481.83it/s]


# Panel attributes

Now, the panel object has been created, it contains a list of pairs. Each pair represents an input and output sample according to the horizon and lookback parameters defined when creating the panel.

The panel and the pairs have many attributes that can be accessed easily. The attributes belonging to the panel concern the complete data, while the attributes belonging to the pairs concern only each input and output pair.
For instance, the panel has the main following attributes: 
* channels
* units
* horizon
* lookback
* gap
* index

And others.

In [7]:
print(panel.channels)
print(panel.units)
print(panel.horizon)
print(panel.lookback)
print(panel.gap)
print(panel.index[:10])

['p (mbar)', 'T (degC)', 'Tpot (K)', 'Tdew (degC)', 'rh (%)', 'VPmax (mbar)', 'VPact (mbar)', 'VPdef (mbar)', 'sh (g/kg)', 'H2OC (mmol/mol)', 'rho (g/m**3)', 'wv (m/s)', 'max. wv (m/s)', 'wd (deg)']
['main']
1
3
0
['2009-01-01 01:00:00', '2009-01-01 02:00:00', '2009-01-01 03:00:00', '2009-01-01 04:00:00', '2009-01-01 05:00:00', '2009-01-01 06:00:00', '2009-01-01 07:00:00', '2009-01-01 08:00:00', '2009-01-01 09:00:00', '2009-01-01 10:00:00']


For instance, xdata and ydata are attributes belonging to the panel. Thus, they return a DataFrame with all input data concatenated and a DataFrame with all output data concatenated, respectively.

In [8]:
panel.xdata

Unnamed: 0_level_0,main,main,main,main,main,main,main,main,main,main,main,main,main,main
Unnamed: 0_level_1,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
2009-01-01 01:00:00,996.5,-8.05,265.38,-8.78,94.4,3.33,3.14,0.19,1.96,3.15,1307.86,0.21,0.63,192.7
2009-01-01 02:00:00,996.62,-8.88,264.54,-9.77,93.2,3.12,2.9,0.21,1.81,2.91,1312.25,0.25,0.63,190.3
2009-01-01 03:00:00,996.84,-8.81,264.59,-9.66,93.5,3.13,2.93,0.2,1.83,2.94,1312.18,0.18,0.63,167.2
2009-01-01 04:00:00,996.99,-9.05,264.34,-10.02,92.6,3.07,2.85,0.23,1.78,2.85,1313.61,0.1,0.38,240
2009-01-01 05:00:00,997.46,-9.63,263.72,-10.65,92.2,2.94,2.71,0.23,1.69,2.71,1317.19,0.4,0.88,157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-12-31 19:00:00,1002.18,-0.98,272.01,-5.36,72,5.69,4.09,1.59,2.54,4.08,1280.7,0.87,1.36,190.6
2016-12-31 20:00:00,1001.4,-1.4,271.66,-6.84,66.29,5.51,3.65,1.86,2.27,3.65,1281.87,1.02,1.92,225.4
2016-12-31 21:00:00,1001.19,-2.75,270.32,-6.9,72.9,4.99,3.64,1.35,2.26,3.63,1288.02,0.71,1.56,158.7
2016-12-31 22:00:00,1000.65,-2.89,270.22,-7.15,72.3,4.93,3.57,1.37,2.22,3.57,1288.03,0.35,0.68,216.7


In [9]:
panel.ydata

Unnamed: 0_level_0,main,main,main,main,main,main,main,main,main,main,main,main,main,main
Unnamed: 0_level_1,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
2009-01-01 01:00:00,,,,,,,,,,,,,,
2009-01-01 02:00:00,,,,,,,,,,,,,,
2009-01-01 03:00:00,,,,,,,,,,,,,,
2009-01-01 04:00:00,996.99,-9.05,264.34,-10.02,92.6,3.07,2.85,0.23,1.78,2.85,1313.61,0.1,0.38,240
2009-01-01 05:00:00,997.46,-9.63,263.72,-10.65,92.2,2.94,2.71,0.23,1.69,2.71,1317.19,0.4,0.88,157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016-12-31 19:00:00,1002.18,-0.98,272.01,-5.36,72,5.69,4.09,1.59,2.54,4.08,1280.7,0.87,1.36,190.6
2016-12-31 20:00:00,1001.4,-1.4,271.66,-6.84,66.29,5.51,3.65,1.86,2.27,3.65,1281.87,1.02,1.92,225.4
2016-12-31 21:00:00,1001.19,-2.75,270.32,-6.9,72.9,4.99,3.64,1.35,2.26,3.63,1288.02,0.71,1.56,158.7
2016-12-31 22:00:00,1000.65,-2.89,270.22,-7.15,72.3,4.93,3.57,1.37,2.22,3.57,1288.03,0.35,0.68,216.7


The first three rows of the ydata are NaN values because the first DateTime of the output starts after three timesteps, due to the lookback value being equal 3.
The same occurs to the xdata, the last row is composed of NaN values due to the horizon value equal 1.

To get the same xdata and ydata, but this time as numpy arrays, we will access the X and y attributes, respectively. 

The X and y are 4-dimensional arrays. The first dimension indicates the number of pairs. The second dimension indicates the number of units. The third dimension indicates the number of timesteps of each pair: for the X array, it is the number of timesteps of each input sample (or the value of the lookback), and for the y array, it is the number of timesteps of each output sample (or the value of the horizon). Finally, the fourth dimension indicates the number of channels.

This separation is useful for organizing and accessing each section of the data. For instance, for training or processing the data from a specific unit, channel, or pair, it can be easily accessed when specifying the index for each dimension

In [10]:
X = panel.X
X.shape

(70124, 1, 3, 14)

In [11]:
y = panel.y
y.shape

(70124, 1, 1, 14)

If we are working with data with no multi-level columns, we can take off the unit of the X and y arrays. To do so, we use the 'smash_array' function.

In [12]:
X = utils.smash_array(X)
X.shape

(70124, 3, 14)

In [13]:
y = utils.smash_array(y)
y.shape

(70124, 1, 14)

# Pairs attributes

In order to access a pair, we can just specify the index of the desired pair. 

To get the number of pairs, we check the length of the panel.

In [14]:
pair0 = panel[0]
pair9 = panel[9]

len(panel)

70124

Now, to access the input and output data from a single pair, we can access their xframe and yframe attributes.

In [15]:
panel[0].xframe

Unnamed: 0_level_0,main,main,main,main,main,main,main,main,main,main,main,main,main,main
Unnamed: 0_level_1,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
2009-01-01 01:00:00,996.5,-8.05,265.38,-8.78,94.4,3.33,3.14,0.19,1.96,3.15,1307.86,0.21,0.63,192.7
2009-01-01 02:00:00,996.62,-8.88,264.54,-9.77,93.2,3.12,2.9,0.21,1.81,2.91,1312.25,0.25,0.63,190.3
2009-01-01 03:00:00,996.84,-8.81,264.59,-9.66,93.5,3.13,2.93,0.2,1.83,2.94,1312.18,0.18,0.63,167.2


In [16]:
panel[0].yframe

Unnamed: 0_level_0,main,main,main,main,main,main,main,main,main,main,main,main,main,main
Unnamed: 0_level_1,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
2009-01-01 04:00:00,996.99,-9.05,264.34,-10.02,92.6,3.07,2.85,0.23,1.78,2.85,1313.61,0.1,0.38,240.0


# Plotting

In order to plot the date on a specific interval, we will use the function plot_data from the panel. We must specify the 'start' and 'end', and if we want to plot the data from the xdata or the ydata. 

We can specify the units and channels we want to plot. If none of them is specified, all are displayed.

In [17]:
panel.plot_data(start='2009-01-01 01:00:00', end='2009-01-02 01:00:00', channels=["p (mbar)","Tpot (K)", "Tdew (degC)"], on="xdata")

# Processing the data

The TimeLab was created to facilitate the processing of the data.

To apply any function to the input or output data we will use the 'xapply' or 'yapply' functions. The parameters of these classes are the function to be applied and the 'on' parameter that can be 'timesteps' or 'channels'. 

The function to be applied can be an ordinary function like np.max, or some custom function defined by the user. 

If we choose to apply on the timesteps, the resulting data will have only one Date Time being the first of each data.
If we choose to apply on the channels, the resulting data will have only one channel, and we must insert the name of the new channel.

In [18]:
# Finding the max value of the yframe of each pair:
new_panel1 = panel.yapply(np.max, on='timestamps')
new_panel1[0].yframe.head()

100%|██████████| 70124/70124 [00:01<00:00, 56137.78it/s]


Unnamed: 0_level_0,main,main,main,main,main,main,main,main,main,main,main,main,main,main
Unnamed: 0_level_1,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
2009-01-01 04:00:00,996.99,-9.05,264.34,-10.02,92.6,3.07,2.85,0.23,1.78,2.85,1313.61,0.1,0.38,240.0


In [19]:
## Finding the mean value of 'p (mbar)' and 'T (degC)' channels:
# def myfunc(X,pair):
#     return (pair.xframe['main']['p (mbar)'] + pair.xframe['main']['T (degC)'])/2


# new_panel2 = panel.xapply(myfunc, on='channels', new_channel="mean_first_two_channels")
# new_panel2[0].xframe.head()

If you want to add this new feature to the existings one, we use the 'add_channel' with the new panel created and the mode we want to apply: 'X' or 'y'.

In [20]:
# new_panel3 = panel.add_channel(new_panel2, mode='X')
# new_panel3[0].xframe

For selecting only certain channels or units from data, we use the 'sel' function and pass the parameter 'xchannels', 'ychannels', 'xunits' and 'yunits'. If any one of the parameters is not inserted, all of the possible values for that parameter are selected.

In [21]:
panel = panel.sel(ychannels='T (degC)')
panel[0].yframe

100%|██████████| 70124/70124 [00:02<00:00, 24297.18it/s]


Unnamed: 0_level_0,main
Unnamed: 0_level_1,T (degC)
2009-01-01 04:00:00,-9.05


# Training a model

In this section we will show how to use the data from TimeLab in a model.

First, we will define a dense model using keras from tensorflow.

In [22]:
model = tf.keras.Sequential([
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(units=64, activation='relu'),
        tf.keras.layers.Dense(units=64, activation='relu'),
        tf.keras.layers.Dense(units=1),
        tf.keras.layers.Reshape([1, -1]),
        ])

model.compile(
    loss=tf.losses.MeanSquaredError(),
    optimizer=tf.optimizers.Adam(),
    metrics=[tf.metrics.MeanAbsoluteError()],
)

2021-08-13 20:43:39.806494: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-08-13 20:43:39.819641: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Now, we will split the data into training, validation and testing sets.

We can accesss the data from each subset by the train, val, and test attributes which return a panel for each one of them.

In [23]:
panel_train = panel.train
len(panel_train)

49086

In [24]:
panel_val = panel.val
len(panel_val)

14024

In [25]:
panel_test = panel.test
len(panel_test)

7014

By default, the panel.train returns the first 70% of the pairs, the panel.val returns the next 20% of the pairs, and the panel.test returns the last 10% of the pairs.
To change this ratio, we can use the function 'set_train_val_test_sets' from the panel and insert the 'train_size', 'val_size', 'test_size' parameters as decimals.

In [26]:
panel.set_train_val_test_sets(train_size=0.9, val_size=0.05, test_size=0.05)

panel_train = panel.train
panel_val = panel.val
panel_test = panel.test

print(len(panel_train))
print(len(panel_val))
print(len(panel_test))


63111
3506
3507


Now we are going to fit the model to the data. First, we convert the 4-dimensional arrays to 3-dimensional removing the units (because there is only one unit).

In [27]:
x_train = utils.smash_array(panel_train.X)
y_train = utils.smash_array(panel_train.y)

x_val = utils.smash_array(panel_val.X)
y_val = utils.smash_array(panel_val.y)

In [28]:
model.fit(
    x_train,
    y_train,
    epochs=30,
    validation_data=(
        x_val,
        y_val,
        ),
)

2021-08-13 20:43:42.373229: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)


Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7ff7650fe430>

After training the model, we can create another panel with the predicted y data instead of the original y. For this, we can use the 'from_predictions' function and pass the trained model as an argument.

In [29]:
panel_predictions = panel.from_predictions(model)

panel.ydata

100%|██████████| 70124/70124 [00:01<00:00, 60185.61it/s]


Unnamed: 0_level_0,main
Unnamed: 0_level_1,T (degC)
2009-01-01 01:00:00,
2009-01-01 02:00:00,
2009-01-01 03:00:00,
2009-01-01 04:00:00,-9.05
2009-01-01 05:00:00,-9.63
...,...
2016-12-31 19:00:00,-0.98
2016-12-31 20:00:00,-1.4
2016-12-31 21:00:00,-2.75
2016-12-31 22:00:00,-2.89
