# Formato de datos de Merlion
Este cuaderno explicará cómo utilizar las clases UnivariateTimeSeries y TimeSeries de Merlion. Estas clases son el formato de datos principal utilizado en todo el repositorio. En general, se puede pensar en cada TimeSeries como una colección de objetos UnivariateTimeSeries, uno para cada variable.

Empecemos cargando algunos datos usando pandas.

In [1]:
import os
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/salesforce/Merlion/main/data/example.csv')
df

Unnamed: 0,timestamp_millis,kpi,kpi_label
0,1583140320000,667.118,0
1,1583140380000,611.751,0
2,1583140440000,599.456,0
3,1583140500000,621.446,0
4,1583140560000,1418.234,0
...,...,...,...
86802,1588376760000,874.214,0
86803,1588376820000,937.929,0
86804,1588376880000,1031.279,0
86805,1588376940000,1099.698,0


La columna timestamp_millis consiste en marcas de tiempo Unix (en unidades de milisegundos), y la columna kpi contiene el valor de la métrica de la serie de tiempo en cada una de esas marcas de tiempo. También crearemos una versión de este marco de datos indexada por tiempo:

In [2]:
time_idx_df = df.copy()
time_idx_df["timestamp_millis"] = pd.to_datetime(time_idx_df["timestamp_millis"], unit="ms")
time_idx_df = time_idx_df.set_index("timestamp_millis")
time_idx_df

Unnamed: 0_level_0,kpi,kpi_label
timestamp_millis,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-03-02 09:12:00,667.118,0
2020-03-02 09:13:00,611.751,0
2020-03-02 09:14:00,599.456,0
2020-03-02 09:15:00,621.446,0
2020-03-02 09:16:00,1418.234,0
...,...,...
2020-05-01 23:46:00,874.214,0
2020-05-01 23:47:00,937.929,0
2020-05-01 23:48:00,1031.279,0
2020-05-01 23:49:00,1099.698,0


## UnivariateTimeSeries: El bloque de construcción básico

La forma más transparente de inicializar una UnivariateTimeSeries es utilizar su constructor. 

El constructor toma dos argumentos: 
- time_stamps, una lista de marcas de tiempo Unix (en unidades de segundos) u objetos datetime, 
- y values, una lista de los valores reales de la serie de tiempo. Opcionalmente, también puede proporcionar un nombre.

In [3]:
from merlion.utils import UnivariateTimeSeries

kpi = UnivariateTimeSeries(
    time_stamps=df.timestamp_millis/1000,  # timestamps in units of seconds
    values=df.kpi,                         # time series values
    name="kpi"                             # optional: a name for this univariate
)

kpi_label = UnivariateTimeSeries(
    time_stamps=df.timestamp_millis/1000,  # timestamps in units of seconds
    values=df.kpi_label                    # time series values
)

Como alternativa, puede inicializar una UnivariateTimeSeries directamente desde un pd.Series indexado por tiempo:

In [4]:
kpi_equivalent = UnivariateTimeSeries.from_pd(time_idx_df.kpi)
print(f"Are the two UnivariateTimeSeries equal?\n {(kpi == kpi_equivalent).all()}")

Are the two UnivariateTimeSeries equal?
 True


Implementamos la UnivariateTimeSeries como un pd.Series con un DatetimeIndex:


In [5]:
print(f"Is {type(kpi).__name__} an instance of pd.Series? "
      f"{isinstance(kpi, pd.Series)}")

Is UnivariateTimeSeries an instance of pd.Series? True


In [6]:
kpi

2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...   
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi, Length: 86807, dtype: float64

También puede convertir una UnivariateTimeSeries en una pd.Series normal de la siguiente manera:

In [7]:
print(f"type(kpi.to_pd()) = {type(kpi.to_pd())}")

type(kpi.to_pd()) = <class 'pandas.core.series.Series'>


In [8]:
kpi.to_pd()

2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...   
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi, Length: 86807, dtype: float64

Puede acceder a las marcas de tiempo (como marcas de tiempo o como objetos datetime) y a los valores de forma independiente:
- .time_stamps
- .index
- .values

In [9]:
# Get the Unix timestamps (first 5 for brevity)
kpi.time_stamps[:5]

[1583140320.0, 1583140380.0, 1583140440.0, 1583140500.0, 1583140560.0]

In [10]:
# Get the datetimes (this is just the index of the UnivariateTimeSeries,
# since we inherit from pd.Series)
kpi.index[:5]

DatetimeIndex(['2020-03-02 09:12:00', '2020-03-02 09:13:00',
               '2020-03-02 09:14:00', '2020-03-02 09:15:00',
               '2020-03-02 09:16:00'],
              dtype='datetime64[ns]', freq=None)

In [11]:
# Get the values
kpi.values[:5]

[667.118, 611.751, 599.456, 621.446, 1418.234]

Puede indexar en un UnivariateTimeSeries para obtener una tupla de (timestamp, value):



In [12]:
print(f"kpi[0] = {kpi[0]}")

kpi[0] = (1583140320.0, 667.118)


Si en su lugar se utiliza un índice de corte, se obtendrá una nueva UnivariateTimeSeries:


In [13]:
print(f"type(kpi[1:5]) = {type(kpi[1:5])}\n")
print(f"kpi[1:5] = \n\n{kpi[1:5]}")

type(kpi[1:5]) = <class 'merlion.utils.time_series.UnivariateTimeSeries'>

kpi[1:5] = 

2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
Name: kpi, dtype: float64


In [14]:
kpi[1:5]

2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
Name: kpi, dtype: float64

La iteración sobre una UnivaraiateTimeSeries iterará sobre tuplas de (timestamp, value):

In [15]:
for t, x in kpi[:5]:
    print((t, x))

(1583140320.0, 667.118)
(1583140380.0, 611.751)
(1583140440.0, 599.456)
(1583140500.0, 621.446)
(1583140560.0, 1418.234)


## TimeSeries: Merlion’s Standard Data Class

Dado que Merlion es una biblioteca de propósito general que maneja tanto series temporales univariantes como multivariantes, nuestra clase de datos estándar es TimeSeries. Esta clase actúa como una envoltura alrededor de una colección de UnivariateTimeSeries. Elegimos este formato en lugar de un enfoque basado en vectores porque este enfoque es mucho más robusto a los valores perdidos, o diferentes univariados que se muestrean a diferentes ritmos.

La forma más transparente de inicializar una TimeSeries es con su constructor, que toma una colección (lista o diccionario (ordenado)) de UnivariateTimeSeries como único argumento:


In [16]:
from collections import OrderedDict
from merlion.utils import TimeSeries

time_series_list = TimeSeries(univariates=[kpi.copy(), kpi_label.copy()])
time_series_dict = TimeSeries(
    univariates=OrderedDict([("kpi_renamed", kpi.copy()),
                             ("kpi_label", kpi_label.copy())]))

In [17]:
time_series_list

                          kpi  kpi_label
2020-03-02 09:12:00   667.118        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698        0.0
2020-05-01 23:50:00   935.405        0.0

[86807 rows x 2 columns]

In [18]:
time_series_dict

                     kpi_renamed  kpi_label
2020-03-02 09:12:00      667.118        0.0
2020-03-02 09:13:00      611.751        0.0
2020-03-02 09:14:00      599.456        0.0
2020-03-02 09:15:00      621.446        0.0
2020-03-02 09:16:00     1418.234        0.0
...                          ...        ...
2020-05-01 23:46:00      874.214        0.0
2020-05-01 23:47:00      937.929        0.0
2020-05-01 23:48:00     1031.279        0.0
2020-05-01 23:49:00     1099.698        0.0
2020-05-01 23:50:00      935.405        0.0

[86807 rows x 2 columns]

In [19]:
type(time_series_dict),type(time_series_list)

(merlion.utils.time_series.TimeSeries, merlion.utils.time_series.TimeSeries)

Alternativamente, **puede inicializar una TimeSeries desde un pd.DataFrame y convertir una TimeSeries a un pd.DataFrame como sigue:**

In [20]:
time_series = TimeSeries.from_pd(time_idx_df)
print(f"type(TimeSeries.from_pd(time_idx_df)) = {type(time_series)}\n")

recovered_time_idx_df = time_series.to_pd()
print("(recovered_time_idx_df == time_idx_df).all()")
print((recovered_time_idx_df == time_idx_df).all())

type(TimeSeries.from_pd(time_idx_df)) = <class 'merlion.utils.time_series.TimeSeries'>

(recovered_time_idx_df == time_idx_df).all()
kpi          True
kpi_label    True
dtype: bool


Podemos acceder a los nombres de las univariantes individuales con time_series.names, acceder a una univariante específica a través de time_series.univariates[name], e iterar sobre las univariantes iterando por univariante en time_series.univariates. Concretamente:

In [21]:
# When we use a list of univariates, we retain the names of the univariates
# where possible. If a univariate is unnamed, we set its name to its integer
# index in the list of all univariates given. Here, kpi_label was
# originally unnamed, so we set its name to 1
print(time_series_list.names)

['kpi', 'kpi_label']


In [22]:
# If we pass a dictionary instead of a list, all univariates will have
# their specified names. The order is retained from the OrderedDict.
print(time_series_dict.names)

['kpi_renamed', 'kpi_label']


In [23]:
# We can access the KPI like so:
kpi1 = time_series_list.univariates["kpi"]
kpi2 = time_series_dict.univariates["kpi_renamed"]

# kpi1 and kpi2 are the same univariate, just with different names
assert (kpi1 == kpi2).all() #assert()  permite expresar una condición que ha de ser cierta siempre, ya que de no serlo se interrumpirá el programa.

In [24]:
# We can iterate over all univariates like so:
for univariate in time_series_dict.univariates:
    print(univariate)
    print()

2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...   
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi_renamed, Length: 86807, dtype: float64

2020-03-02 09:12:00    0.0
2020-03-02 09:13:00    0.0
2020-03-02 09:14:00    0.0
2020-03-02 09:15:00    0.0
2020-03-02 09:16:00    0.0
                      ... 
2020-05-01 23:46:00    0.0
2020-05-01 23:47:00    0.0
2020-05-01 23:48:00    0.0
2020-05-01 23:49:00    0.0
2020-05-01 23:50:00    0.0
Name: kpi_label, Length: 86807, dtype: float64



In [25]:
# We can also iterate over all univariates & names like so:
for name, univariate in time_series_dict.items():
    print(f"- Univariate {name.upper()}")
    print(univariate)
    print()

- Univariate KPI_RENAMED
2020-03-02 09:12:00     667.118
2020-03-02 09:13:00     611.751
2020-03-02 09:14:00     599.456
2020-03-02 09:15:00     621.446
2020-03-02 09:16:00    1418.234
                         ...   
2020-05-01 23:46:00     874.214
2020-05-01 23:47:00     937.929
2020-05-01 23:48:00    1031.279
2020-05-01 23:49:00    1099.698
2020-05-01 23:50:00     935.405
Name: kpi_renamed, Length: 86807, dtype: float64

- Univariate KPI_LABEL
2020-03-02 09:12:00    0.0
2020-03-02 09:13:00    0.0
2020-03-02 09:14:00    0.0
2020-03-02 09:15:00    0.0
2020-03-02 09:16:00    0.0
                      ... 
2020-05-01 23:46:00    0.0
2020-05-01 23:47:00    0.0
2020-05-01 23:48:00    0.0
2020-05-01 23:49:00    0.0
2020-05-01 23:50:00    0.0
Name: kpi_label, Length: 86807, dtype: float64



## Time Series Indexing & Alignment



Un concepto importante de TimeSeries en Merlion es la alineación.   
Llamamos a una serie de tiempo alineada si todas sus univariables son muestreadas en las mismas marcas de tiempo.   
A continuación ilustramos ejemplos de series temporales que están y no están alineadas:

In [26]:
aligned = TimeSeries({"kpi": kpi.copy(), "kpi_label": kpi_label.copy()})
print(f"Is aligned? {aligned.is_aligned}")

Is aligned? True


In [27]:
not_aligned = TimeSeries({"kpi": kpi[1:],                # 2020-03-02 09:13:00 to 2020-05-01 23:50:00
                          "kpi_label": kpi_label[:-1]})  # 2020-03-02 09:12:00 to 2020-05-01 23:49:00
print(f"Is aligned? {not_aligned.is_aligned}")

Is aligned? False


In [28]:
not_aligned

                          kpi  kpi_label
2020-03-02 09:12:00       NaN        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698        0.0
2020-05-01 23:50:00   935.405        NaN

[86807 rows x 2 columns]

Si su serie temporal está alineada, puede utilizar un índice entero para obtener una tupla (timestamp, (value_1, ..., value_k)), o un índice slice para obtener una sub-serie temporal:

In [29]:
aligned[0]


(1583140320.0, (667.118, 0.0))

In [30]:
print(f"type(aligned[1:5]) = {type(aligned[1:5])}\n")
print(f"aligned[1:5] = \n{aligned[1:5]}")

type(aligned[1:5]) = <class 'merlion.utils.time_series.TimeSeries'>

aligned[1:5] = 
                          kpi  kpi_label
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0


También puede iterar sobre una serie temporal alineada como para timestamp, (valor_1, ..., valor_k) en time_series:


In [31]:
for t, (x1, x2) in aligned[:5]:
    print((t, (x1, x2)))

(1583140320.0, (667.118, 0.0))
(1583140380.0, (611.751, 0.0))
(1583140440.0, (599.456, 0.0))
(1583140500.0, (621.446, 0.0))
(1583140560.0, (1418.234, 0.0))


**¡Nota a tener en cuenta!**
> Tenga en cuenta que Merlion arrojará un error si intenta hacer cualquiera de estas cosas con una serie temporal que no esté alineada. Por ejemplo,


In [32]:
try:
    not_aligned[0]
except RuntimeError as e:
    print(f"{type(e).__name__}: {e}")

RuntimeError: The univariates comprising this time series are not aligned (they have different time stamps), but alignment is required to index into the time series.


You can still get the length/shape of a misaligned time series, but Merlion will emit a warning.



In [33]:
print(len(not_aligned))

The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.


86807


In [34]:
print(not_aligned.shape)

The univariates comprising this time series are not aligned (they have different time stamps). The length returned is equal to the length of the _union_ of all time stamps present in any of the univariates.


(2, 86807)


Sin embargo, se puede llamar a time_series.align() para remuestrear automáticamente las univariantes individuales de una serie temporal para que esté alineada. Por defecto, esto tomará la unión de todas las marcas de tiempo presentes en cualquiera de las univariantes individuales, pero esto es personalizable.

In [35]:
print(f"Is not_aligned.align() aligned? {not_aligned.align().is_aligned}")

Is not_aligned.align() aligned? True


In [36]:
print(not_aligned)
print(not_aligned.align())

                          kpi  kpi_label
2020-03-02 09:12:00       NaN        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698        0.0
2020-05-01 23:50:00   935.405        NaN

[86807 rows x 2 columns]
                          kpi  kpi_label
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
2020-03-02 09:17:00  1015.559        0.0
...                       ...        ...
2020-05-01 23:45:00   981.007        0.0
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698  

## TimeSeries: Algunas características útiles

Proporcionamos mucha más información sobre la clase merlion.utils.time_series.TimeSeries en los documentos de la API, pero aquí destacamos dos características más útiles. ¡Estas funcionan independientemente de si una serie temporal está alineada!

Puede **obtener el subconjunto de una serie temporal entre los tiempos t0 y tf llamando a time_series.window(t0, tf)**.  
t0 y tf pueden ser cualquier formato razonable de fecha-hora, o una marca de tiempo Unix.


In [37]:
aligned

                          kpi  kpi_label
2020-03-02 09:12:00   667.118        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698        0.0
2020-05-01 23:50:00   935.405        0.0

[86807 rows x 2 columns]

In [38]:
# Para filtar entre 2020-03-05 12:00:00 y 2020-04-01
aligned.window("2020-03-05 12:00:00", pd.Timestamp(year=2020, month=4, day=1))

                          kpi  kpi_label
2020-03-05 12:00:00  1166.819        0.0
2020-03-05 12:01:00  1345.504        0.0
2020-03-05 12:02:00  1061.391        0.0
2020-03-05 12:03:00  1260.874        0.0
2020-03-05 12:04:00  1202.009        0.0
...                       ...        ...
2020-03-31 23:55:00  1154.397        0.0
2020-03-31 23:56:00  1270.292        0.0
2020-03-31 23:57:00  1160.761        0.0
2020-03-31 23:58:00  1082.076        0.0
2020-03-31 23:59:00  1167.297        0.0

[38160 rows x 2 columns]

In [39]:
# Note that the first value of the KPI (which is missing in not_aligned) is NaN
not_aligned.window(1583140320, 1583226720)

                          kpi  kpi_label
2020-03-02 09:12:00       NaN        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-03-03 09:07:00  1132.564        0.0
2020-03-03 09:08:00  1087.037        0.0
2020-03-03 09:09:00   984.432        0.0
2020-03-03 09:10:00  1085.008        0.0
2020-03-03 09:11:00  1020.937        0.0

[1440 rows x 2 columns]

También puede dividir una serie temporal en una parte izquierda y otra derecha, en cualquier fecha.


In [40]:
left, right = aligned.bisect("2020-05-01")

print(f"Left\n{left}\n")
print()
print(f"Right\n{right}\n")

Left
                          kpi  kpi_label
2020-03-02 09:12:00   667.118        0.0
2020-03-02 09:13:00   611.751        0.0
2020-03-02 09:14:00   599.456        0.0
2020-03-02 09:15:00   621.446        0.0
2020-03-02 09:16:00  1418.234        0.0
...                       ...        ...
2020-04-30 23:55:00  1296.091        0.0
2020-04-30 23:56:00  1323.743        0.0
2020-04-30 23:57:00  1203.672        0.0
2020-04-30 23:58:00  1278.720        0.0
2020-04-30 23:59:00  1217.877        0.0

[85376 rows x 2 columns]


Right
                          kpi  kpi_label
2020-05-01 00:00:00  1381.110        0.0
2020-05-01 00:01:00  1807.039        0.0
2020-05-01 00:02:00  1833.385        0.0
2020-05-01 00:03:00  1674.412        0.0
2020-05-01 00:04:00  1683.194        0.0
...                       ...        ...
2020-05-01 23:46:00   874.214        0.0
2020-05-01 23:47:00   937.929        0.0
2020-05-01 23:48:00  1031.279        0.0
2020-05-01 23:49:00  1099.698        0.0
2020-05-01 23:50:0