# NOTEBOOK COPIA EJEMPLO NIXTLA hierarchicalforecast
- Ejecución de notebook ejemplo base de nixtla - replicar código ejemplo y análisis extra para entender código y modelos ya hechos

- EJEMPLO LA DATA FULL DE TOURISM

- FUENTES:
    - geographical aggregation: https://nixtlaverse.nixtla.io/hierarchicalforecast/examples/australiandomestictourism.html

    - geographical and temporal aggregation: https://nixtlaverse.nixtla.io/hierarchicalforecast/examples/australiandomestictourismcrosstemporal.html
 
    - paper base investigación (los códigos de nixtla replican lo obtenido por el paper): https://robjhyndman.com/seminars/fr_overview.html

# POINT RECONCILIATION: Geographical Aggregation (Tourism)

### 0. Install nixtla package

In [1]:
# pip install hierarchicalforecast
# pip install datasetsforecast
# pip install statsforecast

In [2]:
import hierarchicalforecast
from datasetsforecast.hierarchical import HierarchicalData
import pandas as pd

## run code - example nixtla
https://nixtlaverse.nixtla.io/hierarchicalforecast/examples/australiandomestictourism.html

In [3]:
import numpy as np
import pandas as pd

### 1. Read data - download data example hyndman

#### 1.1 read raw data

In [11]:
data_raw = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/tourism.csv')
data_raw = data_raw.rename({'Trips': 'y', 'Quarter': 'ds'}, axis=1)
data_raw.insert(0, 'Country', 'Australia')
data_raw = data_raw[['Country', 'Region', 'State', 'Purpose', 'ds', 'y']]
data_raw['ds'] = data_raw['ds'].str.replace(r'(\d+) (Q\d)', r'\1-\2', regex=True)
data_raw['ds'] = pd.PeriodIndex(data_raw["ds"], freq='Q').to_timestamp()

In [12]:
# print data - se observa que a diferencia del ejemplo small que se carga la matriz S para realizar la reconciliación de forecast
data_raw.head()

Unnamed: 0,Country,Region,State,Purpose,ds,y
0,Australia,Adelaide,South Australia,Business,1998-01-01,135.07769
1,Australia,Adelaide,South Australia,Business,1998-04-01,109.987316
2,Australia,Adelaide,South Australia,Business,1998-07-01,166.034687
3,Australia,Adelaide,South Australia,Business,1998-10-01,127.160464
4,Australia,Adelaide,South Australia,Business,1999-01-01,137.448533


#### 1.2 Cargar agregación de los datos - cómo se agrupan las series en los diferentes niveles

In [13]:
spec = [
    ['Country'],
    ['Country', 'State'], 
    ['Country', 'Purpose'], 
    ['Country', 'State', 'Region'], 
    ['Country', 'State', 'Purpose'], 
    ['Country', 'State', 'Region', 'Purpose']
]

#### 1.3 Generar el conjunto completo de series y las matrices para realizar la consolidación de los resultados (S, tags)
**nixtla tiene la función para generar la agregación**

In [14]:
from hierarchicalforecast.utils import aggregate

In [15]:
Y_df, S_df, tags = aggregate(data_raw, spec)

In [26]:
# print head diferentes dataframes generados en la agregación de las series

In [21]:
Y_df.head(3)

Unnamed: 0,unique_id,ds,y
0,Australia,1998-01-01,23182.197269
1,Australia,1998-04-01,20323.380067
2,Australia,1998-07-01,19826.640511


In [22]:
S_df.iloc[:5, :5]

Unnamed: 0,unique_id,Australia/ACT/Canberra/Business,Australia/ACT/Canberra/Holiday,Australia/ACT/Canberra/Other,Australia/ACT/Canberra/Visiting
0,Australia,1.0,1.0,1.0,1.0
1,Australia/ACT,1.0,1.0,1.0,1.0
2,Australia/New South Wales,0.0,0.0,0.0,0.0
3,Australia/Northern Territory,0.0,0.0,0.0,0.0
4,Australia/Queensland,0.0,0.0,0.0,0.0


In [23]:
tags['Country/Purpose']

array(['Australia/Business', 'Australia/Holiday', 'Australia/Other',
       'Australia/Visiting'], dtype=object)

In [18]:
# shape data raw
data_raw.shape

(24320, 6)

In [19]:
# shape y_df - se generan más series
Y_df.shape

(34000, 3)

In [20]:
# matriz matriz S
S_df.shape

(425, 305)

In [25]:
tags.keys()

dict_keys(['Country', 'Country/State', 'Country/Purpose', 'Country/State/Region', 'Country/State/Purpose', 'Country/State/Region/Purpose'])

### 2. Discovery Data (fuente origen propia)

#### 2.1 discovery data_raw

In [135]:
# generar columna unique_id
data_raw['unique_id'] = data_raw['Country'] + '/' + data_raw['State'] + '/' + data_raw['Region'] + '/' + data_raw['Purpose']
data_raw.head(3)

Unnamed: 0,Country,Region,State,Purpose,ds,y,unique_id
0,Australia,Adelaide,South Australia,Business,1998-01-01,135.07769,Australia/South Australia/Adelaide/Business
1,Australia,Adelaide,South Australia,Business,1998-04-01,109.987316,Australia/South Australia/Adelaide/Business
2,Australia,Adelaide,South Australia,Business,1998-07-01,166.034687,Australia/South Australia/Adelaide/Business


In [136]:
# print tamaño data RAW
data_raw.shape

(24320, 7)

In [137]:
# print algunos ejemplo de cada serie
unique_values_country = data_raw['Country'].unique().tolist()
unique_values_state = data_raw['State'].unique().tolist()
unique_values_region = data_raw['Region'].unique().tolist()
unique_values_purpose = data_raw['Purpose'].unique().tolist()
print('\nunique_values_country: ', unique_values_country)
print('\nunique_values_state: ', unique_values_state)
print('\nunique_values_region (example 10): ', unique_values_region[0:10])
print('\nunique_values_purpose: ', unique_values_purpose)


unique_values_country:  ['Australia']

unique_values_state:  ['South Australia', 'Northern Territory', 'Western Australia', 'Victoria', 'New South Wales', 'Queensland', 'ACT', 'Tasmania']

unique_values_region (example 10):  ['Adelaide', 'Adelaide Hills', 'Alice Springs', "Australia's Coral Coast", "Australia's Golden Outback", "Australia's North West", "Australia's South West", 'Ballarat', 'Barkly', 'Barossa']

unique_values_purpose:  ['Business', 'Holiday', 'Other', 'Visiting']


In [138]:
# combinaciones únicas de cada columna
print('shape_unique_values_country: ', len(unique_values_country))
print('shape_unique_values_state: ', len(unique_values_state))
print('shape_unique_values_region: ', len(unique_values_region))
print('shape_unique_values_purpose: ', len(unique_values_purpose))

shape_unique_values_country:  1
shape_unique_values_state:  8
shape_unique_values_region:  76
shape_unique_values_purpose:  4


In [139]:
# cantidad de series únicas en los datos
data_raw['unique_id'].unique().shape

(304,)

In [140]:
# se ve que no todas las combinaciones existen como series
shape_unique_values_country[0] * shape_unique_values_region[0] * shape_unique_values_state[0] * shape_unique_values_purpose[0]

2432

In [141]:
# print fecha de inicio y fin de los datos (OJO TODAS LAS SERIES TIENEN LA MISMA CANTIDAD DE DATOS y las mismas fechas de inicio y fin)
example_unique_id = data_raw['unique_id'][0]
fecha_inicio_raw = data_raw[data_raw['unique_id'] == example_unique_id]['ds'].min()
fecha_fin_raw = data_raw[data_raw['unique_id'] == example_unique_id]['ds'].max()
shape_unique_serie = data_raw[data_raw['unique_id'] == example_unique_id].shape

print('fecha inicio data: ', fecha_inicio_raw)
print('fecha fin data: ', fecha_fin_raw)
print('shape serie individual: ', shape_unique_serie)

fecha inicio data:  1998-01-01 00:00:00
fecha fin data:  2017-10-01 00:00:00
shape serie individual:  (80, 7)


In [142]:
# cantidad de series por tamaño de serie, igual al tamaño del dataframe (validar que todas las series tienen la misma cantidad de datos)
(80 * 304) == data_raw.shape[0]

True

#### 2.2 discovery data_agregada (data luego de aplicar función de agregación)

In [143]:
# print tamaño data
Y_df.shape

(34000, 3)

In [144]:
# cantidad de series únicas en los datos
Y_df['unique_id'].unique().shape

(425,)

#### 2.3 Validar qué series están en la data raw y cuales no

In [145]:
# cantidad de series cada data
list_unique_id_data_raw = data_raw['unique_id'].unique().tolist()
list_unique_id_data_ydf = Y_df['unique_id'].unique().tolist()
print('list_unique_id_data_raw: ', len(list_unique_id_data_raw))
print('list_unique_id_data_ydf: ',len(list_unique_id_data_ydf))

list_unique_id_data_raw:  304
list_unique_id_data_ydf:  425


In [146]:
# qué series están en cada data

In [156]:
# cantidad de series que están en la data raw y no en la data agregada (DEBERÍA ESTAR TODA LA DATA)
len(list(set(list_unique_id_data_raw) - set(list_unique_id_data_ydf)))

0

In [152]:
# cantidad de series que están en la data agregada y NO están en la raw (DEBERÍAN ESTAR TODAS LAS SERIES AGREGADAS)
len(list(set(list_unique_id_data_ydf) - set(list_unique_id_data_raw)))

121

In [154]:
# validar que restar la data agregada menos la data raw, se obtiene la diferencia
len(list_unique_id_data_ydf) - len(list_unique_id_data_raw)

121

**---> ENTONCES LA DATA RAW QUE SE OBTUVO ES LA MÁS DESAGREGADA - LAS SERIES AL NIVEL MÁS INDIVIDUAL**

### 3. Split Train/Test sets (continuación código base nixtla)
We use the final two years (8 quarters) as test set

In [159]:
Y_test_df = Y_df.groupby('unique_id', as_index=False).tail(8)
Y_train_df = Y_df.drop(Y_test_df.index)

In [160]:
print('Y_df: ', Y_df.shape)
print('Y_train_df: ', Y_train_df.shape)
print('Y_test_df: ', Y_test_df.shape)

Y_df:  (34000, 3)
Y_train_df:  (30600, 3)
Y_test_df:  (3400, 3)


### 4. Computing base forecasts

In [162]:
from statsforecast.models import AutoETS
from statsforecast.core import StatsForecast

In [None]:
fcst = StatsForecast(models=[AutoETS(season_length=4, model='ZZA')], 
                     freq='QS', n_jobs=-1)
Y_hat_df = fcst.forecast(df=Y_train_df, h=8, fitted=True)
Y_fitted_df = fcst.forecast_fitted_values()