# Modelado tabular con Autgluon

In [1]:
!sudo apt-get update
!sudo apt-get install gcsfuse

Get:1 https://nvidia.github.io/libnvidia-container/stable/deb/amd64  InRelease [1477 B]
Hit:2 https://deb.debian.org/debian bullseye InRelease                         
Hit:3 https://download.docker.com/linux/debian bullseye InRelease   
Hit:4 https://deb.debian.org/debian-security bullseye-security InRelease
Hit:5 https://deb.debian.org/debian bullseye-updates InRelease
Hit:6 https://deb.debian.org/debian bullseye-backports InRelease
Hit:7 https://packages.cloud.google.com/apt gcsfuse-bullseye InRelease
Hit:8 https://packages.cloud.google.com/apt google-compute-engine-bullseye-stable InRelease
Hit:9 https://packages.cloud.google.com/apt cloud-sdk-bullseye InRelease
Hit:10 https://packages.cloud.google.com/apt google-fast-socket InRelease
Fetched 1477 B in 1s (1782 B/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
gcsfuse is already the newest version (3.1.0).
0 upgraded, 0 newly installed, 0 to remove and

In [2]:
#!pip install autogluon.timeseries

# Carga librerías

In [3]:
import pandas as pd
from autogluon.timeseries import TimeSeriesPredictor
import numpy as np

In [4]:
!mkdir -p /home/jupyter/franco_maestria/gcs_model_dir_fullpower_serie

In [5]:
#!fusermount -u /home/jupyter/franco_maestria/gcs_model_dir_fullpower_serie

In [6]:
!gcsfuse forecasting_customer_product /home/jupyter/franco_maestria/gcs_model_dir_fullpower_serie

{"timestamp":{"seconds":1752405925,"nanos":203624749},"severity":"INFO","message":"Start gcsfuse/3.1.0 (Go version go1.24.0) for app \"\" using mount point: /home/jupyter/franco_maestria/gcs_model_dir_fullpower_serie\n"}
{"timestamp":{"seconds":1752405925,"nanos":203667874},"severity":"INFO","message":"GCSFuse config","config":{"AppName":"","CacheDir":"","Debug":{"ExitOnInvariantViolation":false,"Fuse":false,"Gcs":false,"LogMutex":false},"DisableAutoconfig":false,"EnableAtomicRenameObject":true,"EnableGoogleLibAuth":false,"EnableHns":true,"EnableNewReader":false,"FileCache":{"CacheFileForRangeRead":false,"DownloadChunkSizeMb":200,"EnableCrc":false,"EnableODirect":false,"EnableParallelDownloads":false,"ExperimentalExcludeRegex":"","ExperimentalParallelDownloadsDefaultOn":true,"MaxParallelDownloads":96,"MaxSizeMb":-1,"ParallelDownloadsPerFile":16,"WriteBufferSize":4194304},"FileSystem":{"DirMode":"755","DisableParallelDirops":false,"ExperimentalEnableDentryCache":false,"ExperimentalEnabl

# ✅ 1) Cálculo y aplicación de la estandarización

### 👉 Idea clave:

- Calculas media y desvío por product_id usando solo los registros de entrenamiento (train_set).
- Creas un scaler_dict para mapear cada product_id a su (mean, std).
- Normalizas tn y clase solo en training.
- En test, aplicas el mismo scaler_dict para transformar los features antes de predecir y reviertes la predicción después.

In [8]:
# ------------------------
# 1) Cargar parquet con FE
# ------------------------

parquet_path = "panel_cliente_producto_fe.parquet"
df = pd.read_parquet(parquet_path)

print(f"✅ Parquet cargado. Shape: {df.shape}")

✅ Parquet cargado. Shape: (12138186, 194)


In [None]:
# ================================================
# ✅ BLOQUE — Forecasting con AutoGluon TimeSeries
# ================================================

# ------------------------
# 2) Preparar dataset para AutoGluon
# ------------------------

# Crear columna item_id combinando customer_id y product_id
df['item_id'] = df['customer_id'].astype(str) + '_' + df['product_id'].astype(str)

# Asegurarse que 'fecha' es datetime y se llama 'timestamp'
df['timestamp'] = pd.to_datetime(df['fecha'])

# Mantener solo columnas necesarias
df_ts = df[['item_id', 'timestamp', 'tn']].copy()

print(df_ts.head())

# ------------------------
# 3) Configurar predictor
# ------------------------

predictor = TimeSeriesPredictor(
    target='tn',
    eval_metric='MAE',
     freq='M', 
    prediction_length=1,  # Porque queremos solo un punto, el mes +2
)

# ------------------------
# 4) Entrenar predictor
# ------------------------

# ⚙️ Entrenar predictor con time_limit dentro de .fit()
predictor.fit(
    train_data=df_ts,
    time_limit=18000,          # ⏰ límite de tiempo acá, NO en __init__
    num_val_windows=2
)

# ------------------------
# 5) Preparar datos de predicción
# ------------------------

# Identificar última fecha
last_date = df_ts['timestamp'].max()
future_date = last_date + pd.DateOffset(months=2)
print(f"Última fecha: {last_date.date()} → Fecha objetivo: {future_date.date()}")

# Crear DataFrame para forecast: item_id + timestamps futuros
# AutoGluon genera internamente los steps, solo pasamos series base
forecast = predictor.predict(df_ts)

# ------------------------
# 6) Procesar salida final
# ------------------------

# 🗂️ forecast es un TimeSeriesDataFrame con MultiIndex (item_id, timestamp)
forecast_df = forecast.reset_index()

# 👀 Verifica cómo queda
print(forecast_df.head())

# Si tu item_id es 'customerId_productId', sepáralo:
forecast_df[['customer_id', 'product_id']] = forecast_df['item_id'].str.split('_', expand=True)

# Renombra columna de salida
forecast_df = forecast_df.rename(columns={'mean': 'tn_pred'})

# Asegúrate de convertir product_id a int si es necesario
forecast_df['product_id'] = forecast_df['product_id'].astype(int)

# Suma por producto
df_final = (
    forecast_df.groupby('product_id')['tn_pred']
    .sum()
    .reset_index()
    .rename(columns={'tn_pred': 'tn'})
)

print(df_final.head())

# ------------------------
# 7) Exportar CSV final
# ------------------------

output_file = 'forecast_customer_producto_serie.csv'
df_final.to_csv(output_file, index=False)
print(f"✅ Archivo guardado: {output_file} | Productos: {df_final.shape[0]}")


  offset = pd.tseries.frequencies.to_offset(self.freq)
Frequency 'M' stored as 'ME'
Beginning AutoGluon training... Time limit = 18000s
AutoGluon will save models to '/home/jupyter/franco_maestria/AutogluonModels/ag-20250713_125601'
AutoGluon Version:  1.3.1
Python Version:     3.10.18
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Debian 5.10.237-1 (2025-05-19)
CPU Count:          48
GPU Count:          0
Memory Avail:       329.59 GB / 377.89 GB (87.2%)
Disk Space Avail:   65.11 GB / 97.87 GB (66.5%)

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MAE,
 'freq': 'ME',
 'hyperparameters': 'default',
 'known_covariates_names': [],
 'num_val_windows': 2,
 'prediction_length': 1,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'tn',
 'time_limit': 18000,
 'verbosity': 2}



       item_id  timestamp         tn
0  10001_20001 2017-01-01   99.43861
1  10001_20001 2017-02-01  198.84365
2  10001_20001 2017-03-01   92.46537
3  10001_20001 2017-04-01   13.29728
4  10001_20001 2017-05-01  101.00563


train_data with frequency 'IRREG' has been resampled to frequency 'ME'.
Provided train_data has 12138186 rows (NaN fraction=9.1%), 450311 time series. Median time series length is 35 (min=1, max=36). 
	Removing 35117 short time series from train_data. Only series with length >= 7 will be used for training.
	After filtering, train_data has 11981042 rows (NaN fraction=9.2%), 415194 time series. Median time series length is 36 (min=7, max=36). 

Provided data contains following columns:
	target: 'tn'

AutoGluon will gauge predictive performance using evaluation metric: 'MAE'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.

Starting training. Start time is 2025-07-13 13:18:52
Models that will be trained: ['SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'NPTS', 'DynamicOptimizedTheta', 'AutoETS', 'ChronosZeroShot[bolt_base]', 'ChronosFineTuned[bolt_small]', 'TemporalFusionTransformer', 'DeepAR'

Última fecha: 2019-12-01 → Fecha objetivo: 2020-02-01


data with frequency 'IRREG' has been resampled to frequency 'ME'.
Model not specified in predict, will default to the model with the best validation score: WeightedEnsemble


       item_id  timestamp        mean        0.1        0.2         0.3  \
0  10001_20001 2020-01-31  158.124273 -12.723517  49.539174   86.166681   
1  10001_20002 2020-01-31  214.488485  30.852084  93.652966  137.073882   
2  10001_20003 2020-01-31   73.952669  -8.141188  19.989356   42.915080   
3  10001_20004 2020-01-31   15.677887  -3.278758   3.188757    7.894541   
4  10001_20005 2020-01-31    4.193111  -0.167670   1.632387    2.463391   

          0.4         0.5         0.6         0.7         0.8         0.9  
0  124.609566  158.124273  188.906034  224.023255  270.600919  343.839415  
1  177.747329  214.488485  249.180038  292.254952  349.569661  457.365709  
2   58.900472   73.952669   88.942018  110.484680  138.778123  181.103612  
3   12.674014   15.677887   19.898588   24.930783   29.651709   36.153114  
4    3.300738    4.193111    4.970698    5.834702    7.041490    9.012295  
   product_id           tn
0       20001  1030.672333
1       20002   769.951558
2       2000