# 📌 BLOQUE 1 — Carga de datos y unión de fuentes (productos, stocks, clusters)

En este bloque:
- Cargamos el panel base `customer_id` - `product_id` - `periodo`.
- Filtramos los productos objetivo.
- Unimos:
   1) Datos de contexto país (`periodo`).
   2) Datos de productos (`product_id`).
   3) Datos de stocks (`product_id` + `periodo`).
   4) Clusters DTW (`product_id`).

Verificamos el shape final para saber cuántos registros y columnas tenemos.



In [None]:
# =================================================
# ✅ BLOQUE 1 — Carga y unión de múltiples fuentes
# =================================================

import pandas as pd

# ⚙️ 1) Panel cliente-producto-periodo
df_panel = pd.read_parquet("C:/Developer/Laboratorio_III/data/panel_completo_cliente_producto.parquet")
print(f"Panel original shape: {df_panel.shape}")

# ⚙️ 2) Lista de productos objetivo
df_products = pd.read_csv("C:/Developer/Laboratorio_III/data/product_id_apredecir201912.txt", sep='\t')
products_objetivo = df_products['product_id'].unique()
df_panel = df_panel[df_panel['product_id'].isin(products_objetivo)].copy()
print(f"Panel filtrado shape: {df_panel.shape}")

# Asegurar que periodo sea int
df_panel['periodo'] = df_panel['periodo'].astype(int)

# ⚙️ 3) Contexto país
df_context = pd.read_csv("C:/Developer/Laboratorio_III/data/datos_contexto_pais.csv")
df_context['periodo'] = df_context['periodo'].astype(int)
df_panel = df_panel.merge(df_context, on='periodo', how='left')
print(f"Panel con contexto país shape: {df_panel.shape}")

# ⚙️ 4) tb_productos.txt
df_prod = pd.read_csv("C:/Developer/Laboratorio_III/data/tb_productos.txt", sep='\t')
df_panel = df_panel.merge(df_prod, on='product_id', how='left')
print(f"Panel con tb_productos shape: {df_panel.shape}")

# ⚙️ 5) tb_stocks.txt
df_stocks = pd.read_csv("C:/Developer/Laboratorio_III/data/tb_stocks.txt", sep='\t')
df_stocks['periodo'] = df_stocks['periodo'].astype(int)
df_panel = df_panel.merge(df_stocks[['product_id', 'periodo', 'stock_final']], 
                          on=['product_id', 'periodo'], how='left')
print(f"Panel con tb_stocks shape: {df_panel.shape}")

# ⚙️ 6) clusters_dtw_sample.csv
df_clusters = pd.read_csv("C:/Developer/Laboratorio_III/data/clusters_dtw_sample.csv")
df_panel = df_panel.merge(df_clusters[['product_id', 'cluster_dtw']], 
                          on='product_id', how='left')
print(f"Panel con clusters_dtw shape: {df_panel.shape}")

# ✅ Resumen final
print("\n✅ Shape final del panel enriquecido")
print(f"Registros: {df_panel.shape[0]}")
print(f"Features totales: {df_panel.shape[1]}")
print("Features:", df_panel.columns.tolist())

df_panel.head(10)  # Mostrar las primeras 10 filas del DataFrame enriquecido

Panel original shape: (17022744, 4)
Panel filtrado shape: (12138186, 4)
Panel con contexto país shape: (12138186, 8)
Panel con tb_productos shape: (12138186, 14)
Panel con tb_stocks shape: (12138186, 15)
Panel con clusters_dtw shape: (12138186, 16)

✅ Shape final del panel enriquecido
Registros: 12138186
Features totales: 16
Features: ['customer_id', 'product_id', 'periodo', 'tn', 'IPC', 'inflacion', 'cambio_dolar', 'dias_feriados', 'cat1', 'cat2', 'cat3', 'brand', 'sku_size', 'descripcion', 'stock_final', 'cluster_dtw']


# 📝 Qué logramos con esto

## ✔️ El panel base ahora tiene:

- Variables macroeconómicas/contextuales (periodo).

- Atributos de producto (product_id).

- Estado de stock (product_id + periodo).

- Cluster asignado (product_id).

✔️ Mantiene granularidad customer_id + product_id + periodo.

✔️ Listo para feature engineering y modelado.

# 🚀 BLOQUE 2 — Feature Engineering Avanzado

En este paso:
- Creamos una columna de fecha real a partir de `periodo` (YYYYMM).
- Generamos la variable `clase`: toneladas del mes +2, calculado por `customer_id` y `product_id`.
- Calculamos:
  1) Lags (1 a 36) y sus diferencias.
  2) Medias móviles y sus diferencias.
  3) Contexto temporal enriquecido (año, mes, quarter, sin/cos, dayofweek, etc.)
  4) Mínimos/máximos locales.
  5) Volatilidad.
  6) Relación con categorías superiores (brand, cat1, cat2, cat3) si existen.
  7) Cluster DTW como feature.
  8) Factorización de variables categóricas.
  9) Target Encoding por `customer_id` y `product_id`.

Validamos que no se use información futura (shifts).


In [7]:
# =================================================
# ✅ BLOQUE 2 — Feature Engineering Avanzado
# =================================================

import numpy as np

df_fe = df_panel.copy()

# ⚙️ 1) Crear campo de fecha real
df_fe['periodo'] = df_fe['periodo'].astype(str)
df_fe['fecha'] = pd.to_datetime(df_fe['periodo'] + '01', format='%Y%m%d')

# ⚙️ 2) Ordenar y crear CLASE: tn del periodo +2
df_fe = df_fe.sort_values(['customer_id', 'product_id', 'fecha'])
df_fe['clase'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].shift(-2)

# ⚙️ 3) Lags (1 a 36) + diferencias
for lag in range(1, 37):
    df_fe[f'tn_{lag}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].shift(lag)
    df_fe[f'diff_tn_{lag}'] = df_fe['tn'] - df_fe[f'tn_{lag}']

# ⚙️ 4) Medias móviles + diferencias
for window in range(1, 37):
    df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
        lambda x: x.shift(1).rolling(window).mean()
    )
    df_fe[f'diff_rollmean_{window}'] = df_fe['tn'] - df_fe[f'rollmean_{window}']

# ⚙️ 5) Contexto temporal
df_fe['year'] = df_fe['fecha'].dt.year
df_fe['month'] = df_fe['fecha'].dt.month
df_fe['quarter'] = df_fe['fecha'].dt.quarter
df_fe['dayofweek'] = df_fe['fecha'].dt.dayofweek
df_fe['month_sin'] = np.sin(2 * np.pi * df_fe['month'] / 12)
df_fe['month_cos'] = np.cos(2 * np.pi * df_fe['month'] / 12)

# ⚙️ 6) Flags de mínimos y máximos
for window in [3, 6, 12]:
    df_fe[f'is_min_{window}'] = (
        df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
            lambda x: x == x.shift(1).rolling(window).min()
        ).astype(int)
    )
    df_fe[f'is_max_{window}'] = (
        df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
            lambda x: x == x.shift(1).rolling(window).max()
        ).astype(int)
    )

# ⚙️ 7) Volatilidad
df_fe['pct_change_1'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].pct_change(1)
df_fe['rolling_std_3'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
    lambda x: x.shift(1).rolling(3).std()
)

# ⚙️ 8) Jerárquico: promedio por brand / cat1 / cat2 / cat3 si existen
if 'brand' in df_fe.columns:
    df_fe['brand_avg'] = df_fe.groupby(['brand', 'fecha'])['tn'].transform('mean')
    df_fe['ratio_to_brand_avg'] = df_fe['tn'] / (df_fe['brand_avg'] + 1e-6)

for cat in ['cat1', 'cat2', 'cat3']:
    if cat in df_fe.columns:
        df_fe[f'{cat}_avg'] = df_fe.groupby([cat, 'fecha'])['tn'].transform('mean')
        df_fe[f'ratio_to_{cat}_avg'] = df_fe['tn'] / (df_fe[f'{cat}_avg'] + 1e-6)

# ⚙️ 9) Cluster DTW como feature + interacción
df_fe['cluster_dtw_factorized'], _ = pd.factorize(df_fe['cluster_dtw'].fillna(-1))
df_fe['cluster_x_month'] = df_fe['cluster_dtw_factorized'] * df_fe['month']

# ⚙️ 10) Factorizar otras variables categóricas
for col in df_fe.select_dtypes(include='object').columns:
    if col not in ['cluster_dtw']:
        df_fe[col + '_factorized'], _ = pd.factorize(df_fe[col])

# ⚙️ 11) Target Encoding opcional por customer_id y product_id
product_mean = df_fe.groupby(['product_id'])['clase'].transform('mean')
customer_mean = df_fe.groupby(['customer_id'])['clase'].transform('mean')
df_fe['product_target_enc'] = product_mean
df_fe['customer_target_enc'] = customer_mean

# ✅ Validación final
print("✅ Feature Engineering Avanzado completado.")
print(f"Shape final: {df_fe.shape}")
print("Ejemplo de columnas:", df_fe.columns.tolist()[:10], "...")
print(df_fe.head())


  df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
  df_fe[f'diff_rollmean_{window}'] = df_fe['tn'] - df_fe[f'rollmean_{window}']
  df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
  df_fe[f'diff_rollmean_{window}'] = df_fe['tn'] - df_fe[f'rollmean_{window}']
  df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
  df_fe[f'diff_rollmean_{window}'] = df_fe['tn'] - df_fe[f'rollmean_{window}']
  df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
  df_fe[f'diff_rollmean_{window}'] = df_fe['tn'] - df_fe[f'rollmean_{window}']
  df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
  df_fe[f'diff_rollmean_{window}'] = df_fe['tn'] - df_fe[f'rollmean_{window}']
  df_fe[f'rollmean_{window}'] = df_fe.groupby(['customer_id', 'product_id'])['tn'].transform(
  df_fe[f'diff_rollmean_{window}'] = df_f

✅ Feature Engineering Avanzado completado.
Shape final: (12138186, 194)
Ejemplo de columnas: ['customer_id', 'product_id', 'periodo', 'tn', 'IPC', 'inflacion', 'cambio_dolar', 'dias_feriados', 'cat1', 'cat2'] ...
   customer_id  product_id periodo         tn  IPC  inflacion  cambio_dolar  \
0        10001       20001  201701   99.43861  102   2.000000        16.080   
1        10001       20001  201702  198.84365  104   1.960784        15.800   
2        10001       20001  201703   92.46537  107   2.884615        15.645   
3        10001       20001  201704   13.29728  109   1.869159        15.490   
4        10001       20001  201705  101.00563  111   1.834862        16.185   

   dias_feriados cat1         cat2  ... cluster_dtw_factorized  \
0              1   HC  ROPA LAVADO  ...                      0   
1              2   HC  ROPA LAVADO  ...                      0   
2              1   HC  ROPA LAVADO  ...                      0   
3              1   HC  ROPA LAVADO  ...         

  df_fe['product_target_enc'] = product_mean
  df_fe['customer_target_enc'] = customer_mean


# 📌 Con esto tienes:
✅ Variable objetivo clase correctamente alineada sin data futura.
✅ Lags, rolling, volatilidad, banderas de extremos, contexto temporal, jerárquico y clusters DTW.
✅ Listo para filtrado, selección de features, split y modelado.

In [8]:
# ================================================
# ✅ Guardar snapshot del dataset enriquecido
# ================================================

# Ruta de salida (ajusta a tu estructura de carpetas)
output_path = "C:/Developer/Laboratorio_III/data/panel_cliente_producto_fe.parquet"

# Guardar parquet
df_fe.to_parquet(output_path, index=False)

print(f"✅ Dataset enriquecido guardado en: {output_path}")
print(f"Total filas: {df_fe.shape[0]} | Total columnas: {df_fe.shape[1]}")


✅ Dataset enriquecido guardado en: C:/Developer/Laboratorio_III/data/panel_cliente_producto_fe.parquet
Total filas: 12138186 | Total columnas: 194
