# Recomendador de tallas orientado a reducción de devoluciones

Este notebook recoge el desarrollo completo del **recomendador de tallas**, desde la construcción de los datasets hasta la estimación del impacto económico esperado. El objetivo no es predecir devoluciones de forma aislada, sino **utilizar un modelo de riesgo para decidir cuándo y cómo intervenir cambiando la talla recomendada**, minimizando devoluciones y cuantificando el ahorro potencial.

El proyecto se apoya en tres pilares:
1. Datos consistentes y auditables (productos, clientes, tickets y costes).
2. Un modelo de probabilidad de devolución entrenado y calibrado.
3. Un recomendador que usa ese modelo para decidir cambios de talla con criterio económico.

---

## 1. Datos de partida y estructura

El universo de análisis está a nivel **item** (línea de ticket). A partir de los generadores previos se dispone de:

- **Productos**: categoría, SKU, talla, color y sesgos de tallaje implícitos por histórico.
- **Clientes**: altura, peso, BMI, histórico de compras y devoluciones.
- **Tickets (online)**: fecha de compra, item_id, ticket_id, producto y talla comprada.
- **Variable objetivo**: `devuelto` (1 si el item fue devuelto).

Adicionalmente, se construye un **dataset de costes de devolución** (`items_devoluciones_ajustadas.csv`) que incluye:
- coste base por categoría,
- recargo logístico por zona,
- ajustes por canal,
- ruido controlado para mayor realismo.

Este coste se mergea de forma auditada por `(item_id, ticket_id)`, resolviendo duplicados con una política conservadora (máximo coste informado) y generando el flag `has_coste`.

---

## 2. Modelo de devoluciones: para qué se entrena

El modelo de devoluciones no se utiliza como un clasificador genérico, sino como un **estimador de riesgo** (`p_dev`) que permite comparar escenarios de talla.

La pregunta que responde no es:
> “¿Este item se devolverá?”

sino:
> “¿Qué talla minimiza la probabilidad de devolución para este cliente y este producto?”

Por tanto, el modelo es un **componente interno del recomendador**, no un fin en sí mismo.

---

## 3. Feature engineering orientado a talla

El feature engineering está diseñado para capturar la interacción cliente–producto:

### Variables antropométricas
- `altura_cm`
- `peso_kg`
- `bmi`

### Talla y ajuste
- `talla_idx`
- `ideal_idx` (estimada a partir de altura y peso)
- `desajuste` y `desajuste_abs`
- `talla_extrema`

### Producto y contexto
- `categoria`
- `id_producto`
- estadísticas históricas de devoluciones por producto

Estas variables permiten al modelo aprender patrones como:
- productos que tallan sistemáticamente grandes o pequeños,
- mayor riesgo cuando el desajuste de talla es elevado,
- comportamientos diferenciados en tallas extremas.

---

## 4. Entrenamiento, validación y calibración

El entrenamiento se realiza con **split temporal**, evitando leakage:

- **Train**: histórico antiguo.
- **Calibración**: periodo intermedio.
- **Test**: periodo más reciente.

El modelo principal es **XGBoost**, entrenado con:
- PR-AUC como métrica objetivo,
- early stopping,
- gestión explícita del desbalanceo de clases.

Posteriormente, las probabilidades se **calibran** (Platt / Isotonic), de forma que `p_dev` pueda interpretarse como un riesgo real y utilizarse directamente en métricas económicas.

---

## 5. Recomendador de tallas: lógica de decisión

El recomendador opera a nivel item y sigue estos pasos:

1. Se calcula una talla ideal (`ideal_idx`) mediante heurística antropométrica.
2. Se generan tallas candidatas dentro de un rango acotado (±2 tallas).
3. Para cada talla candidata:
   - se recalculan las variables dependientes de talla,
   - el modelo estima la probabilidad de devolución (`p_dev`).
4. Se selecciona la talla con menor riesgo esperado.
5. La talla solo se cambia si la mejora supera un umbral mínimo (`min_gain`).

Este enfoque evita cambios innecesarios, reduce fricción en la experiencia de usuario y prioriza decisiones con impacto real.

Las variables clave generadas son:
- `p_dev_actual`
- `p_dev_final`
- `delta_p_final`
- `cambia_talla`

---

## 6. Integración con el modelo global y costes

Para contextualizar el impacto económico, se integra un **modelo global de devoluciones** (`p_dev_global`) y el coste estimado de cada devolución.

Esto permite medir de forma directa:
- **ahorro esperado por item**,
- **ahorro esperado por intervención** (solo cuando cambia la talla),
- **share del coste total de devoluciones atacable mediante talla**.

La métrica central utilizada es:

expected_savings_talla = delta_p_final × coste_devolucion

---

## 7. Dataset final para BI y métricas

Se construye un dataset plano **`bi_items.csv`**, listo para Power BI, que incluye:

- información del item y del producto,
- scores del modelo de devoluciones,
- decisión del recomendador de tallas,
- métricas económicas esperadas,
- variables temporales (`year`, `month`).

A partir de este dataset se generan agregaciones para análisis y reporting:
- por **categoría**,
- por **producto** (top impacto),
- por **deciles de riesgo global**,
- curvas de **fricción vs ahorro** en función del umbral `min_gain`.

---

## 8. Lectura de resultados

Los resultados muestran que:
- solo alrededor del **18% de los items** requieren intervención,
- el ahorro se concentra en una fracción reducida de productos y en items de mayor riesgo,
- aumentar el umbral reduce la fricción con un impacto económico marginal.

Esto valida que el recomendador:
- actúa de forma selectiva,
- está alineado con objetivos de negocio,
- y es escalable a un entorno productivo.

---

## 9. Conclusión

Este recomendador no “adivina tallas”, sino que **optimiza decisiones** utilizando un modelo de riesgo calibrado y una función económica clara.  
El valor no está en cambiar muchas tallas, sino en **cambiar pocas, bien justificadas y con impacto medible**.


In [57]:
import sqlite3
import pandas as pd
import numpy as np
import unicodedata
import os
import json
import sqlite3
import unicodedata

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, average_precision_score, log_loss, brier_score_loss
from sklearn.isotonic import IsotonicRegression

import xgboost as xgb


In [58]:
# Ruta y origen de datos
DB_PATH = "database/mi_base.db"
TABLE_NAME = "dataset_modelo_a_tallas"

# Categorías consideradas como prendas con recomendación por talla estándar
ROPA_CATS = {"camiseta", "sudadera", "pantalon", "abrigo", "camisa"}

# Espacio de tallas soportado y diccionarios de conversión a índice (útil para modelos/clasificación)
TALLAS_ROPA = ["XS", "S", "M", "L", "XL"]
TALLA_TO_IDX = {talla: i for i, talla in enumerate(TALLAS_ROPA)}
IDX_TO_TALLA = {i: talla for talla, i in TALLA_TO_IDX.items()}

# Normalización de tallas de entrada (p. ej. texto en minúsculas o formatos inconsistentes)
MAP_TALLA = {"xs": "XS", "s": "S", "m": "M", "l": "L", "xl": "XL"}

# Rangos heurísticos de referencia (baseline) para asociar talla con altura/peso.
# Se utilizan como regla inicial o fallback cuando no se dispone de señal suficiente del modelo.
ROPA_RANGES = {
    "XS": {"h": (150, 165), "w": (40, 70)},
    "S":  {"h": (158, 172), "w": (48, 80)},
    "M":  {"h": (166, 180), "w": (58, 95)},
    "L":  {"h": (174, 188), "w": (68, 115)},
    "XL": {"h": (182, 210), "w": (78, 140)},
}

In [59]:
def load_from_sqlite(db_path: str, table_name: str) -> pd.DataFrame:
    """
    Carga una tabla completa desde una base de datos SQLite y la devuelve
    como DataFrame de pandas.
    """
    con = sqlite3.connect(db_path)
    try:
        df = pd.read_sql_query(
            f"SELECT * FROM {table_name}",
            con
        )
    finally:
        con.close()

    return df


df_raw = load_from_sqlite(DB_PATH, TABLE_NAME)

# Comprobación rápida de volumen para asegurar que la carga es consistente
print(f"Dataset cargado: {df_raw.shape[0]} filas, {df_raw.shape[1]} columnas")

df_raw.head()


Dataset cargado: 684561 filas, 13 columnas


Unnamed: 0,item_id,ticket_id,customer_id,canal,sku,id_producto,categoria,talla,altura_cm,peso_kg,bmi,fecha_item,devuelto
0,T000001-001,T000001,C000001,online,P005-NAV-L,P005,abrigo,L,184.2,71.2,20.99,2017-08-01 00:00:00,0
1,T000002-001,T000002,C000002,online,P005-BEI-L,P005,abrigo,L,182.3,75.4,22.68,2017-08-01 00:00:00,0
2,T000003-001,T000003,C000003,online,P002-BLU-XL,P002,camiseta,XL,189.6,94.5,26.29,2017-08-01 00:00:00,0
3,T000003-002,T000003,C000003,online,P003-BLK-XL,P003,sudadera,XL,189.6,94.5,26.29,2017-08-01 00:00:00,1
4,T000004-001,T000004,C000003,online,P003-NAV-XL,P003,sudadera,XL,189.6,94.5,26.29,2021-02-14 00:00:00,0


In [60]:
def normalize_text(value):
    """
    Normaliza texto eliminando acentos, forzando minúsculas y
    recortando espacios. Devuelve NaN si el valor es nulo.
    """
    if pd.isna(value):
        return np.nan

    value = unicodedata.normalize("NFKD", str(value))
    value = "".join(c for c in value if not unicodedata.combining(c))

    return value.lower().strip()


def build_df_clean(df: pd.DataFrame) -> pd.DataFrame:
    """
    Aplica limpieza y validaciones básicas al dataset original,
    dejando un conjunto consistente de observaciones aptas para
    la lógica de recomendación de tallas.
    """
    out = df.copy()

    # Normalización de variables categóricas clave
    out["categoria"] = out["categoria"].apply(normalize_text)
    out["talla"] = out["talla"].apply(normalize_text)

    # Filtrado a categorías de ropa con sistema de tallas estándar
    out = out[out["categoria"].isin(ROPA_CATS)].copy()

    # Normalización y validación del espacio de tallas
    out["talla"] = out["talla"].map(MAP_TALLA)
    out = out[out["talla"].isin(TALLAS_ROPA)].copy()

    # Conversión explícita de variables antropométricas a numérico
    for col in ["altura_cm", "peso_kg", "bmi"]:
        out[col] = pd.to_numeric(out[col], errors="coerce")

    # Eliminación de observaciones con valores físicos no plausibles
    out = out[
        out["altura_cm"].between(145, 210) &
        out["peso_kg"].between(40, 180)
    ].copy()

    # Re-cálculo del BMI como fuente única de verdad
    out["bmi"] = out["peso_kg"] / (out["altura_cm"] / 100.0) ** 2

    # Validación temporal del evento de compra
    out["fecha_item"] = pd.to_datetime(out["fecha_item"], errors="coerce")
    out = out[out["fecha_item"].notna()].copy()

    # Normalización de la variable objetivo (devolución)
    out["devuelto"] = (
        pd.to_numeric(out["devuelto"], errors="coerce")
        .fillna(0)
        .astype("int8")
    )

    return out.reset_index(drop=True)


df_clean = build_df_clean(df_raw)

print(f"df_clean generado: {df_clean.shape[0]} filas, {df_clean.shape[1]} columnas")
print("Distribución por categoría:", df_clean["categoria"].value_counts().to_dict())
print("Distribución por talla:", df_clean["talla"].value_counts().to_dict())
print(
    "Rango altura (cm):",
    df_clean["altura_cm"].min(),
    "-",
    df_clean["altura_cm"].max()
)
print(
    "Rango peso (kg):",
    df_clean["peso_kg"].min(),
    "-",
    df_clean["peso_kg"].max()
)
print("Ratio global de devolución:", round(df_clean["devuelto"].mean(), 4))


df_clean generado: 619514 filas, 13 columnas
Distribución por categoría: {'camiseta': 196123, 'abrigo': 169076, 'pantalon': 146619, 'sudadera': 67015, 'camisa': 40681}
Distribución por talla: {'M': 237333, 'L': 207858, 'XL': 99612, 'S': 62986, 'XS': 11725}
Rango altura (cm): 150.1 - 209.5
Rango peso (kg): 40.0 - 153.8
Ratio global de devolución: 0.3159


In [61]:
def infer_talla_ideal_ropa(h: np.ndarray, w: np.ndarray) -> np.ndarray:
    """
    Infere la talla ideal a partir de altura (cm) y peso (kg) usando una métrica
    de distancia a los centros de los rangos heurísticos definidos en ROPA_RANGES.
    La distancia se normaliza por la amplitud del rango para hacer comparables
    variables en escalas distintas.
    """
    h = h.astype(float)
    w = w.astype(float)

    sizes = TALLAS_ROPA
    n = len(h)
    k = len(sizes)

    h_mid = np.empty(k)
    w_mid = np.empty(k)
    h_span = np.empty(k)
    w_span = np.empty(k)

    for j, talla in enumerate(sizes):
        h_low, h_high = ROPA_RANGES[talla]["h"]
        w_low, w_high = ROPA_RANGES[talla]["w"]

        h_mid[j] = (h_low + h_high) / 2.0
        w_mid[j] = (w_low + w_high) / 2.0
        h_span[j] = (h_high - h_low)
        w_span[j] = (w_high - w_low)

    dh = (h.reshape(n, 1) - h_mid.reshape(1, k)) / h_span.reshape(1, k)
    dw = (w.reshape(n, 1) - w_mid.reshape(1, k)) / w_span.reshape(1, k)

    score = (dh ** 2) * 0.60 + (dw ** 2) * 0.40
    best_idx = np.argmin(score, axis=1)

    return np.array([sizes[j] for j in best_idx], dtype=object)


def build_size_features(df_clean: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Construye variables relacionadas con talla ideal y desajuste entre la talla
    comprada y la talla inferida por la heurística.
    Devuelve:
      - df_all: dataset completo con las nuevas variables
      - df_mis: subconjunto con desajuste distinto de cero
    """
    out = df_clean.copy()

    out["talla_idx"] = out["talla"].map(TALLA_TO_IDX).astype("int8")

    talla_ideal = infer_talla_ideal_ropa(
        out["altura_cm"].to_numpy(),
        out["peso_kg"].to_numpy()
    )
    out["talla_ideal"] = talla_ideal
    out["ideal_idx"] = pd.Series(out["talla_ideal"]).map(TALLA_TO_IDX).astype("int8")

    out["desajuste"] = (out["talla_idx"] - out["ideal_idx"]).astype("int8")
    out["desajuste_abs"] = out["desajuste"].abs().astype("int8")

    out["talla_extrema"] = out["talla"].isin(["XS", "XL"]).astype("int8")

    df_all = out.reset_index(drop=True)
    df_mis = df_all[df_all["desajuste_abs"] > 0].copy().reset_index(drop=True)

    return df_all, df_mis


df_all, df_mis = build_size_features(df_clean)

print(f"df_all: {df_all.shape} | ratio devolución: {round(df_all['devuelto'].mean(), 4)}")
print(f"df_mis: {df_mis.shape} | ratio devolución: {round(df_mis['devuelto'].mean(), 4)}")
print(df_all["desajuste"].value_counts().sort_index())


df_all: (619514, 19) | ratio devolución: 0.3159
df_mis: (80469, 19) | ratio devolución: 0.491
desajuste
-2       320
-1     20676
 0    539045
 1     58679
 2       780
 3        14
Name: count, dtype: int64


In [62]:
def add_product_profile_features(
    df_in: pd.DataFrame,
    prod_col: str = "id_producto",
    des_col: str = "desajuste",
    min_n: int = 15,
    prior_strength: float = 50.0,
) -> tuple[pd.DataFrame, pd.DataFrame, dict]:
    """
    Construye un perfil por producto basado en el desajuste histórico observado.
    Se aplica suavizado (shrinkage) hacia la media global para evitar estimaciones
    ruidosas en productos con pocas observaciones.

    Devuelve:
      - out: dataset original con las variables de perfil añadidas
      - prof: tabla de perfil por producto
      - meta: parámetros globales usados en la construcción
    """
    df = df_in.copy()

    global_mean = float(df[des_col].mean())
    global_mean_abs = float(df[des_col].abs().mean())

    g = df.groupby(prod_col, dropna=False)

    prof = (
        g[des_col]
        .agg(prod_n="size", prod_mean_des="mean")
        .reset_index()
    )

    prof_abs = (
        g[des_col]
        .agg(prod_mean_abs=lambda s: float(np.mean(np.abs(s))))
        .reset_index(drop=True)
    )
    prof["prod_mean_abs"] = prof_abs

    n = prof["prod_n"].astype(float)

    prof["prod_mean_des_smooth"] = (
        (n * prof["prod_mean_des"] + prior_strength * global_mean) / (n + prior_strength)
    )
    prof["prod_mean_abs_smooth"] = (
        (n * prof["prod_mean_abs"] + prior_strength * global_mean_abs) / (n + prior_strength)
    )

    prof["prod_has_history"] = (prof["prod_n"] >= min_n).astype("int8")
    prof["prod_bias_dir"] = np.sign(prof["prod_mean_des_smooth"]).astype("int8")
    prof["prod_bias_strength"] = np.abs(prof["prod_mean_des_smooth"]).astype("float32")

    prof = prof[
        [
            prod_col,
            "prod_n",
            "prod_has_history",
            "prod_mean_des_smooth",
            "prod_mean_abs_smooth",
            "prod_bias_dir",
            "prod_bias_strength",
        ]
    ].copy()

    out = df.merge(prof, on=prod_col, how="left")

    # Rellenos coherentes para productos sin historial (por ejemplo, productos nuevos)
    out["prod_n"] = out["prod_n"].fillna(0).astype(int)
    out["prod_has_history"] = out["prod_has_history"].fillna(0).astype("int8")

    out["prod_mean_des_smooth"] = out["prod_mean_des_smooth"].fillna(global_mean).astype("float32")
    out["prod_mean_abs_smooth"] = out["prod_mean_abs_smooth"].fillna(global_mean_abs).astype("float32")

    out["prod_bias_dir"] = out["prod_bias_dir"].fillna(0).astype("int8")
    out["prod_bias_strength"] = out["prod_bias_strength"].fillna(abs(global_mean)).astype("float32")

    meta = {
        "global_mean_des": global_mean,
        "global_mean_abs": global_mean_abs,
        "min_n": min_n,
        "prior_strength": prior_strength,
    }

    return out, prof, meta


df_all2, prod_profile, prod_meta = add_product_profile_features(
    df_all,
    prod_col="id_producto",
    des_col="desajuste",
    min_n=15,
    prior_strength=50.0,
)

print(f"df_all2: {df_all2.shape}")
print(f"prod_profile: {prod_profile.shape}")
print("meta:", prod_meta)

prod_profile.sort_values("prod_n", ascending=False).head(5)


df_all2: (619514, 25)
prod_profile: (51, 7)
meta: {'global_mean_des': 0.06289607660198156, 'global_mean_abs': 0.1317113091875244, 'min_n': 15, 'prior_strength': 50.0}


Unnamed: 0,id_producto,prod_n,prod_has_history,prod_mean_des_smooth,prod_mean_abs_smooth,prod_bias_dir,prod_bias_strength
2,P003,54193,1,0.063956,0.132083,1,0.063956
1,P002,54136,1,0.062823,0.130543,1,0.062823
7,P008,51913,1,0.062316,0.132856,1,0.062316
11,P013,48626,1,0.064593,0.132993,1,0.064593
4,P005,43462,1,0.067111,0.133149,1,0.067111


In [63]:
def clamp_idx(idx: int) -> int:
    """Asegura que un índice de talla cae dentro del rango válido."""
    return int(max(0, min(len(TALLAS_ROPA) - 1, idx)))


def build_product_bias_lookup(prod_profile: pd.DataFrame) -> dict:
    """
    Construye un diccionario id_producto -> sesgo suavizado de desajuste.
    Se usa como señal adicional para corregir tallaje sistemático por producto.
    """
    return (
        prod_profile
        .assign(id_producto=lambda d: d["id_producto"].astype(str))
        .set_index("id_producto")["prod_mean_des_smooth"]
        .astype(float)
        .to_dict()
    )


def bias_to_step(bias: float, thr: float = 0.25, gain: float = 2.0) -> int:
    """
    Convierte un sesgo continuo en un ajuste discreto de talla en {-1, 0, +1}.
    El umbral evita aplicar correcciones por ruido cuando el historial es débil.
    """
    x = gain * float(bias)
    if abs(x) < thr:
        return 0
    return int(np.sign(x))


def recommend_A1(
    df_like: pd.DataFrame,
    id_producto: str,
    altura_cm: float,
    peso_kg: float,
    max_step_ropa: int = 2,
    tau: float = 0.06,
) -> tuple[str | None, pd.DataFrame, str]:
    """
    Baseline A1: recomienda la talla ideal inferida a partir de altura y peso.
    Devuelve la talla recomendada, una tabla de candidatas con puntuación y un
    texto resumen para depuración/auditoría.
    """
    sub = df_like[df_like["id_producto"].astype(str) == str(id_producto)]
    if sub.empty:
        raise ValueError(f"[A1] id_producto={id_producto} no existe en el dataset de referencia.")

    categoria = str(sub.iloc[0]["categoria"]).strip().lower()
    if categoria not in ROPA_CATS:
        return None, pd.DataFrame(), f"[A1] Categoría fuera de alcance: {categoria}"

    ideal = infer_talla_ideal_ropa(np.array([altura_cm]), np.array([peso_kg]))[0]
    idx_ideal = TALLA_TO_IDX[ideal]

    lo = clamp_idx(idx_ideal - max_step_ropa)
    hi = clamp_idx(idx_ideal + max_step_ropa)
    candidatas = TALLAS_ROPA[lo:hi + 1]

    rows = []
    for talla in candidatas:
        dist = abs(TALLA_TO_IDX[talla] - idx_ideal)
        rows.append((talla, -dist))

    tab = (
        pd.DataFrame(rows, columns=["talla", "score"])
        .sort_values("score", ascending=False)
        .reset_index(drop=True)
    )

    s = tab["score"].to_numpy(float)
    s = s - s.max()
    w = np.exp(s / max(1e-6, tau))
    w = w / w.sum()
    tab["pct"] = np.round(100 * w, 2)

    best = str(tab.loc[0, "talla"])
    msg = f"[A1] {id_producto} ({categoria}) {int(altura_cm)}/{int(peso_kg)} -> {best} (ideal={ideal})"

    return best, tab, msg


def recommend_A2(
    df_like: pd.DataFrame,
    id_producto: str,
    altura_cm: float,
    peso_kg: float,
    prod_bias_lookup: dict,
    max_step_ropa: int = 2,
    tau: float = 0.06,
    thr: float = 0.25,
    gain: float = 2.0,
) -> tuple[str | None, pd.DataFrame, str]:
    """
    Baseline A2: parte de la talla ideal y aplica un ajuste discreto por producto
    usando el sesgo histórico suavizado del desajuste.
    """
    sub = df_like[df_like["id_producto"].astype(str) == str(id_producto)]
    if sub.empty:
        raise ValueError(f"[A2] id_producto={id_producto} no existe en el dataset de referencia.")

    categoria = str(sub.iloc[0]["categoria"]).strip().lower()
    if categoria not in ROPA_CATS:
        return None, pd.DataFrame(), f"[A2] Categoría fuera de alcance: {categoria}"

    ideal = infer_talla_ideal_ropa(np.array([altura_cm]), np.array([peso_kg]))[0]
    idx_ideal = TALLA_TO_IDX[ideal]

    bias = float(prod_bias_lookup.get(str(id_producto), 0.0))
    step = bias_to_step(bias, thr=thr, gain=gain)
    idx_target = idx_ideal + step

    lo = clamp_idx(idx_ideal - max_step_ropa)
    hi = clamp_idx(idx_ideal + max_step_ropa)
    candidatas = TALLAS_ROPA[lo:hi + 1]

    rows = []
    for talla in candidatas:
        dist = abs(TALLA_TO_IDX[talla] - idx_target)
        rows.append((talla, -dist))

    tab = (
        pd.DataFrame(rows, columns=["talla", "score"])
        .sort_values("score", ascending=False)
        .reset_index(drop=True)
    )

    s = tab["score"].to_numpy(float)
    s = s - s.max()
    w = np.exp(s / max(1e-6, tau))
    w = w / w.sum()
    tab["pct"] = np.round(100 * w, 2)

    best = str(tab.loc[0, "talla"])
    msg = (
        f"[A2] {id_producto} ({categoria}) {int(altura_cm)}/{int(peso_kg)} -> {best} "
        f"(ideal={ideal}, bias={bias:+.3f}, step={step:+d})"
    )

    return best, tab, msg


prod_bias_lookup = build_product_bias_lookup(prod_profile)

row = df_all2.sample(1, random_state=7).iloc[0]
pid = str(row["id_producto"])
h = float(row["altura_cm"])
w = float(row["peso_kg"])

t1, tab1, msg1 = recommend_A1(df_all2, pid, h, w)
t2, tab2, msg2 = recommend_A2(df_all2, pid, h, w, prod_bias_lookup=prod_bias_lookup)

print(msg1)
display(tab1)

print(msg2)
display(tab2)


[A1] P004 (pantalon) 167/54 -> S (ideal=S)


Unnamed: 0,talla,score,pct
0,S,0,100.0
1,XS,-1,0.0
2,M,-1,0.0
3,L,-2,0.0


[A2] P004 (pantalon) 167/54 -> S (ideal=S, bias=+0.062, step=+0)


Unnamed: 0,talla,score,pct
0,S,0,100.0
1,XS,-1,0.0
2,M,-1,0.0
3,L,-2,0.0


In [64]:
import os

# Ruta de salida
OUTPUT_PATH = "data/processed/df_all2.pkl"

# Crear carpeta si no existe
os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

# Guardar dataset final del pipeline
df_all2.to_pickle(OUTPUT_PATH)

print(f"Dataset final guardado en: {OUTPUT_PATH}")


Dataset final guardado en: data/processed/df_all2.pkl


# modelo

In [65]:
df_all2["fecha_item"] = pd.to_datetime(df_all2["fecha_item"], errors="coerce")

cut_train_end = pd.Timestamp("2024-01-01")
cut_calib_end = pd.Timestamp("2024-09-01")

train_df = df_all2[df_all2["fecha_item"] < cut_train_end].copy()
calib_df = df_all2[(df_all2["fecha_item"] >= cut_train_end) & (df_all2["fecha_item"] < cut_calib_end)].copy()
test_df  = df_all2[df_all2["fecha_item"] >= cut_calib_end].copy()

print("train:", train_df.shape, train_df["fecha_item"].min(), train_df["fecha_item"].max())
print("calib:", calib_df.shape, calib_df["fecha_item"].min(), calib_df["fecha_item"].max())
print("test :", test_df.shape, test_df["fecha_item"].min(), test_df["fecha_item"].max())

print("RR train:", float(train_df["devuelto"].mean()))
print("RR calib:", float(calib_df["devuelto"].mean()))
print("RR test :", float(test_df["devuelto"].mean()))


def drop_prod_cols(df: pd.DataFrame) -> pd.DataFrame:
    """
    Elimina columnas derivadas del perfil de producto para evitar colisiones
    cuando se recalculan perfiles con distinta partición temporal.
    """
    prod_cols = [c for c in df.columns if c.startswith("prod_")]
    return df.drop(columns=prod_cols, errors="ignore").copy()


train_base = drop_prod_cols(train_df)
calib_base = drop_prod_cols(calib_df)
test_base  = drop_prod_cols(test_df)


train_df2, prod_profile_train, prod_meta_train = add_product_profile_features(
    train_base,
    prod_col="id_producto",
    des_col="desajuste",
    min_n=15,
    prior_strength=50.0
)

global_mean = float(prod_meta_train["global_mean_des"])
global_abs  = float(prod_meta_train["global_mean_abs"])


def apply_prod_profile(df_base: pd.DataFrame, prod_profile: pd.DataFrame) -> pd.DataFrame:
    """
    Aplica un perfil de producto precomputado al dataset, usando valores
    globales como fallback para productos sin historial.
    """
    out = df_base.merge(prod_profile, on="id_producto", how="left")

    out["prod_n"] = out["prod_n"].fillna(0).astype(int)
    out["prod_has_history"] = out["prod_has_history"].fillna(0).astype("int8")

    out["prod_mean_des_smooth"] = out["prod_mean_des_smooth"].fillna(global_mean).astype("float32")
    out["prod_mean_abs_smooth"] = out["prod_mean_abs_smooth"].fillna(global_abs).astype("float32")

    out["prod_bias_dir"] = out["prod_bias_dir"].fillna(0).astype("int8")
    out["prod_bias_strength"] = out["prod_bias_strength"].fillna(abs(global_mean)).astype("float32")

    return out


calib_df2 = apply_prod_profile(calib_base, prod_profile_train)
test_df2  = apply_prod_profile(test_base,  prod_profile_train)

print("train_df2:", train_df2.shape, "calib_df2:", calib_df2.shape, "test_df2:", test_df2.shape)


train: (331148, 25) 2017-08-01 00:00:00 2023-12-31 00:00:00
calib: (96412, 25) 2024-01-01 00:00:00 2024-08-31 00:00:00
test : (191954, 25) 2024-09-01 00:00:00 2025-09-30 00:00:00
RR train: 0.28916073779699714
RR calib: 0.3466373480479608
RR test : 0.34667159840378425
train_df2: (331148, 25) calib_df2: (96412, 25) test_df2: (191954, 25)


In [66]:
TARGET = "devuelto"

cat_features = ["categoria", "id_producto"]
num_features = [
    "altura_cm", "peso_kg", "bmi",
    "talla_idx", "ideal_idx",
    "desajuste", "desajuste_abs",
    "talla_extrema",
    "prod_mean_des_smooth", "prod_bias_strength",
    "prod_has_history",
]

X_train = train_df2[cat_features + num_features].copy()
y_train = train_df2[TARGET].astype(int).copy()

X_cal = calib_df2[cat_features + num_features].copy()
y_cal = calib_df2[TARGET].astype(int).copy()

X_test = test_df2[cat_features + num_features].copy()
y_test = test_df2[TARGET].astype(int).copy()


preprocess = ColumnTransformer(
    transformers=[
        ("cat", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_features),
        ("num", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="median")),
        ]), num_features),
    ],
    remainder="drop",
)

Xtr = preprocess.fit_transform(X_train)
Xca = preprocess.transform(X_cal)
Xte = preprocess.transform(X_test)

print("Matrices:", Xtr.shape, Xca.shape, Xte.shape)


Matrices: (331148, 55) (96412, 55) (191954, 55)


In [67]:
dtrain = xgb.DMatrix(Xtr, label=y_train)
dcal   = xgb.DMatrix(Xca, label=y_cal)
dtest  = xgb.DMatrix(Xte, label=y_test)

params = {
    "objective": "binary:logistic",
    "eval_metric": "logloss",
    "eta": 0.05,
    "max_depth": 6,
    "min_child_weight": 5,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "lambda": 1.0,
    "alpha": 0.0,
    "tree_method": "hist",
    "seed": 7,
}

evals = [(dtrain, "train"), (dcal, "calib")]

booster = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=5000,
    evals=evals,
    early_stopping_rounds=200,
    verbose_eval=200,
)


[0]	train-logloss:0.59922	calib-logloss:0.64909
[200]	train-logloss:0.57093	calib-logloss:0.60768
[400]	train-logloss:0.56675	calib-logloss:0.60737
[600]	train-logloss:0.56332	calib-logloss:0.60754
[648]	train-logloss:0.56247	calib-logloss:0.60760


In [68]:
try:
    p_cal_raw  = booster.predict(dcal,  iteration_range=(0, booster.best_iteration + 1))
    p_test_raw = booster.predict(dtest, iteration_range=(0, booster.best_iteration + 1))
except TypeError:
    p_cal_raw  = booster.predict(dcal,  ntree_limit=getattr(booster, "best_ntree_limit", 0) or 0)
    p_test_raw = booster.predict(dtest, ntree_limit=getattr(booster, "best_ntree_limit", 0) or 0)

print("RR real calib:", float(y_cal.mean()), "RR pred calib RAW:", float(np.mean(p_cal_raw)))
print("RR real test :", float(y_test.mean()), "RR pred test  RAW:", float(np.mean(p_test_raw)))


RR real calib: 0.3466373480479608 RR pred calib RAW: 0.3099621534347534
RR real test : 0.34667159840378425 RR pred test  RAW: 0.3105109930038452


In [69]:
eps = 1e-6

z_cal = np.log(np.clip(p_cal_raw, eps, 1 - eps) / np.clip(1 - p_cal_raw, eps, 1 - eps)).reshape(-1, 1)
z_test = np.log(np.clip(p_test_raw, eps, 1 - eps) / np.clip(1 - p_test_raw, eps, 1 - eps)).reshape(-1, 1)

platt = LogisticRegression(max_iter=2000, solver="lbfgs")
platt.fit(z_cal, y_cal)

p_test_platt = platt.predict_proba(z_test)[:, 1]

print("=== XGB calibrado (Platt) ===")
print("AUC:", roc_auc_score(y_test, p_test_platt))
print("PR-AUC:", average_precision_score(y_test, p_test_platt))
print("LogLoss:", log_loss(y_test, p_test_platt))
print("Brier:", brier_score_loss(y_test, p_test_platt))
print("RR real test:", float(y_test.mean()))
print("RR pred test:", float(np.mean(p_test_platt)))


=== XGB calibrado (Platt) ===
AUC: 0.6588341948226026
PR-AUC: 0.5231333377851516
LogLoss: 0.6046991920502619
Brier: 0.20771543233581286
RR real test: 0.34667159840378425
RR pred test: 0.34740813163962375


In [70]:
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(p_cal_raw, y_cal)

p_test_iso = iso.predict(p_test_raw)

print("=== XGB calibrado (Isotonic) ===")
print("AUC:", roc_auc_score(y_test, p_test_iso))
print("PR-AUC:", average_precision_score(y_test, p_test_iso))
print("LogLoss:", log_loss(y_test, p_test_iso))
print("Brier:", brier_score_loss(y_test, p_test_iso))
print("RR real test:", float(y_test.mean()))
print("RR pred test:", float(np.mean(p_test_iso)))


=== XGB calibrado (Isotonic) ===
AUC: 0.6586505294296866
PR-AUC: 0.5149625198702965
LogLoss: 0.6051253823544808
Brier: 0.20777431113671938
RR real test: 0.34667159840378425
RR pred test: 0.34717676043510437


In [71]:
test_scored = test_df2.copy()

# Selecciona aquí el score final que vas a usar
test_scored["p_dev"] = p_test_platt  # alternativa: p_test_iso

prod_eval = (
    test_scored
    .groupby(["id_producto", "categoria"], as_index=False)
    .agg(
        n=("devuelto", "size"),
        rr_real=("devuelto", "mean"),
        rr_pred=("p_dev", "mean"),
    )
)

prod_eval["residual"] = prod_eval["rr_real"] - prod_eval["rr_pred"]
prod_eval["lift"] = prod_eval["rr_real"] / (prod_eval["rr_pred"] + 1e-9)

prod_eval_big = prod_eval[prod_eval["n"] >= 500].copy()

print("TOP peor (residual alto):")
display(prod_eval_big.sort_values("residual", ascending=False).head(15))

print("TOP mejor (residual bajo):")
display(prod_eval_big.sort_values("residual", ascending=True).head(15))

print("TOP peor (lift alto):")
display(prod_eval_big.sort_values("lift", ascending=False).head(15))

print("TOP mejor (lift bajo):")
display(prod_eval_big.sort_values("lift", ascending=True).head(15))


TOP peor (residual alto):


Unnamed: 0,id_producto,categoria,n,rr_real,rr_pred,residual,lift
15,P060,sudadera,2986,0.359009,0.304,0.055008,1.180949
21,P069,abrigo,3207,0.43561,0.388354,0.047256,1.121682
13,P058,abrigo,10660,0.40122,0.385203,0.016016,1.041578
19,P065,pantalon,4098,0.344802,0.331183,0.013619,1.041122
17,P063,camiseta,3050,0.261311,0.25029,0.011022,1.044035
0,P001,camiseta,10841,0.259293,0.250112,0.009181,1.036709
14,P059,abrigo,3027,0.39478,0.387626,0.007154,1.018457
5,P006,camiseta,10903,0.264698,0.257756,0.006942,1.026931
11,P042,camisa,11017,0.55369,0.552786,0.000903,1.001634
3,P004,pantalon,10814,0.326891,0.326202,0.000689,1.002112


TOP mejor (residual bajo):


Unnamed: 0,id_producto,categoria,n,rr_real,rr_pred,residual,lift
20,P066,camisa,3052,0.330931,0.358334,-0.027403,0.923526
9,P020,abrigo,10665,0.372433,0.394179,-0.021746,0.944832
4,P005,abrigo,10722,0.333054,0.344757,-0.011703,0.966054
18,P064,camiseta,6946,0.240858,0.25232,-0.011462,0.954572
16,P062,camisa,3130,0.346006,0.355434,-0.009428,0.973475
7,P013,abrigo,13200,0.361364,0.370695,-0.009331,0.974828
10,P024,pantalon,10891,0.316867,0.325148,-0.008281,0.974533
2,P003,sudadera,13760,0.298038,0.303753,-0.005716,0.981184
1,P002,camiseta,13442,0.250186,0.255119,-0.004933,0.980665
8,P016,camiseta,10848,0.220409,0.222784,-0.002374,0.989342


TOP peor (lift alto):


Unnamed: 0,id_producto,categoria,n,rr_real,rr_pred,residual,lift
15,P060,sudadera,2986,0.359009,0.304,0.055008,1.180949
21,P069,abrigo,3207,0.43561,0.388354,0.047256,1.121682
17,P063,camiseta,3050,0.261311,0.25029,0.011022,1.044035
13,P058,abrigo,10660,0.40122,0.385203,0.016016,1.041578
19,P065,pantalon,4098,0.344802,0.331183,0.013619,1.041122
0,P001,camiseta,10841,0.259293,0.250112,0.009181,1.036709
5,P006,camiseta,10903,0.264698,0.257756,0.006942,1.026931
14,P059,abrigo,3027,0.39478,0.387626,0.007154,1.018457
3,P004,pantalon,10814,0.326891,0.326202,0.000689,1.002112
11,P042,camisa,11017,0.55369,0.552786,0.000903,1.001634


TOP mejor (lift bajo):


Unnamed: 0,id_producto,categoria,n,rr_real,rr_pred,residual,lift
20,P066,camisa,3052,0.330931,0.358334,-0.027403,0.923526
9,P020,abrigo,10665,0.372433,0.394179,-0.021746,0.944832
18,P064,camiseta,6946,0.240858,0.25232,-0.011462,0.954572
4,P005,abrigo,10722,0.333054,0.344757,-0.011703,0.966054
16,P062,camisa,3130,0.346006,0.355434,-0.009428,0.973475
10,P024,pantalon,10891,0.316867,0.325148,-0.008281,0.974533
7,P013,abrigo,13200,0.361364,0.370695,-0.009331,0.974828
1,P002,camiseta,13442,0.250186,0.255119,-0.004933,0.980665
2,P003,sudadera,13760,0.298038,0.303753,-0.005716,0.981184
8,P016,camiseta,10848,0.220409,0.222784,-0.002374,0.989342


In [72]:
os.makedirs("modelos", exist_ok=True)
os.makedirs("modelos/xgb_devoluciones", exist_ok=True)

MODEL_DIR = "modelos/xgb_devoluciones"

# Booster (XGBoost)
booster.save_model(os.path.join(MODEL_DIR, "xgb_booster.json"))

# Preprocesado y calibradores
import pickle
with open(os.path.join(MODEL_DIR, "preprocess.pkl"), "wb") as f:
    pickle.dump(preprocess, f)

with open(os.path.join(MODEL_DIR, "platt.pkl"), "wb") as f:
    pickle.dump(platt, f)

with open(os.path.join(MODEL_DIR, "isotonic.pkl"), "wb") as f:
    pickle.dump(iso, f)

# Perfil de producto entrenado en train
train_artifacts = {
    "prod_profile_train": prod_profile_train,
}
with open(os.path.join(MODEL_DIR, "prod_profile_train.pkl"), "wb") as f:
    pickle.dump(train_artifacts, f)

with open(os.path.join(MODEL_DIR, "prod_meta_train.json"), "w") as f:
    json.dump(prod_meta_train, f, indent=2)

# Configuración mínima para reproducibilidad
config = {
    "cut_train_end": str(cut_train_end.date()),
    "cut_calib_end": str(cut_calib_end.date()),
    "target": TARGET,
    "cat_features": cat_features,
    "num_features": num_features,
    "xgb_params": params,
    "best_iteration": int(getattr(booster, "best_iteration", -1)),
}
with open(os.path.join(MODEL_DIR, "config.json"), "w") as f:
    json.dump(config, f, indent=2)

print(f"Modelo y artefactos guardados en: {MODEL_DIR}")


Modelo y artefactos guardados en: modelos/xgb_devoluciones


# recomendador

In [73]:
def clamp_idx(idx: int, lo: int = 0, hi: int | None = None) -> int:
    """Asegura que un índice de talla cae dentro del rango permitido."""
    if hi is None:
        hi = len(TALLAS_ROPA) - 1
    return int(max(lo, min(hi, int(idx))))


def make_candidates(ideal_idx: int, max_step: int = 2) -> list[str]:
    """
    Genera tallas candidatas alrededor de la talla ideal, restringiendo el salto
    máximo permitido.
    """
    lo = clamp_idx(ideal_idx - max_step)
    hi = clamp_idx(ideal_idx + max_step)
    return [IDX_TO_TALLA[i] for i in range(lo, hi + 1)]


In [74]:
def score_rows(df_rows: pd.DataFrame) -> np.ndarray:
    """
    Puntúa filas con el modelo de devoluciones y devuelve p_dev calibrada.

    Requiere en el entorno:
      - cat_features, num_features
      - preprocess (ColumnTransformer)
      - booster (xgboost Booster)
      - iso (IsotonicRegression)
    """
    X = df_rows[cat_features + num_features].copy()
    Xmat = preprocess.transform(X)

    dmat = xgb.DMatrix(Xmat)

    try:
        p_raw = booster.predict(dmat, iteration_range=(0, booster.best_iteration + 1))
    except TypeError:
        p_raw = booster.predict(dmat, ntree_limit=getattr(booster, "best_ntree_limit", 0) or 0)

    return iso.predict(p_raw)


In [75]:
def build_scenarios(df_base: pd.DataFrame, max_step: int = 2) -> pd.DataFrame:
    """
    Expande cada compra en múltiples escenarios (uno por talla candidata).

    Solo recalcula variables dependientes de la talla:
      - talla, talla_idx
      - desajuste, desajuste_abs
      - talla_extrema

    Mantiene fijo:
      - ideal_idx
      - variables corporales (altura/peso/bmi)
      - perfil de producto (prod_*)
      - categoría / id_producto
    """
    if "ideal_idx" not in df_base.columns:
        raise ValueError("df_base debe contener la columna ideal_idx.")

    base = df_base.copy().reset_index(drop=False).rename(columns={"index": "_orig_idx"})

    cands = base["ideal_idx"].apply(lambda i: make_candidates(int(i), max_step=max_step))

    scen = base.loc[base.index.repeat(cands.str.len())].copy()
    scen["talla"] = np.concatenate(cands.values)

    scen["talla_idx"] = scen["talla"].map(TALLA_TO_IDX).astype("int8")
    scen["desajuste"] = (scen["talla_idx"].astype(int) - scen["ideal_idx"].astype(int)).astype("int16")
    scen["desajuste_abs"] = scen["desajuste"].abs().astype("int16")
    scen["talla_extrema"] = scen["talla"].isin(["XS", "XL"]).astype("int8")

    return scen


In [76]:
def recommend_sizes(
    df_base: pd.DataFrame,
    max_step: int = 2,
    min_gain: float = 0.01,
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Recomienda la talla que minimiza p_dev dentro del vecindario de la talla ideal.

    min_gain controla la histeresis:
      - si la mejora esperada (delta_p) es menor que min_gain, se mantiene la talla actual.

    Devuelve:
      - recs: 1 fila por compra con recomendación y métricas
      - scen_scored: escenarios con p_dev por talla (útil para auditoría)
    """
    scen = build_scenarios(df_base, max_step=max_step)
    scen["p_dev"] = score_rows(scen).astype("float32")

    best = (
        scen.sort_values(["_orig_idx", "p_dev"], ascending=[True, True])
            .groupby("_orig_idx", as_index=False)
            .head(1)
            .rename(columns={"talla": "talla_reco", "p_dev": "p_dev_reco"})
    )

    base_scored = df_base.copy().reset_index(drop=False).rename(columns={"index": "_orig_idx"})
    base_scored["p_dev_actual"] = score_rows(base_scored).astype("float32")

    keep_cols = [
        "_orig_idx", "id_producto", "categoria", "talla",
        "altura_cm", "peso_kg", "ideal_idx", "devuelto", "p_dev_actual"
    ]

    recs = (
        base_scored[keep_cols]
        .merge(
            best[["_orig_idx", "talla_reco", "p_dev_reco", "desajuste", "desajuste_abs"]],
            on="_orig_idx",
            how="left",
        )
    )

    recs["delta_p"] = (recs["p_dev_actual"] - recs["p_dev_reco"]).astype("float32")
    recs["cambia_talla_raw"] = (recs["talla"] != recs["talla_reco"]).astype("int8")

    # Regla de decisión final (evita cambios por ruido)
    recs["talla_final"] = np.where(
        recs["delta_p"] >= float(min_gain),
        recs["talla_reco"],
        recs["talla"],
    )
    recs["cambia_talla"] = (recs["talla_final"] != recs["talla"]).astype("int8")

    # p_dev final esperado (si cambio aplico p_dev_reco; si no, p_dev_actual)
    recs["p_dev_final"] = np.where(
        recs["cambia_talla"] == 1,
        recs["p_dev_reco"],
        recs["p_dev_actual"],
    ).astype("float32")

    recs["delta_p_final"] = (recs["p_dev_actual"] - recs["p_dev_final"]).astype("float32")

    return recs, scen


In [77]:
sample_n = 20000
test_small = test_df2.sample(sample_n, random_state=7).copy()

recs, scen_scored = recommend_sizes(test_small, max_step=2, min_gain=0.01)

print("Mejora media esperada (delta_p_final):", float(recs["delta_p_final"].mean()))
print("Pct compras donde cambia talla:", float(recs["cambia_talla"].mean()))
print(
    "Mejora media esperada SOLO donde cambia:",
    float(recs.loc[recs["cambia_talla"] == 1, "delta_p_final"].mean())
)

display(recs.sort_values("delta_p_final", ascending=False).head(20))

res_cat = (
    recs.groupby("categoria", as_index=False)
        .agg(
            n=("delta_p_final", "size"),
            pct_change=("cambia_talla", "mean"),
            delta_p_mean=("delta_p_final", "mean"),
        )
        .sort_values("delta_p_mean", ascending=False)
)
display(res_cat)


Mejora media esperada (delta_p_final): 0.034451983869075775
Pct compras donde cambia talla: 0.17905
Mejora media esperada SOLO donde cambia: 0.19241543114185333


Unnamed: 0,_orig_idx,id_producto,categoria,talla,altura_cm,peso_kg,ideal_idx,devuelto,p_dev_actual,talla_reco,p_dev_reco,desajuste,desajuste_abs,delta_p,cambia_talla_raw,talla_final,cambia_talla,p_dev_final,delta_p_final
10030,39654,P024,pantalon,S,177.8,81.5,3,1,0.953488,L,0.277466,0,0,0.676022,1,L,1,0.277466,0.676022
2897,1865,P024,pantalon,L,162.8,60.4,1,1,0.888889,S,0.259179,0,0,0.62971,1,S,1,0.259179,0.62971
18270,119804,P058,abrigo,S,181.7,92.3,3,1,0.953488,L,0.327632,0,0,0.625856,1,L,1,0.327632,0.625856
8150,155260,P004,pantalon,S,183.8,95.4,3,1,0.888889,L,0.272097,0,0,0.616791,1,L,1,0.272097,0.616791
9675,74343,P008,pantalon,L,168.8,66.5,1,0,0.888889,S,0.300976,0,0,0.587913,1,S,1,0.300976,0.587913
4022,126006,P005,abrigo,M,157.5,60.7,0,1,0.888889,XS,0.300976,0,0,0.587913,1,XS,1,0.300976,0.587913
7978,179664,P004,pantalon,XL,169.7,73.2,2,1,0.851064,M,0.277466,0,0,0.573598,1,M,1,0.277466,0.573598
15962,50623,P024,pantalon,L,167.1,66.6,1,1,0.851064,S,0.277466,0,0,0.573598,1,S,1,0.277466,0.573598
2075,50625,P024,pantalon,L,167.1,66.6,1,1,0.851064,S,0.277466,0,0,0.573598,1,S,1,0.277466,0.573598
6562,161920,P064,camiseta,XL,173.3,78.6,2,1,0.795181,M,0.22963,0,0,0.565551,1,M,1,0.22963,0.565551


Unnamed: 0,categoria,n,pct_change,delta_p_mean
0,abrigo,6676,0.235021,0.049749
1,camisa,1820,0.356593,0.035243
3,pantalon,4105,0.122046,0.02902
4,sudadera,1651,0.119322,0.027005
2,camiseta,5748,0.115692,0.022454


In [78]:
import os

os.makedirs("data/processed", exist_ok=True)

recs.to_pickle("data/processed/recs_test_small.pkl")

# Escenarios: útil para debug, pero puede crecer mucho si aumentas sample_n
scen_scored.to_pickle("data/processed/scen_test_small.pkl")

print("Guardado completado:")
print("- data/processed/recs_test_small.pkl")
print("- data/processed/scen_test_small.pkl")


Guardado completado:
- data/processed/recs_test_small.pkl
- data/processed/scen_test_small.pkl


# Analisis

In [79]:
import pandas as pd
import numpy as np

COST_FILE = "data/items_devoluciones_ajustadas.csv"
KEYS = ["item_id", "ticket_id"]

items_cost = pd.read_csv(
    COST_FILE,
    usecols=["item_id", "ticket_id", "coste_devolucion"]
)

# Normalización de claves
for c in KEYS:
    items_cost[c] = items_cost[c].astype(str).str.strip()

# Coste a numérico
items_cost["coste_devolucion"] = pd.to_numeric(items_cost["coste_devolucion"], errors="coerce")

print(" Cost file cargado")
print(f"- Filas: {items_cost.shape[0]:,}")
print(f"- NA en coste_devolucion: {items_cost['coste_devolucion'].isna().mean():.2%}")

# Auditoría de duplicados por clave
dup_mask = items_cost.duplicated(subset=KEYS, keep=False)
n_dup_rows = int(dup_mask.sum())

if n_dup_rows > 0:
    n_dup_keys = items_cost.loc[dup_mask, KEYS].drop_duplicates().shape[0]
    print("\n Atención: existen claves duplicadas en el fichero de costes")
    print(f"- Filas duplicadas: {n_dup_rows:,}")
    print(f"- Claves únicas afectadas: {n_dup_keys:,}")
    print("Ejemplo de duplicados:")
    display(items_cost.loc[dup_mask].head(10))
else:
    print("\n No hay duplicados por (item_id, ticket_id)")


 Cost file cargado
- Filas: 905,445
- NA en coste_devolucion: 0.00%

 No hay duplicados por (item_id, ticket_id)


In [80]:
# Resolver duplicados por clave con una política estable
# (conservadora: si hay varias filas, nos quedamos con el máximo coste informado)
items_cost_resolved = (
    items_cost
    .groupby(KEYS, as_index=False)
    .agg(coste_devolucion=("coste_devolucion", "max"))
)

resolved_drop = items_cost.shape[0] - items_cost_resolved.shape[0]

print("\n Tabla de costes preparada para merge")
print(f"- Filas originales: {items_cost.shape[0]:,}")
print(f"- Filas tras resolver duplicados: {items_cost_resolved.shape[0]:,}")
print(f"- Filas colapsadas por duplicados: {resolved_drop:,}")

print("\nResumen coste_devolucion (tabla resuelta):")
display(items_cost_resolved["coste_devolucion"].describe())


 Tabla de costes preparada para merge
- Filas originales: 905,445
- Filas tras resolver duplicados: 905,445
- Filas colapsadas por duplicados: 0

Resumen coste_devolucion (tabla resuelta):


count    905445.000000
mean          3.105396
std           1.122479
min           0.530000
25%           2.170000
50%           2.940000
75%           3.970000
max           6.910000
Name: coste_devolucion, dtype: float64

In [81]:
def add_coste_devolucion(df_base: pd.DataFrame, cost_table: pd.DataFrame) -> pd.DataFrame:
    """
    Añade coste_devolucion al dataset base mediante merge por (item_id, ticket_id).
    Deja un flag has_coste y rellena NA con 0 para permitir métricas en € sin nulos.
    """
    df = df_base.copy()

    # Validación de llaves en df_base
    for c in KEYS:
        if c not in df.columns:
            raise KeyError(f"El dataset base no contiene la columna '{c}'.")

        df[c] = df[c].astype(str).str.strip()

    out = df.merge(
        cost_table,
        on=KEYS,
        how="left",
        validate="m:1"  # muchos items en df_base pueden mapear a una fila de coste
    )

    out["has_coste"] = out["coste_devolucion"].notna().astype("int8")

    # Política para BI: NA -> 0, y lo indicamos con has_coste
    out["coste_devolucion"] = out["coste_devolucion"].fillna(0.0).astype("float32")

    # Sanity checks
    neg_rate = float((out["coste_devolucion"] < 0).mean())
    if neg_rate > 0:
        print(f"⚠️ Atención: hay costes negativos ({neg_rate:.2%}). Revisa el fichero de entrada.")

    return out


test_df2_with_cost = add_coste_devolucion(test_df2, items_cost_resolved)

print("\n Merge completado: coste_devolucion añadido a test_df2")
print(f"- NA rate tras merge (debería ser 0 por fillna): {test_df2_with_cost['coste_devolucion'].isna().mean():.2%}")
print(f"- Cobertura de coste (has_coste=1): {test_df2_with_cost['has_coste'].mean():.2%}")

print("\nResumen coste_devolucion (post-merge):")
display(test_df2_with_cost["coste_devolucion"].describe())

print("\nEjemplos:")
display(test_df2_with_cost[["item_id", "ticket_id", "devuelto", "coste_devolucion", "has_coste"]].head(10))


 Merge completado: coste_devolucion añadido a test_df2
- NA rate tras merge (debería ser 0 por fillna): 0.00%
- Cobertura de coste (has_coste=1): 100.00%

Resumen coste_devolucion (post-merge):


count    191954.000000
mean          3.465760
std           1.123252
min           1.490000
25%           2.460000
50%           3.290000
75%           4.440000
max           6.910000
Name: coste_devolucion, dtype: float64


Ejemplos:


Unnamed: 0,item_id,ticket_id,devuelto,coste_devolucion,has_coste
0,T000016-001,T000016,1,1.93,1
1,T000084-001,T000084,1,2.98,1
2,T000975-001,T000975,0,4.42,1
3,T001279-001,T001279,0,4.44,1
4,T001279-002,T001279,0,3.55,1
5,T001368-001,T001368,1,1.98,1
6,T001750-001,T001750,0,2.89,1
7,T003130-001,T003130,0,4.32,1
8,T003131-001,T003131,0,4.31,1
9,T003386-001,T003386,0,3.3,1


In [82]:
import pandas as pd
import numpy as np

COST_FILE = "data/items_devoluciones_ajustadas.csv"
KEYS = ["item_id", "ticket_id"]

items_cost = pd.read_csv(
    COST_FILE,
    usecols=["item_id", "ticket_id", "coste_devolucion"]
)

# Normalización de claves
for c in KEYS:
    items_cost[c] = items_cost[c].astype(str).str.strip()

# Coste a numérico
items_cost["coste_devolucion"] = pd.to_numeric(items_cost["coste_devolucion"], errors="coerce")

print(" Cost file cargado")
print(f"- Filas: {items_cost.shape[0]:,}")
print(f"- NA en coste_devolucion: {items_cost['coste_devolucion'].isna().mean():.2%}")

# Auditoría de duplicados por clave
dup_mask = items_cost.duplicated(subset=KEYS, keep=False)
n_dup_rows = int(dup_mask.sum())

if n_dup_rows > 0:
    n_dup_keys = items_cost.loc[dup_mask, KEYS].drop_duplicates().shape[0]
    print("\n Atención: existen claves duplicadas en el fichero de costes")
    print(f"- Filas duplicadas: {n_dup_rows:,}")
    print(f"- Claves únicas afectadas: {n_dup_keys:,}")
    print("Ejemplo de duplicados:")
    display(items_cost.loc[dup_mask].head(10))
else:
    print("\n No hay duplicados por (item_id, ticket_id)")


 Cost file cargado
- Filas: 905,445
- NA en coste_devolucion: 0.00%

 No hay duplicados por (item_id, ticket_id)


In [83]:
# Resolver duplicados por clave con una política estable
# (conservadora: si hay varias filas, nos quedamos con el máximo coste informado)
items_cost_resolved = (
    items_cost
    .groupby(KEYS, as_index=False)
    .agg(coste_devolucion=("coste_devolucion", "max"))
)

resolved_drop = items_cost.shape[0] - items_cost_resolved.shape[0]

print("\n Tabla de costes preparada para merge")
print(f"- Filas originales: {items_cost.shape[0]:,}")
print(f"- Filas tras resolver duplicados: {items_cost_resolved.shape[0]:,}")
print(f"- Filas colapsadas por duplicados: {resolved_drop:,}")

print("\nResumen coste_devolucion (tabla resuelta):")
display(items_cost_resolved["coste_devolucion"].describe())



 Tabla de costes preparada para merge
- Filas originales: 905,445
- Filas tras resolver duplicados: 905,445
- Filas colapsadas por duplicados: 0

Resumen coste_devolucion (tabla resuelta):


count    905445.000000
mean          3.105396
std           1.122479
min           0.530000
25%           2.170000
50%           2.940000
75%           3.970000
max           6.910000
Name: coste_devolucion, dtype: float64

In [84]:
def add_coste_devolucion(df_base: pd.DataFrame, cost_table: pd.DataFrame) -> pd.DataFrame:
    """
    Añade coste_devolucion al dataset base mediante merge por (item_id, ticket_id).
    Deja un flag has_coste y rellena NA con 0 para permitir métricas en € sin nulos.
    """
    df = df_base.copy()

    # Validación de llaves en df_base
    for c in KEYS:
        if c not in df.columns:
            raise KeyError(f"El dataset base no contiene la columna '{c}'.")

        df[c] = df[c].astype(str).str.strip()

    out = df.merge(
        cost_table,
        on=KEYS,
        how="left",
        validate="m:1"  # muchos items en df_base pueden mapear a una fila de coste
    )

    out["has_coste"] = out["coste_devolucion"].notna().astype("int8")

    # Política para BI: NA -> 0, y lo indicamos con has_coste
    out["coste_devolucion"] = out["coste_devolucion"].fillna(0.0).astype("float32")

    # Sanity checks
    neg_rate = float((out["coste_devolucion"] < 0).mean())
    if neg_rate > 0:
        print(f"⚠️ Atención: hay costes negativos ({neg_rate:.2%}). Revisa el fichero de entrada.")

    return out


test_df2_with_cost = add_coste_devolucion(test_df2, items_cost_resolved)

print("\n Merge completado: coste_devolucion añadido a test_df2")
print(f"- NA rate tras merge (debería ser 0 por fillna): {test_df2_with_cost['coste_devolucion'].isna().mean():.2%}")
print(f"- Cobertura de coste (has_coste=1): {test_df2_with_cost['has_coste'].mean():.2%}")

print("\nResumen coste_devolucion (post-merge):")
display(test_df2_with_cost["coste_devolucion"].describe())

print("\nEjemplos:")
display(test_df2_with_cost[["item_id", "ticket_id", "devuelto", "coste_devolucion", "has_coste"]].head(10))


 Merge completado: coste_devolucion añadido a test_df2
- NA rate tras merge (debería ser 0 por fillna): 0.00%
- Cobertura de coste (has_coste=1): 100.00%

Resumen coste_devolucion (post-merge):


count    191954.000000
mean          3.465760
std           1.123252
min           1.490000
25%           2.460000
50%           3.290000
75%           4.440000
max           6.910000
Name: coste_devolucion, dtype: float64


Ejemplos:


Unnamed: 0,item_id,ticket_id,devuelto,coste_devolucion,has_coste
0,T000016-001,T000016,1,1.93,1
1,T000084-001,T000084,1,2.98,1
2,T000975-001,T000975,0,4.42,1
3,T001279-001,T001279,0,4.44,1
4,T001279-002,T001279,0,3.55,1
5,T001368-001,T001368,1,1.98,1
6,T001750-001,T001750,0,2.89,1
7,T003130-001,T003130,0,4.32,1
8,T003131-001,T003131,0,4.31,1
9,T003386-001,T003386,0,3.3,1


In [85]:
import pandas as pd
import numpy as np

COST_FILE = "data/items_devoluciones_ajustadas.csv"
KEYS = ["item_id", "ticket_id"]

# 1) Cargar costes
items_cost = pd.read_csv(COST_FILE, usecols=["item_id", "ticket_id", "coste_devolucion"]).copy()

for c in KEYS:
    items_cost[c] = items_cost[c].astype(str).str.strip()

items_cost["coste_devolucion"] = pd.to_numeric(items_cost["coste_devolucion"], errors="coerce")

# 2) Resolver duplicados por clave (conservador: máximo coste)
items_cost = (
    items_cost
    .groupby(KEYS, as_index=False)
    .agg(coste_devolucion=("coste_devolucion", "max"))
)

# 3) Merge a test_df2 (y reasignar)
for c in KEYS:
    if c not in test_df2.columns:
        raise KeyError(f"test_df2 no tiene '{c}'. No puedo mergear coste.")
    test_df2[c] = test_df2[c].astype(str).str.strip()

test_df2 = test_df2.merge(items_cost, on=KEYS, how="left", validate="m:1")

# 4) Flags y NA policy
test_df2["has_coste"] = test_df2["coste_devolucion"].notna().astype("int8")
test_df2["coste_devolucion"] = test_df2["coste_devolucion"].fillna(0.0).astype("float32")

print(" coste_devolucion añadido a test_df2")
print(f"- Cobertura (has_coste=1): {test_df2['has_coste'].mean():.2%}")
print(f"- coste_devolucion NA rate: {test_df2['coste_devolucion'].isna().mean():.2%}")
print("Resumen coste_devolucion:")
display(test_df2["coste_devolucion"].describe())


 coste_devolucion añadido a test_df2
- Cobertura (has_coste=1): 100.00%
- coste_devolucion NA rate: 0.00%
Resumen coste_devolucion:


count    191954.000000
mean          3.465760
std           1.123252
min           1.490000
25%           2.460000
50%           3.290000
75%           4.440000
max           6.910000
Name: coste_devolucion, dtype: float64

In [86]:
import os
import numpy as np
import pandas as pd

# Carpeta de salida para BI
BI_DIR = "data/bi"
os.makedirs(BI_DIR, exist_ok=True)

# Checks mínimos para evitar merges rotos
required_base_cols = ["item_id", "ticket_id", "fecha_item", "categoria", "id_producto", "talla", "devuelto", "coste_devolucion"]
missing_base = [c for c in required_base_cols if c not in test_df2.columns]
if missing_base:
    raise KeyError(f"test_df2 no tiene columnas necesarias: {missing_base}")

required_recs_cols = ["_orig_idx", "talla_final", "p_dev_actual", "p_dev_final", "delta_p_final", "cambia_talla"]
missing_recs = [c for c in required_recs_cols if c not in recs.columns]
if missing_recs:
    raise KeyError(f"recs no tiene columnas necesarias: {missing_recs}")

print(" Checks OK: test_df2 y recs tienen las columnas mínimas para construir el dataset BI.")


 Checks OK: test_df2 y recs tienen las columnas mínimas para construir el dataset BI.


In [87]:
def build_bi_item_dataset(df_base: pd.DataFrame, recs: pd.DataFrame) -> pd.DataFrame:
    """
    Devuelve un dataset plano a nivel item, listo para exportar a Power BI.
    Une recs con el universo base mediante _orig_idx (índice original del df_base).
    """
    base = df_base.copy().reset_index(drop=False).rename(columns={"index": "_orig_idx"})

    # Selección de columnas base: añadimos las que suelen interesar en BI si existen
    base_cols = [
        "_orig_idx",
        "item_id", "ticket_id", "customer_id", "canal",
        "sku", "id_producto", "categoria",
        "fecha_item",
        "talla", "altura_cm", "peso_kg", "bmi",
        "ideal_idx", "talla_idx", "desajuste", "desajuste_abs", "talla_extrema",
        "devuelto",
        "coste_devolucion",
        "p_dev_global",  # puede no existir, lo gestionamos abajo
    ]
    base_cols = [c for c in base_cols if c in base.columns]

    # Parte de recs: lo que genera el recomendador
    recs_cols = [
        "_orig_idx",
        "talla_reco", "talla_final",
        "p_dev_actual", "p_dev_reco", "p_dev_final",
        "delta_p", "delta_p_final",
        "cambia_talla_raw", "cambia_talla",
    ]
    recs_cols = [c for c in recs_cols if c in recs.columns]

    out = (
        base[base_cols]
        .merge(recs[recs_cols], on="_orig_idx", how="left", validate="1:1")
    )

    # Seguridad: si no hay p_dev_global, lo dejamos como NaN y seguimos (para BI igual sirve)
    if "p_dev_global" not in out.columns:
        out["p_dev_global"] = np.nan

    # Métricas económicas (esperadas)
    out["expected_cost_global"] = (out["p_dev_global"] * out["coste_devolucion"]).astype("float64")
    out["expected_savings_talla"] = (out["delta_p_final"] * out["coste_devolucion"]).astype("float64")

    # Métricas auxiliares de BI
    out["year"] = pd.to_datetime(out["fecha_item"]).dt.year
    out["month"] = pd.to_datetime(out["fecha_item"]).dt.to_period("M").astype(str)

    return out


bi_items = build_bi_item_dataset(test_df2, recs)

print(" Dataset BI (items) creado.")
print(f"- Filas: {bi_items.shape[0]:,}")
print(f"- Columnas: {bi_items.shape[1]:,}")
display(bi_items.head(5))


 Dataset BI (items) creado.
- Filas: 191,954
- Columnas: 34


Unnamed: 0,_orig_idx,item_id,ticket_id,customer_id,canal,sku,id_producto,categoria,fecha_item,talla,...,p_dev_final,delta_p,delta_p_final,cambia_talla_raw,cambia_talla,p_dev_global,expected_cost_global,expected_savings_talla,year,month
0,0,T000016-001,T000016,C000010,online,P001-BEI-M,P001,camiseta,2024-10-31,M,...,,,,,,,,,2024,2024-10
1,1,T000084-001,T000084,C000059,online,P004-NAV-S,P004,pantalon,2024-11-06,S,...,,,,,,,,,2024,2024-11
2,2,T000975-001,T000975,C000657,online,P020-BLK-XL,P020,abrigo,2024-10-05,XL,...,,,,,,,,,2024,2024-10
3,3,T001279-001,T001279,C000874,online,P020-BRN-M,P020,abrigo,2025-02-17,M,...,,,,,,,,,2025,2025-02
4,4,T001279-002,T001279,C000874,online,P003-BRN-M,P003,sudadera,2025-02-17,M,...,,,,,,,,,2025,2025-02


In [88]:
def print_exec_summary(bi_items: pd.DataFrame, adoption_rate: float = 1.0) -> None:
    """
    Imprime un resumen ejecutivo con métricas en formato negocio.
    adoption_rate permite simular aceptación realista (0.2, 0.4, 0.6...).
    """
    df = bi_items.copy()

    n_items = len(df)
    pct_interv = float(df["cambia_talla"].mean())
    n_interv = int(df["cambia_talla"].sum())

    total_savings = float(df["expected_savings_talla"].sum())
    total_savings_adopt = total_savings * float(adoption_rate)

    mean_savings_item_all = float(df["expected_savings_talla"].mean())
    mean_savings_item_interv = float(df.loc[df["cambia_talla"] == 1, "expected_savings_talla"].mean()) if n_interv > 0 else 0.0

    # Métricas por pedido (si ticket_id existe)
    if "ticket_id" in df.columns:
        per_ticket = (
            df.groupby("ticket_id", as_index=False)
              .agg(
                  n_items=("item_id", "size") if "item_id" in df.columns else ("ticket_id", "size"),
                  n_interv=("cambia_talla", "sum"),
                  savings=("expected_savings_talla", "sum"),
              )
        )
        per_ticket["has_interv"] = (per_ticket["n_interv"] > 0).astype(int)

        pct_tickets_interv = float(per_ticket["has_interv"].mean())
        mean_savings_ticket_interv = float(per_ticket.loc[per_ticket["has_interv"] == 1, "savings"].mean()) if per_ticket["has_interv"].sum() > 0 else 0.0
    else:
        pct_tickets_interv = np.nan
        mean_savings_ticket_interv = np.nan

    # Share atacable si existe p_dev_global
    has_global = df["p_dev_global"].notna().any()
    if has_global:
        total_expected_cost = float(df["expected_cost_global"].sum())
        share_cost_addr = (total_savings / total_expected_cost) if total_expected_cost > 0 else np.nan
    else:
        total_expected_cost = np.nan
        share_cost_addr = np.nan

    print("RESUMEN EJECUTIVO — Recomendador de tallas + modelo de devoluciones")
    print(f"Universo analizado: {n_items:,} items")
    print(f"Intervención (cambio de talla final): {n_interv:,} items ({pct_interv:.2%})")

    print("\nImpacto económico esperado (sin A/B; probabilidad × coste):")
    print(f"- Ahorro esperado total por talla (adopción 100%): {total_savings:,.2f} €")
    print(f"- Ahorro esperado total ajustado por adopción ({adoption_rate:.0%}): {total_savings_adopt:,.2f} €")

    print("\nImpacto medio (útil para producto/UX):")
    print(f"- Ahorro medio por item (incluyendo no intervenidos): {mean_savings_item_all:,.4f} €")
    print(f"- Ahorro medio por item intervenido: {mean_savings_item_interv:,.2f} €")

    if not np.isnan(pct_tickets_interv):
        print("\nA nivel pedido (ticket):")
        print(f"- % pedidos con al menos 1 recomendación: {pct_tickets_interv:.2%}")
        print(f"- Ahorro medio por pedido intervenido: {mean_savings_ticket_interv:,.2f} €")

    if has_global:
        print("\nContexto (modelo global):")
        print(f"- Coste esperado total de devoluciones: {total_expected_cost:,.2f} €")
        print(f"- Share del coste atacable por talla: {share_cost_addr:.2%}")
    else:
        print("\nContexto (modelo global): p_dev_global no está disponible en este dataset (se omite share atacable).")

    print("\nLectura recomendada:")
    print("- La métrica 'por item intervenido' es la más representativa del valor cuando el sistema actúa.")
    print("- La métrica global incluye muchos items sin intervención y por eso se diluye.")
    print("=" * 72)


print_exec_summary(bi_items, adoption_rate=1.0)


RESUMEN EJECUTIVO — Recomendador de tallas + modelo de devoluciones
Universo analizado: 191,954 items
Intervención (cambio de talla final): 3,581 items (17.90%)

Impacto económico esperado (sin A/B; probabilidad × coste):
- Ahorro esperado total por talla (adopción 100%): 2,602.65 €
- Ahorro esperado total ajustado por adopción (100%): 2,602.65 €

Impacto medio (útil para producto/UX):
- Ahorro medio por item (incluyendo no intervenidos): 0.1301 €
- Ahorro medio por item intervenido: 0.73 €

A nivel pedido (ticket):
- % pedidos con al menos 1 recomendación: 2.65%
- Ahorro medio por pedido intervenido: 0.74 €

Contexto (modelo global): p_dev_global no está disponible en este dataset (se omite share atacable).

Lectura recomendada:
- La métrica 'por item intervenido' es la más representativa del valor cuando el sistema actúa.
- La métrica global incluye muchos items sin intervención y por eso se diluye.


In [89]:
def build_aggregations(bi_items: pd.DataFrame, min_n_prod: int = 500):
    df = bi_items.copy()

    by_cat = (
        df.groupby("categoria", as_index=False)
          .agg(
              n_items=("ticket_id", "size") if "ticket_id" in df.columns else ("categoria", "size"),
              pct_interv=("cambia_talla", "mean"),
              savings_total=("expected_savings_talla", "sum"),
              savings_mean_item=("expected_savings_talla", "mean"),
              savings_mean_interv=("expected_savings_talla", lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1])) if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
          )
          .sort_values("savings_total", ascending=False)
    )

    by_prod = (
        df.groupby(["id_producto", "categoria"], as_index=False)
          .agg(
              n=("categoria", "size"),
              pct_interv=("cambia_talla", "mean"),
              savings_total=("expected_savings_talla", "sum"),
              savings_mean_interv=("expected_savings_talla", lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1])) if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
          )
    )
    by_prod_big = by_prod[by_prod["n"] >= int(min_n_prod)].sort_values("savings_total", ascending=False)

    return by_cat, by_prod_big


by_cat, by_prod_big = build_aggregations(bi_items, min_n_prod=500)

print("\n Tabla por categoría (ordenada por ahorro total esperado):")
display(by_cat)

print("\n Top productos (n>=500) por ahorro total esperado:")
display(by_prod_big.head(20))



 Tabla por categoría (ordenada por ahorro total esperado):


Unnamed: 0,categoria,n_items,pct_interv,savings_total,savings_mean_item,savings_mean_interv
0,abrigo,62485,0.235021,1583.959292,0.237262,1.009534
3,pantalon,39494,0.122046,385.982118,0.094027,0.770423
2,camiseta,56030,0.115692,291.269217,0.050673,0.437999
1,camisa,17199,0.356593,175.032288,0.096172,0.269695
4,sudadera,16746,0.119322,166.406744,0.100791,0.844704



 Top productos (n>=500) por ahorro total esperado:


Unnamed: 0,id_producto,categoria,n,pct_interv,savings_total,savings_mean_interv
12,P047,abrigo,11004,0.770728,710.552472,0.77997
7,P013,abrigo,13200,0.128114,247.490754,1.374949
13,P058,abrigo,10660,0.121542,186.394098,1.285477
4,P005,abrigo,10722,0.110052,179.835725,1.416029
9,P020,abrigo,10665,0.119963,170.010256,1.297788
2,P003,sudadera,13760,0.122076,140.374251,0.840564
6,P008,pantalon,13691,0.125884,137.14726,0.77049
11,P042,camisa,11017,0.479381,122.076099,0.218774
10,P024,pantalon,10891,0.119755,108.008666,0.788384
3,P004,pantalon,10814,0.124432,105.841834,0.772568


In [90]:
def run_threshold_sweep(
    df_base: pd.DataFrame,
    gains: list[float],
    max_step: int = 2,
    adoption_rate: float = 1.0,
    sample_n: int | None = 20000,
    seed: int = 7,
) -> pd.DataFrame:
    """
    Ejecuta el recomendador con distintos umbrales min_gain y devuelve una tabla resumen.
    """
    if sample_n is not None:
        df_eval = df_base.sample(sample_n, random_state=seed).copy()
        scope = f"muestra n={sample_n:,}"
    else:
        df_eval = df_base.copy()
        scope = f"universo completo n={len(df_eval):,}"

    rows = []
    for g in gains:
        recs_g, _ = recommend_sizes(df_eval, max_step=max_step, min_gain=float(g))
        bi_g = build_bi_item_dataset(df_eval, recs_g)

        n_items = len(bi_g)
        pct_interv = float(bi_g["cambia_talla"].mean())
        savings_total = float(bi_g["expected_savings_talla"].sum()) * float(adoption_rate)
        savings_per_item = float(bi_g["expected_savings_talla"].mean()) * float(adoption_rate)
        savings_per_interv = float(bi_g.loc[bi_g["cambia_talla"] == 1, "expected_savings_talla"].mean()) * float(adoption_rate) if bi_g["cambia_talla"].sum() > 0 else 0.0

        rows.append({
            "scope": scope,
            "min_gain": float(g),
            "pct_interv": pct_interv,
            "savings_total_adj_adoption": savings_total,
            "savings_mean_item_adj_adoption": savings_per_item,
            "savings_mean_interv_item_adj_adoption": savings_per_interv,
        })

        print(f"[min_gain={g:.3f}] intervención={pct_interv:.2%} | ahorro_total(adop)={savings_total:,.2f} €")

    return pd.DataFrame(rows)


gains = [0.00, 0.005, 0.01, 0.02, 0.05]
threshold_curve = run_threshold_sweep(
    test_df2,
    gains=gains,
    max_step=2,
    adoption_rate=1.0,
    sample_n=20000,   # sube o pon None para todo el test
    seed=7
)

print("\n Curva fricción vs ahorro (lista para Power BI):")
display(threshold_curve)


[min_gain=0.000] intervención=18.56% | ahorro_total(adop)=2,603.46 €
[min_gain=0.005] intervención=18.02% | ahorro_total(adop)=2,603.32 €
[min_gain=0.010] intervención=17.90% | ahorro_total(adop)=2,602.65 €
[min_gain=0.020] intervención=17.52% | ahorro_total(adop)=2,598.21 €
[min_gain=0.050] intervención=16.02% | ahorro_total(adop)=2,561.79 €

 Curva fricción vs ahorro (lista para Power BI):


Unnamed: 0,scope,min_gain,pct_interv,savings_total_adj_adoption,savings_mean_item_adj_adoption,savings_mean_interv_item_adj_adoption
0,"muestra n=20,000",0.0,0.1856,2603.460727,0.130173,0.701363
1,"muestra n=20,000",0.005,0.18015,2603.320173,0.130166,0.722542
2,"muestra n=20,000",0.01,0.17905,2602.649661,0.130132,0.726794
3,"muestra n=20,000",0.02,0.17515,2598.207219,0.12991,0.741709
4,"muestra n=20,000",0.05,0.16025,2561.785789,0.128089,0.799309


In [91]:
bi_items_path = os.path.join(BI_DIR, "bi_items.csv")
by_cat_path = os.path.join(BI_DIR, "by_category.csv")
by_prod_path = os.path.join(BI_DIR, "by_product_top.csv")
curve_path = os.path.join(BI_DIR, "threshold_curve.csv")

bi_items.to_csv(bi_items_path, index=False)
by_cat.to_csv(by_cat_path, index=False)
by_prod_big.to_csv(by_prod_path, index=False)
threshold_curve.to_csv(curve_path, index=False)

print("\n Exportación completada (Power BI ready):")
print(f"- {bi_items_path}")
print(f"- {by_cat_path}")
print(f"- {by_prod_path}")
print(f"- {curve_path}")



 Exportación completada (Power BI ready):
- data/bi\bi_items.csv
- data/bi\by_category.csv
- data/bi\by_product_top.csv
- data/bi\threshold_curve.csv


# KPIS

In [92]:
import os
import numpy as np
import pandas as pd

BI_DIR = "data/bi"
os.makedirs(BI_DIR, exist_ok=True)

KEYS = ["item_id", "ticket_id"]

def _require_cols(df: pd.DataFrame, cols: list[str], df_name: str = "df"):
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise KeyError(f"{df_name} no tiene columnas necesarias: {missing}")

def _pick_global_pred():
    if "global_pred_test" in globals():
        return global_pred_test.copy(), "global_pred_test"
    if "global_pred" in globals():
        return global_pred.copy(), "global_pred"
    raise NameError("No encuentro global_pred_test ni global_pred en memoria. Crea/carga predicciones globales primero.")

def _consolidate_cost(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()

    # Detecta coste: coste_devolucion o coste_devolucion_x/y
    if "coste_devolucion" in out.columns:
        cost = out["coste_devolucion"]
    else:
        cands = [c for c in out.columns if c.startswith("coste_devolucion")]
        if not cands:
            raise KeyError("No encuentro coste_devolucion ni variantes coste_devolucion_x/y.")
        # si hay x/y, bfill elige la primera no nula por fila
        cost = out[cands].bfill(axis=1).iloc[:, 0] if len(cands) > 1 else out[cands[0]]

    out["coste_devolucion"] = pd.to_numeric(cost, errors="coerce").fillna(0.0).astype("float32")
    out["has_coste"] = (out["coste_devolucion"] > 0).astype("int8")

    # Limpia columnas coste duplicadas
    drop_cols = [c for c in out.columns if c.startswith("coste_devolucion") and c != "coste_devolucion"]
    out = out.drop(columns=drop_cols, errors="ignore")

    return out

def _print_exec_kpis(bi_items: pd.DataFrame, title: str):
    df = bi_items.copy()
    df["cambia_talla"] = pd.to_numeric(df["cambia_talla"], errors="coerce").fillna(0).astype("int8")
    df["expected_savings_talla"] = pd.to_numeric(df["expected_savings_talla"], errors="coerce").fillna(0.0)

    n_items = len(df)
    n_orders = df["ticket_id"].nunique() if "ticket_id" in df.columns else np.nan
    pct_item = float(df["cambia_talla"].mean())
    pct_order = float(df.groupby("ticket_id")["cambia_talla"].max().mean()) if "ticket_id" in df.columns else np.nan
    savings_total = float(df["expected_savings_talla"].sum())
    savings_item = float(df["expected_savings_talla"].mean())
    savings_interv = float(df.loc[df["cambia_talla"] == 1, "expected_savings_talla"].mean()) if df["cambia_talla"].sum() > 0 else 0.0

    p0 = float(pd.to_numeric(df["p_dev_actual"], errors="coerce").dropna().mean()) if "p_dev_actual" in df.columns else np.nan
    p1 = float(pd.to_numeric(df["p_dev_final"], errors="coerce").dropna().mean()) if "p_dev_final" in df.columns else np.nan

    print(f"✅ {title}")
    print(f"- Items: {n_items:,} | Pedidos: {n_orders:,}")
    print(f"- Intervención: {pct_item:.2%} (items) | {pct_order:.2%} (pedidos)")
    print(f"- Ahorro esperado total (talla): {savings_total:,.2f} €")
    print(f"- Ahorro medio: {savings_item:,.2f} € por item | {savings_interv:,.2f} € por intervención")
    if not np.isnan(p0) and not np.isnan(p1):
        print(f"- Riesgo medio sin reco: {p0:.4f} | con reco: {p1:.4f} | Δp: {(p0 - p1):.4f}")


In [93]:
print("=== CHECK UNIVERSO ===")
print("test_df2 existe:", "test_df2" in globals())
if "test_df2" in globals():
    print("test_df2 filas:", len(test_df2))
    print("canales:", test_df2["canal"].value_counts(dropna=False).head(5).to_dict() if "canal" in test_df2.columns else "no canal")
    print("fecha range:", test_df2["fecha_item"].min(), "->", test_df2["fecha_item"].max() if "fecha_item" in test_df2.columns else "no fecha")


=== CHECK UNIVERSO ===
test_df2 existe: True
test_df2 filas: 191954
canales: {'online': 191954}
fecha range: 2024-09-01 00:00:00 -> 2025-09-30 00:00:00


In [94]:
import os
import pandas as pd
import numpy as np
import xgboost as xgb

GLOBAL_DATA_DIR = "data/processed/devoluciones"
GLOBAL_MODEL_PATH = "modelos/devoluciones/xgb_final.json"

X_TEST_PATH = os.path.join(GLOBAL_DATA_DIR, "X_test.parquet")
TEST_INDEX_PATH = os.path.join(GLOBAL_DATA_DIR, "test_index.parquet")

X_test_global = pd.read_parquet(X_TEST_PATH)
test_index = pd.read_parquet(TEST_INDEX_PATH)

required_idx_cols = ["item_id", "ticket_id"]
missing = [c for c in required_idx_cols if c not in test_index.columns]
if missing:
    raise KeyError(f"test_index.parquet no tiene columnas necesarias: {missing}")

if len(X_test_global) != len(test_index):
    raise ValueError(f"X_test ({len(X_test_global):,}) y test_index ({len(test_index):,}) no tienen la misma longitud.")

booster_global = xgb.Booster()
booster_global.load_model(GLOBAL_MODEL_PATH)

dtest_global = xgb.DMatrix(X_test_global)
p_dev_global_test = booster_global.predict(dtest_global)

global_pred_test = test_index[["item_id", "ticket_id"]].copy()
for c in ["item_id", "ticket_id"]:
    global_pred_test[c] = global_pred_test[c].astype(str).str.strip()

global_pred_test["p_dev_global"] = pd.to_numeric(pd.Series(p_dev_global_test), errors="coerce").astype("float32")
global_pred_test = global_pred_test.drop_duplicates(subset=["item_id", "ticket_id"])

print(" Predicciones globales (test del modelo global) creadas")
print("- Filas:", len(global_pred_test))
print("- Media p_dev_global:", float(global_pred_test["p_dev_global"].mean()))
display(global_pred_test.head(3))

 Predicciones globales (test del modelo global) creadas
- Filas: 226362
- Media p_dev_global: 0.5012995004653931


Unnamed: 0,item_id,ticket_id,p_dev_global
0,T378172-001,T378172,0.539345
1,T378172-002,T378172,0.30995
2,T378174-001,T378174,0.559957


In [95]:
# =========================================================
# FULL | Alinear score global al universo del recomendador
# =========================================================

_require_cols(test_df2, KEYS, df_name="test_df2")

global_pred_use, global_pred_name = _pick_global_pred()
_require_cols(global_pred_use, KEYS + ["p_dev_global"], df_name=global_pred_name)

# Normaliza llaves
for c in KEYS:
    global_pred_use[c] = global_pred_use[c].astype(str).str.strip()

test_df2_aligned = test_df2.copy()
for c in KEYS:
    test_df2_aligned[c] = test_df2_aligned[c].astype(str).str.strip()

# Merge
test_df2_aligned = test_df2_aligned.merge(
    global_pred_use[KEYS + ["p_dev_global"]],
    on=KEYS,
    how="left",
    validate="m:1"
)

cov = float(test_df2_aligned["p_dev_global"].notna().mean())
print(" test_df2_aligned creado")
print("- Filas:", len(test_df2_aligned))
print(f"- Cobertura p_dev_global: {cov:.2%}")
print("- Media p_dev_global (no-NA):", float(pd.to_numeric(test_df2_aligned["p_dev_global"], errors="coerce").dropna().mean()) if cov > 0 else 0.0)


 test_df2_aligned creado
- Filas: 191954
- Cobertura p_dev_global: 69.68%
- Media p_dev_global (no-NA): 0.5495098233222961


In [96]:
# FULL | Recomendaciones definitivas para KPIs

MIN_GAIN_PORTADA = 0.0   # pon 0.0 si quieres el comportamiento ~18% intervención

recs_full, scen_full = recommend_sizes(
    test_df2,         # usa el universo original del recomendador
    max_step=2,
    min_gain=float(MIN_GAIN_PORTADA)
)

print(" recs_full generado")
print("- Filas recs_full:", len(recs_full))
print("- % intervención (items):", float(pd.to_numeric(recs_full["cambia_talla"], errors="coerce").fillna(0).mean()))
print("- Δp medio (todo):", float(pd.to_numeric(recs_full["delta_p_final"], errors="coerce").fillna(0).mean()))
print("- Δp medio (solo intervención):",
      float(pd.to_numeric(recs_full.loc[recs_full["cambia_talla"] == 1, "delta_p_final"], errors="coerce").fillna(0).mean()))


 recs_full generado
- Filas recs_full: 191954
- % intervención (items): 0.18784187878345854
- Δp medio (todo): 0.03465647250413895
- Δp medio (solo intervención): 0.184498131275177


In [97]:
# =========================================================
# FULL | Dataset BI definitivo (Power BI ready)
# =========================================================

base = test_df2_aligned.copy()

# Coste + fecha
_require_cols(base, ["fecha_item"], df_name="test_df2_aligned")
base["fecha_item"] = pd.to_datetime(base["fecha_item"], errors="coerce")
if base["fecha_item"].isna().mean() > 0:
    raise ValueError("Hay NA en fecha_item tras parseo. Revisa test_df2_aligned.")

base = _consolidate_cost(base)

# Índice estable para merge con recs_full
base = base.reset_index(drop=True).reset_index().rename(columns={"index": "_orig_idx"})

recs_keep = recs_full[[
    "_orig_idx", "talla_reco", "talla_final", "cambia_talla",
    "delta_p_final", "p_dev_actual", "p_dev_final"
]].copy()

bi_items_full = base.merge(recs_keep, on="_orig_idx", how="left", validate="1:1")

# Limpieza defensiva
bi_items_full["cambia_talla"] = pd.to_numeric(bi_items_full["cambia_talla"], errors="coerce").fillna(0).astype("int8")
bi_items_full["delta_p_final"] = pd.to_numeric(bi_items_full["delta_p_final"], errors="coerce").fillna(0.0).astype("float32")

# Métricas económicas
bi_items_full["expected_savings_talla"] = (bi_items_full["delta_p_final"] * bi_items_full["coste_devolucion"]).astype("float32")

bi_items_full["p_dev_global"] = pd.to_numeric(bi_items_full["p_dev_global"], errors="coerce")
bi_items_full["expected_cost_global"] = (bi_items_full["p_dev_global"] * bi_items_full["coste_devolucion"]).astype("float32")

# Tiempo
bi_items_full["year"] = bi_items_full["fecha_item"].dt.year.astype("int16")
bi_items_full["month"] = bi_items_full["fecha_item"].dt.to_period("M").astype(str)

print(" bi_items_full creado")
_print_exec_kpis(bi_items_full, title="PÁGINA 1 | RESUMEN EJECUTIVO (FULL)")


 bi_items_full creado
✅ PÁGINA 1 | RESUMEN EJECUTIVO (FULL)
- Items: 191,954 | Pedidos: 132,024
- Intervención: 18.78% (items) | 22.10% (pedidos)
- Ahorro esperado total (talla): 25,074.35 €
- Ahorro medio: 0.13 € por item | 0.70 € por intervención
- Riesgo medio sin reco: 0.3472 | con reco: 0.3125 | Δp: 0.0347


In [98]:
# PORTADA | Share atacable por talla (solo universo cubierto)

df = bi_items_full.copy()

cov = df["p_dev_global"].notna()
coverage = float(cov.mean())

savings_all = float(df["expected_savings_talla"].sum())
savings_cov = float(df.loc[cov, "expected_savings_talla"].sum())
cost_cov = float(df.loc[cov, "expected_cost_global"].sum())

share_cov = float(savings_cov / (cost_cov + 1e-9))
share_savings_covered = float(savings_cov / (savings_all + 1e-9))

print(" PORTADA | Contexto modelo global (sin mezclar universos)")
print(f"- Cobertura p_dev_global: {coverage:.2%}")
print(f"- Coste esperado total devoluciones (cubierto): {cost_cov:,.2f} €")
print(f"- Ahorro talla (solo cubiertos): {savings_cov:,.2f} €")
print(f"- % del ahorro dentro de cubiertos: {share_savings_covered:.2%}")
print(f"- Share coste atacable por talla (cubierto): {share_cov:.2%}")


 PORTADA | Contexto modelo global (sin mezclar universos)
- Cobertura p_dev_global: 69.68%
- Coste esperado total devoluciones (cubierto): 262,662.19 €
- Ahorro talla (solo cubiertos): 17,386.57 €
- % del ahorro dentro de cubiertos: 69.34%
- Share coste atacable por talla (cubierto): 6.62%


In [99]:
# PORTADA | Tabla KPI exportable

df = bi_items_full.copy()

items = int(len(df))
orders = int(df["ticket_id"].nunique())
pct_items = float(df["cambia_talla"].mean())
pct_orders = float(df.groupby("ticket_id")["cambia_talla"].max().mean())

savings_total = float(df["expected_savings_talla"].sum())
savings_item = float(df["expected_savings_talla"].mean())
savings_interv = float(df.loc[df["cambia_talla"] == 1, "expected_savings_talla"].mean()) if df["cambia_talla"].sum() > 0 else 0.0

p0 = float(pd.to_numeric(df["p_dev_actual"], errors="coerce").dropna().mean())
p1 = float(pd.to_numeric(df["p_dev_final"], errors="coerce").dropna().mean())
delta_p = float(p0 - p1)

cov = df["p_dev_global"].notna()
coverage = float(cov.mean())
cost_cov = float(df.loc[cov, "expected_cost_global"].sum())
savings_cov = float(df.loc[cov, "expected_savings_talla"].sum())
share_cov = float(savings_cov / (cost_cov + 1e-9))

kpi_portada = pd.DataFrame({
    "kpi": [
        "Items analizados",
        "Pedidos analizados",
        "% items con recomendación",
        "% pedidos afectados",
        "Ahorro esperado total (talla)",
        "Ahorro medio por item",
        "Ahorro medio por intervención",
        "Riesgo medio sin recomendador (p_dev_actual)",
        "Riesgo medio con recomendador (p_dev_final)",
        "Δp medio (todo el universo)",
        "Cobertura p_dev_global (items)",
        "Coste esperado total devoluciones (cubierto)",
        "Share coste atacable por talla (cubierto)"
    ],
    "valor": [
        f"{items:,}",
        f"{orders:,}",
        f"{pct_items:.2%}",
        f"{pct_orders:.2%}",
        f"{savings_total:,.2f} €",
        f"{savings_item:,.2f} €",
        f"{savings_interv:,.2f} €",
        f"{p0:.4f}",
        f"{p1:.4f}",
        f"{delta_p:.4f}",
        f"{coverage:.2%}",
        f"{cost_cov:,.2f} €",
        f"{share_cov:.2%}",
    ]
})

display(kpi_portada)

kpi_portada.to_csv(os.path.join(BI_DIR, "kpi_portada.csv"), index=False)
print(" Exportado:", os.path.join(BI_DIR, "kpi_portada.csv"))


Unnamed: 0,kpi,valor
0,Items analizados,191954
1,Pedidos analizados,132024
2,% items con recomendación,18.78%
3,% pedidos afectados,22.10%
4,Ahorro esperado total (talla),"25,074.35 €"
5,Ahorro medio por item,0.13 €
6,Ahorro medio por intervención,0.70 €
7,Riesgo medio sin recomendador (p_dev_actual),0.3472
8,Riesgo medio con recomendador (p_dev_final),0.3125
9,Δp medio (todo el universo),0.0347


 Exportado: data/bi\kpi_portada.csv


In [100]:
# PORTADA | KPIs "Power BI ready" (wide + long)
# - Wide: 1 fila, 1 columna por KPI (ideal para tarjetas)
# - Long: 1 fila por KPI, con valor_num + valor_text (más flexible)


import os
import numpy as np
import pandas as pd

df = bi_items_full.copy()

required = [
    "ticket_id", "cambia_talla", "expected_savings_talla",
    "p_dev_actual", "p_dev_final",
]
missing = [c for c in required if c not in df.columns]
if missing:
    raise KeyError(f"bi_items_full no tiene columnas necesarias: {missing}")

# básicos
items = int(len(df))
orders = int(df["ticket_id"].nunique())

pct_items = float(pd.to_numeric(df["cambia_talla"], errors="coerce").fillna(0).mean())
pct_orders = float(df.groupby("ticket_id")["cambia_talla"].max().mean())

savings_total = float(pd.to_numeric(df["expected_savings_talla"], errors="coerce").fillna(0).sum())
savings_item = float(pd.to_numeric(df["expected_savings_talla"], errors="coerce").fillna(0).mean())

if int(df["cambia_talla"].sum()) > 0:
    savings_interv = float(df.loc[df["cambia_talla"] == 1, "expected_savings_talla"].mean())
else:
    savings_interv = 0.0

p0 = float(pd.to_numeric(df["p_dev_actual"], errors="coerce").dropna().mean())
p1 = float(pd.to_numeric(df["p_dev_final"], errors="coerce").dropna().mean())
delta_p = float(p0 - p1)

# opcional: contexto global (solo si existe)
has_global = ("p_dev_global" in df.columns) and ("expected_cost_global" in df.columns)
if has_global:
    cov = df["p_dev_global"].notna()
    coverage = float(cov.mean())
    cost_cov = float(pd.to_numeric(df.loc[cov, "expected_cost_global"], errors="coerce").fillna(0).sum())
    savings_cov = float(pd.to_numeric(df.loc[cov, "expected_savings_talla"], errors="coerce").fillna(0).sum())
    share_cov = float(savings_cov / (cost_cov + 1e-9))
else:
    coverage = np.nan
    cost_cov = np.nan
    share_cov = np.nan

# ---------------------------------------------------------
# 1) LONG: una fila por KPI (numérico + texto + unidad)
# ---------------------------------------------------------
kpi_specs = [
    ("Items analizados", items, "count"),
    ("Pedidos analizados", orders, "count"),
    ("% items con recomendación", pct_items, "pct"),
    ("% pedidos afectados", pct_orders, "pct"),
    ("Ahorro esperado total (talla)", savings_total, "eur"),
    ("Ahorro medio por item", savings_item, "eur"),
    ("Ahorro medio por intervención", savings_interv, "eur"),
    ("Riesgo medio sin recomendador (p_dev_actual)", p0, "prob"),
    ("Riesgo medio con recomendador (p_dev_final)", p1, "prob"),
    ("Δp medio (todo el universo)", delta_p, "prob"),
    ("Cobertura p_dev_global (items)", coverage, "pct"),
    ("Coste esperado total devoluciones (cubierto)", cost_cov, "eur"),
    ("Share coste atacable por talla (cubierto)", share_cov, "pct"),
]

def _format_value(x: float, unit: str) -> str:
    if pd.isna(x):
        return ""
    if unit == "count":
        return f"{int(round(float(x))):,}".replace(",", "")
    if unit == "eur":
        return f"{float(x):.2f}"
    if unit == "pct":
        return f"{float(x):.6f}"  # en Power BI lo formateas como %
    if unit == "prob":
        return f"{float(x):.6f}"
    return str(x)

kpi_long = pd.DataFrame(
    [{
        "kpi": name,
        "value_num": (np.nan if pd.isna(val) else float(val)),
        "unit": unit,
        "value_text": _format_value(val, unit),
    } for (name, val, unit) in kpi_specs]
)

# Sugerencia de formato para Power BI (opcional, ayuda a documentar)
fmt_map = {"count": "0", "eur": "0.00", "pct": "0.00%", "prob": "0.0000"}
kpi_long["format_hint"] = kpi_long["unit"].map(fmt_map).fillna("")

# ---------------------------------------------------------
# 2) WIDE: 1 fila con columnas numéricas (tarjetas directas)
# ---------------------------------------------------------
kpi_wide = pd.DataFrame([{
    "items_analizados": items,
    "pedidos_analizados": orders,
    "pct_items_reco": pct_items,
    "pct_pedidos_afectados": pct_orders,
    "ahorro_total_talla_eur": savings_total,
    "ahorro_medio_item_eur": savings_item,
    "ahorro_medio_interv_eur": savings_interv,
    "riesgo_medio_sin_reco": p0,
    "riesgo_medio_con_reco": p1,
    "delta_p_medio": delta_p,
    "cobertura_p_dev_global": coverage,
    "coste_esperado_total_cubierto_eur": cost_cov,
    "share_coste_atacable_talla_cubierto": share_cov,
}])

# Export
os.makedirs(BI_DIR, exist_ok=True)

path_long = os.path.join(BI_DIR, "kpi_portada_long.csv")
path_wide = os.path.join(BI_DIR, "kpi_portada_wide.csv")

kpi_long.to_csv(path_long, index=False)
kpi_wide.to_csv(path_wide, index=False)

print(" KPI portada exportados (Power BI ready)")
print("-", path_long)
print("-", path_wide)

display(kpi_long)
display(kpi_wide)


 KPI portada exportados (Power BI ready)
- data/bi\kpi_portada_long.csv
- data/bi\kpi_portada_wide.csv


Unnamed: 0,kpi,value_num,unit,value_text,format_hint
0,Items analizados,191954.0,count,191954.0,0
1,Pedidos analizados,132024.0,count,132024.0,0
2,% items con recomendación,0.187842,pct,0.187842,0.00%
3,% pedidos afectados,0.221036,pct,0.221036,0.00%
4,Ahorro esperado total (talla),25074.347656,eur,25074.35,0.00
5,Ahorro medio por item,0.130627,eur,0.13,0.00
6,Ahorro medio por intervención,0.695409,eur,0.7,0.00
7,Riesgo medio sin recomendador (p_dev_actual),0.347177,prob,0.347177,0.0000
8,Riesgo medio con recomendador (p_dev_final),0.31252,prob,0.31252,0.0000
9,Δp medio (todo el universo),0.034656,prob,0.034656,0.0000


Unnamed: 0,items_analizados,pedidos_analizados,pct_items_reco,pct_pedidos_afectados,ahorro_total_talla_eur,ahorro_medio_item_eur,ahorro_medio_interv_eur,riesgo_medio_sin_reco,riesgo_medio_con_reco,delta_p_medio,cobertura_p_dev_global,coste_esperado_total_cubierto_eur,share_coste_atacable_talla_cubierto
0,191954,132024,0.187842,0.221036,25074.347656,0.130627,0.695409,0.347177,0.31252,0.034656,0.696818,262662.1875,0.066194


In [101]:
# PÁGINA 2 | Tablas de priorización


df = bi_items_full.copy()

_require_cols(df, ["categoria", "id_producto", "ticket_id", "cambia_talla", "expected_savings_talla", "year", "month"], df_name="bi_items_full")

by_category = (
    df.groupby("categoria", as_index=False)
      .agg(
          n_items=("ticket_id", "size"),
          n_orders=("ticket_id", "nunique"),
          pct_interv=("cambia_talla", "mean"),
          savings_total=("expected_savings_talla", "sum"),
          savings_mean_item=("expected_savings_talla", "mean"),
          savings_mean_interv=("expected_savings_talla",
                              lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1]))
                              if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
      )
      .sort_values("savings_total", ascending=False)
      .reset_index(drop=True)
)

by_product_top = (
    df.groupby(["id_producto", "categoria"], as_index=False)
      .agg(
          n_items=("ticket_id", "size"),
          n_orders=("ticket_id", "nunique"),
          pct_interv=("cambia_talla", "mean"),
          savings_total=("expected_savings_talla", "sum"),
          savings_mean_interv=("expected_savings_talla",
                              lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1]))
                              if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
      )
)

MIN_N_PROD = 500
by_product_top = (
    by_product_top[by_product_top["n_items"] >= MIN_N_PROD]
    .sort_values("savings_total", ascending=False)
    .reset_index(drop=True)
)

month_summary = (
    df.groupby(["year", "month"], as_index=False)
      .agg(
          n_items=("ticket_id", "size"),
          n_orders=("ticket_id", "nunique"),
          pct_interv=("cambia_talla", "mean"),
          savings_total=("expected_savings_talla", "sum"),
          savings_mean_item=("expected_savings_talla", "mean"),
          savings_mean_interv=("expected_savings_talla",
                              lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1]))
                              if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
      )
      .sort_values(["year", "month"])
      .reset_index(drop=True)
)

print(" PÁGINA 2 | Tablas listas")
print("- by_category:", by_category.shape)
print("- by_product_top:", by_product_top.shape)
print("- month_summary:", month_summary.shape)

display(by_category)
display(by_product_top.head(20))
display(month_summary.head(12))

by_category.to_csv(os.path.join(BI_DIR, "by_category.csv"), index=False)
by_product_top.to_csv(os.path.join(BI_DIR, "by_product_top.csv"), index=False)
month_summary.to_csv(os.path.join(BI_DIR, "month_summary.csv"), index=False)

print(" Exportados: by_category.csv, by_product_top.csv, month_summary.csv")


 PÁGINA 2 | Tablas listas
- by_category: (5, 7)
- by_product_top: (22, 7)
- month_summary: (13, 8)


Unnamed: 0,categoria,n_items,n_orders,pct_interv,savings_total,savings_mean_item,savings_mean_interv
0,abrigo,62485,55514,0.251964,14933.295898,0.23899,0.948507
1,pantalon,39494,36590,0.128374,3867.003418,0.097914,0.762723
2,camiseta,56030,50327,0.120667,2871.069092,0.051242,0.424652
3,camisa,17199,16606,0.372696,1763.111938,0.102512,0.275056
4,sudadera,16746,16203,0.123731,1639.86731,0.097926,0.791442


Unnamed: 0,id_producto,categoria,n_items,n_orders,pct_interv,savings_total,savings_mean_interv
0,P047,abrigo,11004,10789,0.848237,6711.040527,0.718989
1,P013,abrigo,13200,12866,0.122576,2112.063721,1.305355
2,P020,abrigo,10665,10460,0.123019,1744.149536,1.329382
3,P005,abrigo,10722,10498,0.122738,1729.842285,1.31447
4,P058,abrigo,10660,10424,0.124953,1612.811279,1.210819
5,P008,pantalon,13691,13334,0.130889,1381.661011,0.771016
6,P003,sudadera,13760,13411,0.123837,1342.200562,0.787676
7,P042,camisa,11017,10773,0.508941,1245.537598,0.22214
8,P024,pantalon,10891,10661,0.133505,1121.13562,0.77107
9,P004,pantalon,10814,10603,0.120862,974.602295,0.745679


Unnamed: 0,year,month,n_items,n_orders,pct_interv,savings_total,savings_mean_item,savings_mean_interv
0,2024,2024-09,12388,8593,0.190911,1664.296875,0.134347,0.70372
1,2024,2024-10,13174,9141,0.194019,1766.463135,0.134087,0.691105
2,2024,2024-11,15934,11053,0.186206,2086.595703,0.130952,0.703268
3,2024,2024-12,16226,11275,0.186614,2128.842041,0.131199,0.703052
4,2025,2025-01,15393,10674,0.18268,2038.308716,0.132418,0.724861
5,2025,2025-02,15300,10336,0.205294,2244.334717,0.146689,0.714529
6,2025,2025-03,15164,10471,0.181812,1943.428467,0.128161,0.704907
7,2025,2025-04,14808,10143,0.183482,1918.897705,0.129585,0.706256
8,2025,2025-05,14827,10184,0.183584,1892.314087,0.127626,0.695193
9,2025,2025-06,15030,10390,0.179375,1826.288086,0.12151,0.677407


 Exportados: by_category.csv, by_product_top.csv, month_summary.csv


In [102]:
# PÁGINA 2 | Top productos por categoría

df = bi_items_full.copy()

by_product_cat = (
    df.groupby(["categoria", "id_producto"], as_index=False)
      .agg(
          n_items=("ticket_id", "size"),
          n_orders=("ticket_id", "nunique"),
          pct_interv=("cambia_talla", "mean"),
          savings_total=("expected_savings_talla", "sum"),
          savings_mean_interv=("expected_savings_talla",
                              lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1]))
                              if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
      )
)

by_product_cat["rank_in_cat"] = (
    by_product_cat.groupby("categoria")["savings_total"]
    .rank(method="first", ascending=False)
    .astype(int)
)

TOPN = 10
by_product_top_by_category = (
    by_product_cat[by_product_cat["rank_in_cat"] <= TOPN]
    .sort_values(["categoria", "rank_in_cat"])
    .reset_index(drop=True)
)

print(f" PÁGINA 2 | Top {TOPN} por categoría:", by_product_top_by_category.shape)
display(by_product_top_by_category.head(30))

by_product_top_by_category.to_csv(os.path.join(BI_DIR, "by_product_top_by_category.csv"), index=False)
print(" Exportado:", os.path.join(BI_DIR, "by_product_top_by_category.csv"))


 PÁGINA 2 | Top 10 por categoría: (22, 8)


Unnamed: 0,categoria,id_producto,n_items,n_orders,pct_interv,savings_total,savings_mean_interv,rank_in_cat
0,abrigo,P047,11004,10789,0.848237,6711.040527,0.718989,1
1,abrigo,P013,13200,12866,0.122576,2112.063721,1.305355,2
2,abrigo,P020,10665,10460,0.123019,1744.149536,1.329382,3
3,abrigo,P005,10722,10498,0.122738,1729.842285,1.31447,4
4,abrigo,P058,10660,10424,0.124953,1612.811279,1.210819,5
5,abrigo,P069,3207,3161,0.135329,523.081299,1.205256,6
6,abrigo,P059,3027,2992,0.131483,500.307251,1.257053,7
7,camisa,P042,11017,10773,0.508941,1245.537598,0.22214,1
8,camisa,P066,3052,3006,0.13401,263.302612,0.643772,2
9,camisa,P062,3130,3092,0.125879,254.271744,0.64536,3


 Exportado: data/bi\by_product_top_by_category.csv


In [103]:
bi_items_full.to_csv(os.path.join(BI_DIR, "bi_items.csv"), index=False)
print(" Exportado dataset base:", os.path.join(BI_DIR, "bi_items.csv"))

print("\n Archivos en data/bi:")
for f in sorted(os.listdir(BI_DIR)):
    print("-", os.path.join(BI_DIR, f))


 Exportado dataset base: data/bi\bi_items.csv

 Archivos en data/bi:
- data/bi\altura_quintiles.csv
- data/bi\bi_items.csv
- data/bi\bmi_quintiles.csv
- data/bi\by_category.csv
- data/bi\by_product_top.csv
- data/bi\by_product_top_by_category.csv
- data/bi\category_daily.csv
- data/bi\channel_daily.csv
- data/bi\channel_overview.csv
- data/bi\customer_daily.csv
- data/bi\customer_monthly.csv
- data/bi\geo_daily.csv
- data/bi\geo_daily_cat_canal.csv
- data/bi\geo_daily_cat_talla.csv
- data/bi\geo_mix_prov.csv
- data/bi\geo_prov_cat_talla.csv
- data/bi\geo_provincia_overview.csv
- data/bi\impact_percentiles_interv.csv
- data/bi\impact_summary.csv
- data/bi\items_global_diagnostico.csv
- data/bi\items_model_global.csv
- data/bi\items_model_global.parquet
- data/bi\items_model_global_enriched.csv
- data/bi\items_model_global_enriched.parquet
- data/bi\kpi_portada.csv
- data/bi\kpi_portada_long.csv
- data/bi\kpi_portada_wide.csv
- data/bi\month_summary.csv
- data/bi\preds_global_item_level.

In [104]:
import numpy as np
import pandas as pd
import os

df = bi_items_full.copy()

needed = [
    "ticket_id", "categoria", "talla", "altura_cm", "peso_kg", "bmi",
    "cambia_talla", "delta_p_final", "expected_savings_talla"
]
for c in needed:
    if c not in df.columns:
        raise KeyError(f"bi_items_full no tiene '{c}'. Revisa la construcción del dataset.")

df["cambia_talla"] = pd.to_numeric(df["cambia_talla"], errors="coerce").fillna(0).astype("int8")
df["delta_p_final"] = pd.to_numeric(df["delta_p_final"], errors="coerce").fillna(0.0).astype("float32")
df["expected_savings_talla"] = pd.to_numeric(df["expected_savings_talla"], errors="coerce").fillna(0.0).astype("float32")

print(" PÁGINA 3 | Dataset listo")
print("- Items:", len(df))
print("- % intervención:", float(df["cambia_talla"].mean()))
print("- Ahorro total (€):", float(df["expected_savings_talla"].sum()))


 PÁGINA 3 | Dataset listo
- Items: 191954
- % intervención: 0.18784187878345854
- Ahorro total (€): 25074.34765625


In [105]:
# PÁGINA 3.1 | Intervención vs No intervención

def _pct(x): 
    return f"{100*x:.2f}%"

impact_summary = pd.DataFrame([
    {
        "grupo": "No intervención",
        "n_items": int((df["cambia_talla"] == 0).sum()),
        "pct_items": float((df["cambia_talla"] == 0).mean()),
        "delta_p_mean": float(df.loc[df["cambia_talla"] == 0, "delta_p_final"].mean()),
        "savings_total": float(df.loc[df["cambia_talla"] == 0, "expected_savings_talla"].sum()),
        "savings_mean_item": float(df.loc[df["cambia_talla"] == 0, "expected_savings_talla"].mean()),
    },
    {
        "grupo": "Intervención",
        "n_items": int((df["cambia_talla"] == 1).sum()),
        "pct_items": float((df["cambia_talla"] == 1).mean()),
        "delta_p_mean": float(df.loc[df["cambia_talla"] == 1, "delta_p_final"].mean()),
        "savings_total": float(df.loc[df["cambia_talla"] == 1, "expected_savings_talla"].sum()),
        "savings_mean_item": float(df.loc[df["cambia_talla"] == 1, "expected_savings_talla"].mean()),
    }
])

impact_summary["pct_items"] = impact_summary["pct_items"].apply(_pct)

print(" PÁGINA 3 | Impact summary")
display(impact_summary)

impact_summary.to_csv(os.path.join(BI_DIR, "impact_summary.csv"), index=False)
print(" Exportado:", os.path.join(BI_DIR, "impact_summary.csv"))


 PÁGINA 3 | Impact summary


Unnamed: 0,grupo,n_items,pct_items,delta_p_mean,savings_total,savings_mean_item
0,No intervención,155897,81.22%,0.0,0.0,0.0
1,Intervención,36057,18.78%,0.184498,25074.347656,0.695409


 Exportado: data/bi\impact_summary.csv


In [106]:
# PÁGINA 3.2 | Tallas extremas vs no extremas

df["talla_extrema"] = df["talla"].isin(["XS", "XL"]).astype("int8")

extreme_table = (
    df.groupby("talla_extrema", as_index=False)
      .agg(
          n_items=("ticket_id", "size"),
          n_orders=("ticket_id", "nunique"),
          pct_interv=("cambia_talla", "mean"),
          delta_p_mean=("delta_p_final", "mean"),
          savings_total=("expected_savings_talla", "sum"),
          savings_mean_interv=("expected_savings_talla",
                              lambda s: float(np.mean(s[df.loc[s.index, "cambia_talla"] == 1]))
                              if df.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
      )
)

extreme_table["segmento"] = extreme_table["talla_extrema"].map({0: "No extrema", 1: "Extrema"})
extreme_table = extreme_table[["segmento", "n_items", "n_orders", "pct_interv", "delta_p_mean", "savings_total", "savings_mean_interv"]]

print(" PÁGINA 3 | Extremos vs no extremos")
display(extreme_table)

extreme_table.to_csv(os.path.join(BI_DIR, "talla_extremos.csv"), index=False)
print(" Exportado:", os.path.join(BI_DIR, "talla_extremos.csv"))


 PÁGINA 3 | Extremos vs no extremos


Unnamed: 0,segmento,n_items,n_orders,pct_interv,delta_p_mean,savings_total,savings_mean_interv
0,No extrema,157182,108131,0.183183,0.030444,17986.158203,0.624671
1,Extrema,34772,23893,0.208904,0.0537,7088.188965,0.975797


 Exportado: data/bi\talla_extremos.csv


In [107]:
# PÁGINA 3.3 | Quintiles de BMI

df_bmi = df.copy()
df_bmi["bmi"] = pd.to_numeric(df_bmi["bmi"], errors="coerce")
df_bmi = df_bmi[df_bmi["bmi"].notna()].copy()

df_bmi["bmi_quintil"] = pd.qcut(df_bmi["bmi"], q=5, labels=[f"Q{i}" for i in range(1, 6)])

bmi_table = (
    df_bmi.groupby("bmi_quintil", as_index=False, observed=False)
          .agg(
              n_items=("ticket_id", "size"),
              pct_interv=("cambia_talla", "mean"),
              delta_p_mean=("delta_p_final", "mean"),
              savings_total=("expected_savings_talla", "sum"),
              savings_mean_interv=("expected_savings_talla",
                                  lambda s: float(np.mean(s[df_bmi.loc[s.index, "cambia_talla"] == 1]))
                                  if df_bmi.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
          )
)

print(" PÁGINA 3 | BMI quintiles")
display(bmi_table)

bmi_table.to_csv(os.path.join(BI_DIR, "bmi_quintiles.csv"), index=False)
print(" Exportado:", os.path.join(BI_DIR, "bmi_quintiles.csv"))


 PÁGINA 3 | BMI quintiles


Unnamed: 0,bmi_quintil,n_items,pct_interv,delta_p_mean,savings_total,savings_mean_interv
0,Q1,38393,0.217019,0.036819,5309.074707,0.637191
1,Q2,38390,0.198593,0.036967,5408.62207,0.70942
2,Q3,38401,0.187443,0.037661,5558.407715,0.772216
3,Q4,38380,0.167952,0.030348,4313.814453,0.669223
4,Q5,38390,0.168195,0.031485,4484.428711,0.694507


 Exportado: data/bi\bmi_quintiles.csv


In [108]:
# PÁGINA 3.4 | Quintiles de Altura

df_h = df.copy()
df_h["altura_cm"] = pd.to_numeric(df_h["altura_cm"], errors="coerce")
df_h = df_h[df_h["altura_cm"].notna()].copy()

df_h["altura_quintil"] = pd.qcut(df_h["altura_cm"], q=5, labels=[f"Q{i}" for i in range(1, 6)])

altura_table = (
    df_h.groupby("altura_quintil", as_index=False, observed=False)
        .agg(
            n_items=("ticket_id", "size"),
            pct_interv=("cambia_talla", "mean"),
            delta_p_mean=("delta_p_final", "mean"),
            savings_total=("expected_savings_talla", "sum"),
            savings_mean_interv=("expected_savings_talla",
                                lambda s: float(np.mean(s[df_h.loc[s.index, "cambia_talla"] == 1]))
                                if df_h.loc[s.index, "cambia_talla"].sum() > 0 else 0.0),
        )
)

print(" PÁGINA 3 | Altura quintiles")
display(altura_table)

altura_table.to_csv(os.path.join(BI_DIR, "altura_quintiles.csv"), index=False)
print(" Exportado:", os.path.join(BI_DIR, "altura_quintiles.csv"))


 PÁGINA 3 | Altura quintiles


Unnamed: 0,altura_quintil,n_items,pct_interv,delta_p_mean,savings_total,savings_mean_interv
0,Q1,39045,0.261109,0.044892,6286.635254,0.616639
1,Q2,38264,0.121707,0.014905,2237.173828,0.48039
2,Q3,38048,0.253417,0.046049,6562.776855,0.680645
3,Q4,38777,0.117699,0.029775,4468.647949,0.979108
4,Q5,37820,0.185061,0.037616,5519.11377,0.788557


 Exportado: data/bi\altura_quintiles.csv


In [109]:
# PÁGINA 3.5 | Intervención por deciles de riesgo global

if "p_dev_global" not in df.columns:
    print(" No existe p_dev_global en bi_items_full. Esta tabla no se puede generar.")
else:
    df_r = df.copy()
    df_r["p_dev_global"] = pd.to_numeric(df_r["p_dev_global"], errors="coerce")
    df_r = df_r[df_r["p_dev_global"].notna()].copy()

    if len(df_r) == 0:
        print(" p_dev_global está todo NA. Revisa el merge del modelo global.")
    else:
        df_r["risk_decile"] = pd.qcut(df_r["p_dev_global"], q=10, labels=[i for i in range(1, 11)])

        risk_table = (
            df_r.groupby("risk_decile", as_index=False, observed=False)
                .agg(
                    n_items=("ticket_id", "size"),
                    pct_interv=("cambia_talla", "mean"),
                    delta_p_mean=("delta_p_final", "mean"),
                    savings_total=("expected_savings_talla", "sum"),
                )
        )

        print(" PÁGINA 3 | Riesgo global por deciles")
        display(risk_table)

        risk_table.to_csv(os.path.join(BI_DIR, "risk_deciles.csv"), index=False)
        print(" Exportado:", os.path.join(BI_DIR, "risk_deciles.csv"))


 PÁGINA 3 | Riesgo global por deciles


Unnamed: 0,risk_decile,n_items,pct_interv,delta_p_mean,savings_total
0,1,13376,0.007177,0.000448,15.580882
1,2,13376,0.013457,0.001406,51.604553
2,3,13375,0.034841,0.00451,162.788498
3,4,13376,0.057865,0.009108,329.474854
4,5,13376,0.091283,0.015654,606.241577
5,6,13375,0.144,0.024108,1049.237915
6,7,13376,0.264279,0.042712,2101.927002
7,8,13375,0.312822,0.056478,2973.012695
8,9,13376,0.371112,0.069669,3603.990723
9,10,13376,0.577751,0.120945,6492.715332


 Exportado: data/bi\risk_deciles.csv


In [110]:
# PÁGINA 3.6 | Distribución del impacto y concentración del ahorro

df_i = df[df["cambia_talla"] == 1].copy()

if len(df_i) == 0:
    print(" No hay intervenciones (cambia_talla=1). Revisa min_gain.")
else:
    deltas = df_i["delta_p_final"].to_numpy()

    pct = np.percentile(deltas, [25, 50, 75, 90, 95, 99])
    dist_table = pd.DataFrame({
        "percentil": ["P25", "P50", "P75", "P90", "P95", "P99"],
        "delta_p_final": pct
    })

    # Concentración del ahorro: top X% intervenciones
    df_i_sorted = df_i.sort_values("expected_savings_talla", ascending=False).reset_index(drop=True)
    total_savings = float(df_i_sorted["expected_savings_talla"].sum())

    def _share_top(frac: float) -> float:
        k = max(1, int(len(df_i_sorted) * frac))
        return float(df_i_sorted.iloc[:k]["expected_savings_talla"].sum() / (total_savings + 1e-9))

    conc_table = pd.DataFrame({
        "top_frac_interv": ["Top 1%", "Top 5%", "Top 10%", "Top 20%"],
        "share_ahorro": [_share_top(0.01), _share_top(0.05), _share_top(0.10), _share_top(0.20)]
    })

    print(" PÁGINA 3 | Percentiles de delta_p_final (solo intervención)")
    display(dist_table)

    print(" PÁGINA 3 | Concentración del ahorro (solo intervención)")
    display(conc_table)

    dist_table.to_csv(os.path.join(BI_DIR, "impact_percentiles_interv.csv"), index=False)
    conc_table.to_csv(os.path.join(BI_DIR, "savings_concentration.csv"), index=False)

    print(" Exportados:")
    print("-", os.path.join(BI_DIR, "impact_percentiles_interv.csv"))
    print("-", os.path.join(BI_DIR, "savings_concentration.csv"))


 PÁGINA 3 | Percentiles de delta_p_final (solo intervención)


Unnamed: 0,percentil,delta_p_final
0,P25,0.081811
1,P50,0.188258
2,P75,0.261063
3,P90,0.336538
4,P95,0.390156
5,P99,0.452681


 PÁGINA 3 | Concentración del ahorro (solo intervención)


Unnamed: 0,top_frac_interv,share_ahorro
0,Top 1%,0.034507
1,Top 5%,0.143062
2,Top 10%,0.254063
3,Top 20%,0.433335


 Exportados:
- data/bi\impact_percentiles_interv.csv
- data/bi\savings_concentration.csv


In [111]:
MIN_N_PROD_CAT = 200  # ajusta (100/200/500)

tmp = by_product_cat[by_product_cat["n_items"] >= MIN_N_PROD_CAT].copy()

tmp["rank_in_cat"] = (
    tmp.groupby("categoria")["savings_total"]
       .rank(method="first", ascending=False)
       .astype(int)
)

TOPN = 10
by_product_top_by_category = (
    tmp[tmp["rank_in_cat"] <= TOPN]
    .sort_values(["categoria", "rank_in_cat"])
    .reset_index(drop=True)
)

display(by_product_top_by_category.head(30))
by_product_top_by_category.to_csv(os.path.join(BI_DIR, "by_product_top_by_category.csv"), index=False)


Unnamed: 0,categoria,id_producto,n_items,n_orders,pct_interv,savings_total,savings_mean_interv,rank_in_cat
0,abrigo,P047,11004,10789,0.848237,6711.040527,0.718989,1
1,abrigo,P013,13200,12866,0.122576,2112.063721,1.305355,2
2,abrigo,P020,10665,10460,0.123019,1744.149536,1.329382,3
3,abrigo,P005,10722,10498,0.122738,1729.842285,1.31447,4
4,abrigo,P058,10660,10424,0.124953,1612.811279,1.210819,5
5,abrigo,P069,3207,3161,0.135329,523.081299,1.205256,6
6,abrigo,P059,3027,2992,0.131483,500.307251,1.257053,7
7,camisa,P042,11017,10773,0.508941,1245.537598,0.22214,1
8,camisa,P066,3052,3006,0.13401,263.302612,0.643772,2
9,camisa,P062,3130,3092,0.125879,254.271744,0.64536,3
