# 03 - Feature Creation & Data Preprocessing Pipeline
## Proyecto: Store Sales Forecasting

**Objetivo**: Construir un pipeline robusto de preprocesamiento que incluya:
- Creaci√≥n de features derivadas
- Imputaci√≥n de valores faltantes
- Codificaci√≥n de variables categ√≥ricas
- Transformaciones de variables num√©ricas
- Normalizaci√≥n/Escalado

In [18]:
import os
import pandas as pd
import numpy as np
import joblib
import warnings
warnings.filterwarnings('ignore')

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

from feature_engine.imputation import MeanMedianImputer, CategoricalImputer
from feature_engine.encoding import CountFrequencyEncoder
from feature_engine.transformation import LogTransformer 
from feature_engine.selection import DropFeatures

# Cargar operadores definidos
import operators

print("‚úÖ Librer√≠as cargadas correctamente")

‚úÖ Librer√≠as cargadas correctamente


# 1. Carga y Exploraci√≥n del Dataset

In [2]:
# Cargar el dataset
data_train = pd.read_csv('../data/raw/stores_sales_forecasting_updated_v3.1.csv', 
                          sep=';',
                          encoding='utf-8')

print(f"üìä Dataset cargado: {data_train.shape}")
print(f"\nüìã Primeras filas:")
data_train.head()

üìä Dataset cargado: (2121, 22)

üìã Primeras filas:


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,849,CA-2017-107503,1/01/2017,6/01/2017,Standard Class,GA-14725,Guy Armstrong,Consumer,United States,Lorain,...,44052,East,FUR-FU-10003878,Furniture,Furnishings,"Linden 10"" Round Wall Clock, Black",48.896,4,0.2,8.5568
1,4010,CA-2017-144463,1/01/2017,5/01/2017,Standard Class,SC-20725,Steven Cartwright,Consumer,United States,Los Angeles,...,90036,West,FUR-FU-10001215,Furniture,Furnishings,"Howard Miller 11-1/2"" Diameter Brentwood Wall ...",474.43,11,0.0,199.2606
2,8071,CA-2017-151750,1/01/2017,5/01/2017,Standard Class,JM-15250,Janet Martin,Consumer,United States,Huntsville,...,77340,Central,FUR-FU-10002116,Furniture,Furnishings,"Tenex Carpeted, Granite-Look or Clear Contempo...",141.42,5,0.6,-187.3815
3,8072,CA-2017-151750,1/01/2017,5/01/2017,Standard Class,JM-15250,Janet Martin,Consumer,United States,Huntsville,...,77340,Central,FUR-CH-10003199,Furniture,Chairs,Office Star - Contemporary Task Swivel Chair,310.744,4,0.3,-26.6352
4,867,CA-2014-149020,10/01/2014,15/01/2014,Standard Class,AJ-10780,Anthony Jacobs,Corporate,United States,Springfield,...,22153,South,FUR-FU-10000965,Furniture,Furnishings,"Howard Miller 11-1/2"" Diameter Ridgewood Wall ...",51.94,1,0.0,21.2954


In [3]:
# Informaci√≥n general del dataset
print("=== INFORMACI√ìN DEL DATASET ===")
print(f"Total de registros: {len(data_train):,}")
print(f"Total de columnas: {len(data_train.columns)}")
print(f"\nMemoria utilizada: {data_train.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nüìä Tipos de datos:")
print(data_train.dtypes.value_counts())

=== INFORMACI√ìN DEL DATASET ===
Total de registros: 2,121
Total de columnas: 22

Memoria utilizada: 2.34 MB

üìä Tipos de datos:
object     16
int64       3
float64     3
Name: count, dtype: int64


# 2. Feature Engineering - Creaci√≥n de Variables Derivadas

In [4]:
# Convertir fechas a datetime
print("üìÖ Convirtiendo fechas...")
data_train['Order Date'] = pd.to_datetime(data_train['Order Date'], dayfirst=True, errors='coerce')
data_train['Ship Date'] = pd.to_datetime(data_train['Ship Date'], dayfirst=True, errors='coerce')

# Crear variables derivadas de fechas
print("üîß Creando features derivadas...")
data_train['Order_Month'] = data_train['Order Date'].dt.month
data_train['Order_Quarter'] = data_train['Order Date'].dt.quarter
data_train['Days to Ship'] = (data_train['Ship Date'] - data_train['Order Date']).dt.days

# Asegurar tipos num√©ricos correctos
print("üî¢ Convirtiendo variables num√©ricas...")
numeric_cols = ['Postal Code', 'Discount', 'Quantity', 'Order_Month', 'Order_Quarter', 'Days to Ship']
for col in numeric_cols:
    data_train[col] = pd.to_numeric(data_train[col], errors='coerce')

print("\n‚úÖ Features derivadas creadas:")
print("   - Order_Month: Mes de la orden")
print("   - Order_Quarter: Trimestre de la orden")
print("   - Days to Ship: D√≠as entre orden y env√≠o")

print(f"\nüìä Dataset actualizado: {data_train.shape}")
data_train.head()

üìÖ Convirtiendo fechas...
üîß Creando features derivadas...
üî¢ Convirtiendo variables num√©ricas...

‚úÖ Features derivadas creadas:
   - Order_Month: Mes de la orden
   - Order_Quarter: Trimestre de la orden
   - Days to Ship: D√≠as entre orden y env√≠o

üìä Dataset actualizado: (2121, 25)


Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Order_Month,Order_Quarter,Days to Ship
0,849,CA-2017-107503,2017-01-01,2017-01-06,Standard Class,GA-14725,Guy Armstrong,Consumer,United States,Lorain,...,Furniture,Furnishings,"Linden 10"" Round Wall Clock, Black",48.896,4,0.2,8.5568,1,1,5
1,4010,CA-2017-144463,2017-01-01,2017-01-05,Standard Class,SC-20725,Steven Cartwright,Consumer,United States,Los Angeles,...,Furniture,Furnishings,"Howard Miller 11-1/2"" Diameter Brentwood Wall ...",474.43,11,0.0,199.2606,1,1,4
2,8071,CA-2017-151750,2017-01-01,2017-01-05,Standard Class,JM-15250,Janet Martin,Consumer,United States,Huntsville,...,Furniture,Furnishings,"Tenex Carpeted, Granite-Look or Clear Contempo...",141.42,5,0.6,-187.3815,1,1,4
3,8072,CA-2017-151750,2017-01-01,2017-01-05,Standard Class,JM-15250,Janet Martin,Consumer,United States,Huntsville,...,Furniture,Chairs,Office Star - Contemporary Task Swivel Chair,310.744,4,0.3,-26.6352,1,1,4
4,867,CA-2014-149020,2014-01-10,2014-01-15,Standard Class,AJ-10780,Anthony Jacobs,Corporate,United States,Springfield,...,Furniture,Furnishings,"Howard Miller 11-1/2"" Diameter Ridgewood Wall ...",51.94,1,0.0,21.2954,1,1,5


# 3. Creaci√≥n de Valores Nulos Artificiales

In [5]:
# Verificar si ya existen nulos
existing_nulls = data_train.isnull().sum()
has_nulls = existing_nulls[existing_nulls > 0]

if len(has_nulls) == 0:
    print("‚ö†Ô∏è  No hay valores nulos. Creando algunos artificialmente para demostrar el pipeline...\n")
    
    np.random.seed(42)
    
    # Crear nulos en variables categ√≥ricas (5% de los datos)
    sample_size_cat = int(len(data_train) * 0.05)
    
    # Ship Mode
    null_indices_ship = np.random.choice(data_train.index, sample_size_cat, replace=False)
    data_train.loc[null_indices_ship, 'Ship Mode'] = np.nan
    
    # Segment
    null_indices_segment = np.random.choice(data_train.index, sample_size_cat, replace=False)
    data_train.loc[null_indices_segment, 'Segment'] = np.nan
    
    # Sub-Category
    null_indices_subcat = np.random.choice(data_train.index, sample_size_cat, replace=False)
    data_train.loc[null_indices_subcat, 'Sub-Category'] = np.nan
    
    # Crear nulos en variables num√©ricas (5% de los datos)
    sample_size_num = int(len(data_train) * 0.05)
    
    # Quantity
    null_indices_quantity = np.random.choice(data_train.index, sample_size_num, replace=False)
    data_train.loc[null_indices_quantity, 'Quantity'] = np.nan
    
    # Discount
    null_indices_discount = np.random.choice(data_train.index, sample_size_num, replace=False)
    data_train.loc[null_indices_discount, 'Discount'] = np.nan
    
    print("‚úÖ Valores nulos creados artificialmente\n")
else:
    print("‚ÑπÔ∏è  El dataset ya contiene valores nulos\n")

# Mostrar resumen de valores nulos
print("=== VALORES NULOS POR VARIABLE ===")
null_summary = data_train.isnull().sum()
null_summary = null_summary[null_summary > 0].sort_values(ascending=False)
if len(null_summary) > 0:
    for col, count in null_summary.items():
        pct = (count / len(data_train)) * 100
        print(f"{col:20s}: {count:5d} ({pct:5.2f}%)")
else:
    print("No hay valores nulos en el dataset")

‚ö†Ô∏è  No hay valores nulos. Creando algunos artificialmente para demostrar el pipeline...

‚úÖ Valores nulos creados artificialmente

=== VALORES NULOS POR VARIABLE ===
Ship Mode           :   106 ( 5.00%)
Segment             :   106 ( 5.00%)
Sub-Category        :   106 ( 5.00%)
Quantity            :   106 ( 5.00%)
Discount            :   106 ( 5.00%)


In [6]:
# Validaci√≥n de variables disponibles
print("=== VARIABLES DISPONIBLES ===")
columnas_info = pd.DataFrame({
    'Variable': data_train.columns,
    'Tipo': data_train.dtypes.values,
    'Nulos': data_train.isnull().sum().values,
    '√önicos': [data_train[col].nunique() for col in data_train.columns]
})
print(columnas_info.to_string(index=False))

=== VARIABLES DISPONIBLES ===
     Variable           Tipo  Nulos  √önicos
       Row ID          int64      0    2121
     Order ID         object      0    1764
   Order Date datetime64[ns]      0     889
    Ship Date datetime64[ns]      0     960
    Ship Mode         object    106       4
  Customer ID         object      0     707
Customer Name         object      0     707
      Segment         object    106       3
      Country         object      0       1
         City         object      0     371
        State         object      0      48
       Branch         object      0     427
  Postal Code          int64      0     454
       Region         object      0       4
   Product ID         object      0     375
     Category         object      0       1
 Sub-Category         object    106       4
 Product Name         object      0     380
        Sales        float64      0    1636
     Quantity        float64    106      14
     Discount        float64    106      11
 

# 4. Configuraci√≥n de Variables para el Pipeline

In [7]:
# ===== CONFIGURACI√ìN DE VARIABLES =====

# Variables categ√≥ricas con valores faltantes - Imputaci√≥n por 'Missing'
CATEGORICAL_VARS_WITH_NA_MISSING = ['Segment']

# Variables categ√≥ricas con valores faltantes - Imputaci√≥n por frecuencia
CATEGORICAL_VARS_WITH_NA_FREQUENT = ['Sub-Category']

# Variables num√©ricas con valores faltantes - Imputaci√≥n por media
NUMERICAL_VARS_WITH_NA = ['Quantity', 'Discount']

# Variables a eliminar (no aportan al modelo)
DROP_FEATURES = [
    'Row ID', 'Order ID', 'Customer ID', 'Customer Name', 
    'Order Date', 'Ship Date', 'Branch', 'Postal Code',
    'Product ID', 'Product Name'
]

# Variables para transformaci√≥n logar√≠tmica
NUMERICAL_LOG_VARS = ['Quantity']

# Variables para codificaci√≥n ordinal (calidad/prioridad)
QUAL_VARS = ['Ship Mode']

# Mapeos para variables categ√≥ricas ordinales
QUAL_MAPPINGS = {
    'Standard Class': 1, 
    'Second Class': 2, 
    'First Class': 3
}

# Variables categ√≥ricas para frequency encoding
CATEGORICAL_VARS = [
    'Segment', 'Sub-Category', 'Country', 'City',
    'State', 'Region', 'Category'
]

# ‚ö†Ô∏è IMPORTANTE: NO incluir 'Profit' para evitar data leakage
# Profit est√° altamente correlacionado con Sales (variable objetivo)
FEATURES = [
    'Quantity', 'Discount', 'Ship Mode', 'Segment', 
    'Country', 'City', 'State', 'Region', 'Category', 'Sub-Category',
    'Order_Month', 'Order_Quarter', 'Days to Ship'
]

print("‚úÖ Configuraci√≥n de variables definida")
print(f"\nüìä Total de features finales: {len(FEATURES)}")
print(f"\nüìã Features seleccionadas:")
for i, feat in enumerate(FEATURES, 1):
    print(f"   {i:2d}. {feat}")

‚úÖ Configuraci√≥n de variables definida

üìä Total de features finales: 13

üìã Features seleccionadas:
    1. Quantity
    2. Discount
    3. Ship Mode
    4. Segment
    5. Country
    6. City
    7. State
    8. Region
    9. Category
   10. Sub-Category
   11. Order_Month
   12. Order_Quarter
   13. Days to Ship


# 5. Divisi√≥n Train/Test


In [8]:
# Separacion
print("üìä Preparando divisi√≥n Train/Test...")

# Separar features y target
X = data_train.drop(['Sales'], axis=1)
y = data_train['Sales']

# Split temporal (80/20) - SIN shuffle para respetar orden temporal
split_index = int(len(data_train) * 0.8)

x_train = X.iloc[:split_index].copy()
x_test = X.iloc[split_index:].copy()
y_train = y.iloc[:split_index].copy()
y_test = y.iloc[split_index:].copy()

print("\n‚úÖ Divisi√≥n completada:")
print(f"   Train set: {x_train.shape[0]:,} registros ({(len(x_train)/len(X))*100:.1f}%)")
print(f"   Test set:  {x_test.shape[0]:,} registros ({(len(x_test)/len(X))*100:.1f}%)")
print(f"\nüìã Columnas en x_train: {x_train.shape[1]}")
print(f"üìã Columnas esperadas en FEATURES: {len(FEATURES)}")

# Verificar que las features derivadas est√°n presentes
derived_features = ['Order_Month', 'Order_Quarter', 'Days to Ship']
missing_features = [f for f in derived_features if f not in x_train.columns]

if missing_features:
    print(f"\n‚ùå ERROR: Features derivadas faltantes: {missing_features}")
else:
    print(f"\n‚úÖ Todas las features derivadas est√°n presentes")
    
print(f"\nüìä Estad√≠sticas de la variable objetivo (Sales):")
print(f"   Train - Mean: ${y_train.mean():.2f}, Std: ${y_train.std():.2f}")
print(f"   Test  - Mean: ${y_test.mean():.2f}, Std: ${y_test.std():.2f}")

üìä Preparando divisi√≥n Train/Test...

‚úÖ Divisi√≥n completada:
   Train set: 1,696 registros (80.0%)
   Test set:  425 registros (20.0%)

üìã Columnas en x_train: 24
üìã Columnas esperadas en FEATURES: 13

‚úÖ Todas las features derivadas est√°n presentes

üìä Estad√≠sticas de la variable objetivo (Sales):
   Train - Mean: $345.90, Std: $487.42
   Test  - Mean: $365.54, Std: $562.03


# 6. Construcci√≥n del Pipeline de Preprocesamiento

In [9]:
# Identificar features a eliminar
all_features = set(x_train.columns)
features_to_keep = set(FEATURES)
features_to_drop = list(all_features.difference(features_to_keep))

print(f"‚úÖ Features a mantener: {len(features_to_keep)}")
print(f"‚ùå Features a eliminar: {len(features_to_drop)}")
print(f"\nFeatures que se eliminar√°n:")
for feat in sorted(features_to_drop):
    print(f"   - {feat}")

‚úÖ Features a mantener: 13
‚ùå Features a eliminar: 11

Features que se eliminar√°n:
   - Branch
   - Customer ID
   - Customer Name
   - Order Date
   - Order ID
   - Postal Code
   - Product ID
   - Product Name
   - Profit
   - Row ID
   - Ship Date


In [10]:
# Construcci√≥n del pipeline de preprocesamiento
print("üîß Construyendo pipeline de preprocesamiento...\n")

stores_sales_forecasting_data_pre_proc = Pipeline([
    # 0. Selecci√≥n de features
    ('drop_features',
     DropFeatures(features_to_drop=features_to_drop)),
    
    # 1. Imputaci√≥n de variables categ√≥ricas - M√©todo 'Missing'
    ('cat_missing_imputation',
     CategoricalImputer(
         imputation_method='missing', 
         variables=CATEGORICAL_VARS_WITH_NA_MISSING
     )),
    
    # 2. Imputaci√≥n de variables categ√≥ricas - M√©todo 'Frecuencia'
    ('cat_missing_freq_imputation',
     CategoricalImputer(
         imputation_method='frequent', 
         variables=CATEGORICAL_VARS_WITH_NA_FREQUENT
     )),
    
    # 3. Imputaci√≥n de variables num√©ricas - Media
    ('mean_imputation',
     MeanMedianImputer(
         imputation_method='mean', 
         variables=NUMERICAL_VARS_WITH_NA
     )),
    
    # 4. Codificaci√≥n ordinal (Ship Mode: Standard < Second < First)
    ('quality_mapper',
     operators.Mapper(
         variables=QUAL_VARS, 
         mappins=QUAL_MAPPINGS
     )),
    
    # 5. Frequency Encoding para variables categ√≥ricas
    ('cat_freq_encode',
     CountFrequencyEncoder(
         encoding_method='count', 
         variables=CATEGORICAL_VARS
     )),
    
    # 6. Transformaci√≥n logar√≠tmica de variables asim√©tricas
    ('continues_log_transform',
     LogTransformer(variables=NUMERICAL_LOG_VARS)),
    
    # 7. Normalizaci√≥n MinMax (0-1)
    ('Variable_scaler',
     MinMaxScaler())
])

print("‚úÖ Pipeline construido con 8 pasos:")
for i, (name, _) in enumerate(stores_sales_forecasting_data_pre_proc.steps, 1):
    print(f"   {i}. {name}")

üîß Construyendo pipeline de preprocesamiento...

‚úÖ Pipeline construido con 8 pasos:
   1. drop_features
   2. cat_missing_imputation
   3. cat_missing_freq_imputation
   4. mean_imputation
   5. quality_mapper
   6. cat_freq_encode
   7. continues_log_transform
   8. Variable_scaler


# 7. Entrenamiento del Pipeline

In [11]:
print("üöÄ Entrenando pipeline de preprocesamiento...\n")

# Entrenar el pipeline con los datos de train
stores_sales_forecasting_data_pre_proc.fit(x_train, y_train)

print("‚úÖ Pipeline entrenado exitosamente")
print("\nüìä Pipeline est√° listo para transformar datos")

üöÄ Entrenando pipeline de preprocesamiento...

‚úÖ Pipeline entrenado exitosamente

üìä Pipeline est√° listo para transformar datos


In [12]:
# Verificar transformaci√≥n con una muestra peque√±a
print("üß™ Probando pipeline con muestra de datos...\n")

sample = x_train.head(3)
sample_transformed = stores_sales_forecasting_data_pre_proc.transform(sample)

print(f"‚úÖ Transformaci√≥n exitosa")
print(f"   Shape original: {sample.shape}")
print(f"   Shape transformada: {sample_transformed.shape}")
print(f"   Features esperadas: {len(FEATURES)}")

if sample_transformed.shape[1] == len(FEATURES):
    print("\n‚úÖ El n√∫mero de features coincide correctamente")
else:
    print(f"\n‚ö†Ô∏è  ADVERTENCIA: Se esperaban {len(FEATURES)} features, se obtuvieron {sample_transformed.shape[1]}")

üß™ Probando pipeline con muestra de datos...

‚úÖ Transformaci√≥n exitosa
   Shape original: (3, 24)
   Shape transformada: (3, 13)
   Features esperadas: 13

‚úÖ El n√∫mero de features coincide correctamente


# 8. Transformaci√≥n y Guardado de Datos Procesados

In [13]:
def save_processed_data(X, y, str_df_name):
    """
    Transforma los datos usando el pipeline y guarda el resultado.
    
    Parameters:
    -----------
    X : DataFrame
        Features sin procesar
    y : Series
        Variable objetivo
    str_df_name : str
        Nombre del archivo de salida (sin extensi√≥n)
    """
    print(f"\nüîÑ Procesando {str_df_name}...")
    
    # Transformar datos
    X_transformed = stores_sales_forecasting_data_pre_proc.transform(X)
    
    # Crear DataFrame con features transformadas
    df_X_transformed = pd.DataFrame(
        data=X_transformed, 
        columns=FEATURES
    )
    
    # Resetear √≠ndice de y para concatenar
    y_reset = y.reset_index(drop=True)
    
    # Concatenar features y target
    df_transformed = pd.concat(
        [df_X_transformed, y_reset.rename('Sales')], 
        axis=1
    )
    
    # Guardar archivo
    output_path = f"../data/interim/proc_{str_df_name}.csv"
    df_transformed.to_csv(output_path, index=False)
    
    print(f"‚úÖ Datos procesados guardados: {output_path}")
    print(f"   Shape: {df_transformed.shape}")
    print(f"   Columnas: {list(df_transformed.columns)}")
    
    return df_transformed

print("‚úÖ Funci√≥n save_processed_data() definida")

‚úÖ Funci√≥n save_processed_data() definida


## 8.1 Procesamiento de Datos de Entrenamiento

In [14]:
# Procesar y guardar datos de entrenamiento
df_train_processed = save_processed_data(
    x_train, 
    y_train, 
    str_df_name="data_train"
)

# Mostrar primeras filas
print("\nüìä Primeras filas de datos de entrenamiento procesados:")
df_train_processed.head()


üîÑ Procesando data_train...
‚úÖ Datos procesados guardados: ../data/interim/proc_data_train.csv
   Shape: (1696, 14)
   Columnas: ['Quantity', 'Discount', 'Ship Mode', 'Segment', 'Country', 'City', 'State', 'Region', 'Category', 'Sub-Category', 'Order_Month', 'Order_Quarter', 'Days to Ship', 'Sales']

üìä Primeras filas de datos de entrenamiento procesados:


Unnamed: 0,Quantity,Discount,Ship Mode,Segment,Country,City,State,Region,Category,Sub-Category,Order_Month,Order_Quarter,Days to Ship,Sales
0,0.0,1.0,0.0,0.014706,0.22063,0.643885,0.0,1.0,0.525299,0.285714,0.0,0.0,0.714286,48.896
1,0.0,1.0,0.0,0.852941,1.0,1.0,0.0,1.0,0.908618,0.0,0.0,0.0,0.571429,474.43
2,0.0,1.0,0.0,0.051471,0.469914,0.413669,0.0,1.0,0.609853,0.857143,0.0,0.0,0.571429,141.42
3,0.0,1.0,0.0,0.051471,0.469914,0.413669,0.0,0.471698,0.525299,0.428571,0.0,0.0,0.571429,310.744
4,0.0,0.522523,0.0,0.132353,0.120344,0.0,0.0,1.0,0.0,0.251576,0.0,0.0,0.714286,51.94


## 8.2 Procesamiento de Datos de Test

In [15]:
# Procesar y guardar datos de test
df_test_processed = save_processed_data(
    x_test, 
    y_test, 
    str_df_name="data_test"
)

# Mostrar primeras filas
print("\nüìä Primeras filas de datos de test procesados:")
df_test_processed.head()


üîÑ Procesando data_test...
‚úÖ Datos procesados guardados: ../data/interim/proc_data_test.csv
   Shape: (425, 14)
   Columnas: ['Quantity', 'Discount', 'Ship Mode', 'Segment', 'Country', 'City', 'State', 'Region', 'Category', 'Sub-Category', 'Order_Month', 'Order_Quarter', 'Days to Ship', 'Sales']

üìä Primeras filas de datos de test procesados:


Unnamed: 0,Quantity,Discount,Ship Mode,Segment,Country,City,State,Region,Category,Sub-Category,Order_Month,Order_Quarter,Days to Ship,Sales
0,0.0,1.0,0.0,0.580882,0.266476,0.643885,0.0,1.0,0.26265,0.285714,0.545455,0.666667,0.571429,69.008
1,0.0,1.0,0.0,0.051471,1.0,1.0,0.0,1.0,0.609853,0.0,0.545455,0.666667,0.571429,215.65
2,0.5,0.522523,0.0,0.095588,0.295129,0.413669,0.0,1.0,0.787949,0.857143,0.545455,0.666667,0.571429,60.288
3,,0.522523,0.0,0.095588,0.295129,0.413669,0.0,0.471698,0.26265,0.428571,0.545455,0.666667,0.571429,253.372
4,0.5,1.0,0.0,0.852941,1.0,1.0,0.0,0.471698,0.525299,0.285714,0.545455,0.666667,0.285714,287.968


# 9. Validaci√≥n Final del Pipeline

In [16]:
print("üîç VALIDACI√ìN FINAL DEL PIPELINE\n")
print("=" * 60)

# 1. Verificar que no hay valores nulos despu√©s del procesamiento
print("\n1Ô∏è‚É£  Verificaci√≥n de valores nulos:")
train_nulls = df_train_processed.isnull().sum().sum()
test_nulls = df_test_processed.isnull().sum().sum()

if train_nulls == 0 and test_nulls == 0:
    print("   ‚úÖ No hay valores nulos en los datos procesados")
else:
    print(f"   ‚ö†Ô∏è  ADVERTENCIA: Train nulls: {train_nulls}, Test nulls: {test_nulls}")

# 2. Verificar dimensiones
print("\n2Ô∏è‚É£  Verificaci√≥n de dimensiones:")
print(f"   Train: {df_train_processed.shape}")
print(f"   Test:  {df_test_processed.shape}")
print(f"   Features esperadas: {len(FEATURES) + 1} (features + Sales)")

if df_train_processed.shape[1] == len(FEATURES) + 1:
    print("   ‚úÖ Dimensiones correctas")
else:
    print("   ‚ö†Ô∏è  ADVERTENCIA: Dimensiones no coinciden")

# 3. Verificar rangos de variables normalizadas
print("\n3Ô∏è‚É£  Verificaci√≥n de rangos (MinMax 0-1):")
numeric_features = df_train_processed.select_dtypes(include=[np.number]).columns
numeric_features = [col for col in numeric_features if col != 'Sales']

all_in_range = True
for col in numeric_features:
    min_val = df_train_processed[col].min()
    max_val = df_train_processed[col].max()
    
    if min_val < 0 or max_val > 1:
        print(f"   ‚ö†Ô∏è  {col}: [{min_val:.3f}, {max_val:.3f}] - Fuera de rango [0,1]")
        all_in_range = False

if all_in_range:
    print("   ‚úÖ Todas las variables est√°n en el rango [0, 1]")

# 4. Estad√≠sticas de Sales (no debe estar normalizada)
print("\n4Ô∏è‚É£  Estad√≠sticas de la variable objetivo (Sales):")
print(f"   Train - Min: ${df_train_processed['Sales'].min():.2f}, "
      f"Max: ${df_train_processed['Sales'].max():.2f}, "
      f"Mean: ${df_train_processed['Sales'].mean():.2f}")
print(f"   Test  - Min: ${df_test_processed['Sales'].min():.2f}, "
      f"Max: ${df_test_processed['Sales'].max():.2f}, "
      f"Mean: ${df_test_processed['Sales'].mean():.2f}")

print("\n" + "=" * 60)
print("‚úÖ VALIDACI√ìN COMPLETADA")
print("=" * 60)

üîç VALIDACI√ìN FINAL DEL PIPELINE


1Ô∏è‚É£  Verificaci√≥n de valores nulos:
   ‚ö†Ô∏è  ADVERTENCIA: Train nulls: 166, Test nulls: 92

2Ô∏è‚É£  Verificaci√≥n de dimensiones:
   Train: (1696, 14)
   Test:  (425, 14)
   Features esperadas: 14 (features + Sales)
   ‚úÖ Dimensiones correctas

3Ô∏è‚É£  Verificaci√≥n de rangos (MinMax 0-1):
   ‚ö†Ô∏è  Region: [0.000, 1.000] - Fuera de rango [0,1]

4Ô∏è‚É£  Estad√≠sticas de la variable objetivo (Sales):
   Train - Min: $1.89, Max: $4404.90, Mean: $345.90
   Test  - Min: $2.78, Max: $4416.17, Mean: $365.54

‚úÖ VALIDACI√ìN COMPLETADA


# 10. Exportar Pipeline Entrenado

In [19]:
# Guardar el pipeline entrenado
pipeline_path = '../models/stores_sales_forecasting_data_pre_proc.pkl'

joblib.dump(stores_sales_forecasting_data_pre_proc, pipeline_path)

print(f"‚úÖ Pipeline guardado exitosamente en: {pipeline_path}")
print(f"\nüì¶ Tama√±o del archivo: {os.path.getsize(pipeline_path) / 1024:.2f} KB")
print("\nüîß El pipeline puede ser cargado con: joblib.load(path)")

‚úÖ Pipeline guardado exitosamente en: ../models/stores_sales_forecasting_data_pre_proc.pkl

üì¶ Tama√±o del archivo: 8.87 KB

üîß El pipeline puede ser cargado con: joblib.load(path)


# 11. Resumen Final

In [21]:
# Resumen

print("\n" + "="*70)
print("üìä RESUMEN DEL PREPROCESAMIENTO".center(70))
print("="*70)

print("\nüìÅ ARCHIVOS GENERADOS:")
print(f"   ‚úì Pipeline: {pipeline_path}")
print(f"   ‚úì Train data: ../data/interim/proc_data_train.csv")
print(f"   ‚úì Test data: ../data/interim/proc_data_test.csv")

print("\nüìä DIMENSIONES:")
print(f"   ‚úì Train: {df_train_processed.shape[0]:,} registros √ó {df_train_processed.shape[1]} columnas")
print(f"   ‚úì Test:  {df_test_processed.shape[0]:,} registros √ó {df_test_processed.shape[1]} columnas")

print("\n‚úÖ Features finales ({} variables):".format(len(FEATURES)))
for i, feat in enumerate(FEATURES, 1):
    print(f"   {i:2d}. {feat}")



                    üìä RESUMEN DEL PREPROCESAMIENTO                    

üìÅ ARCHIVOS GENERADOS:
   ‚úì Pipeline: ../models/stores_sales_forecasting_data_pre_proc.pkl
   ‚úì Train data: ../data/interim/proc_data_train.csv
   ‚úì Test data: ../data/interim/proc_data_test.csv

üìä DIMENSIONES:
   ‚úì Train: 1,696 registros √ó 14 columnas
   ‚úì Test:  425 registros √ó 14 columnas

‚úÖ Features finales (13 variables):
    1. Quantity
    2. Discount
    3. Ship Mode
    4. Segment
    5. Country
    6. City
    7. State
    8. Region
    9. Category
   10. Sub-Category
   11. Order_Month
   12. Order_Quarter
   13. Days to Ship
