# Proyecto EAF - Notebook Refactorizado

## Electric Arc Furnace - Predicción de Temperatura y Composición Química

Este notebook refactorizado organiza el código en pipelines claros y separados:
- **Parte 3**: Pipeline de Temperatura (Secuencial)
- **Parte 3.5**: Pipeline Químico (Estático)
- **Parte 4**: Modelo de Temperatura con XGBoost (Implementado)
- **Parte 5**: Modelo Químico Desempaquetado (Bloques Secuenciales)

---
## PARTE 1: Configuración Inicial

In [None]:
# =============================================================================
# INSTALACIÓN DE DEPENDENCIAS
# =============================================================================
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost joblib kagglehub

In [None]:
# =============================================================================
# IMPORTS
# =============================================================================
import pandas as pd
import numpy as np
import os
import shutil
from pathlib import Path
import json
from typing import Dict, List, Tuple, Optional, Any
import logging
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import kagglehub

from sklearn.model_selection import train_test_split, GroupShuffleSplit
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

import xgboost as xgb
from xgboost import XGBRegressor

warnings.filterwarnings('ignore')

# Configurar logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S',
    force=True
)
logger = logging.getLogger('EAF_Notebook')
logger.setLevel(logging.INFO)

print("Imports completados correctamente")

In [None]:
# =============================================================================
# CONFIGURACIÓN DE DIRECTORIOS
# =============================================================================
PROJECT_ROOT = Path.cwd()

DIRECTORIES = {
    'DATA_RAW': PROJECT_ROOT / 'data' / 'raw',
    'DATA_PROCESSED': PROJECT_ROOT / 'data' / 'processed',
    'MODELS': PROJECT_ROOT / 'models',
    'CHEMICAL_RESULTS': PROJECT_ROOT / 'models' / 'chemical_results'
}

for dir_name, dir_path in DIRECTORIES.items():
    dir_path.mkdir(parents=True, exist_ok=True)

DATA_RAW = DIRECTORIES['DATA_RAW']
DATA_PROCESSED = DIRECTORIES['DATA_PROCESSED']
MODELS_DIR = DIRECTORIES['MODELS']
CHEMICAL_RESULTS_DIR = DIRECTORIES['CHEMICAL_RESULTS']

print(f"PROJECT_ROOT: {PROJECT_ROOT}")
print(f"DATA_RAW: {DATA_RAW}")
print(f"DATA_PROCESSED: {DATA_PROCESSED}")

---
## PARTE 2: Ingesta de Datos y Funciones Auxiliares

In [None]:
# =============================================================================
# DESCARGA DE DATOS DESDE KAGGLE
# =============================================================================
KAGGLE_DATASET = "yuriykatser/industrial-data-from-the-arc-furnace"

ARCHIVOS_ESPERADOS = [
    "eaf_transformer.csv",
    "basket_charged.csv",
    "eaf_temp.csv",
    "eaf_final_chemical_measurements.csv",
    "eaf_added_materials.csv",
    "inj_mat.csv",
    "eaf_gaslance_mat.csv",
    "lf_initial_chemical_measurements.csv",
    "ladle_tapping.csv",
    "lf_added_materials.csv",
    "ferro.csv"
]

faltan_datos = any(not (DATA_RAW / f).exists() for f in ARCHIVOS_ESPERADOS)

if faltan_datos:
    print(f"Descargando {KAGGLE_DATASET}...")
    try:
        cached_path = Path(kagglehub.dataset_download(KAGGLE_DATASET))
        DATA_RAW.mkdir(parents=True, exist_ok=True)
        for archivo in ARCHIVOS_ESPERADOS:
            shutil.copy2(cached_path / archivo, DATA_RAW / archivo)
        print("Descarga completada.")
    except Exception as e:
        print(f"Error durante la descarga: {e}")
        raise
else:
    print("Datos ya disponibles en data/raw/")

In [None]:
# =============================================================================
# FUNCIÓN DE CARGA ESTANDARIZADA
# =============================================================================
def load_standardized(filepath: Path) -> pd.DataFrame:
    """
    Carga un CSV y estandariza los nombres de columnas.
    Detecta y convierte formatos numéricos europeos (coma decimal).
    """
    df = pd.read_csv(filepath, low_memory=False)
    df.columns = df.columns.str.lower().str.strip()
    
    for col in df.select_dtypes(include=['object']).columns:
        if df[col].astype(str).str.match(r'^-?\d+,\d+$').any():
            df[col] = df[col].astype(str).str.replace(',', '.', regex=False)
    
    return df

print("Función load_standardized() definida")

In [None]:
# =============================================================================
# FUNCIONES DE AGREGACIÓN DE SERIES TEMPORALES
# =============================================================================

def aggregate_gas_data(df_gas: pd.DataFrame) -> pd.DataFrame:
    """Agrega datos de gas lance por colada (último valor)."""
    df = df_gas.copy()
    cols_gas = ['o2_amount', 'gas_amount']
    for col in cols_gas:
        df[col] = pd.to_numeric(df[col], errors='coerce')
    df['revtime'] = pd.to_datetime(
        df['revtime'].astype(str).str.replace(',', '.', regex=False),
        format='%Y-%m-%d %H:%M:%S.%f', errors='coerce'
    )
    df = df.sort_values('revtime')
    return df.groupby('heatid').last()[cols_gas].rename(columns={
        'o2_amount': 'total_o2_lance',
        'gas_amount': 'total_gas_lance'
    })


def aggregate_injection_data(df_inj: pd.DataFrame) -> pd.DataFrame:
    """Agrega datos de inyección de carbón por colada (último valor)."""
    df = df_inj.copy()
    df['inj_amount_carbon'] = pd.to_numeric(df['inj_amount_carbon'], errors='coerce')
    df['revtime'] = pd.to_datetime(
        df['revtime'].astype(str).str.replace(',', '.', regex=False),
        format='%Y-%m-%d %H:%M:%S.%f', errors='coerce'
    )
    df = df.sort_values('revtime')
    return df.groupby('heatid').last()[['inj_amount_carbon']].rename(
        columns={'inj_amount_carbon': 'total_injected_carbon'}
    )


def aggregate_transformer_data(df_transformer: pd.DataFrame) -> pd.DataFrame:
    """Agrega datos del transformador por colada (energía total)."""
    df = df_transformer.copy()
    
    def parse_duration(duration_str):
        try:
            parts = str(duration_str).strip().split(':')
            if len(parts) == 2:
                return float(parts[0].strip()) + float(parts[1].strip()) / 60.0
            return 0.0
        except:
            return 0.0
    
    df['duration_minutes'] = df['duration'].apply(parse_duration)
    df['mw'] = pd.to_numeric(df['mw'], errors='coerce').fillna(0)
    df['energy'] = df['mw'] * df['duration_minutes']
    
    return df.groupby('heatid').agg({
        'energy': 'sum',
        'duration_minutes': 'sum'
    }).rename(columns={'energy': 'total_energy', 'duration_minutes': 'total_duration'})


def aggregate_charged_amount(df_ladle: pd.DataFrame) -> pd.DataFrame:
    """Agrega cantidad total de material cargado por colada."""
    df = df_ladle.copy()
    df['charge_amount'] = pd.to_numeric(df['charge_amount'], errors='coerce').fillna(0)
    return df.groupby('heatid').agg({'charge_amount': 'sum'}).rename(
        columns={'charge_amount': 'total_charged_amount'}
    )


def pivot_materials(df_ladle: pd.DataFrame, top_n: int = 10) -> pd.DataFrame:
    """Pivota materiales agregados (top N más frecuentes)."""
    df = df_ladle.copy()
    df['charge_amount'] = pd.to_numeric(df['charge_amount'], errors='coerce')
    top_materials = df['mat_code'].value_counts().head(top_n).index
    df_filtered = df[df['mat_code'].isin(top_materials)]
    
    return df_filtered.pivot_table(
        index='heatid',
        columns='mat_code',
        values='charge_amount',
        aggfunc='sum',
        fill_value=0
    ).add_prefix('added_mat_')


print("Funciones de agregación definidas")

In [None]:
# =============================================================================
# FUNCIÓN PARA CONSTRUIR DATASET MAESTRO DE VARIABLES ESTÁTICAS
# =============================================================================

def build_master_dataset(data_raw_path: Path) -> pd.DataFrame:
    """
    Construye el dataset maestro con variables estáticas por colada.
    Incluye: energía, materiales, gases, etc.
    """
    logger.info("Construyendo dataset maestro...")
    
    # Cargar archivos fuente
    df_transformer = load_standardized(data_raw_path / "eaf_transformer.csv")
    df_gas = load_standardized(data_raw_path / "eaf_gaslance_mat.csv")
    df_inj = load_standardized(data_raw_path / "inj_mat.csv")
    df_ladle = load_standardized(data_raw_path / "ladle_tapping.csv")
    
    # Agregar datos
    grp_transformer = aggregate_transformer_data(df_transformer)
    grp_gas = aggregate_gas_data(df_gas)
    grp_inj = aggregate_injection_data(df_inj)
    grp_charged = aggregate_charged_amount(df_ladle)
    pivot_ladle = pivot_materials(df_ladle, top_n=10)
    
    # Crear base con todos los heatids únicos
    all_heatids = set(grp_transformer.index) | set(grp_gas.index) | set(grp_inj.index)
    df_master = pd.DataFrame({'heatid': list(all_heatids)})
    
    # Merge de todos los componentes
    df_master = df_master.merge(grp_transformer, on='heatid', how='left')
    df_master = df_master.merge(grp_gas, on='heatid', how='left')
    df_master = df_master.merge(grp_inj, on='heatid', how='left')
    df_master = df_master.merge(grp_charged, on='heatid', how='left')
    df_master = df_master.merge(pivot_ladle, on='heatid', how='left')
    
    # Rellenar nulos con 0
    df_master = df_master.fillna(0)
    
    logger.info(f"Dataset maestro: {df_master.shape}")
    return df_master

print("Función build_master_dataset() definida")

---
# PARTE 3: PIPELINE DE TEMPERATURA (Secuencial)

Este pipeline genera `dataset_sequential_temp.csv` con:
- Features dinámicas ($X_t$): temperatura actual, oxidación, posición temporal
- Target ($Y_{t+1}$): temperatura del siguiente registro
- Variables estáticas: energía, materiales, gases (fusionadas desde dataset maestro)

**Pasos:**
1. Carga y limpieza de `eaf_temp.csv`
2. Filtros físicos (1000-1850°C) y cuantiles
3. Ordenamiento por heatid y tiempo
4. Generación de features dinámicas
5. Generación de target con shift
6. Fusión con variables estáticas
7. Exportación

### 3.1 Carga y Limpieza de Datos de Temperatura

In [None]:
# =============================================================================
# PASO 1: CARGA DE DATOS DE TEMPERATURA
# =============================================================================
print("="*60)
print("PIPELINE DE TEMPERATURA - INICIO")
print("="*60)

df_temp_seq = load_standardized(DATA_RAW / "eaf_temp.csv")

print(f"Archivo cargado: eaf_temp.csv")
print(f"Shape original: {df_temp_seq.shape}")
print(f"Columnas: {df_temp_seq.columns.tolist()}")

In [None]:
# =============================================================================
# PASO 2: DETECCIÓN DE COLUMNAS Y CONVERSIÓN DE TIPOS
# =============================================================================

# Detectar columna de temperatura
cols_temp = [c for c in df_temp_seq.columns if 'temp' in c.lower() and 'time' not in c.lower()]
col_temp = cols_temp[0] if cols_temp else 'temp'

# Detectar columna de tiempo
cols_time = [c for c in df_temp_seq.columns if 'time' in c.lower() or 'date' in c.lower()]
col_datetime = cols_time[0] if cols_time else 'datetime'

# Detectar columna de oxidación
cols_ox = [c for c in df_temp_seq.columns if 'ox' in c.lower() or 'o2' in c.lower()]
col_oxidation = cols_ox[0] if cols_ox else None

print(f"Columna de temperatura: {col_temp}")
print(f"Columna de tiempo: {col_datetime}")
print(f"Columna de oxidación: {col_oxidation}")

# Convertir tipos
df_temp_seq[col_temp] = pd.to_numeric(df_temp_seq[col_temp], errors='coerce')
df_temp_seq[col_datetime] = pd.to_datetime(
    df_temp_seq[col_datetime].astype(str).str.replace(',', '.', regex=False),
    errors='coerce'
)
if col_oxidation:
    df_temp_seq[col_oxidation] = pd.to_numeric(df_temp_seq[col_oxidation], errors='coerce')

print(f"\nEstadísticas de temperatura (sin filtrar):")
print(df_temp_seq[col_temp].describe())

In [None]:
# =============================================================================
# PASO 3: FILTROS DE LIMPIEZA (CRÍTICO)
# =============================================================================
filas_inicial = len(df_temp_seq)
print(f"Filas iniciales: {filas_inicial:,}")

# FILTRO 1: Rango físico (1000-1850°C)
TEMP_MIN_FISICA = 1000
TEMP_MAX_FISICA = 1850

mask_fisica = (df_temp_seq[col_temp] >= TEMP_MIN_FISICA) & (df_temp_seq[col_temp] <= TEMP_MAX_FISICA)
df_temp_seq = df_temp_seq[mask_fisica].copy()

filas_despues_fisica = len(df_temp_seq)
print(f"Filtro físico ({TEMP_MIN_FISICA}-{TEMP_MAX_FISICA}°C): {filas_inicial - filas_despues_fisica:,} filas eliminadas")

# FILTRO 2: Cuantiles (0.5% extremos)
QUANTILE_LOWER = 0.005
QUANTILE_UPPER = 0.995

q_low = df_temp_seq[col_temp].quantile(QUANTILE_LOWER)
q_high = df_temp_seq[col_temp].quantile(QUANTILE_UPPER)

mask_quantile = (df_temp_seq[col_temp] >= q_low) & (df_temp_seq[col_temp] <= q_high)
df_temp_seq = df_temp_seq[mask_quantile].copy()

filas_final = len(df_temp_seq)
print(f"Filtro cuantil ({QUANTILE_LOWER:.1%}-{QUANTILE_UPPER:.1%}): {filas_despues_fisica - filas_final:,} filas eliminadas")
print(f"  Rango aceptado: [{q_low:.1f}, {q_high:.1f}] °C")

print(f"\nTotal eliminado: {filas_inicial - filas_final:,} ({100*(filas_inicial - filas_final)/filas_inicial:.2f}%)")
print(f"Filas finales: {filas_final:,}")

### 3.2 Ordenamiento y Generación de Features

In [None]:
# =============================================================================
# PASO 4: ORDENAMIENTO ESTRICTO
# =============================================================================

# Eliminar nulos en columnas críticas
df_temp_seq = df_temp_seq.dropna(subset=['heatid', col_datetime, col_temp])

# Ordenar por heatid y tiempo (CRÍTICO para series temporales)
df_temp_seq = df_temp_seq.sort_values(['heatid', col_datetime]).reset_index(drop=True)

print(f"Dataset ordenado por heatid y {col_datetime}")
print(f"Shape: {df_temp_seq.shape}")
print(f"Coladas únicas: {df_temp_seq['heatid'].nunique()}")

In [None]:
# =============================================================================
# PASO 5: CREACIÓN DE FEATURES DINÁMICAS (X_t)
# =============================================================================

# Feature 1: Temperatura actual
df_temp_seq['temp_actual'] = df_temp_seq[col_temp]

# Feature 2: Oxidación actual
if col_oxidation:
    df_temp_seq['oxidacion_actual'] = df_temp_seq[col_oxidation].fillna(0)
else:
    df_temp_seq['oxidacion_actual'] = 0

# Feature 3: Número de medición dentro de la colada
df_temp_seq['num_medicion'] = df_temp_seq.groupby('heatid').cumcount() + 1

# Feature 4: Tiempo transcurrido desde inicio (minutos)
df_temp_seq['tiempo_desde_inicio'] = df_temp_seq.groupby('heatid')[col_datetime].transform(
    lambda x: (x - x.min()).dt.total_seconds() / 60
)

print("Features dinámicas creadas:")
print("  - temp_actual")
print("  - oxidacion_actual")
print("  - num_medicion")
print("  - tiempo_desde_inicio")

In [None]:
# =============================================================================
# PASO 6: GENERACIÓN DEL TARGET (Y_t) - TEMPERATURA SIGUIENTE
# =============================================================================

# Target: Temperatura del siguiente momento temporal (shift negativo)
df_temp_seq['target_temp_next'] = df_temp_seq.groupby('heatid')[col_temp].shift(-1)

# Datetime del siguiente registro (para calcular horizonte)
df_temp_seq['datetime_next'] = df_temp_seq.groupby('heatid')[col_datetime].shift(-1)

# Horizonte temporal en minutos
df_temp_seq['horizonte_minutos'] = (
    df_temp_seq['datetime_next'] - df_temp_seq[col_datetime]
).dt.total_seconds() / 60

print("Target generado: target_temp_next")
print(f"Estadísticas del horizonte temporal:")
print(df_temp_seq['horizonte_minutos'].describe())

In [None]:
# =============================================================================
# PASO 7: LIMPIEZA DE BORDES
# =============================================================================

filas_antes = len(df_temp_seq)
n_coladas = df_temp_seq['heatid'].nunique()

# Eliminar filas donde target es NaN (últimos registros de cada colada)
df_temp_seq = df_temp_seq.dropna(subset=['target_temp_next'])

# Filtrar horizontes anómalos
mask_horizonte_valido = (df_temp_seq['horizonte_minutos'] > 0) & (df_temp_seq['horizonte_minutos'] < 120)
df_temp_seq = df_temp_seq[mask_horizonte_valido]

filas_despues = len(df_temp_seq)
print(f"Filas antes: {filas_antes:,}")
print(f"Filas después de limpieza de bordes: {filas_despues:,}")
print(f"Eliminadas: {filas_antes - filas_despues:,}")

### 3.3 Fusión con Variables Estáticas y Exportación

In [None]:
# =============================================================================
# PASO 8: FUSIÓN CON DATASET MAESTRO
# =============================================================================

# Construir dataset maestro de variables estáticas
print("Construyendo dataset maestro...")
df_master = build_master_dataset(DATA_RAW)

print(f"Dataset maestro: {df_master.shape}")
print(f"Dataset secuencial: {df_temp_seq.shape}")

# Fusionar (one-to-many: variables estáticas se replican por cada medición)
df_sequential = df_temp_seq.merge(df_master, on='heatid', how='left')

print(f"Dataset fusionado: {df_sequential.shape}")

In [None]:
# =============================================================================
# PASO 9: PREPARACIÓN FINAL Y EXPORTACIÓN
# =============================================================================

# Rellenar nulos con 0
df_sequential = df_sequential.fillna(0)

# Eliminar columnas auxiliares
cols_drop = ['datetime_next', col_datetime, 'datetime', 'positionrow', 'filter_key_date', 'measure_time']
cols_drop = [c for c in cols_drop if c in df_sequential.columns]
df_sequential = df_sequential.drop(columns=cols_drop)

# Eliminar columna de temperatura original si existe
if col_temp in df_sequential.columns and col_temp != 'temp_actual':
    df_sequential = df_sequential.drop(columns=[col_temp])

# Reorganizar columnas
target_col = 'target_temp_next'
cols_order = ['heatid'] + [c for c in df_sequential.columns if c not in ['heatid', target_col]] + [target_col]
df_sequential = df_sequential[cols_order]

# GUARDAR
output_path = DATA_PROCESSED / "dataset_sequential_temp.csv"
df_sequential.to_csv(output_path, index=False)

print("="*60)
print("PIPELINE DE TEMPERATURA - COMPLETADO")
print("="*60)
print(f"Archivo: {output_path}")
print(f"Shape: {df_sequential.shape}")
print(f"Coladas únicas: {df_sequential['heatid'].nunique()}")
print(f"Mediciones por colada (promedio): {len(df_sequential) / df_sequential['heatid'].nunique():.1f}")
print(f"\nColumnas ({len(df_sequential.columns)}):")
print(df_sequential.columns.tolist()[:15], "...")

---
# PARTE 3.5: PIPELINE QUÍMICO (Estático)

Este pipeline genera `dataset_final_chemical.csv` con:
- Inputs: variables del proceso (materiales, energía, gases)
- Targets: composición química final (C, Mn, Si, P, S, Cu, Cr, Mo, Ni)

**Pasos:**
1. Carga de `eaf_final_chemical_measurements.csv`
2. Extracción de targets químicos
3. Fusión con dataset maestro
4. Limpieza y exportación

### 3.5.1 Carga de Mediciones Químicas Finales

In [None]:
# =============================================================================
# PASO 1: CARGA DE DATOS QUÍMICOS FINALES
# =============================================================================
print("="*60)
print("PIPELINE QUÍMICO - INICIO")
print("="*60)

df_chem_final = load_standardized(DATA_RAW / "eaf_final_chemical_measurements.csv")

print(f"Archivo cargado: eaf_final_chemical_measurements.csv")
print(f"Shape: {df_chem_final.shape}")
print(f"Columnas: {df_chem_final.columns.tolist()}")

In [None]:
# =============================================================================
# PASO 2: EXTRACCIÓN DE TARGETS QUÍMICOS
# =============================================================================

# Elementos químicos a extraer como targets
chemical_elements = ['valc', 'valmn', 'valsi', 'valp', 'vals', 'valcu', 'valcr', 'valmo', 'valni']

# Verificar disponibilidad
available_elements = [col for col in chemical_elements if col in df_chem_final.columns]
print(f"Elementos disponibles: {available_elements}")

# Seleccionar heatid y elementos
cols_to_select = ['heatid'] + available_elements
df_targets = df_chem_final[cols_to_select].copy()

# Convertir a numérico
for col in available_elements:
    df_targets[col] = pd.to_numeric(df_targets[col], errors='coerce')

# Renombrar con prefijo target_
rename_dict = {col: f'target_{col}' for col in available_elements}
df_targets = df_targets.rename(columns=rename_dict)

# Eliminar duplicados por heatid
df_targets = df_targets.drop_duplicates(subset=['heatid'], keep='last')

print(f"\nTargets extraídos: {len(df_targets)} coladas")
print(df_targets.describe())

### 3.5.2 Fusión con Dataset Maestro

In [None]:
# =============================================================================
# PASO 3: GENERAR/REUTILIZAR DATASET MAESTRO
# =============================================================================

# Reutilizar df_master si ya existe, sino generarlo
if 'df_master' not in dir() or df_master is None:
    print("Generando dataset maestro...")
    df_master = build_master_dataset(DATA_RAW)
else:
    print("Reutilizando df_master existente.")

print(f"Dataset maestro: {df_master.shape}")

In [None]:
# =============================================================================
# PASO 4: FUSIÓN DE INPUTS CON TARGETS
# =============================================================================

# Inner join: solo coladas con datos completos
df_chemical = df_master.merge(df_targets, on='heatid', how='inner')

print(f"Dataset maestro: {len(df_master)} coladas")
print(f"Targets químicos: {len(df_targets)} coladas")
print(f"Dataset fusionado: {len(df_chemical)} coladas")
print(f"Pérdida por merge: {len(df_master) - len(df_chemical)} coladas")

### 3.5.3 Limpieza Final y Exportación

In [None]:
# =============================================================================
# PASO 5: LIMPIEZA FINAL Y GUARDADO
# =============================================================================

# Eliminar columnas innecesarias
cols_drop = ['datetime', 'positionrow', 'filter_key_date', 'measure_time']
df_chemical = df_chemical.drop(columns=[c for c in cols_drop if c in df_chemical.columns])

# Identificar columnas target
target_cols = [c for c in df_chemical.columns if c.startswith('target_')]

# Eliminar filas donde TODOS los targets son nulos
rows_before = len(df_chemical)
df_chemical = df_chemical.dropna(subset=target_cols, how='all')
print(f"Filas eliminadas (todos targets nulos): {rows_before - len(df_chemical)}")

# Rellenar nulos con 0
df_chemical = df_chemical.fillna(0)

# GUARDAR
output_path_chemical = DATA_PROCESSED / "dataset_final_chemical.csv"
df_chemical.to_csv(output_path_chemical, index=False)

print("\n" + "="*60)
print("PIPELINE QUÍMICO - COMPLETADO")
print("="*60)
print(f"Archivo: {output_path_chemical}")
print(f"Shape: {df_chemical.shape}")
print(f"Features de input: {len(df_chemical.columns) - len(target_cols)}")
print(f"Variables target: {len(target_cols)}")
print(f"\nTargets: {target_cols}")

In [None]:
# =============================================================================
# ANÁLISIS EXPLORATORIO RÁPIDO
# =============================================================================
print("\nEstadísticas de targets químicos:")
print(df_chemical[target_cols].describe().round(4))

---
# PARTE 4: MODELO DE TEMPERATURA (XGBoost)

Implementación del modelo de predicción de temperatura usando:
- **XGBoost Regressor**
- **GroupShuffleSplit** para validación por coladas (evita data leakage)
- Métricas: RMSE, R²
- Visualización: Predicción vs Real

### 4.1 Carga del Dataset Secuencial

In [None]:
# =============================================================================
# CARGA DEL DATASET SECUENCIAL
# =============================================================================
print("="*60)
print("MODELO DE TEMPERATURA - INICIO")
print("="*60)

# Cargar dataset
df_temp = pd.read_csv(DATA_PROCESSED / "dataset_sequential_temp.csv")

print(f"Dataset cargado: {df_temp.shape}")
print(f"Coladas únicas: {df_temp['heatid'].nunique()}")
print(f"\nColumnas:")
print(df_temp.columns.tolist())

### 4.2 Preparación de Features y Target

In [None]:
# =============================================================================
# PREPARACIÓN DE FEATURES Y TARGET
# =============================================================================

# Target
TARGET_COL = 'target_temp_next'

# Features: todas las columnas excepto heatid y target
feature_cols = [c for c in df_temp.columns if c not in ['heatid', TARGET_COL]]

X = df_temp[feature_cols]
y = df_temp[TARGET_COL]
groups = df_temp['heatid']  # Para GroupShuffleSplit

print(f"Features: {len(feature_cols)}")
print(f"Muestras: {len(X)}")
print(f"Target: {TARGET_COL}")
print(f"\nEstadísticas del target:")
print(y.describe())

### 4.3 Split Train/Test con GroupShuffleSplit

In [None]:
# =============================================================================
# SPLIT TRAIN/TEST CON GROUPSHUFFLESPLIT
# =============================================================================
# IMPORTANTE: Usamos GroupShuffleSplit para que todas las mediciones de una
# misma colada estén completamente en train o en test, nunca mezcladas.
# Esto evita DATA LEAKAGE.

from sklearn.model_selection import GroupShuffleSplit

# Configuración del split
TEST_SIZE = 0.2
RANDOM_STATE = 42

gss = GroupShuffleSplit(n_splits=1, test_size=TEST_SIZE, random_state=RANDOM_STATE)

# Obtener índices de train/test
train_idx, test_idx = next(gss.split(X, y, groups))

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

# Verificar que no hay coladas mezcladas
train_heats = set(groups.iloc[train_idx])
test_heats = set(groups.iloc[test_idx])
overlap = train_heats & test_heats

print(f"Split Train/Test (GroupShuffleSplit):")
print(f"  - Train: {len(X_train):,} muestras ({len(train_heats)} coladas)")
print(f"  - Test:  {len(X_test):,} muestras ({len(test_heats)} coladas)")
print(f"  - Coladas superpuestas: {len(overlap)} (debe ser 0)")

if len(overlap) > 0:
    print("  ADVERTENCIA: Hay coladas mezcladas entre train y test!")
else:
    print("  Sin data leakage por coladas mezcladas.")

### 4.4 Entrenamiento del Modelo XGBoost

In [None]:
# =============================================================================
# ENTRENAMIENTO DEL MODELO XGBOOST
# =============================================================================

# Hiperparámetros
HYPERPARAMS = {
    'n_estimators': 200,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': RANDOM_STATE,
    'n_jobs': -1
}

print("Entrenando XGBoost Regressor...")
print(f"Hiperparámetros: {HYPERPARAMS}")

# Crear y entrenar modelo
model_temp = XGBRegressor(**HYPERPARAMS)
model_temp.fit(X_train, y_train)

print("Entrenamiento completado.")

### 4.5 Evaluación del Modelo

In [None]:
# =============================================================================
# PREDICCIÓN Y MÉTRICAS
# =============================================================================

# Predicción
y_pred = model_temp.predict(X_test)

# Métricas
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print("="*60)
print("MÉTRICAS DE EVALUACIÓN - MODELO DE TEMPERATURA")
print("="*60)
print(f"  RMSE: {rmse:.4f} °C")
print(f"  R²:   {r2:.4f}")
print(f"  MAE:  {mae:.4f} °C")
print("="*60)

# Interpretación
if r2 > 0.8:
    print("Excelente capacidad predictiva.")
elif r2 > 0.6:
    print("Buena capacidad predictiva.")
elif r2 > 0.3:
    print("Capacidad predictiva moderada.")
else:
    print("Capacidad predictiva baja. Considerar ajustes.")

### 4.6 Visualización: Predicción vs Real

In [None]:
# =============================================================================
# GRÁFICA: PREDICCIÓN VS REAL
# =============================================================================

fig, ax = plt.subplots(figsize=(10, 8))

# Scatter plot
ax.scatter(y_test, y_pred, alpha=0.3, color='steelblue', s=10)

# Línea de predicción perfecta
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Predicción Perfecta')

# Configuración
ax.set_xlabel('Temperatura Real (°C)', fontsize=12)
ax.set_ylabel('Temperatura Predicha (°C)', fontsize=12)
ax.set_title('Modelo de Temperatura - Predicción vs Real', fontsize=14, fontweight='bold')

# Métricas en recuadro
textstr = f'RMSE: {rmse:.2f}°C\nR²: {r2:.4f}\nMAE: {mae:.2f}°C'
props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
ax.text(0.05, 0.95, textstr, transform=ax.transAxes, fontsize=11,
        verticalalignment='top', bbox=props)

ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Gráfica generada.")

### 4.7 Importancia de Features

In [None]:
# =============================================================================
# IMPORTANCIA DE FEATURES
# =============================================================================

# Obtener importancias
importances = model_temp.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': importances
}).sort_values('importance', ascending=False)

# Top 15 features
top_n = 15
top_features = feature_importance_df.head(top_n)

# Gráfica
fig, ax = plt.subplots(figsize=(10, 8))

colors = plt.cm.viridis(np.linspace(0.2, 0.8, top_n))
bars = ax.barh(range(top_n), top_features['importance'].values[::-1], color=colors)
ax.set_yticks(range(top_n))
ax.set_yticklabels(top_features['feature'].values[::-1])
ax.set_xlabel('Importancia', fontsize=12)
ax.set_title(f'Top {top_n} Features Más Importantes - Modelo de Temperatura', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nTop 10 features más importantes:")
print(feature_importance_df.head(10).to_string(index=False))

In [None]:
# =============================================================================
# GUARDAR MODELO
# =============================================================================

model_path = MODELS_DIR / "model_temperature_xgboost.joblib"
joblib.dump(model_temp, model_path)

# Guardar metadata
metadata = {
    'model_type': 'XGBoost Regressor',
    'hyperparameters': HYPERPARAMS,
    'metrics': {'RMSE': rmse, 'R2': r2, 'MAE': mae},
    'features': feature_cols,
    'n_train_samples': len(X_train),
    'n_test_samples': len(X_test),
    'n_train_heats': len(train_heats),
    'n_test_heats': len(test_heats)
}

with open(MODELS_DIR / "model_temperature_metadata.json", 'w') as f:
    json.dump(metadata, f, indent=4)

print(f"Modelo guardado: {model_path}")
print(f"Metadata guardada: {MODELS_DIR / 'model_temperature_metadata.json'}")

---
# PARTE 5: MODELO QUÍMICO (Desempaquetado)

Esta sección implementa el entrenamiento de modelos químicos en **bloques secuenciales**
para facilitar debugging y modificación de hiperparámetros.

**Flujo:**
1. Carga y selección de features
2. Split Train/Test
3. Entrenamiento
4. Evaluación
5. Visualización

**Nota:** Este ejemplo usa `target_valc` (Carbono). Puedes cambiar el target para otros elementos.

### 5.1 Configuración del Elemento a Predecir

In [None]:
# =============================================================================
# CONFIGURACIÓN DEL MODELO QUÍMICO
# =============================================================================

# Elemento químico a predecir (cambiar aquí para otros elementos)
TARGET_CHEMICAL = 'target_valc'  # Opciones: target_valc, target_valmn, target_valsi, etc.

# Lista de todos los targets disponibles
CHEMICAL_TARGETS = [
    'target_valc',   # Carbono
    'target_valmn',  # Manganeso
    'target_valsi',  # Silicio
    'target_valp',   # Fósforo
    'target_vals',   # Azufre
    'target_valcu',  # Cobre
    'target_valcr',  # Cromo
    'target_valmo',  # Molibdeno
    'target_valni'   # Níquel
]

# Hiperparámetros
HYPERPARAMS_CHEM = {
    'n_estimators': 100,
    'max_depth': 5,
    'learning_rate': 0.1,
    'random_state': 42,
    'n_jobs': -1
}

TEST_SIZE_CHEM = 0.2

print(f"Target seleccionado: {TARGET_CHEMICAL}")
print(f"Elemento: {TARGET_CHEMICAL.replace('target_val', '').upper()}")

### 5.2 Carga de Datos y Selección de Features

In [None]:
# =============================================================================
# PASO 1: CARGA DE DATOS
# =============================================================================
print("="*60)
print(f"MODELO QUÍMICO - {TARGET_CHEMICAL.replace('target_val', '').upper()}")
print("="*60)

df_chem = pd.read_csv(DATA_PROCESSED / "dataset_final_chemical.csv")

print(f"Dataset cargado: {df_chem.shape}")
print(f"Columnas: {len(df_chem.columns)}")

In [None]:
# =============================================================================
# PASO 2: SELECCIÓN DE FEATURES (Evitar Data Leakage)
# =============================================================================

# Obtener nombre del valor inicial del elemento (para excluirlo)
initial_feature = TARGET_CHEMICAL.replace('target_', '')  # ej: 'valc'

# Features: excluir heatid, todos los targets, y el valor inicial del elemento actual
exclude_cols = ['heatid'] + CHEMICAL_TARGETS + [initial_feature]
feature_cols_chem = [c for c in df_chem.columns if c not in exclude_cols]

print(f"Features seleccionadas: {len(feature_cols_chem)}")
print(f"Excluido '{initial_feature}' para evitar data leakage")

# Verificar que el target existe
if TARGET_CHEMICAL not in df_chem.columns:
    raise ValueError(f"Target '{TARGET_CHEMICAL}' no encontrado en el dataset")

print(f"\nPrimeras 10 features:")
print(feature_cols_chem[:10])

In [None]:
# =============================================================================
# PASO 3: PREPARACIÓN DE X e Y
# =============================================================================

X_chem = df_chem[feature_cols_chem].copy()
y_chem = df_chem[TARGET_CHEMICAL].copy()

# Eliminar filas donde el target es NaN
mask_not_null = y_chem.notnull()
X_chem = X_chem[mask_not_null]
y_chem = y_chem[mask_not_null]

print(f"Muestras después de eliminar NaN: {len(X_chem)}")

# Imputar NaNs en features con 0
X_chem = X_chem.fillna(0)

# Limpiar infinitos
X_chem = X_chem.replace([np.inf, -np.inf], 0)
y_chem = y_chem.replace([np.inf, -np.inf], np.nan).dropna()
X_chem = X_chem.loc[y_chem.index]

print(f"\nEstadísticas del target ({TARGET_CHEMICAL}):")
print(y_chem.describe())

### 5.3 Split Train/Test

In [None]:
# =============================================================================
# PASO 4: SPLIT TRAIN/TEST
# =============================================================================

X_train_chem, X_test_chem, y_train_chem, y_test_chem = train_test_split(
    X_chem, y_chem, 
    test_size=TEST_SIZE_CHEM, 
    random_state=HYPERPARAMS_CHEM['random_state']
)

print(f"Split Train/Test:")
print(f"  - Train: {len(X_train_chem)} muestras")
print(f"  - Test:  {len(X_test_chem)} muestras")

### 5.4 Entrenamiento del Modelo

In [None]:
# =============================================================================
# PASO 5: ENTRENAMIENTO
# =============================================================================

print(f"Entrenando XGBoost para {TARGET_CHEMICAL}...")
print(f"Hiperparámetros: {HYPERPARAMS_CHEM}")

model_chem = XGBRegressor(**HYPERPARAMS_CHEM)
model_chem.fit(X_train_chem, y_train_chem)

print("Entrenamiento completado.")

### 5.5 Evaluación

In [None]:
# =============================================================================
# PASO 6: PREDICCIÓN Y MÉTRICAS
# =============================================================================

y_pred_chem = model_chem.predict(X_test_chem)

# Métricas
rmse_chem = np.sqrt(mean_squared_error(y_test_chem, y_pred_chem))
r2_chem = r2_score(y_test_chem, y_pred_chem)
mae_chem = mean_absolute_error(y_test_chem, y_pred_chem)

element_name = TARGET_CHEMICAL.replace('target_val', '').upper()

print("="*60)
print(f"MÉTRICAS - {element_name}")
print("="*60)
print(f"  RMSE: {rmse_chem:.6f}")
print(f"  R²:   {r2_chem:.4f}")
print(f"  MAE:  {mae_chem:.6f}")
print("="*60)

# Interpretación
if r2_chem < 0:
    print("ADVERTENCIA: R² negativo. El modelo es peor que predecir la media.")
elif r2_chem < 0.3:
    print("R² bajo. Considerar ajustar hiperparámetros o features.")
else:
    print(f"Capacidad predictiva {'buena' if r2_chem > 0.6 else 'moderada'}.")

### 5.6 Visualización

In [None]:
# =============================================================================
# PASO 7: GRÁFICAS DE VISUALIZACIÓN
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ----- Gráfica 1: Predicción vs Real -----
ax1 = axes[0]
ax1.scatter(y_test_chem, y_pred_chem, alpha=0.5, color='steelblue', edgecolors='white', linewidth=0.5)

# Línea de predicción perfecta
min_val = min(y_test_chem.min(), y_pred_chem.min())
max_val = max(y_test_chem.max(), y_pred_chem.max())
ax1.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Predicción Perfecta')

ax1.set_xlabel(f'Valor Real - {element_name} (%)', fontsize=11)
ax1.set_ylabel(f'Valor Predicho - {element_name} (%)', fontsize=11)
ax1.set_title(f'Predicción vs Real - {element_name}', fontsize=12, fontweight='bold')

# Métricas en recuadro
textstr = f'RMSE: {rmse_chem:.6f}\nR²: {r2_chem:.4f}\nMAE: {mae_chem:.6f}'
props = dict(boxstyle='round', facecolor='wheat', alpha=0.8)
ax1.text(0.05, 0.95, textstr, transform=ax1.transAxes, fontsize=10,
         verticalalignment='top', bbox=props)
ax1.legend(loc='lower right')
ax1.grid(True, alpha=0.3)

# ----- Gráfica 2: Importancia de Features -----
ax2 = axes[1]

importances_chem = model_chem.feature_importances_
importance_df_chem = pd.DataFrame({
    'feature': feature_cols_chem,
    'importance': importances_chem
}).sort_values('importance', ascending=False)

top_n = 15
top_features_chem = importance_df_chem.head(top_n)

colors = plt.cm.viridis(np.linspace(0.2, 0.8, top_n))
ax2.barh(range(top_n), top_features_chem['importance'].values[::-1], color=colors)
ax2.set_yticks(range(top_n))
ax2.set_yticklabels(top_features_chem['feature'].values[::-1], fontsize=9)
ax2.set_xlabel('Importancia', fontsize=11)
ax2.set_title(f'Top {top_n} Features - {element_name}', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# =============================================================================
# PASO 8: GUARDAR RESULTADOS
# =============================================================================

element_short = TARGET_CHEMICAL.replace('target_', '')

# Guardar modelo
model_path_chem = MODELS_DIR / f"model_chemical_{element_short}_xgboost.joblib"
joblib.dump(model_chem, model_path_chem)

# Guardar metadata
metadata_chem = {
    'target': TARGET_CHEMICAL,
    'element': element_short,
    'model_type': 'XGBoost Regressor',
    'hyperparameters': HYPERPARAMS_CHEM,
    'metrics': {'RMSE': rmse_chem, 'R2': r2_chem, 'MAE': mae_chem},
    'features': feature_cols_chem,
    'n_train_samples': len(X_train_chem),
    'n_test_samples': len(X_test_chem)
}

with open(MODELS_DIR / f"model_chemical_{element_short}_metadata.json", 'w') as f:
    json.dump(metadata_chem, f, indent=4)

print(f"Modelo guardado: {model_path_chem}")
print(f"Metadata guardada.")

### 5.7 Entrenamiento para Múltiples Elementos (Opcional)

Esta celda entrena modelos para todos los elementos químicos disponibles.

In [None]:
# =============================================================================
# ENTRENAMIENTO PARA MÚLTIPLES ELEMENTOS (OPCIONAL)
# =============================================================================
# Descomenta esta celda para entrenar modelos para todos los elementos

TRAIN_ALL_ELEMENTS = False  # Cambiar a True para entrenar todos

if TRAIN_ALL_ELEMENTS:
    results_all = {}
    
    for target in CHEMICAL_TARGETS:
        if target not in df_chem.columns:
            print(f"Saltando {target} - no encontrado en dataset")
            continue
            
        print(f"\n{'='*40}")
        print(f"Entrenando: {target}")
        print(f"{'='*40}")
        
        # Preparar datos
        initial_feat = target.replace('target_', '')
        exclude = ['heatid'] + CHEMICAL_TARGETS + [initial_feat]
        feat_cols = [c for c in df_chem.columns if c not in exclude]
        
        X = df_chem[feat_cols].copy()
        y = df_chem[target].copy()
        
        # Limpiar
        mask = y.notnull()
        X = X[mask].fillna(0).replace([np.inf, -np.inf], 0)
        y = y[mask]
        
        if len(X) < 100:
            print(f"  Insuficientes muestras ({len(X)}), saltando...")
            continue
        
        # Split
        X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
        
        # Entrenar
        model = XGBRegressor(**HYPERPARAMS_CHEM)
        model.fit(X_tr, y_tr)
        
        # Evaluar
        y_p = model.predict(X_te)
        rmse = np.sqrt(mean_squared_error(y_te, y_p))
        r2 = r2_score(y_te, y_p)
        
        results_all[target] = {'RMSE': rmse, 'R2': r2}
        print(f"  RMSE: {rmse:.6f}, R²: {r2:.4f}")
        
        # Guardar
        elem = target.replace('target_', '')
        joblib.dump(model, MODELS_DIR / f"model_chemical_{elem}_xgboost.joblib")
    
    # Resumen
    print("\n" + "="*60)
    print("RESUMEN DE TODOS LOS MODELOS")
    print("="*60)
    for t, m in results_all.items():
        print(f"{t}: RMSE={m['RMSE']:.6f}, R²={m['R2']:.4f}")
else:
    print("Entrenamiento múltiple desactivado. Cambiar TRAIN_ALL_ELEMENTS = True para activar.")

---
## Resumen Final

Este notebook refactorizado incluye:

1. **PARTE 3**: Pipeline de Temperatura (Secuencial)
   - Genera `dataset_sequential_temp.csv`
   - Filtros físicos y cuantiles
   - Features dinámicas + estáticas

2. **PARTE 3.5**: Pipeline Químico (Estático)
   - Genera `dataset_final_chemical.csv`
   - Extracción de 9 elementos químicos como targets

3. **PARTE 4**: Modelo de Temperatura con XGBoost
   - GroupShuffleSplit para evitar data leakage
   - Métricas y visualizaciones

4. **PARTE 5**: Modelo Químico Desempaquetado
   - Bloques secuenciales para debugging
   - Fácil modificación de hiperparámetros
   - Visualización de importancia de features