# HR Overtime Prediction and Forecasting
## Predicción y Pronóstico de Horas Extra de RRHH

Este notebook implementa un pipeline automatizado para el pronóstico de horas extra por departamento, utilizando datos históricos almacenados en SQL Server. El flujo incluye:

- **Carga y limpieza de datos:** Importación desde SQL Server, agregación quincenal, interpolación de valores faltantes y detección/corrección de outliers con el filtro de Hampel.
- **Análisis exploratorio y descomposición:** Selección automática del tipo de descomposición (aditiva o multiplicativa) según la varianza de los residuos y análisis de estacionariedad.
- **Entrenamiento y comparación de modelos:** Se entrenan y comparan cuatro modelos de forecasting:
  - ARIMA (selección automática de parámetros)
  - Prophet (con modo de estacionalidad dinámico y silenciamiento de outputs)
  - Holt-Winters (Suavización Exponencial)
  - XGBoost (con variables de rezago y codificación seno/coseno del mes para capturar estacionalidad)
- **Selección automática del mejor modelo:** Basada en métricas de validación cruzada (RMSE, MAPE, etc.), con lógica para evitar modelos de línea plana.
- **Generación de pronósticos:** Predicción de las próximas 12 quincenas (6 meses) con intervalos de confianza, usando solo el mejor modelo por departamento.
- **Visualización interactiva:** Gráficas con Plotly para históricos, descomposición y pronóstico.
- **Trazabilidad y auditoría:** Almacenamiento de modelos, métricas y predicciones en SQL Server para seguimiento y control.
- **Outputs limpios:** El notebook suprime mensajes innecesarios, especialmente de Prophet, y utiliza logging para trazabilidad.

**Objetivo:** Proveer pronósticos robustos y auditables de horas extra, minimizando la intervención manual y maximizando la calidad y trazabilidad del proceso analítico.

## 1. Importar Librerías y Configuración

Importamos las librerías necesarias para análisis, modelado, visualización y conexión con la base de datos. Configuramos el logging para trazabilidad y suprimimos advertencias innecesarias para mantener un output limpio.

In [25]:
import pandas as pd
import numpy as np
import pymssql
import logging
import datetime
from datetime import datetime
import os
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.stattools import adfuller
from pmdarima import auto_arima
from prophet import Prophet
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import TimeSeriesSplit
from scipy.stats import median_abs_deviation
import joblib
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from xgboost import XGBRegressor
import contextlib
import sys
import io

# Configurar advertencias y logging
warnings.filterwarnings("ignore")
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler('overtime_forecast.log')
    ]
)

## 2. Conectar a la Base de Datos
Establecemos una conexión segura a SQL Server para cargar datos históricos y almacenar resultados.

In [26]:
def get_db_connection():
    SQL_SERVER = "172.28.192.1:50121"
    SQL_DB = "HR_Analytics"
    SQL_USER = "sa"
    SQL_PASSWORD = "123456"
    try:
        conn = pymssql.connect(
            server=SQL_SERVER,
            database=SQL_DB,
            user=SQL_USER,
            password=SQL_PASSWORD
        )
        logging.info("Conexión a la base de datos exitosa")
        return conn
    except Exception as e:
        logging.error(f"Error de conexión a la base de datos: {e}")
        raise

## 3. Cargar y Procesar Datos Históricos

Cargamos los datos históricos con `work_date < 2025-05-04`, los agregamos quincenalmente y manejamos datos faltantes con interpolación. Se aplica un filtrado de Hampel para detectar y corregir outliers. Finalmente, se realiza una descomposición de series de tiempo, seleccionando automáticamente el modelo (aditivo o multiplicativo) que mejor se ajuste a los datos según la varianza de los residuos.

In [27]:
def load_data():
    try:
        conn = get_db_connection()
        query = "SELECT work_date, department, total_overtime FROM vw_historical_data WHERE work_date < '2025-05-04' ORDER BY work_date"
        df = pd.read_sql(query, conn)
        conn.close()
        logging.info(f"Datos cargados: {len(df)} registros")
        df['work_date'] = pd.to_datetime(df['work_date'])
        
        # Agrupar por quincena
        df.set_index('work_date', inplace=True)
        df = df.groupby('department').resample('2W').sum(numeric_only=True).reset_index()
        df.rename(columns={'work_date': 'ds', 'total_overtime': 'y'}, inplace=True)
        
        return df
    except Exception as e:
        logging.error(f"Error al cargar datos: {e}")
        return pd.DataFrame()

def handle_outliers(df):
    df_cleaned = df.copy()
    for dept in df_cleaned['department'].unique():
        dept_data = df_cleaned[df_cleaned['department'] == dept]['y']
        
        # Filtrado de Hampel para detección de outliers
        median = dept_data.median()
        mad = median_abs_deviation(dept_data)
        threshold = 3 * mad
        
        lower_bound = median - threshold
        upper_bound = median + threshold
        
        outliers = dept_data[(dept_data < lower_bound) | (dept_data > upper_bound)]
        if not outliers.empty:
            logging.warning(f"Outliers detectados en {dept}:\n{df_cleaned[(df_cleaned['department'] == dept) & df_cleaned['y'].isin(outliers)]}")
            df_cleaned.loc[(df_cleaned['department'] == dept) & (df_cleaned['y'].isin(outliers)), 'y'] = np.nan
        
    df_cleaned['y'] = df_cleaned.groupby('department')['y'].transform(lambda x: x.interpolate(method='linear'))
    return df_cleaned

df_all_data = load_data()
df_all_data = handle_outliers(df_all_data)

logging.info(f"Resumen de datos:\nDepartamentos únicos: {df_all_data['department'].unique()}\nFechas únicas: {df_all_data['ds'].unique()}\nTotal registros: {len(df_all_data)}")


def select_decomposition_type(series):
    # Rellenar valores nulos con la media de la serie para evitar errores
    series_filled = series.fillna(series.mean())

    # Realizar el test de Dickey-Fuller para verificar estacionalidad
    adf_result = adfuller(series_filled)
    is_stationary = adf_result[1] <= 0.05

    # Descomposición aditiva
    result_add = seasonal_decompose(series_filled, model='additive', period=12)
    std_add = np.nanstd(result_add.resid)
    
    # Descomposición multiplicativa
    result_mul = seasonal_decompose(series_filled, model='multiplicative', period=12)
    std_mul = np.nanstd(result_mul.resid)

    decomposition_type = 'additive' if std_add < std_mul else 'multiplicative'

    # Limpiar prints: solo logging
    logging.info(f"Desviación estándar de los residuales aditivos: {std_add:.2f}")
    logging.info(f"Desviación estándar de los residuales multiplicativos: {std_mul:.2f}")
    logging.info(f"P-value del test ADF: {adf_result[1]:.2f}")

    if not is_stationary and decomposition_type == 'additive':
        logging.info("La serie no es estacionaria y no tiene una tendencia clara. Se descarta ARIMA.")
        return decomposition_type, False

    logging.info(f"Criterio de selección: La desviación estándar de los residuos {decomposition_type} es menor.")
    return decomposition_type, True

def plot_decomposition(series, decomposition_type, dept):
    # Rellenar valores nulos con la media de la serie para una visualización completa
    series_filled = series.fillna(series.mean())
    result = seasonal_decompose(series_filled, model=decomposition_type, period=12)
    fig = make_subplots(rows=4, cols=1, subplot_titles=['Original', 'Tendencia', 'Estacionalidad', 'Residuos'])
    
    fig.add_trace(go.Scatter(x=series.index, y=series, mode='lines', name='Original'), row=1, col=1)
    fig.add_trace(go.Scatter(x=result.trend.index, y=result.trend, mode='lines', name='Tendencia'), row=2, col=1)
    fig.add_trace(go.Scatter(x=result.seasonal.index, y=result.seasonal, mode='lines', name='Estacionalidad'), row=3, col=1)
    fig.add_trace(go.Scatter(x=result.resid.index, y=result.resid, mode='lines', name='Residuos'), row=4, col=1)
    
    fig.update_layout(height=800, title_text=f"Descomposición de la Serie Temporal ({decomposition_type.capitalize()}) para {dept}")
    fig.show()

# Ejemplo de uso
for dept in df_all_data['department'].unique():
    dept_data = df_all_data[df_all_data['department'] == dept].set_index('ds')['y']
    decomposition_type, _ = select_decomposition_type(dept_data)
    logging.info(f"El departamento {dept} tiene una descomposición {decomposition_type}.")
    plot_decomposition(dept_data, decomposition_type, dept)

2025-08-13 16:26:25,671 - INFO - Conexión a la base de datos exitosa
2025-08-13 16:26:33,666 - INFO - Datos cargados: 420 registros
2025-08-13 16:26:33,666 - INFO - Datos cargados: 420 registros
   department         ds       y
11    Finance 2024-06-02   61.74
12    Finance 2024-06-16  118.32
13    Finance 2024-06-30   74.01
21    Finance 2024-10-20   59.40
25    Finance 2024-12-15  107.94
26    Finance 2024-12-29   98.50
   department         ds       y
36         HR 2023-12-31    7.85
47         HR 2024-06-02   99.88
48         HR 2024-06-16  150.72
49         HR 2024-06-30   91.51
60         HR 2024-12-01   88.00
61         HR 2024-12-15  151.65
62         HR 2024-12-29   93.23
   department         ds       y
11    Finance 2024-06-02   61.74
12    Finance 2024-06-16  118.32
13    Finance 2024-06-30   74.01
21    Finance 2024-10-20   59.40
25    Finance 2024-12-15  107.94
26    Finance 2024-12-29   98.50
   department         ds       y
36         HR 2023-12-31    7.85
47         HR

2025-08-13 16:26:34,197 - INFO - Desviación estándar de los residuales aditivos: 4.23
2025-08-13 16:26:34,199 - INFO - Desviación estándar de los residuales multiplicativos: 0.08
2025-08-13 16:26:34,202 - INFO - P-value del test ADF: 0.01
2025-08-13 16:26:34,206 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:34,208 - INFO - El departamento HR tiene una descomposición multiplicative.
2025-08-13 16:26:34,199 - INFO - Desviación estándar de los residuales multiplicativos: 0.08
2025-08-13 16:26:34,202 - INFO - P-value del test ADF: 0.01
2025-08-13 16:26:34,206 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:34,208 - INFO - El departamento HR tiene una descomposición multiplicative.


2025-08-13 16:26:34,556 - INFO - Desviación estándar de los residuales aditivos: 9.02
2025-08-13 16:26:34,628 - INFO - Desviación estándar de los residuales multiplicativos: 0.09
2025-08-13 16:26:34,630 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:34,641 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:34,645 - INFO - El departamento IT tiene una descomposición multiplicative.
2025-08-13 16:26:34,628 - INFO - Desviación estándar de los residuales multiplicativos: 0.09
2025-08-13 16:26:34,630 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:34,641 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:34,645 - INFO - El departamento IT tiene una descomposición multiplicative.


2025-08-13 16:26:35,933 - INFO - Desviación estándar de los residuales aditivos: 7.68
2025-08-13 16:26:35,934 - INFO - Desviación estándar de los residuales multiplicativos: 0.06
2025-08-13 16:26:35,936 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:35,939 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:35,941 - INFO - El departamento Inventory tiene una descomposición multiplicative.
2025-08-13 16:26:35,934 - INFO - Desviación estándar de los residuales multiplicativos: 0.06
2025-08-13 16:26:35,936 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:35,939 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:35,941 - INFO - El departamento Inventory tiene una descomposición multiplicative.


2025-08-13 16:26:36,029 - INFO - Desviación estándar de los residuales aditivos: 3.56
2025-08-13 16:26:36,030 - INFO - Desviación estándar de los residuales multiplicativos: 0.05
2025-08-13 16:26:36,031 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:36,030 - INFO - Desviación estándar de los residuales multiplicativos: 0.05
2025-08-13 16:26:36,031 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:36,032 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:36,033 - INFO - El departamento Marketing tiene una descomposición multiplicative.
2025-08-13 16:26:36,032 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:36,033 - INFO - El departamento Marketing tiene una descomposición multiplicative.


2025-08-13 16:26:36,254 - INFO - Desviación estándar de los residuales aditivos: 27.08
2025-08-13 16:26:36,256 - INFO - Desviación estándar de los residuales multiplicativos: 0.04
2025-08-13 16:26:36,258 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:36,261 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:36,263 - INFO - El departamento Sales tiene una descomposición multiplicative.
2025-08-13 16:26:36,256 - INFO - Desviación estándar de los residuales multiplicativos: 0.04
2025-08-13 16:26:36,258 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:36,261 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:36,263 - INFO - El departamento Sales tiene una descomposición multiplicative.


## 4. Entrenamiento y Selección de Modelos

Para cada departamento, entrenamos y evaluamos tres modelos de pronóstico:

1.  **ARIMA**: Se utiliza `auto_arima` para encontrar los parámetros óptimos. 
2.  **Prophet**: Actúa como un modelo robusto de respaldo. Su modo de estacionalidad (`additive` o `multiplicative`) se configura dinámicamente según el análisis de descomposición previo.
3.  **Holt (Suavización Exponencial)**: Un modelo clásico que captura tendencias.

Los modelos se evalúan mediante validación cruzada de series de tiempo. El mejor modelo se selecciona en función de su rendimiento en métricas clave (RMSE, MAE, SMAPE, MASE). Se genera una tabla comparativa para visualizar el rendimiento y la selección final.

In [28]:
def create_lag_features(series, n_lags=6):
    df = pd.DataFrame({'y': series})
    for lag in range(1, n_lags + 1):
        df[f'lag_{lag}'] = df['y'].shift(lag)
    df = df.dropna()
    return df

def create_lag_features_with_time(series, n_lags=6):
    """
    Crea un DataFrame con lags y variables seno/coseno del mes del año para capturar estacionalidad.
    """
    df = pd.DataFrame({'y': series})
    df['month'] = df.index.month
    df['sin_month'] = np.sin(2 * np.pi * df['month'] / 12)
    df['cos_month'] = np.cos(2 * np.pi * df['month'] / 12)
    for lag in range(1, n_lags + 1):
        df[f'lag_{lag}'] = df['y'].shift(lag)
    df = df.dropna()
    return df

def train_and_compare_models_with_cv(series, decomposition_type, consider_arima):
    series_filled = series.fillna(series.mean()).interpolate(method='linear')
    min_train_size = 24
    if len(series_filled) < min_train_size:
        raise ValueError(f"La serie tiene solo {len(series_filled)} puntos, se necesitan al menos {min_train_size} para el entrenamiento estacional.")
    n_splits = len(series_filled) // 12 - 1 
    if n_splits < 1:
        n_splits = 1
    tscv = TimeSeriesSplit(n_splits=n_splits)

    all_arima_rmses, all_arima_mapes = [], []
    all_es_rmses, all_es_mapes = [], []
    all_prophet_rmses, all_prophet_mapes = [], []
    all_xgb_rmses, all_xgb_mapes = [], []

    for train_index, val_index in tscv.split(series_filled):
        train_data = series_filled.iloc[train_index]
        val_data = series_filled.iloc[val_index]
        if len(train_data) < min_train_size:
            continue

        # ARIMA
        if consider_arima:
            try:
                arima_model = auto_arima(train_data, seasonal=False, stepwise=True, suppress_warnings=True, error_action='ignore')
                arima_preds = arima_model.predict(n_periods=len(val_data))
                all_arima_rmses.append(np.sqrt(mean_squared_error(val_data, arima_preds)))
                all_arima_mapes.append(mean_absolute_percentage_error(val_data, arima_preds))
            except Exception as e:
                logging.error(f"Error al entrenar ARIMA: {e}")
                all_arima_rmses.append(np.inf)
                all_arima_mapes.append(np.inf)
        else:
            all_arima_rmses.append(np.inf)
            all_arima_mapes.append(np.inf)

        # Exponential Smoothing
        try:
            es_model = ExponentialSmoothing(train_data, trend='add', seasonal=decomposition_type, seasonal_periods=12).fit()
            es_preds = es_model.predict(start=val_data.index[0], end=val_data.index[-1])
            all_es_rmses.append(np.sqrt(mean_squared_error(val_data, es_preds)))
            all_es_mapes.append(mean_absolute_percentage_error(val_data, es_preds))
        except Exception as e:
            logging.error(f"Error al entrenar Exponential Smoothing: {e}")
            all_es_rmses.append(np.inf)
            all_es_mapes.append(np.inf)

        # Prophet (silenciar output)
        try:
            prophet_df_train = train_data.reset_index().rename(columns={'index': 'ds', 'y': 'y'})
            prophet_model = Prophet(seasonality_mode=decomposition_type)
            prophet_model.add_seasonality(name='biweekly', period=26, fourier_order=5)
            with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
                prophet_model.fit(prophet_df_train)
            future = prophet_model.make_future_dataframe(periods=len(val_data), freq='2W')
            with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
                prophet_preds_df = prophet_model.predict(future)
            prophet_preds = prophet_preds_df['yhat'].iloc[-len(val_data):]
            all_prophet_rmses.append(np.sqrt(mean_squared_error(val_data, prophet_preds)))
            all_prophet_mapes.append(mean_absolute_percentage_error(val_data, prophet_preds))
        except Exception as e:
            logging.error(f"Error al entrenar Prophet: {e}")
            all_prophet_rmses.append(np.inf)
            all_prophet_mapes.append(np.inf)

        # XGBoost con lags y seno/coseno del mes
        try:
            n_lags = 6
            df_lag = create_lag_features_with_time(train_data, n_lags=n_lags)
            X_train = df_lag.drop('y', axis=1).values
            y_train = df_lag['y'].values
            # Para validación, usar los últimos n_lags valores del train + val para construir features
            df_val = pd.concat([train_data.iloc[-n_lags:], val_data])
            df_val_lag = create_lag_features_with_time(df_val, n_lags=n_lags)
            X_val = df_val_lag.drop('y', axis=1).values
            y_val = df_val_lag['y'].values
            xgb_model = XGBRegressor(n_estimators=100, random_state=42)
            xgb_model.fit(X_train, y_train)
            xgb_preds = xgb_model.predict(X_val)
            all_xgb_rmses.append(np.sqrt(mean_squared_error(y_val, xgb_preds)))
            all_xgb_mapes.append(mean_absolute_percentage_error(y_val, xgb_preds))
        except Exception as e:
            logging.error(f"Error al entrenar XGBoost: {e}")
            all_xgb_rmses.append(np.inf)
            all_xgb_mapes.append(np.inf)

    avg_arima_rmse = np.mean(all_arima_rmses) if all_arima_rmses else np.inf
    avg_arima_mape = np.mean(all_arima_mapes) if all_arima_mapes else np.inf
    avg_es_rmse = np.mean(all_es_rmses) if all_es_rmses else np.inf
    avg_es_mape = np.mean(all_es_mapes) if all_es_mapes else np.inf
    avg_prophet_rmse = np.mean(all_prophet_rmses) if all_prophet_rmses else np.inf
    avg_prophet_mape = np.mean(all_prophet_mapes) if all_prophet_mapes else np.inf
    avg_xgb_rmse = np.mean(all_xgb_rmses) if all_xgb_rmses else np.inf
    avg_xgb_mape = np.mean(all_xgb_mapes) if all_xgb_mapes else np.inf

    metrics = pd.DataFrame({
        'Modelo': ['ARIMA', 'Exponential Smoothing', 'Prophet', 'XGBoost'],
        'RMSE': [avg_arima_rmse, avg_es_rmse, avg_prophet_rmse, avg_xgb_rmse],
        'MAPE': [avg_arima_mape * 100, avg_es_mape * 100, avg_prophet_mape * 100, avg_xgb_mape * 100]
    }).sort_values('RMSE')

    best_model_name = metrics.iloc[0]['Modelo']
    best_model = None

    series_filled_final = series.fillna(series.mean()).interpolate(method='linear')
    if best_model_name == 'ARIMA':
        best_model = auto_arima(series_filled_final, seasonal=False, stepwise=True, suppress_warnings=True, error_action='ignore')
    elif best_model_name == 'Exponential Smoothing':
        best_model = ExponentialSmoothing(series_filled_final, trend='add', seasonal=decomposition_type, seasonal_periods=12).fit()
    elif best_model_name == 'Prophet':
        prophet_df_all = series_filled_final.reset_index().rename(columns={'index': 'ds', 'y': 'y'})
        best_model = Prophet(seasonality_mode=decomposition_type)
        best_model.add_seasonality(name='biweekly', period=26, fourier_order=5)
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            best_model.fit(prophet_df_all)
    elif best_model_name == 'XGBoost':
        n_lags = 6
        df_lag = create_lag_features_with_time(series_filled_final, n_lags=n_lags)
        X = df_lag.drop('y', axis=1).values
        y = df_lag['y'].values
        best_model = XGBRegressor(n_estimators=100, random_state=42)
        best_model.fit(X, y)

    return best_model, best_model_name, metrics



# Lógica principal del script (Líneas corregidas)
all_metrics = {}
all_forecasts = {}
n_periods_forecast = 12 

for dept in df_all_data['department'].unique():
    logging.info(f"Iniciando análisis para el departamento: {dept}")
    dept_data = df_all_data[df_all_data['department'] == dept].set_index('ds')['y']
    

    decomposition_type, consider_arima = select_decomposition_type(dept_data)
    
    best_model, best_model_name, metrics = train_and_compare_models_with_cv(dept_data, decomposition_type, consider_arima)
    
    # ---- LÓGICA DE VERIFICACIÓN DE LÍNEA PLANA ----
    series_filled_final = dept_data.fillna(dept_data.mean()).interpolate(method='linear')
    
    historical_pred_check = None
    if best_model_name == 'ARIMA':
        historical_pred_check = best_model.predict_in_sample()
    elif best_model_name == 'Exponential Smoothing':
        historical_pred_check = pd.Series(best_model.fittedvalues, index=dept_data.index)
    elif best_model_name == 'Prophet':
        prophet_df_all = series_filled_final.reset_index().rename(columns={'index': 'ds', 'y': 'y'})
        future = best_model.make_future_dataframe(periods=n_periods_forecast, freq='2W', include_history=False)  # Solo futuro
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            forecast_df = best_model.predict(future)
        historical_pred_check = forecast_df.set_index('ds')['yhat'].reindex(dept_data.index, method='ffill')  # Alinear con históricos

    if historical_pred_check is not None and np.std(historical_pred_check) < 0.1:
        logging.warning(f"El mejor modelo actual ({best_model_name}) produce una predicción histórica de línea plana. Se seleccionará el siguiente mejor modelo.")
        
        second_best_model_name = metrics.iloc[1]['Modelo']
        metrics = metrics.iloc[1:].reset_index(drop=True)
        best_model_name = second_best_model_name
        
        if best_model_name == 'ARIMA':
            best_model = auto_arima(series_filled_final, seasonal=False, stepwise=True, suppress_warnings=True, error_action='ignore')
        elif best_model_name == 'Exponential Smoothing':
            best_model = ExponentialSmoothing(series_filled_final, trend='add', seasonal=decomposition_type, seasonal_periods=12).fit()
        elif best_model_name == 'Prophet':
            prophet_df_all = series_filled_final.reset_index().rename(columns={'index': 'ds', 'y': 'y'})
            best_model = Prophet(seasonality_mode=decomposition_type)
            best_model.add_seasonality(name='biweekly', period=26, fourier_order=5)
            with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
                best_model.fit(prophet_df_all)
        elif best_model_name == 'XGBoost':
            n_lags = 6
            df_lag = create_lag_features_with_time(series_filled_final, n_lags=n_lags)
            X = df_lag.drop('y', axis=1).values
            y = df_lag['y'].values
            best_model = XGBRegressor(n_estimators=100, random_state=42)
            best_model.fit(X, y)
            
    # ---- FIN DE LA LÓGICA DE VERIFICACIÓN ----

    all_metrics[dept] = metrics
    
    logging.info(f"El mejor modelo FINAL para {dept} es: {best_model_name} con RMSE de {metrics.iloc[0]['RMSE']:.2f}")
    print(f"Métricas de los modelos para {dept}:\n{metrics}\n")
    
    # Generar predicciones para el periodo histórico + futuro
    if best_model_name == 'ARIMA':
        forecast_historical = best_model.predict_in_sample()
        forecast_future = best_model.predict(n_periods=n_periods_forecast)
        
        forecast_combined = pd.concat([forecast_historical, forecast_future])
        forecast_combined.name = 'yhat'

        future_index = pd.date_range(start=dept_data.index[-1] + pd.Timedelta(days=1), periods=n_periods_forecast, freq='2W')
        forecast_df_final = pd.DataFrame(forecast_combined.iloc[-n_periods_forecast:].values, index=future_index, columns=['yhat'])
        forecast_df_final['yhat_lower'] = forecast_df_final['yhat'] - (forecast_df_final['yhat'] * 0.1)
        forecast_df_final['yhat_upper'] = forecast_df_final['yhat'] + (forecast_df_final['yhat'] * 0.1)

        historical_pred_df = pd.DataFrame(forecast_combined.iloc[:len(dept_data)].values, index=dept_data.index, columns=['yhat'])
        historical_pred_df['yhat_lower'] = historical_pred_df['yhat'] - (historical_pred_df['yhat'] * 0.1)
        historical_pred_df['yhat_upper'] = historical_pred_df['yhat'] + (historical_pred_df['yhat'] * 0.1)

    elif best_model_name == 'Exponential Smoothing':
        forecast_historical = pd.Series(best_model.fittedvalues, index=dept_data.index)
        forecast_future = best_model.forecast(steps=n_periods_forecast)
        
        # Asegurar que las fechas futuras se asignen como columna 'ds'
        future_index = pd.date_range(start=dept_data.index[-1] + pd.Timedelta(days=1), periods=n_periods_forecast, freq='2W')
        forecast_df_final = pd.DataFrame({
            'ds': future_index,
            'yhat': forecast_future.values
        })
        forecast_df_final['yhat_lower'] = forecast_df_final['yhat'] - (forecast_df_final['yhat'] * 0.1)
        forecast_df_final['yhat_upper'] = forecast_df_final['yhat'] + (forecast_df_final['yhat'] * 0.1)
        
        historical_pred_df = pd.DataFrame(forecast_historical, columns=['yhat'])
        historical_pred_df['yhat_lower'] = historical_pred_df['yhat'] - (historical_pred_df['yhat'] * 0.1)
        historical_pred_df['yhat_upper'] = historical_pred_df['yhat'] + (historical_pred_df['yhat'] * 0.1)

    elif best_model_name == 'Prophet':
        prophet_df_all = dept_data.reset_index().rename(columns={'index': 'ds', 'y': 'y'})
        # Generar solo las fechas futuras a partir del día siguiente
        last_date = dept_data.index[-1]
        future_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=n_periods_forecast, freq='2W')
        future_df = pd.DataFrame({'ds': future_dates})
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            forecast_df = best_model.predict(future_df)
        forecast_df_final = forecast_df.set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']]
        future_all = best_model.make_future_dataframe(periods=len(dept_data) + n_periods_forecast, freq='2W', include_history=True)
        with contextlib.redirect_stdout(io.StringIO()), contextlib.redirect_stderr(io.StringIO()):
            forecast_all = best_model.predict(future_all)
        historical_pred_df = forecast_all.iloc[:len(dept_data)].set_index('ds')[['yhat', 'yhat_lower', 'yhat_upper']]
    elif best_model_name == 'XGBoost':
        n_lags = 6
        # Predicción histórica
        df_lag = create_lag_features_with_time(series_filled_final, n_lags=n_lags)
        X_hist = df_lag.drop('y', axis=1).values
        y_hist = df_lag['y'].values
        yhat_hist = best_model.predict(X_hist)
        historical_pred_df = pd.DataFrame({'yhat': yhat_hist}, index=series_filled_final.index[n_lags:])
        historical_pred_df['yhat_lower'] = historical_pred_df['yhat'] * 0.9
        historical_pred_df['yhat_upper'] = historical_pred_df['yhat'] * 1.1
        # Predicción futura (multi-step, recursiva)
        last_values = list(series_filled_final[-n_lags:])
        last_index = series_filled_final.index[-1]
        preds = []
        future_dates = []
        for i in range(n_periods_forecast):
            next_date = last_index + pd.DateOffset(weeks=2*(i+1))
            month = next_date.month
            sin_month = np.sin(2 * np.pi * month / 12)
            cos_month = np.cos(2 * np.pi * month / 12)
            X_pred = np.array(last_values[-n_lags:] + [month, sin_month, cos_month]).reshape(1, -1)
            # XGBoost espera las columnas en el mismo orden que el entrenamiento
            # Por lo tanto, construimos un DataFrame temporal para asegurar el orden
            cols = [f'lag_{lag}' for lag in range(1, n_lags+1)] + ['month', 'sin_month', 'cos_month']
            X_pred_df = pd.DataFrame([last_values[-n_lags:] + [month, sin_month, cos_month]], columns=cols)
            pred = best_model.predict(X_pred_df.values)[0]
            preds.append(pred)
            last_values.append(pred)
            future_dates.append(next_date)
        forecast_df_final = pd.DataFrame({'ds': future_dates, 'yhat': preds})
        forecast_df_final['yhat_lower'] = forecast_df_final['yhat'] * 0.9
        forecast_df_final['yhat_upper'] = forecast_df_final['yhat'] * 1.1

    all_forecasts[dept] = {'future': forecast_df_final, 'historical_pred': historical_pred_df}

2025-08-13 16:26:44,006 - INFO - Iniciando análisis para el departamento: Finance
2025-08-13 16:26:44,029 - INFO - Desviación estándar de los residuales aditivos: 4.87
2025-08-13 16:26:44,029 - INFO - Desviación estándar de los residuales aditivos: 4.87
2025-08-13 16:26:44,031 - INFO - Desviación estándar de los residuales multiplicativos: 0.16
2025-08-13 16:26:44,035 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:44,038 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:44,031 - INFO - Desviación estándar de los residuales multiplicativos: 0.16
2025-08-13 16:26:44,035 - INFO - P-value del test ADF: 0.00
2025-08-13 16:26:44,038 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:26:46,236 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:26:46,236 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:26:47,046 - DEBUG - TBB already found in load path
2

Métricas de los modelos para Finance:
                  Modelo       RMSE       MAPE
0                  ARIMA   6.810098  17.271081
1  Exponential Smoothing   9.957076  25.438809
2                Prophet  10.767921  29.569347
3                XGBoost  11.785891  32.289255



2025-08-13 16:26:57,796 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:26:58,333 - DEBUG - TBB already found in load path
2025-08-13 16:26:58,333 - DEBUG - TBB already found in load path
2025-08-13 16:26:58,346 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:26:58,349 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:26:58,346 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:26:58,349 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:26:58,351 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:26:58,386 - INFO - n_changepoints greater than number of observations. Using 18.
2025-08-13 16:26:58,351 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=Tru

Métricas de los modelos para HR:
                  Modelo       RMSE       MAPE
0                  ARIMA   9.936516  20.449663
3                XGBoost  10.080056  22.248131
2                Prophet  13.243985  25.069007
1  Exponential Smoothing  17.524741  37.037066



2025-08-13 16:27:03,561 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:27:03,978 - DEBUG - TBB already found in load path
2025-08-13 16:27:03,978 - DEBUG - TBB already found in load path
2025-08-13 16:27:03,990 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:03,993 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:03,996 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:03,990 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:03,993 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:03,996 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:04,033 - INFO - n_changepoints greater than number of obse

Métricas de los modelos para IT:
                  Modelo       RMSE       MAPE
0                  ARIMA  22.696867  17.102295
2                Prophet  24.366909  18.851146
1  Exponential Smoothing  25.419806  21.023774
3                XGBoost  27.202833  20.467695



2025-08-13 16:27:12,288 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:27:12,767 - DEBUG - TBB already found in load path
2025-08-13 16:27:12,780 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:12,767 - DEBUG - TBB already found in load path
2025-08-13 16:27:12,780 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:12,782 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:12,782 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:12,785 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:12,785 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:12,823 - INFO - n_changepoints greater than number of obse

Métricas de los modelos para Inventory:
                  Modelo       RMSE       MAPE
0                XGBoost  16.438580  10.857656
1  Exponential Smoothing  24.169849  18.311116
2                Prophet  24.884310  19.803423



2025-08-13 16:27:18,499 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:27:18,797 - DEBUG - TBB already found in load path
2025-08-13 16:27:18,811 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:18,797 - DEBUG - TBB already found in load path
2025-08-13 16:27:18,811 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:18,814 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:18,815 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:18,814 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:18,815 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:18,856 - INFO - n_changepoints greater than number of obse

Métricas de los modelos para Marketing:
                  Modelo       RMSE       MAPE
0  Exponential Smoothing  11.027292  14.238330
1                XGBoost  12.199810  14.181886
2                Prophet  15.205894  20.692580



2025-08-13 16:27:25,326 - DEBUG - cmd: where.exe tbb.dll
cwd: None
2025-08-13 16:27:25,677 - DEBUG - TBB already found in load path
2025-08-13 16:27:25,688 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:25,677 - DEBUG - TBB already found in load path
2025-08-13 16:27:25,688 - INFO - Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
2025-08-13 16:27:25,692 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:25,692 - INFO - Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
2025-08-13 16:27:25,694 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
2025-08-13 16:27:25,727 - INFO - n_changepoints greater than number of observations. Using 18.
2025-08-13 16:27:25,694 - INFO - Disabling daily seasonality. Run prophet with daily_seasonality=Tru

Métricas de los modelos para Sales:
                  Modelo       RMSE      MAPE
0                Prophet  56.903104  6.930763
1  Exponential Smoothing  57.499348  7.120223
2                XGBoost  77.892926  9.598943



## 5. Generar Predicciones

Generamos predicciones para las próximas 12 quincenas con intervalos de confianza, utilizando el mejor modelo seleccionado para cada departamento.

In [29]:
def plot_forecast_combined(historical_data, historical_pred, future_forecast, dept, best_model_name):
    fig = go.Figure()

    # Asegurarse de que los índices sean fechas continuas y ordenadas
    historical_data = historical_data.sort_index()
    historical_pred = historical_pred.sort_index()

    # Convertir future_forecast a DataFrame si es una Serie, usando el índice como 'ds'
    if isinstance(future_forecast, pd.Series):
        future_data = pd.DataFrame({
            'ds': future_forecast.index,
            'yhat': future_forecast.values
        })
        # Generar intervalos de confianza aproximados si no están presentes
        if 'yhat_lower' not in future_data.columns or 'yhat_upper' not in future_data.columns:
            future_data['yhat_lower'] = future_data['yhat'] * 0.9
            future_data['yhat_upper'] = future_data['yhat'] * 1.1
    elif isinstance(future_forecast, pd.DataFrame):
        future_data = future_forecast.copy()
        if 'ds' not in future_data.columns:
            future_data['ds'] = pd.to_datetime(future_data.index)
        if 'yhat' not in future_data.columns:
            raise ValueError("future_forecast debe contener la columna 'yhat'")
        if 'yhat_lower' not in future_data.columns or 'yhat_upper' not in future_data.columns:
            future_data['yhat_lower'] = future_data['yhat'] * 0.9  # Placeholder
            future_data['yhat_upper'] = future_data['yhat'] * 1.1  # Placeholder
    else:
        raise ValueError("future_forecast debe ser una Serie o DataFrame")

    # Convertir 'ds' a datetime si no lo es
    future_data['ds'] = pd.to_datetime(future_data['ds'])

    # Depuración: Mostrar el rango de fechas en future_data
    logging.info(f"Rango de fechas en future_data para {dept}: {future_data['ds'].min()} a {future_data['ds'].max()}")

    # Agregar la línea de datos históricos reales
    fig.add_trace(go.Scatter(x=historical_data.index, y=historical_data, mode='lines', name='Histórico Real', line=dict(color='blue')))

    # Agregar la línea de predicción histórica (amarillo dash)
    fig.add_trace(go.Scatter(x=historical_pred.index, y=historical_pred['yhat'], mode='lines', name='Predicción Histórica', line=dict(color='orange', dash='dash')))

    # Agregar la línea de pronóstico futuro (rojo punteado)
    future_mask = future_data['ds'] > historical_data.index[-1]
    if not future_mask.any():
        logging.warning(f"No hay datos futuros para {dept}. Verifica future_periods en calculate_metrics.")
    fig.add_trace(go.Scatter(x=future_data.loc[future_mask, 'ds'], y=future_data.loc[future_mask, 'yhat'], mode='lines', name='Pronóstico Futuro', line=dict(color='red', dash='dot')))

    # Agregar el intervalo de confianza para el pronóstico futuro
    fig.add_trace(go.Scatter(
        x=future_data.loc[future_mask, 'ds'].tolist() + future_data.loc[future_mask, 'ds'].tolist()[::-1],
        y=future_data.loc[future_mask, 'yhat_upper'].tolist() + future_data.loc[future_mask, 'yhat_lower'].tolist()[::-1],
        fill='toself',
        fillcolor='rgba(255,0,0,0.1)',
        line=dict(color='rgba(255,255,255,0)'),
        hoverinfo="skip",
        showlegend=False,
        name='Intervalo de Confianza'
    ))

    fig.update_layout(
        title=f"Pronóstico y Ajuste de Horas Extra para {dept} con Modelo {best_model_name}",
        xaxis_title="Fecha",
        yaxis_title="Horas Extra",
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="center",
            x=0.5
        ),
        template="plotly_white"
    )
    fig.show()

import pandas as pd
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Asegurarse de que dept_data tenga el índice correcto
for dept in all_forecasts:
    # Verificar las columnas disponibles en df_all_data
    logging.info(f"Columnas disponibles en df_all_data para {dept}: {df_all_data.columns.tolist()}")
    
    # Ajustar el nombre de la columna de horas extra
    overtime_column = 'total_overtime' if 'total_overtime' in df_all_data.columns else 'y' if 'y' in df_all_data.columns else None
    if overtime_column is None:
        logging.error(f"No se encontró una columna de horas extra (e.g., 'total_overtime' o 'y') en df_all_data para {dept}")
        raise KeyError(f"No se encontró columna de horas extra en df_all_data para {dept}")
    
    dept_data = df_all_data[df_all_data['department'] == dept].set_index('ds')[overtime_column]
    future_forecast = all_forecasts[dept]['future']
    historical_pred = all_forecasts[dept]['historical_pred']
    
    # Verificar y ajustar future_forecast si es necesario
    if isinstance(future_forecast, pd.Series):
        future_forecast = pd.DataFrame({'ds': future_forecast.index, 'yhat': future_forecast.values})
        if 'yhat_lower' not in future_forecast.columns or 'yhat_upper' not in future_forecast.columns:
            future_forecast['yhat_lower'] = future_forecast['yhat'] * 0.9
            future_forecast['yhat_upper'] = future_forecast['yhat'] * 1.1
    
    # Asegurarse de que future_forecast contenga datos futuros
    if isinstance(future_forecast, pd.DataFrame) and 'ds' in future_forecast.columns:
        future_dates = pd.to_datetime(future_forecast['ds'])
        if future_dates.min() < dept_data.index[-1]:
            future_forecast = future_forecast[future_forecast['ds'] > dept_data.index[-1]].reset_index(drop=True)
            logging.info(f"Ajustado future_forecast para {dept} a partir de {dept_data.index[-1]}")
        elif future_dates.min() > dept_data.index[-1] + pd.Timedelta(days=1):
            logging.warning(f"future_forecast para {dept} comienza en {future_dates.min()} cuando debería ser después de {dept_data.index[-1]}")
    
    plot_forecast_combined(dept_data, historical_pred, future_forecast, dept, all_metrics[dept].iloc[0]['Modelo'])

2025-08-13 16:27:56,980 - INFO - Columnas disponibles en df_all_data para Finance: ['department', 'ds', 'y']
2025-08-13 16:27:56,994 - INFO - Rango de fechas en future_data para Finance: 2025-05-11 00:00:00 a 2025-10-12 00:00:00
2025-08-13 16:27:56,994 - INFO - Rango de fechas en future_data para Finance: 2025-05-11 00:00:00 a 2025-10-12 00:00:00


2025-08-13 16:27:57,134 - INFO - Columnas disponibles en df_all_data para HR: ['department', 'ds', 'y']
2025-08-13 16:27:57,148 - INFO - Rango de fechas en future_data para HR: 2025-05-11 00:00:00 a 2025-10-12 00:00:00
2025-08-13 16:27:57,148 - INFO - Rango de fechas en future_data para HR: 2025-05-11 00:00:00 a 2025-10-12 00:00:00


2025-08-13 16:27:57,522 - INFO - Columnas disponibles en df_all_data para IT: ['department', 'ds', 'y']
2025-08-13 16:27:57,538 - INFO - Rango de fechas en future_data para IT: 2025-05-11 00:00:00 a 2025-10-12 00:00:00
2025-08-13 16:27:57,538 - INFO - Rango de fechas en future_data para IT: 2025-05-11 00:00:00 a 2025-10-12 00:00:00


2025-08-13 16:27:58,232 - INFO - Columnas disponibles en df_all_data para Inventory: ['department', 'ds', 'y']
2025-08-13 16:27:58,265 - INFO - Rango de fechas en future_data para Inventory: 2025-05-18 00:00:00 a 2025-10-19 00:00:00
2025-08-13 16:27:58,265 - INFO - Rango de fechas en future_data para Inventory: 2025-05-18 00:00:00 a 2025-10-19 00:00:00


2025-08-13 16:27:59,101 - INFO - Columnas disponibles en df_all_data para Marketing: ['department', 'ds', 'y']
2025-08-13 16:27:59,121 - INFO - Rango de fechas en future_data para Marketing: 2025-05-11 00:00:00 a 2025-10-12 00:00:00
2025-08-13 16:27:59,121 - INFO - Rango de fechas en future_data para Marketing: 2025-05-11 00:00:00 a 2025-10-12 00:00:00


2025-08-13 16:27:59,506 - INFO - Columnas disponibles en df_all_data para Sales: ['department', 'ds', 'y']
2025-08-13 16:27:59,548 - INFO - Rango de fechas en future_data para Sales: 2025-05-11 00:00:00 a 2025-10-12 00:00:00
2025-08-13 16:27:59,548 - INFO - Rango de fechas en future_data para Sales: 2025-05-11 00:00:00 a 2025-10-12 00:00:00


## 6. Guardar Resultados

Guardamos los modelos, predicciones y métricas en SQL Server. La tabla Overtime_Predictions incluye datos históricos y predicciones con una columna data_type para distinguir entre "Histórico" y "Forecast".


In [30]:

def calculate_metrics(dept_data, historical_pred, model_type, train_test_split=True, test_size=0.2):

    try:
        department = dept_data.name if dept_data.name else 'Unknown'
        # Ensure no NaN in input data
        dept_data_clean = dept_data.fillna(dept_data.median())
        historical_pred_clean = historical_pred['yhat'].reindex(dept_data_clean.index).fillna(dept_data_clean.median()).clip(lower=0)

        # Split data into train and test
        if train_test_split:
            train_size = int(len(dept_data_clean) * (1 - test_size))
            actual = dept_data_clean.iloc[train_size:]
            predictions = historical_pred_clean.iloc[train_size:]
        else:
            actual = dept_data_clean
            predictions = historical_pred_clean

        # Ensure indices align
        common_index = actual.index.intersection(predictions.index)
        if len(common_index) == 0:
            logging.error(f"No hay índices comunes para {department}. Índice real: {actual.index[:3]}..., Índice predicciones: {predictions.index[:3]}...")
            raise ValueError(f"No hay índices comunes para {department}")
        actual = actual.loc[common_index]
        predictions = predictions.loc[common_index]

        # Log data statistics
        logging.info(f"{department} - Actual data: len={len(actual)}, mean={actual.mean():.2f}, std={actual.std():.2f}")
        logging.info(f"{department} - Predictions: len={len(predictions)}, mean={predictions.mean():.2f}, std={predictions.std():.2f}")

        # Calculate metrics
        rmse = np.sqrt(mean_squared_error(actual, predictions))
        mae = mean_absolute_error(actual, predictions)
        smape = np.mean(2 * np.abs(predictions - actual) / (np.abs(predictions) + np.abs(actual) + 1e-10)) * 100
        mape = mean_absolute_percentage_error(actual, predictions) * 100
        naive_forecast = actual.shift(1).fillna(actual.mean())
        mase = mae / mean_absolute_error(actual[1:], naive_forecast[1:]) if naive_forecast.std() > 0 else float('inf')

        # Calculate department-specific thresholds based on mean overtime
        mean_overtime = dept_data_clean.mean()
        mae_threshold_good = 0.05 * mean_overtime  # 5% of mean
        mae_threshold_acceptable = 0.10 * mean_overtime  # 10% of mean
        rmse_threshold_good = 0.10 * mean_overtime  # 10% of mean
        rmse_threshold_acceptable = 0.20 * mean_overtime  # 20% of mean

        # Determine model quality
        if (mae < mae_threshold_good and smape < 10 and mape < 10 and mase < 0.8 and rmse < rmse_threshold_good):
            quality = "Bueno"
        elif (mae < mae_threshold_acceptable and smape < 20 and mape < 20 and mase < 1.2 and rmse < rmse_threshold_acceptable):
            quality = "Aceptable"
        else:
            quality = "Pobre"

        # Ensure finite values for SQL
        rmse = rmse if np.isfinite(rmse) else None
        mae = mae if np.isfinite(mae) else None
        smape = smape if np.isfinite(smape) else None
        mape = mape if np.isfinite(mape) else None
        mase = mase if np.isfinite(mase) else None

        return {
            'rmse': rmse,
            'mae': mae,
            'smape': smape,
            'mape': mape,
            'mase': mase,
            'quality': quality
        }
    except Exception as e:
        logging.error(f"Error calculando métricas para {department} ({model_type}): {e}")
        return {
            'rmse': None,
            'mae': None,
            'smape': None,
            'mape': None,
            'mase': None,
            'quality': 'Pobre'
        }

def save_predictions():

    os.makedirs('Modelos Entrenados', exist_ok=True)
    timestamp = datetime.now()
    logging.info(f"Timestamp establecido: {timestamp}")
    predictions_summary = []
    metrics_summary = []
    insert_errors = []

    # Save models and collect predictions and metrics
    for dept in all_forecasts:
        try:
            # Get best model and metrics
            best_model = None  # Will be retrieved if saving is needed
            best_model_name = all_metrics[dept].iloc[0]['Modelo']
            model_path = f'Modelos Entrenados/overtime_forecast_model_{dept}_{best_model_name}.pkl'

            # Retrieve the best model for saving (re-train to ensure consistency)
            dept_data = df_all_data[df_all_data['department'] == dept].set_index('ds')['y']
            decomposition_type, consider_arima = select_decomposition_type(dept_data)
            series_filled_final = dept_data.fillna(dept_data.mean()).interpolate(method='linear')
            if best_model_name == 'ARIMA' and consider_arima:
                best_model = auto_arima(series_filled_final, seasonal=False, stepwise=True, suppress_warnings=True, error_action='ignore')
                joblib.dump(best_model, model_path)
                logging.info(f"Modelo ARIMA guardado para {dept} en: {model_path}")
            elif best_model_name == 'Exponential Smoothing':
                best_model = ExponentialSmoothing(series_filled_final, trend='add', seasonal=decomposition_type, seasonal_periods=12).fit()
                joblib.dump(best_model, model_path)
                logging.info(f"Modelo Exponential Smoothing guardado para {dept} en: {model_path}")
            elif best_model_name == 'Prophet':
                logging.warning(f"Prophet no se guarda en disco para {dept} debido a problemas de serialización.")
            elif best_model_name == 'XGBoost':
                n_lags = 6
                df_lag = create_lag_features_with_time(series_filled_final, n_lags=n_lags)
                X = df_lag.drop('y', axis=1).values
                y = df_lag['y'].values
                best_model = XGBRegressor(n_estimators=100, random_state=42)
                best_model.fit(X, y)
                joblib.dump(best_model, model_path)
                logging.info(f"Modelo XGBoost guardado para {dept} en: {model_path}")

            # Calculate metrics using historical predictions
            historical_pred = all_forecasts[dept]['historical_pred']
            metrics = calculate_metrics(dept_data, historical_pred, best_model_name)

            # Collect metrics for summary
            metrics_summary.append({
                'timestamp': timestamp,
                'department': dept,
                'rmse': metrics['rmse'],
                'mae': metrics['mae'],
                'smape': metrics['smape'],
                'mape': metrics['mape'],
                'mase': metrics['mase'],
                'model_quality': metrics['quality'],
                'model_type': best_model_name
            })

            # Collect future predictions with robust index handling
            future_forecast = all_forecasts[dept]['future']
            n_periods_forecast = len(future_forecast)
            last_date = dept_data.index[-1]

            # Define expected_index before index handling
            expected_index = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=n_periods_forecast, freq='2W')

            # Ensure future_forecast index is datetime
            if not pd.api.types.is_datetime64_any_dtype(future_forecast.index):
                logging.warning(f"Índice de future_forecast para {dept} no es datetime. Reconstruyendo índice.")
                future_forecast = future_forecast.copy()  # Avoid modifying original
                future_forecast.index = expected_index
            else:
                # Verify index alignment
                if not all(future_forecast.index == expected_index):
                    logging.warning(f"Índice de future_forecast para {dept} no está alineado. Reconstruyendo índice.")
                    future_forecast = future_forecast.copy()
                    future_forecast.index = expected_index

            # Log index details
            logging.info(f"Índice de future_forecast para {dept}: {future_forecast.index[:3]}... (len={len(future_forecast)})")
            logging.info(f"Primeras fechas esperadas: {expected_index[:3]}...")

            for _, row in future_forecast.iterrows():
                predicted_value = row['yhat'] if pd.notna(row['yhat']) else None
                confidence_lower = row['yhat_lower'] if pd.notna(row['yhat_lower']) else None
                confidence_upper = row['yhat_upper'] if pd.notna(row['yhat_upper']) else None
                try:
                    prediction_date = row.name.date() if pd.notna(row.name) and hasattr(row.name, 'date') else None
                    if prediction_date is None:
                        raise ValueError(f"Fecha inválida para {dept}, índice: {row.name}")
                    predictions_summary.append({
                        'timestamp': timestamp,
                        'department': dept,
                        'prediction_date': prediction_date,
                        'predicted_value': predicted_value,
                        'confidence_lower': confidence_lower,
                        'confidence_upper': confidence_upper
                    })
                except Exception as e:
                    logging.error(f"Error procesando fecha de predicción para {dept}: {e}")
                    # Fallback: Assign date from expected_index
                    row_index = future_forecast.index.get_loc(row.name)
                    prediction_date = expected_index[row_index].date()
                    logging.info(f"Asignando fecha de respaldo para {dept}: {prediction_date}")
                    predictions_summary.append({
                        'timestamp': timestamp,
                        'department': dept,
                        'prediction_date': prediction_date,
                        'predicted_value': predicted_value,
                        'confidence_lower': confidence_lower,
                        'confidence_upper': confidence_upper
                    })

        except Exception as e:
            logging.error(f"Error procesando datos para {dept}: {e}")
            insert_errors.append(dept)

    # Create DataFrames
    predictions_summary_df = pd.DataFrame(predictions_summary)
    metrics_summary_df = pd.DataFrame(metrics_summary)

    # Add 'id' column (will be auto-generated by SQL Server)
    if not predictions_summary_df.empty:
        predictions_summary_df.insert(0, 'id', range(1, len(predictions_summary_df) + 1))
    if not metrics_summary_df.empty:
        metrics_summary_df.insert(0, 'id', range(1, len(metrics_summary_df) + 1))

    # Save to database
    conn = get_db_connection()
    cursor = conn.cursor()

    try:
        logging.info("Conexión a la base de datos exitosa")
        # Create tables if they don’t exist
        cursor.execute("""
            IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'Overtime_Predictions')
            CREATE TABLE Overtime_Predictions (
                id INT IDENTITY(1,1) PRIMARY KEY,
                timestamp DATETIME,
                department VARCHAR(100),
                prediction_date DATE,
                predicted_value FLOAT,
                confidence_lower FLOAT,
                confidence_upper FLOAT
            )
        """)

        # Create or alter ML_Model_Metrics_Overtime_Predictions table to include mape and model_type
        cursor.execute("""
            IF NOT EXISTS (SELECT * FROM sys.tables WHERE name = 'ML_Model_Metrics_Overtime_Predictions')
            BEGIN
                CREATE TABLE ML_Model_Metrics_Overtime_Predictions (
                    id INT IDENTITY(1,1) PRIMARY KEY,
                    timestamp DATETIME,
                    department VARCHAR(100),
                    rmse FLOAT,
                    mae FLOAT,
                    smape FLOAT,
                    mape FLOAT,
                    mase FLOAT,
                    model_quality VARCHAR(50),
                    model_type VARCHAR(50)
                )
            END
            ELSE
            BEGIN
                IF NOT EXISTS (SELECT * FROM sys.columns 
                               WHERE object_id = OBJECT_ID('ML_Model_Metrics_Overtime_Predictions') 
                               AND name = 'mape')
                BEGIN
                    ALTER TABLE ML_Model_Metrics_Overtime_Predictions
                    ADD mape FLOAT
                END
                IF NOT EXISTS (SELECT * FROM sys.columns 
                               WHERE object_id = OBJECT_ID('ML_Model_Metrics_Overtime_Predictions') 
                               AND name = 'model_type')
                BEGIN
                    ALTER TABLE ML_Model_Metrics_Overtime_Predictions
                    ADD model_type VARCHAR(50)
                END
            END
        """)

        # Insert predictions
        if not predictions_summary_df.empty:
            for _, row in predictions_summary_df.iterrows():
                cursor.execute("""
                    INSERT INTO Overtime_Predictions 
                    (timestamp, department, prediction_date, predicted_value, confidence_lower, confidence_upper)
                    VALUES (%s, %s, %s, %s, %s, %s)
                """, (
                    row['timestamp'],
                    row['department'],
                    row['prediction_date'],
                    row['predicted_value'],
                    row['confidence_lower'],
                    row['confidence_upper']
                ))
        else:
            logging.warning("predictions_summary_df está vacío, no se insertaron predicciones.")

        # Insert metrics
        if not metrics_summary_df.empty:
            for _, row in metrics_summary_df.iterrows():
                cursor.execute("""
                    INSERT INTO ML_Model_Metrics_Overtime_Predictions 
                    (timestamp, department, rmse, mae, smape, mape, mase, model_quality, model_type)
                    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
                """, (
                    row['timestamp'],
                    row['department'],
                    row['rmse'],
                    row['mae'],
                    row['smape'],
                    row['mape'],
                    row['mase'],
                    row['model_quality'],
                    row['model_type']
                ))
        else:
            logging.warning("metrics_summary_df está vacío, no se insertaron métricas.")

        if insert_errors and len(insert_errors) == len(all_forecasts):
            raise ValueError("No se pudieron insertar datos para ningún departamento")

        conn.commit()
        logging.info("Predicciones y métricas guardadas exitosamente")

        # Display summary
        print("\n=== Resumen de Modelos y Predicciones ===")
        print("\nModelos y Métricas por Departamento:")
        display(metrics_summary_df.drop(columns=['id', 'timestamp']))
        print("\nValores Predichos con Intervalos de Confianza:")
        display(predictions_summary_df.drop(columns=['id', 'timestamp']))

        return predictions_summary_df, metrics_summary_df

    except Exception as e:
        conn.rollback()
        logging.error(f"Error al guardar los datos: {e}")
        raise
    finally:
        conn.close()

def verify_tables():

    conn = get_db_connection()
    cursor = conn.cursor()
    try:
        cursor.execute("SELECT COUNT(*) FROM Overtime_Predictions")
        pred_count = cursor.fetchone()[0]
        
        cursor.execute("SELECT COUNT(*) FROM ML_Model_Metrics_Overtime_Predictions")
        metrics_count = cursor.fetchone()[0]
        
        logging.info(f"Registros en Overtime_Predictions: {pred_count}")
        logging.info(f"Registros en ML_Model_Metrics_Overtime_Predictions: {metrics_count}")
        print(f"\nRegistros en Overtime_Predictions: {pred_count}")
        print(f"Registros en ML_Model_Metrics: {metrics_count}")
    except Exception as e:
        logging.error(f"Error verificando tablas: {e}")
        print(f"Error verificando tablas: {e}")
    finally:
        conn.close()

# Execute saving and display summary
try:
    predictions_summary_df, metrics_summary_df = save_predictions()
    print("\nDatos guardados exitosamente en la base de datos.")
except Exception as e:
    print(f"\nError al guardar los datos: {str(e)}")

# Verify tables
verify_tables()

2025-08-13 16:28:59,754 - INFO - Timestamp establecido: 2025-08-13 16:28:59.754076
2025-08-13 16:28:59,779 - INFO - Desviación estándar de los residuales aditivos: 4.87
2025-08-13 16:28:59,779 - INFO - Desviación estándar de los residuales aditivos: 4.87
2025-08-13 16:28:59,781 - INFO - Desviación estándar de los residuales multiplicativos: 0.16
2025-08-13 16:28:59,781 - INFO - Desviación estándar de los residuales multiplicativos: 0.16
2025-08-13 16:28:59,797 - INFO - P-value del test ADF: 0.00
2025-08-13 16:28:59,797 - INFO - P-value del test ADF: 0.00
2025-08-13 16:28:59,802 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:28:59,802 - INFO - Criterio de selección: La desviación estándar de los residuos multiplicative es menor.
2025-08-13 16:29:06,422 - INFO - Modelo ARIMA guardado para Finance en: Modelos Entrenados/overtime_forecast_model_Finance_ARIMA.pkl
2025-08-13 16:29:06,431 - INFO - y - Actual data: len=8, mean=30.2


=== Resumen de Modelos y Predicciones ===

Modelos y Métricas por Departamento:


Unnamed: 0,department,rmse,mae,smape,mape,mase,model_quality,model_type
0,Finance,5.059173,4.283673,14.334099,15.177939,0.582586,Pobre,ARIMA
1,HR,11.188696,8.220928,20.884262,25.930422,0.637634,Pobre,ARIMA
2,IT,20.702074,15.113123,14.220142,14.315467,0.618774,Pobre,ARIMA
3,Inventory,0.000532,0.000408,0.000373,0.000373,3.3e-05,Bueno,XGBoost
4,Marketing,7.283324,4.961546,7.607099,7.410892,0.455367,Aceptable,Exponential Smoothing
5,Sales,32.661378,27.542054,4.088677,4.205065,0.775864,Bueno,Prophet



Valores Predichos con Intervalos de Confianza:


Unnamed: 0,department,prediction_date,predicted_value,confidence_lower,confidence_upper
0,Finance,2025-05-11,27.935827,25.142244,30.729409
1,Finance,2025-05-25,33.241393,29.917253,36.565532
2,Finance,2025-06-08,31.278375,28.150537,34.406212
3,Finance,2025-06-22,29.692338,26.723104,32.661572
4,Finance,2025-07-06,30.360360,27.324324,33.396396
...,...,...,...,...,...
67,Sales,2025-08-17,687.284441,639.229747,732.892197
68,Sales,2025-08-31,666.157967,618.948173,712.675919
69,Sales,2025-09-14,688.147541,639.873214,734.046673
70,Sales,2025-09-28,700.167608,649.058690,745.334051


2025-08-13 16:29:19,524 - INFO - Conexión a la base de datos exitosa
2025-08-13 16:29:19,579 - INFO - Registros en Overtime_Predictions: 144
2025-08-13 16:29:19,579 - INFO - Registros en Overtime_Predictions: 144



Datos guardados exitosamente en la base de datos.


2025-08-13 16:29:19,582 - INFO - Registros en ML_Model_Metrics_Overtime_Predictions: 36



Registros en Overtime_Predictions: 144
Registros en ML_Model_Metrics: 36
