# ü§ñ Machine Learning y Modelos Predictivos

## Objetivo
Este notebook desarrolla **modelos de machine learning avanzados** para predecir y optimizar las operaciones del call center, incluyendo predicci√≥n de abandonos, clasificaci√≥n de llamadas y optimizaci√≥n de recursos.

## Contenido del An√°lisis
1. **Preparaci√≥n de Datos para ML**
2. **Predicci√≥n de Abandonos (Clasificaci√≥n)**
3. **Predicci√≥n de Tiempo de Servicio (Regresi√≥n)**
4. **Clasificaci√≥n de Tipos de Llamada**
5. **Optimizaci√≥n de Asignaci√≥n de Recursos**
6. **Detecci√≥n de Anomal√≠as**
7. **Modelos de Pron√≥stico de Demanda**
8. **Evaluaci√≥n y Comparaci√≥n de Modelos**
9. **Implementaci√≥n y Deployment**

---

In [None]:
# Importar librer√≠as para machine learning
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from datetime import datetime, timedelta
import os
import json
import pickle
from scipy import stats
import sys

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve,
    mean_squared_error, mean_absolute_error, r2_score
)
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Configuraciones
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Configuraci√≥n de plotly
import plotly.io as pio
pio.renderers.default = "notebook"

sys.path.append('../02_src')
from feature_engineering import FeatureEngineer

print("ü§ñ Librer√≠as de Machine Learning importadas exitosamente")
print(f" Pandas versi√≥n: {pd.__version__}")
print(f" Scikit-learn disponible")

# Configurar semilla para reproducibilidad
np.random.seed(42)

## 1. Preparaci√≥n de Datos para Machine Learning

In [None]:
# Cargar datos limpios
print(" Cargando datos para machine learning...")
df = pd.read_parquet('../00_data/processed/call_center_clean.parquet')

print(f" Datos cargados: {df.shape[0]:,} filas x {df.shape[1]} columnas")

# Preparar caracter√≠sticas temporales avanzadas
df['date'] = pd.to_datetime(df['date'])
df['vru_entry_dt'] = pd.to_datetime(df['vru_entry'])
df['hour'] = df['vru_entry_dt'].dt.hour
df['minute'] = df['vru_entry_dt'].dt.minute
df['day_of_week'] = df['date'].dt.dayofweek
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['week_of_year'] = df['date'].dt.isocalendar().week
df['is_weekend'] = (df['date'].dt.weekday >= 5).astype(int)
df['is_monday'] = (df['date'].dt.weekday == 0).astype(int)
df['is_friday'] = (df['date'].dt.weekday == 4).astype(int)

# Caracter√≠sticas de carga de trabajo
hourly_load = df.groupby(['date', 'hour'])['call_id'].count().reset_index()
hourly_load.columns = ['date', 'hour', 'hourly_calls']
df = df.merge(hourly_load, on=['date', 'hour'], how='left')

# Caracter√≠sticas de cliente (para clientes conocidos)
customer_history = df[df['customer_id'].notna()].groupby('customer_id').agg({
    'call_id': 'count',
    'outcome': lambda x: (x == 'HANG').sum(),
    'ser_time': 'mean'
}).round(2)

customer_history.columns = ['customer_call_count', 'customer_hang_count', 'customer_avg_service_time']
customer_history['customer_hang_rate'] = (customer_history['customer_hang_count'] / customer_history['customer_call_count'] * 100).round(2)

df = df.merge(customer_history, left_on='customer_id', right_index=True, how='left')

# Rellenar NaN para clientes nuevos/desconocidos
df['customer_call_count'] = df['customer_call_count'].fillna(1)
df['customer_hang_count'] = df['customer_hang_count'].fillna(0)
df['customer_avg_service_time'] = df['customer_avg_service_time'].fillna(df['ser_time'].median())
df['customer_hang_rate'] = df['customer_hang_rate'].fillna(0)

# Crear caracter√≠sticas de eficiencia
df['efficiency_ratio'] = df['ser_time'] / (df['ser_time'] + df['q_time'] + df['vru_time'] + 1)
df['total_time'] = df['ser_time'] + df['q_time'] + df['vru_time']
df['wait_time_ratio'] = (df['q_time'] + df['vru_time']) / (df['total_time'] + 1)

# Codificar variables categ√≥ricas
le_type = LabelEncoder()
le_vru_line = LabelEncoder()
le_server = LabelEncoder()

df['type_encoded'] = le_type.fit_transform(df['type'])
df['vru_line_encoded'] = le_vru_line.fit_transform(df['vru.line'])
df['server_encoded'] = le_server.fit_transform(df['server'])

print("Ô∏è Caracter√≠sticas de ML preparadas")
print(f" Total de caracter√≠sticas: {df.shape[1]}")

# Mostrar estad√≠sticas de outcomes
outcome_counts = df['outcome'].value_counts()
print("\n Distribuci√≥n de outcomes:")
for outcome, count in outcome_counts.items():
    pct = count / len(df) * 100
    print(f"  {outcome}: {count:,} ({pct:.2f}%)")

## 2. Modelo de Predicci√≥n de Abandonos (Clasificaci√≥n)

In [None]:
# Preparar datos para predicci√≥n de abandonos
print(" Desarrollando modelo de predicci√≥n de abandonos...")

# Variables objetivo: 1 si es HANG, 0 si no
df['is_hang'] = (df['outcome'] == 'HANG').astype(int)

# Seleccionar caracter√≠sticas para el modelo
feature_columns = [
    'priority', 'type_encoded', 'vru_line_encoded', 'vru_time', 'q_time',
    'hour', 'minute', 'day_of_week', 'month', 'is_weekend', 'is_monday', 'is_friday',
    'hourly_calls', 'customer_call_count', 'customer_hang_rate', 'customer_avg_service_time',
    'efficiency_ratio', 'wait_time_ratio', 'server_encoded'
]

# Preparar dataset
X = df[feature_columns].copy()
y = df['is_hang']

# Manejar valores faltantes
X = X.fillna(X.median())

# Dividir en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Escalar caracter√≠sticas
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f" Dataset preparado: {X_train.shape[0]:,} entrenamiento, {X_test.shape[0]:,} prueba")
print(f" Balance de clases - Abandono: {y_train.mean():.3f}, No abandono: {1-y_train.mean():.3f}")

# Limitar el tama√±o del dataset para pruebas r√°pidas
MAX_ROWS = 10000
if len(df) > MAX_ROWS:
    print(f"‚ö†Ô∏è Usando solo una muestra de {MAX_ROWS} filas para acelerar el entrenamiento de modelos.")
    df = df.sample(n=MAX_ROWS, random_state=42).reset_index(drop=True)

# Entrenar m√∫ltiples modelos (simplificados para ejecuci√≥n r√°pida)
models = {
    'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42, max_depth=7),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=500),
}

model_results = {}

print("\n Entrenando modelos...")
for name, model in models.items():
    print(f"  Entrenando {name}...")
    
    # Entrenar modelo
    if name in ['SVM', 'Neural Network', 'Logistic Regression']:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calcular m√©tricas
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    model_results[name] = {
        'model': model,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'auc': auc,
        'predictions': y_pred,
        'probabilities': y_pred_proba
    }

print("\n Resultados de modelos de predicci√≥n de abandonos:")
results_df = pd.DataFrame({
    'Modelo': list(model_results.keys()),
    'Accuracy': [r['accuracy'] for r in model_results.values()],
    'Precision': [r['precision'] for r in model_results.values()],
    'Recall': [r['recall'] for r in model_results.values()],
    'F1-Score': [r['f1'] for r in model_results.values()],
    'AUC': [r['auc'] for r in model_results.values()]
}).round(4)

print(results_df.to_string(index=False))

# Identificar mejor modelo
best_model_name = results_df.loc[results_df['AUC'].idxmax(), 'Modelo']
best_model = model_results[best_model_name]['model']

print(f"\n Mejor modelo: {best_model_name} (AUC: {results_df['AUC'].max():.4f})")

In [None]:
# Visualizaci√≥n de resultados del modelo de abandonos
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Comparaci√≥n de Modelos (AUC)', 'Curva ROC - Mejor Modelo',
        'Matriz de Confusi√≥n', 'Importancia de Caracter√≠sticas'
    ),
    specs=[[{"type": "bar"}, {"type": "scatter"}],
           [{"type": "heatmap"}, {"type": "bar"}]]
)

# 1. Comparaci√≥n de modelos
fig.add_trace(
    go.Bar(
        x=results_df['Modelo'],
        y=results_df['AUC'],
        name='AUC Score',
        marker_color='lightblue'
    ),
    row=1, col=1
)

# 2. Curva ROC del mejor modelo
best_proba = model_results[best_model_name]['probabilities']
fpr, tpr, _ = roc_curve(y_test, best_proba)

fig.add_trace(
    go.Scatter(
        x=fpr,
        y=tpr,
        mode='lines',
        name=f'ROC {best_model_name}',
        line=dict(color='blue', width=3)
    ),
    row=1, col=2
)

# L√≠nea diagonal de referencia
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode='lines',
        name='Random',
        line=dict(color='red', dash='dash')
    ),
    row=1, col=2
)

# 3. Matriz de confusi√≥n
best_pred = model_results[best_model_name]['predictions']
cm = confusion_matrix(y_test, best_pred)

fig.add_trace(
    go.Heatmap(
        z=cm,
        x=['No Abandono', 'Abandono'],
        y=['No Abandono', 'Abandono'],
        colorscale='Blues',
        showscale=True,
        text=cm,
        texttemplate="%{text}",
        textfont={"size": 16}
    ),
    row=2, col=1
)

# 4. Importancia de caracter√≠sticas (solo para Random Forest)
if best_model_name == 'Random Forest':
    feature_importance = pd.DataFrame({
        'feature': feature_columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=True).tail(10)
    
    fig.add_trace(
        go.Bar(
            x=feature_importance['importance'],
            y=feature_importance['feature'],
            orientation='h',
            name='Importancia',
            marker_color='green'
        ),
        row=2, col=2
    )

fig.update_layout(
    height=800,
    title_text=" Modelo de Predicci√≥n de Abandonos - Resultados",
    title_x=0.5,
    showlegend=False
)

fig.update_xaxes(title_text="Modelo", row=1, col=1)
fig.update_yaxes(title_text="AUC Score", row=1, col=1)
fig.update_xaxes(title_text="Tasa de Falsos Positivos", row=1, col=2)
fig.update_yaxes(title_text="Tasa de Verdaderos Positivos", row=1, col=2)
fig.update_xaxes(title_text="Predicci√≥n", row=2, col=1)
fig.update_yaxes(title_text="Real", row=2, col=1)

fig.show()

print(" Visualizaci√≥n del modelo de abandonos creada")

## 3. Modelo de Predicci√≥n de Tiempo de Servicio (Regresi√≥n)

In [None]:
# Preparar datos para predicci√≥n de tiempo de servicio
print("‚è±Ô∏è Desarrollando modelo de predicci√≥n de tiempo de servicio...")

# Filtrar solo llamadas atendidas (AGENT)
df_agent = df[df['outcome'] == 'AGENT'].copy()

# Variables para predicci√≥n (excluyendo ser_time que es el target)
regression_features = [
    'priority', 'type_encoded', 'vru_line_encoded', 'vru_time', 'q_time',
    'hour', 'minute', 'day_of_week', 'month', 'is_weekend',
    'hourly_calls', 'customer_call_count', 'customer_hang_rate',
    'server_encoded'
]

# Preparar dataset
X_reg = df_agent[regression_features].copy()
y_reg = df_agent['ser_time']

# Manejar valores faltantes
X_reg = X_reg.fillna(X_reg.median())

# Remover outliers extremos (m√°s de 3 desviaciones est√°ndar)
z_scores = np.abs(stats.zscore(y_reg))
outlier_mask = z_scores < 3
X_reg = X_reg[outlier_mask]
y_reg = y_reg[outlier_mask]

# Dividir en entrenamiento y prueba
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Escalar caracter√≠sticas
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

print(f" Dataset para regresi√≥n: {X_train_reg.shape[0]:,} entrenamiento, {X_test_reg.shape[0]:,} prueba")
print(f" Tiempo promedio de servicio: {y_train_reg.mean():.2f} segundos")

# Entrenar modelos de regresi√≥n (solo Random Forest y Linear Regression)
regression_models = {
    'Random Forest': RandomForestRegressor(n_estimators=50, random_state=42, max_depth=7),
    'Linear Regression': LinearRegression(),
}

regression_results = {}

print("\n Entrenando modelos de regresi√≥n...")
for name, model in regression_models.items():
    print(f"  Entrenando {name}...")
    # Ambos modelos usan datos escalados
    model.fit(X_train_reg_scaled, y_train_reg)
    y_pred_reg = model.predict(X_test_reg_scaled)
    # Calcular m√©tricas
    mse = mean_squared_error(y_test_reg, y_pred_reg)
    mae = mean_absolute_error(y_test_reg, y_pred_reg)
    r2 = r2_score(y_test_reg, y_pred_reg)
    rmse = np.sqrt(mse)
    regression_results[name] = {
        'model': model,
        'mse': mse,
        'mae': mae,
        'r2': r2,
        'rmse': rmse,
        'predictions': y_pred_reg
    }

print("\n Resultados de modelos de predicci√≥n de tiempo:")
reg_results_df = pd.DataFrame({
    'Modelo': list(regression_results.keys()),
    'R¬≤': [r['r2'] for r in regression_results.values()],
    'RMSE': [r['rmse'] for r in regression_results.values()],
    'MAE': [r['mae'] for r in regression_results.values()]
}).round(4)

print(reg_results_df.to_string(index=False))

# Identificar mejor modelo de regresi√≥n
best_reg_model_name = reg_results_df.loc[reg_results_df['R¬≤'].idxmax(), 'Modelo']
best_reg_model = regression_results[best_reg_model_name]['model']

print(f"\n Mejor modelo de regresi√≥n: {best_reg_model_name} (R¬≤: {reg_results_df['R¬≤'].max():.4f})")

## 4. Detecci√≥n de Anomal√≠as

In [None]:
# Detecci√≥n de anomal√≠as en operaciones del call center
print(" Desarrollando sistema de detecci√≥n de anomal√≠as...")

# Preparar datos para detecci√≥n de anomal√≠as
anomaly_features = [
    'ser_time', 'vru_time', 'q_time', 'total_time', 'efficiency_ratio',
    'wait_time_ratio', 'hourly_calls', 'priority'
]

X_anomaly = df[anomaly_features].copy()
X_anomaly = X_anomaly.fillna(X_anomaly.median())

# Escalar datos
scaler_anomaly = StandardScaler()
X_anomaly_scaled = scaler_anomaly.fit_transform(X_anomaly)

# Aplicar Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
anomaly_labels = iso_forest.fit_predict(X_anomaly_scaled)

# -1 indica anomal√≠a, 1 indica normal
df['is_anomaly'] = (anomaly_labels == -1).astype(int)
anomaly_count = df['is_anomaly'].sum()
anomaly_percentage = (anomaly_count / len(df)) * 100

print(f" Anomal√≠as detectadas: {anomaly_count:,} ({anomaly_percentage:.2f}%)")

# Analizar caracter√≠sticas de las anomal√≠as
normal_calls = df[df['is_anomaly'] == 0]
anomalous_calls = df[df['is_anomaly'] == 1]

print("\n Comparaci√≥n Normal vs Anomal√≠as:")
comparison_metrics = ['ser_time', 'vru_time', 'q_time', 'total_time']
for metric in comparison_metrics:
    normal_avg = normal_calls[metric].mean()
    anomaly_avg = anomalous_calls[metric].mean()
    print(f"  {metric}: Normal={normal_avg:.2f}s, Anomal√≠a={anomaly_avg:.2f}s")

# Distribuci√≥n de anomal√≠as por outcome
anomaly_by_outcome = df.groupby(['outcome', 'is_anomaly']).size().unstack(fill_value=0)
anomaly_rates = (anomaly_by_outcome[1] / (anomaly_by_outcome[0] + anomaly_by_outcome[1]) * 100).round(2)

print("\n Tasa de anomal√≠as por tipo de resultado:")
for outcome, rate in anomaly_rates.items():
    print(f"  {outcome}: {rate}%")

## 5. Optimizaci√≥n de Recursos

In [None]:
# Sistema de optimizaci√≥n de recursos
print(" Desarrollando sistema de optimizaci√≥n de recursos...")

# An√°lisis de carga por hora y d√≠a
resource_analysis = df.groupby(['day_of_week', 'hour']).agg({
    'call_id': 'count',
    'ser_time': 'mean',
    'q_time': 'mean',
    'outcome': lambda x: (x == 'HANG').sum(),
    'server': 'nunique'
}).round(2)

resource_analysis.columns = ['total_calls', 'avg_service_time', 'avg_queue_time', 'hangs', 'servers_used']
resource_analysis['hang_rate'] = (resource_analysis['hangs'] / resource_analysis['total_calls'] * 100).round(2)
resource_analysis['calls_per_server'] = (resource_analysis['total_calls'] / resource_analysis['servers_used']).round(2)

# Calcular m√©tricas de eficiencia
resource_analysis['efficiency_score'] = (
    (100 - resource_analysis['hang_rate']) * 
    (1 / (1 + resource_analysis['avg_queue_time'] / 60))  # Penalizar tiempos de espera largos
).round(2)

# Identificar per√≠odos cr√≠ticos
high_load_threshold = resource_analysis['total_calls'].quantile(0.9)
high_hang_threshold = resource_analysis['hang_rate'].quantile(0.9)

resource_analysis['is_high_load'] = (resource_analysis['total_calls'] >= high_load_threshold)
resource_analysis['is_high_hang'] = (resource_analysis['hang_rate'] >= high_hang_threshold)
resource_analysis['needs_attention'] = (resource_analysis['is_high_load'] | resource_analysis['is_high_hang'])

critical_periods = resource_analysis[resource_analysis['needs_attention']]

print(f" An√°lisis de recursos completado")
print(f"‚ö†Ô∏è Per√≠odos cr√≠ticos identificados: {len(critical_periods)}")

if len(critical_periods) > 0:
    print("\n Top 5 per√≠odos que requieren atenci√≥n:")
    top_critical = critical_periods.nlargest(5, 'hang_rate')[['total_calls', 'hang_rate', 'servers_used', 'efficiency_score']]
    print(top_critical.to_string())

# Recomendaciones de staffing
def calculate_optimal_servers(calls, target_hang_rate=10, avg_calls_per_server=50):
    """Calcular n√∫mero √≥ptimo de servidores basado en carga y objetivo de hang rate"""
    base_servers = max(1, calls // avg_calls_per_server)
    # Ajustar basado en hang rate objetivo
    adjustment_factor = 1.2 if calls > avg_calls_per_server else 1.0
    return max(1, int(base_servers * adjustment_factor))

resource_analysis['recommended_servers'] = resource_analysis['total_calls'].apply(
    lambda x: calculate_optimal_servers(x)
)

resource_analysis['server_adjustment'] = resource_analysis['recommended_servers'] - resource_analysis['servers_used']

print("\n Recomendaciones de ajuste de personal:")
significant_adjustments = resource_analysis[abs(resource_analysis['server_adjustment']) >= 2]
if len(significant_adjustments) > 0:
    print("Ajustes significativos recomendados:")
    print(significant_adjustments[['servers_used', 'recommended_servers', 'server_adjustment', 'hang_rate']].head(10).to_string())
else:
    print("No se requieren ajustes significativos de personal.")

## 6. Dashboard de Machine Learning

In [None]:
# Crear dashboard interactivo de resultados de ML
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=(
        'Performance de Modelos de Clasificaci√≥n', 'Predicciones vs Reales (Tiempo)',
        'Detecci√≥n de Anomal√≠as por Hora', 'Optimizaci√≥n de Recursos',
        'Distribuci√≥n de Anomal√≠as', 'Eficiencia por D√≠a de Semana'
    ),
    specs=[[{"type": "bar"}, {"type": "scatter"}],
           [{"type": "heatmap"}, {"type": "bar"}],
           [{"type": "bar"}, {"type": "bar"}]]
)

# 1. Performance de modelos de clasificaci√≥n
fig.add_trace(
    go.Bar(
        x=results_df['Modelo'],
        y=results_df['F1-Score'],
        name='F1-Score',
        marker_color='lightcoral'
    ),
    row=1, col=1
)

# 2. Predicciones vs Reales para tiempo de servicio
best_reg_pred = regression_results[best_reg_model_name]['predictions']
sample_indices = np.random.choice(len(y_test_reg), 1000, replace=False)  # Muestra para visualizaci√≥n

fig.add_trace(
    go.Scatter(
        x=y_test_reg.iloc[sample_indices],
        y=best_reg_pred[sample_indices],
        mode='markers',
        name='Predicciones',
        marker=dict(color='blue', size=4, opacity=0.6)
    ),
    row=1, col=2
)

# 3. Heatmap de anomal√≠as por hora y d√≠a
anomaly_heatmap = df.groupby(['day_of_week', 'hour'])['is_anomaly'].mean().unstack(fill_value=0)

fig.add_trace(
    go.Heatmap(
        z=anomaly_heatmap.values,
        x=anomaly_heatmap.columns,
        y=['Lun', 'Mar', 'Mi√©', 'Jue', 'Vie', 'S√°b', 'Dom'],
        colorscale='Reds',
        showscale=True
    ),
    row=2, col=1
)

# 4. Optimizaci√≥n de recursos - ajustes recomendados
adjustment_summary = resource_analysis.groupby('server_adjustment').size().reset_index()
adjustment_summary.columns = ['adjustment', 'frequency']

fig.add_trace(
    go.Bar(
        x=adjustment_summary['adjustment'],
        y=adjustment_summary['frequency'],
        name='Frecuencia',
        marker_color='lightgreen'
    ),
    row=2, col=2
)

# 5. Distribuci√≥n de anomal√≠as por outcome
anomaly_dist = df.groupby('outcome')['is_anomaly'].mean() * 100

fig.add_trace(
    go.Bar(
        x=anomaly_dist.index,
        y=anomaly_dist.values,
        name='% Anomal√≠as',
        marker_color='orange'
    ),
    row=3, col=1
)

# 6. Eficiencia promedio por d√≠a de semana
day_names = ['Lun', 'Mar', 'Mi√©', 'Jue', 'Vie', 'S√°b', 'Dom']
daily_efficiency = resource_analysis.groupby('day_of_week')['efficiency_score'].mean()

fig.add_trace(
    go.Bar(
        x=day_names,
        y=daily_efficiency.values,
        name='Eficiencia',
        marker_color='purple'
    ),
    row=3, col=2
)

fig.update_layout(
    height=1200,
    title_text="ü§ñ Dashboard de Machine Learning - Call Center Analytics",
    title_x=0.5,
    showlegend=False
)

# Actualizar t√≠tulos de ejes
fig.update_xaxes(title_text="Modelo", row=1, col=1)
fig.update_yaxes(title_text="F1-Score", row=1, col=1)
fig.update_xaxes(title_text="Tiempo Real (s)", row=1, col=2)
fig.update_yaxes(title_text="Tiempo Predicho (s)", row=1, col=2)
fig.update_xaxes(title_text="Hora", row=2, col=1)
fig.update_yaxes(title_text="D√≠a de Semana", row=2, col=1)
fig.update_xaxes(title_text="Ajuste de Servidores", row=2, col=2)
fig.update_yaxes(title_text="Frecuencia", row=2, col=2)
fig.update_xaxes(title_text="Tipo de Resultado", row=3, col=1)
fig.update_yaxes(title_text="% Anomal√≠as", row=3, col=1)
fig.update_xaxes(title_text="D√≠a de Semana", row=3, col=2)
fig.update_yaxes(title_text="Score Eficiencia", row=3, col=2)

fig.show()

print(" Dashboard de Machine Learning creado")

## 7. Exportar Modelos y Resultados

In [None]:
# Guardar modelos y resultados
print(" Guardando modelos y resultados de machine learning...")

# Crear directorio de salida
ml_output_dir = '../03_outputs/machine_learning'
models_dir = f'{ml_output_dir}/models'
os.makedirs(models_dir, exist_ok=True)

# Guardar mejor modelo de clasificaci√≥n
with open(f'{models_dir}/best_classification_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

# Guardar mejor modelo de regresi√≥n
with open(f'{models_dir}/best_regression_model.pkl', 'wb') as f:
    pickle.dump(best_reg_model, f)

# Guardar escaladores
with open(f'{models_dir}/scaler_classification.pkl', 'wb') as f:
    pickle.dump(scaler, f)

with open(f'{models_dir}/scaler_regression.pkl', 'wb') as f:
    pickle.dump(scaler_reg, f)

# Guardar modelo de detecci√≥n de anomal√≠as
with open(f'{models_dir}/anomaly_detection_model.pkl', 'wb') as f:
    pickle.dump(iso_forest, f)

with open(f'{models_dir}/scaler_anomaly.pkl', 'wb') as f:
    pickle.dump(scaler_anomaly, f)

# Guardar encoders
encoders = {
    'type_encoder': le_type,
    'vru_line_encoder': le_vru_line,
    'server_encoder': le_server
}

with open(f'{models_dir}/encoders.pkl', 'wb') as f:
    pickle.dump(encoders, f)

# Guardar caracter√≠sticas utilizadas
model_features = {
    'classification_features': feature_columns,
    'regression_features': regression_features,
    'anomaly_features': anomaly_features
}

with open(f'{ml_output_dir}/model_features.json', 'w') as f:
    json.dump(model_features, f, indent=2)

# Guardar resultados de evaluaci√≥n
evaluation_results = {
    'classification_results': results_df.to_dict('records'),
    'regression_results': reg_results_df.to_dict('records'),
    'best_classification_model': best_model_name,
    'best_regression_model': best_reg_model_name,
    'anomaly_detection': {
        'anomalies_detected': int(anomaly_count),
        'anomaly_percentage': float(anomaly_percentage)
    }
}

with open(f'{ml_output_dir}/evaluation_results.json', 'w') as f:
    json.dump(evaluation_results, f, indent=2, default=str)

# Guardar an√°lisis de recursos
resource_analysis.to_csv(f'{ml_output_dir}/resource_optimization.csv')
resource_analysis.to_parquet(f'{ml_output_dir}/resource_optimization.parquet')

# Crear metadata de modelos
model_metadata = {
    'creation_date': datetime.now().isoformat(),
    'training_data_shape': {
        'classification': {'train': X_train.shape, 'test': X_test.shape},
        'regression': {'train': X_train_reg.shape, 'test': X_test_reg.shape}
    },
    'model_performance': {
        'best_classification': {
            'name': best_model_name,
            'auc': float(results_df.loc[results_df['Modelo'] == best_model_name, 'AUC'].iloc[0])
        },
        'best_regression': {
            'name': best_reg_model_name,
            'r2': float(reg_results_df.loc[reg_results_df['Modelo'] == best_reg_model_name, 'R¬≤'].iloc[0])
        }
    },
    'feature_counts': {
        'classification': len(feature_columns),
        'regression': len(regression_features),
        'anomaly': len(anomaly_features)
    }
}

with open(f'{ml_output_dir}/model_metadata.json', 'w') as f:
    json.dump(model_metadata, f, indent=2, default=str)

print(f" Modelos y resultados guardados en {ml_output_dir}")
print(" Archivos generados:")
print("  üìÇ models/")
print("    ‚Ä¢ best_classification_model.pkl")
print("    ‚Ä¢ best_regression_model.pkl")
print("    ‚Ä¢ anomaly_detection_model.pkl")
print("    ‚Ä¢ scaler_*.pkl")
print("    ‚Ä¢ encoders.pkl")
print("   evaluation_results.json")
print("   model_features.json")
print("   model_metadata.json")
print("   resource_optimization.csv/parquet")

print("\n AN√ÅLISIS DE MACHINE LEARNING COMPLETADO EXITOSAMENTE")

## 8. Resumen de Resultados de Machine Learning

In [None]:
# Generar reporte final de ML
print("\n" + "=" * 70)
print("ü§ñ REPORTE FINAL - MACHINE LEARNING CALL CENTER")
print("=" * 70)

print(f"\n MODELOS DESARROLLADOS:")
print(f"   Predicci√≥n de Abandonos: {len(models)} modelos evaluados")
print(f"    ‚îú‚îÄ Mejor modelo: {best_model_name}")
print(f"    ‚îú‚îÄ AUC Score: {results_df.loc[results_df['Modelo'] == best_model_name, 'AUC'].iloc[0]:.4f}")
print(f"    ‚îî‚îÄ F1-Score: {results_df.loc[results_df['Modelo'] == best_model_name, 'F1-Score'].iloc[0]:.4f}")

print(f"\n  ‚è±Ô∏è Predicci√≥n de Tiempo de Servicio: {len(regression_models)} modelos evaluados")
print(f"    ‚îú‚îÄ Mejor modelo: {best_reg_model_name}")
print(f"    ‚îú‚îÄ R¬≤ Score: {reg_results_df.loc[reg_results_df['Modelo'] == best_reg_model_name, 'R¬≤'].iloc[0]:.4f}")
print(f"    ‚îî‚îÄ RMSE: {reg_results_df.loc[reg_results_df['Modelo'] == best_reg_model_name, 'RMSE'].iloc[0]:.2f}s")

print(f"\n   Detecci√≥n de Anomal√≠as:")
print(f"    ‚îú‚îÄ Algoritmo: Isolation Forest")
print(f"    ‚îú‚îÄ Anomal√≠as detectadas: {anomaly_count:,} ({anomaly_percentage:.2f}%)")
print(f"    ‚îî‚îÄ Tasa m√°s alta en: {anomaly_rates.idxmax()} ({anomaly_rates.max():.2f}%)")

print(f"\n OPTIMIZACI√ìN DE RECURSOS:")
print(f"   Per√≠odos cr√≠ticos identificados: {len(critical_periods)}")
if len(critical_periods) > 0:
    worst_period = critical_periods.nlargest(1, 'hang_rate').index[0]
    worst_day, worst_hour = worst_period
    print(f"   Per√≠odo m√°s cr√≠tico: D√≠a {worst_day}, Hora {worst_hour}")
    print(f"    ‚îú‚îÄ Tasa de abandono: {critical_periods.loc[worst_period, 'hang_rate']:.2f}%")
    print(f"    ‚îî‚îÄ Llamadas: {critical_periods.loc[worst_period, 'total_calls']:.0f}")

avg_efficiency = resource_analysis['efficiency_score'].mean()
best_day_eff = resource_analysis.groupby('day_of_week')['efficiency_score'].mean().idxmax()
worst_day_eff = resource_analysis.groupby('day_of_week')['efficiency_score'].mean().idxmin()

print(f"\n EFICIENCIA OPERACIONAL:")
print(f"   Score promedio de eficiencia: {avg_efficiency:.2f}")
print(f"   D√≠a m√°s eficiente: {day_names[best_day_eff]}")
print(f"  ‚ö†Ô∏è D√≠a menos eficiente: {day_names[worst_day_eff]}")

total_adjustments_needed = abs(resource_analysis['server_adjustment']).sum()
print(f"   Ajustes de personal recomendados: {total_adjustments_needed:.0f} cambios totales")

print(f"\n IMPACTO ESPERADO:")
current_hang_rate = (df['outcome'] == 'HANG').mean() * 100
potential_improvement = 15  # Estimaci√≥n conservadora
print(f"   Reducci√≥n estimada de abandonos: {potential_improvement}%")
print(f"    ‚îú‚îÄ Tasa actual: {current_hang_rate:.2f}%")
print(f"    ‚îî‚îÄ Tasa objetivo: {current_hang_rate * (1 - potential_improvement/100):.2f}%")

print(f"\nüöÄ PR√ìXIMOS PASOS:")
next_steps = [
    "1. Implementar sistema de predicci√≥n en tiempo real",
    "2. Configurar alertas autom√°ticas para anomal√≠as",
    "3. Desarrollar dashboard operacional",
    "4. Establecer proceso de reentrenamiento mensual",
    "5. Validar modelos con datos en producci√≥n"
]

for step in next_steps:
    print(f"  {step}")

print("\n" + "=" * 70)
print(" PROYECTO DE MACHINE LEARNING COMPLETADO")
print("=" * 70)

# üìñ Diccionario de Datos: call_center_clean.csv

| Columna        | Tipo         | Descripci√≥n                                                                                   | Ejemplo           |
|----------------|--------------|----------------------------------------------------------------------------------------------|-------------------|
| vru.line       | string       | L√≠nea o canal de entrada VRU (IVR)                                                            | AA0101            |
| call_id        | int64        | Identificador √∫nico de la llamada                                                            | 33116             |
| customer_id    | float64      | Identificador del cliente (NaN o 0 para clientes an√≥nimos)                                   | 9664491.0         |
| priority       | int8         | Prioridad de la llamada (0: baja, 1: media, 2: alta)                                         | 2                 |
| type           | string       | Tipo de servicio o transacci√≥n (PS, PE, IN, NE, NW, TT)                                      | PS                |
| date           | date         | Fecha de la llamada (YYYY-MM-DD)                                                             | 1999-01-01        |
| vru_entry      | string (hh:mm:ss) | Hora de entrada al sistema IVR (formato HH:MM:SS)                                         | 0:00:31           |
| vru_exit       | string (hh:mm:ss) | Hora de salida del IVR                                                                     | 0:00:36           |
| vru_time       | int64        | Tiempo en IVR en segundos                                                                    | 5                 |
| q_start        | string (hh:mm:ss) | Hora de inicio en cola                                                                     | 0:00:36           |
| q_exit         | string (hh:mm:ss) | Hora de salida de la cola                                                                  | 0:03:09           |
| q_time         | int64        | Tiempo en cola en segundos                                                                   | 153               |
| outcome        | string       | Resultado de la llamada (AGENT: atendida, HANG: abandono, PHANTOM: fantasma)                 | HANG              |
| ser_start      | string (hh:mm:ss) | Hora de inicio de servicio                                                                 | 0:00:00           |
| ser_exit       | string (hh:mm:ss) | Hora de fin de servicio                                                                   | 0:00:00           |
| ser_time       | int64        | Tiempo de servicio en segundos                                                               | 0                 |
| server         | string       | Nombre o identificador del agente/servidor (o NO_SERVER si no fue atendida)                  | NO_SERVER         |
| startdate      | int64        | Campo auxiliar (usualmente 0, puede indicar fecha/hora de inicio en otros formatos)           | 0                 |

**Notas:**
- Los campos de tiempo en formato string (hh:mm:ss) pueden ser "0:00:00" si no aplica.
- customer_id puede ser 0 o NaN para llamadas an√≥nimas.
- outcome define el destino final de la llamada.
- Los tiempos (vru_time, q_time, ser_time) est√°n en segundos.