# 1. Entrenamiento y Evaluación del Modelo LSTM

Este notebook toma las secuencias preparadas (sin escalar) del Notebook 2 y realiza el proceso completo de entrenamiento y evaluación del modelo de predicción de fugas.

**Pasos Realizados:**

1.  **Carga de Datos:** Se cargan los arrays `X_sequences_unscaled.npy` (features) e `y_targets.npy` (objetivo).
2.  **División de Datos (Train/Validation/Test):**
    * Los datos se dividen en tres conjuntos: **entrenamiento** (60%), **validación** (20%) y **prueba** (20%).
    * Se utiliza `stratify=y` para asegurar que la proporción de fugas (la clase minoritaria) sea similar en los tres conjuntos, lo cual es crucial para datos desbalanceados.
3.  **Escalado de Features (`MinMaxScaler`):**
    * Se inicializa un escalador `MinMaxScaler` para normalizar las *features* entre 0 y 1 (beneficioso para LSTMs).
    * **Importante:** El escalador se **ajusta (`fit`) únicamente** con los datos de **entrenamiento** (`X_train`) para evitar fuga de información (*data leakage*) desde los conjuntos de validación o prueba.
    * Luego, se **transforman** los tres conjuntos (`X_train`, `X_val`, `X_test`) usando el escalador ya ajustado.
    * Los datos se reestructuran temporalmente a 2D para el escalado y luego se devuelven a su formato 3D original (`[muestras, pasos_de_tiempo, features]`).
4.  **Definición de la Arquitectura del Modelo:**
    * Se define un modelo secuencial simple usando Keras:
        * **Entrada:** Secuencias de `N_TIMESTEPS` (ej. 72) pasos con `N_FEATURES` (ej. 9) características cada uno.
        * **Capa LSTM:** Una capa LSTM con 64 unidades procesa la secuencia temporal. `return_sequences=False` indica que solo se necesita la salida del último paso de tiempo.
        * **Dropout:** Se aplica Dropout (30%) para regularización y prevenir el sobreajuste.
        * **Capas Densas:** Capas densas (`Dense`) procesan la salida del LSTM, con una capa final de 1 neurona y activación `sigmoid` para la predicción de probabilidad (0 a 1) de fuga.
5.  **Manejo del Desbalance de Clases:**
    * Se calculan **pesos de clase (`class_weight`)** usando la estrategia `'balanced'`. Esto asigna automáticamente un peso mayor a las muestras de la clase minoritaria (fugas) durante el entrenamiento, obligando al modelo a prestarles más atención.
6.  **Compilación del Modelo:**
    * Se configura el modelo para el entrenamiento con:
        * **Optimizador:** `Adam` (un optimizador estándar y eficiente).
        * **Función de Pérdida:** `binary_crossentropy` (adecuada para clasificación binaria).
        * **Métricas:** Se monitorizan `AUC-PR` (Área Bajo la Curva Precisión-Recall, la métrica **más importante** para datos desbalanceados), `AUC-ROC` y `accuracy`.
7.  **Entrenamiento del Modelo:**
    * Se entrena el modelo (`model.fit`) usando los datos de entrenamiento escalados (`X_train_scaled`, `y_train`).
    * Se utilizan los datos de validación escalados (`X_val_scaled`, `y_val`) para monitorizar el rendimiento en datos no vistos durante el entrenamiento.
    * Se aplican los `class_weight` calculados.
    * Se usa **`EarlyStopping`**: El entrenamiento se detiene automáticamente si la métrica `val_auc_pr` (AUC-PR en validación) no mejora durante un número determinado de épocas (`patience=5`), restaurando los pesos del modelo correspondientes a la mejor época.
8.  **Evaluación del Modelo:**
    * Se evalúa el rendimiento final del mejor modelo (restaurado por `EarlyStopping`) sobre el conjunto de **prueba escalado** (`X_test_scaled`, `y_test`), que el modelo nunca ha visto.
    * Se imprimen las métricas clave (Loss, AUC-PR, AUC-ROC, Accuracy).
    * Se generan predicciones de probabilidad y clase para el conjunto de prueba.
    * Se muestra un **`classification_report`** (con precisión, recall, f1-score por clase) y una **`confusion_matrix`** para un análisis detallado del rendimiento, especialmente en la detección de fugas (clase 1).
9.  **Guardado del Modelo y Escalador:**
    * El modelo Keras entrenado se guarda en un archivo (`.keras`).
    * El objeto `scaler` (ajustado con los datos de entrenamiento) se guarda usando `joblib`. Ambos son necesarios para hacer predicciones sobre datos nuevos en el futuro (Notebook 4).

In [11]:
#%pip install tensorflow

In [12]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout, concatenate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.utils import class_weight
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, average_precision_score

## 1.1. Carga de datos

In [13]:
# Load Prepared Data
print("--- Loading Prepared Data ---")
x_filename = '../data/X_sequences_unscaled.npy' 
y_filename = '../data/y_targets.npy' 

try:
    X = np.load(x_filename, allow_pickle=True)
    y = np.load(y_filename, allow_pickle=True)

    print(f"Data loaded successfully:")
    print(f"X shape: {X.shape}")
    print(f"y shape: {y.shape}")
    
except FileNotFoundError as e:
    print(f"Error loading data: {e}")
    print("Please ensure the data files are in the correct directory.")

--- Loading Prepared Data ---
Data loaded successfully:
X shape: (121114, 72, 9)
y shape: (121114,)


## 1.2. División de Datos

In [None]:
# Train-Validation-Test Split 
print("\n--- Splitting Data (Train/Validation/Test) ---")

# Stratify ensures that the proportion of leaks (y=1) is similar across splits
# First split: Train (60%) + Temp (40%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.4, random_state=42, stratify=y
)
# Second split: Validation (20%) + Test (20%) from Temp
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

# Verify stratification (optional)
print(f"Leak % in Train: {np.mean(y_train):.4f}")
print(f"Leak % in Val:   {np.mean(y_val):.4f}")
print(f"Leak % in Test:  {np.mean(y_test):.4f}")



--- Splitting Data (Train/Validation/Test) ---
X_train shape: (72668, 72, 9), y_train shape: (72668,)
X_val shape: (24223, 72, 9), y_val shape: (24223,)
X_test shape: (24223, 72, 9), y_test shape: (24223,)
Leak % in Train: 0.8183
Leak % in Val:   0.8184
Leak % in Test:  0.8183


## 1.3. Escalado de Features

In [15]:
# Feature Scaling (Fit on Train, Transform All) 
print("\n--- Scaling Features (MinMaxScaler) ---")
scaler = MinMaxScaler()
num_features = X_train.shape[2] # Number of features per time step

# Reshape data for scaler: (samples * timesteps, features)
# This scales each feature across all time steps consistently
X_train_reshaped = X_train.reshape(-1, num_features)
X_val_reshaped = X_val.reshape(-1, num_features)
X_test_reshaped = X_test.reshape(-1, num_features)

# Fit scaler ONLY on the training data
print("Fitting scaler on training data...")
scaler.fit(X_train_reshaped)

# Transform all datasets
print("Transforming train, validation, and test data...")
X_train_scaled_reshaped = scaler.transform(X_train_reshaped)
X_val_scaled_reshaped = scaler.transform(X_val_reshaped)
X_test_scaled_reshaped = scaler.transform(X_test_reshaped)

# Reshape data back to 3D format: (samples, timesteps, features)
X_train_scaled = X_train_scaled_reshaped.reshape(X_train.shape)
X_val_scaled = X_val_scaled_reshaped.reshape(X_val.shape)
X_test_scaled = X_test_scaled_reshaped.reshape(X_test.shape)
print("Scaling complete. Data reshaped back to 3D.")


--- Scaling Features (MinMaxScaler) ---
Fitting scaler on training data...
Transforming train, validation, and test data...
Scaling complete. Data reshaped back to 3D.


## 1.4. Definición de la arquitectura del modelo 

In [None]:
# Define Model Architecture (Hybrid LSTM + MLP) 
# Adapt this architecture based on your final feature set
print("--- Defining Model Architecture (Hybrid LSTM + MLP) ---")

N_TIMESTEPS = X_train_scaled.shape[1] 
N_FEATURES = X_train_scaled.shape[2] 

# Define Input Layers
# Input A: Sequence for LSTM (All features in this case)
input_sequence = Input(shape=(N_TIMESTEPS, N_FEATURES), name='input_lstm_sequence')

# --- LSTM Path ---
lstm_out = LSTM(64, return_sequences=False, name='lstm_layer')(input_sequence) 
lstm_out = Dropout(0.3, name='lstm_dropout')(lstm_out)

# In this simplified hybrid, the LSTM output directly feeds the final layers
combined_out = Dense(32, activation='relu', name='dense_layer_1')(lstm_out)
output = Dense(1, activation='sigmoid', name='output_fuga')(combined_out) 

# Create Model
model = Model(inputs=input_sequence, outputs=output)

print("Model architecture defined:")
model.summary()


--- Defining Model Architecture (Hybrid LSTM + MLP) ---
Model architecture defined:


## 1.5. Manejo del desbalanceo de clases

In [None]:
# Handle Class Imbalance 
print("--- Calculating Class Weights ---")

# Calculate weights to give more importance to the minority class (leaks)
class_weights = class_weight.compute_class_weight(
    'balanced',
    classes=np.unique(y_train),
    y=y_train
)

class_weights_dict = dict(enumerate(class_weights))

print(f"Class weights computed: {class_weights_dict}")
# Example: {0: 0.501, 1: 25.0} (gives ~50x more weight to leaks if they are ~2% of data)


--- Calculating Class Weights ---
Class weights computed: {0: np.float64(2.752367244905689), 1: np.float64(0.610994332991407)}


## 1.6. Compilación del modelo

In [None]:
# Compile Model
print("--- Compiling Model ---")
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='binary_crossentropy',
              metrics=[tf.keras.metrics.AUC(curve='PR', name='auc_pr'), 
                       tf.keras.metrics.AUC(name='auc_roc'),
                       'accuracy'])
print("Model compiled.")


--- Compiling Model ---
Model compiled.


## 1.7. Entrenamiento del modelo

In [19]:
# Train Model with Early Stopping
print("--- Starting Model Training ---")
BATCH_SIZE = 64
EPOCHS = 20 # Start with fewer epochs, increase if needed

# Add EarlyStopping to prevent overfitting and save the best model
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_auc_pr', 
    patience=5,          
    mode='max',            
    restore_best_weights=True 
)

history = model.fit(
    X_train_scaled,
    y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val_scaled, y_val),
    class_weight=class_weights_dict, 
    callbacks=[early_stopping]       
)

# Training complete
print("Model training finished.")


--- Starting Model Training ---
Epoch 1/20
[1m1136/1136[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 25ms/step - accuracy: 0.7418 - auc_pr: 0.9203 - auc_roc: 0.7489 - loss: 0.5855 - val_accuracy: 0.8508 - val_auc_pr: 0.9499 - val_auc_roc: 0.8278 - val_loss: 0.5173
Epoch 2/20
[1m1136/1136[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 29ms/step - accuracy: 0.8529 - auc_pr: 0.9430 - auc_roc: 0.8175 - loss: 0.5024 - val_accuracy: 0.8537 - val_auc_pr: 0.9505 - val_auc_roc: 0.8294 - val_loss: 0.4906
Epoch 3/20
[1m1136/1136[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 26ms/step - accuracy: 0.8516 - auc_pr: 0.9451 - auc_roc: 0.8206 - loss: 0.4986 - val_accuracy: 0.8567 - val_auc_pr: 0.9527 - val_auc_roc: 0.8386 - val_loss: 0.5210
Epoch 4/20
[1m1136/1136[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 27ms/step - accuracy: 0.8546 - auc_pr: 0.9471 - auc_roc: 0.8250 - loss: 0.4940 - val_accuracy: 0.8596 - val_auc_pr: 0.9544 - val_auc_roc: 0.8441 - val_loss

## 1.8. Evaluación del modelo

In [20]:
# Evaluate the Model on Test Data 
print("--- Evaluating Model on Test Data ---")
results = model.evaluate(X_test_scaled, y_test, batch_size=BATCH_SIZE)
print(f"Test Loss: {results[0]:.4f}")
print(f"Test AUC-PR: {results[1]:.4f}")
print(f"Test AUC-ROC: {results[2]:.4f}")
print(f"Test Accuracy: {results[3]:.4f}")

# Get predictions for detailed metrics
y_pred_proba = model.predict(X_test_scaled).ravel() 
y_pred_class = (y_pred_proba > 0.5).astype(int)     


print("\nClassification Report (Test Set):")
try:
    print(classification_report(
        y_test,
        y_pred_class,
        target_names=['No Fuga (0)', 'Fuga (1)'],
        labels=[0, 1], 
    ))

except ValueError as e:
     print(f"Error generating classification report even with labels: {e}")
     print("Checking unique values:")
     print("Unique values in y_test:", np.unique(y_test))
     print("Unique values in y_pred_class:", np.unique(y_pred_class))


print("\nConfusion Matrix (Test Set):")
# Rows: Actual, Columns: Predicted
# [[TN, FP],
#  [FN, TP]]
print(confusion_matrix(y_test, y_pred_class))

# Calculate Average Precision Score (equivalent to AUC-PR) again for confirmation
avg_precision = average_precision_score(y_test, y_pred_proba)
print(f"\nAverage Precision Score (Test Set): {avg_precision:.4f}")

--- Evaluating Model on Test Data ---
[1m379/379[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 18ms/step - accuracy: 0.8686 - auc_pr: 0.9620 - auc_roc: 0.8645 - loss: 0.4038
Test Loss: 0.4038
Test AUC-PR: 0.9620
Test AUC-ROC: 0.8645
Test Accuracy: 0.8686
[1m757/757[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 10ms/step

Classification Report (Test Set):
              precision    recall  f1-score   support

 No Fuga (0)       0.63      0.67      0.65      4401
    Fuga (1)       0.93      0.91      0.92     19822

    accuracy                           0.87     24223
   macro avg       0.78      0.79      0.78     24223
weighted avg       0.87      0.87      0.87     24223


Confusion Matrix (Test Set):
[[ 2962  1439]
 [ 1744 18078]]

Average Precision Score (Test Set): 0.9621


## 1.9. Guardado del modelo y escalador

In [22]:
# Save the Trained Model and Scaler 
print("\n--- Saving Model and Scaler ---")
model_filename = '../data/gesai_lstm_model.keras' 
scaler_filename = '../data/gesai_scaler.joblib'

try:
    model.save(model_filename)
    print(f"Model saved successfully to: {model_filename}")

    # Save the scaler
    import joblib
    joblib.dump(scaler, scaler_filename)
    print(f"Scaler saved successfully to: {scaler_filename}")

except Exception as e:
    print(f"ERROR saving model or scaler: {e}")


--- Saving Model and Scaler ---
Model saved successfully to: ../data/gesai_lstm_model.keras
Scaler saved successfully to: ../data/gesai_scaler.joblib
