
# Clasificador de Congestión con Packet-Loss y ACKs Duplicados

**Objetivo:** Entrenar y evaluar modelos **Random Forest** y **XGBoost** que clasifiquen **congestión (1) / no congestión (0)** usando **solo dos features**:
1. `packet_loss_rate`
2. `dup_acks` (o un **proxy** si no está disponible directamente)

> **Contexto académico (resumen):** La pérdida de paquetes y los ACKs duplicados son señales clásicas para inferir congestión. Distintas propuestas de ML para redes han usado supervisión con árboles (boosting/Random Forest) para distinguir pérdidas por congestión de otras causas y mejorar TCP, reportando mejoras significativas en detección y throughput. Ver, por ejemplo, *Boutaba et al., 2018 (ml for networking)* y notas del proyecto (*Proceso.pdf*).



## Requisitos
- Python 3.9+
- `pandas`, `numpy`, `scikit-learn`, `xgboost`, `matplotlib`, `joblib`

Instalación (opcional):

```bash
pip install pandas numpy scikit-learn xgboost matplotlib joblib
```


In [None]:

# =====================
# Configuración general
# =====================
CSV_PATH = "../data/mi_dataset.csv"  # <-- Cambia a tu ruta real
TIMESTAMP_COL = "ts"                 # Si existe; si no, dejar None
FLOW_KEYS = ["src_ip", "dst_ip", "src_port", "dst_port", "protocol"]  # opcional si existen

# Umbrales (para fallback de etiquetas si no existe columna de target)
TH_PACKET_LOSS = 0.01   # 1% de pérdida
TH_DUP_ACKS = 3         # 3 ACKs duplicados en ventana

RANDOM_STATE = 42
TEST_SIZE = 0.2
N_JOBS = -1


In [None]:

import os
import numpy as np
import pandas as pd
from typing import Tuple, Optional, List

from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_auc_score, roc_curve
)
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

import matplotlib.pyplot as plt
from joblib import dump

# Asegurar opciones de Pandas legibles
pd.set_option("display.max_columns", 120)
pd.set_option("display.width", 160)


In [None]:

def load_dataset(path: str) -> pd.DataFrame:
    """Carga el CSV. Ajusta el separador, encoding y parseo de fechas según tu caso.
    Este ejemplo asume separador ';' porque tu dataset de referencia lo usa así.
    """
    if not os.path.exists(path):
        raise FileNotFoundError(f"No se encuentra el archivo: {path}")
    df = pd.read_csv(path, sep=';', engine='python')
    # Si hay una columna basura como 'Unnamed: 19', la eliminamos:
    cols_to_drop = [c for c in df.columns if c.lower().startswith("unnamed")]
    if cols_to_drop:
        df = df.drop(columns=cols_to_drop)
    return df


In [None]:

def derive_features(
    df: pd.DataFrame,
    flow_keys: Optional[List[str]] = None,
    timestamp_col: Optional[str] = None,
) -> pd.DataFrame:
    """Deriva dos features:
    - packet_loss_rate: a partir de bytes_retrans / bytes_sent (si existen).
    - dup_acks: si existe columna directa (p.ej. dup_acks), úsala; si no, arma un PROXY.

    PROXY de dup_acks (sugerencia inicial, ajustable a tu dataset):
    - Usar diferencias por flujo y tiempo entre 'bytes_acked' y 'data_segs_out' / 'segs_out'. 
      En TCP, varios ACKs por el mismo segmento pueden coincidir con ventanas deslizantes y/o
      reordenamientos; aquí se construye un proxy conservador basado en picos en 'segs_in' 
      no acompañados de progreso en 'bytes_acked'. **Ajusta esta lógica cuando tengas las columnas definitivas.**
    """
    df = df.copy()

    # packet_loss_rate
    if {"bytes_retrans", "bytes_sent"} <= set(df.columns):
        denom = (df["bytes_sent"].astype(float).abs() + 1e-9)
        df["packet_loss_rate"] = (df["bytes_retrans"].astype(float).clip(lower=0)) / denom
    else:
        # Fallback: si no existen, crea NaN y avisa
        df["packet_loss_rate"] = np.nan
        print("[WARN] No se encontraron columnas 'bytes_retrans' y/o 'bytes_sent'. packet_loss_rate = NaN.")

    # dup_acks (directa o proxy)
    if "dup_acks" in df.columns:
        df["dup_acks"] = df["dup_acks"].astype(float).clip(lower=0)
    else:
        # PROXY de dup_acks:
        # 1) Ordenar por flujo y tiempo si están disponibles
        if flow_keys and all(k in df.columns for k in flow_keys):
            sort_cols = flow_keys.copy()
        else:
            sort_cols = []
        if timestamp_col and timestamp_col in df.columns:
            sort_cols.append(timestamp_col)
        if sort_cols:
            df = df.sort_values(sort_cols).reset_index(drop=True)

        # 2) Diferencias no negativas de segs_in y data_segs_out/ segs_out
        segs_in = df["segs_in"].astype(float) if "segs_in" in df.columns else pd.Series(0.0, index=df.index)
        if "data_segs_out" in df.columns:
            out = df["data_segs_out"].astype(float)
        elif "segs_out" in df.columns:
            out = df["segs_out"].astype(float)
        else:
            out = pd.Series(0.0, index=df.index)

        d_in = segs_in.diff().fillna(0).clip(lower=0)
        d_out = out.diff().fillna(0).clip(lower=0)

        # 3) Diferencias en bytes_acked
        if "bytes_acked" in df.columns:
            d_bytes_acked = df["bytes_acked"].astype(float).diff().fillna(0).clip(lower=0)
        else:
            d_bytes_acked = pd.Series(0.0, index=df.index)

        # 4) Proxy: picos de llegadas (ACKs recibidos) sin progreso proporcional en bytes_acked
        #    Fórmula simple: exceso = d_in - (d_bytes_acked / MSS_aprox_en_segmentos)
        #    Si no hay mss, usamos un MSS aproximado de 1460 bytes para convertir bytes a "segmentos" aprox.
        MSS_APROX = 1460.0
        proxy = d_in - (d_bytes_acked / MSS_APROX)
        proxy = (proxy + (d_in - d_out) * 0.0).clip(lower=0)  # mantener no negativo

        # Suavizado opcional: ventana pequeña
        df["dup_acks"] = proxy.rolling(window=3, min_periods=1).mean()

    # Mantener solo features finales + columnas útiles para referencia
    keep = ["packet_loss_rate", "dup_acks"]
    extras = [c for c in [timestamp_col] if c and c in df.columns]
    df_feats = df[keep + extras].copy()
    return df_feats


In [None]:

def get_labels(
    df_original: pd.DataFrame,
    df_feats: pd.DataFrame,
    th_packet_loss: float = 0.01,
    th_dup_acks: float = 3.0,
) -> pd.Series:
    """Devuelve la etiqueta binaria 'congestion'.
    - Si df_original tiene una columna 'congestion', se usa esa.
    - Si no, se usa una **regla de fallback** con umbrales:
      congestión = 1 si packet_loss_rate >= th_packet_loss o dup_acks >= th_dup_acks
    """
    if "congestion" in df_original.columns:
        y = df_original["congestion"].astype(int)
    else:
        pl = df_feats["packet_loss_rate"].fillna(0)
        da = df_feats["dup_acks"].fillna(0)
        y = ((pl >= th_packet_loss) | (da >= th_dup_acks)).astype(int)
    return y


In [None]:

def train_models(X: pd.DataFrame, y: pd.Series):
    # Random Forest
    rf = RandomForestClassifier(
        n_estimators=300,
        max_depth=None,
        random_state=RANDOM_STATE,
        n_jobs=N_JOBS,
        class_weight=None
    )
    rf.fit(X, y)

    # XGBoost
    xgb = XGBClassifier(
        n_estimators=500,
        max_depth=4,
        learning_rate=0.05,
        subsample=0.9,
        colsample_bytree=0.9,
        min_child_weight=1.0,
        reg_lambda=1.0,
        random_state=RANDOM_STATE,
        n_jobs=N_JOBS,
        objective="binary:logistic",
        eval_metric="logloss",
        tree_method="hist"
    )
    xgb.fit(X, y)

    return rf, xgb


def evaluate_models(models, X_train, y_train, X_test, y_test):
    results = {}
    for name, model in models.items():
        y_pred = model.predict(X_test)
        try:
            y_prob = model.predict_proba(X_test)[:, 1]
        except Exception:
            # Algunos modelos podrían no tener predict_proba
            y_prob = None

        metrics = {
            "accuracy": accuracy_score(y_test, y_pred),
            "precision": precision_score(y_test, y_pred, zero_division=0),
            "recall": recall_score(y_test, y_pred, zero_division=0),
            "f1": f1_score(y_test, y_pred, zero_division=0),
        }
        if y_prob is not None and len(np.unique(y_test)) == 2:
            try:
                metrics["roc_auc"] = roc_auc_score(y_test, y_prob)
            except Exception:
                metrics["roc_auc"] = np.nan
        else:
            metrics["roc_auc"] = np.nan

        results[name] = metrics
    return results


def plot_confusion_matrix(y_true, y_pred, title="Matriz de confusión"):
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(4.5, 4.5))
    im = ax.imshow(cm, interpolation='nearest')
    ax.figure.colorbar(im, ax=ax)
    ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]))
    ax.set_xlabel("Predicción")
    ax.set_ylabel("Real")
    ax.set_title(title)
    # Etiquetas
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], 'd'), ha="center", va="center")
    plt.show()


def plot_roc(y_true, y_prob, title="ROC Curve"):
    if y_prob is None:
        print("[INFO] El modelo no provee probabilidades. ROC omitida.")
        return
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    fig, ax = plt.subplots(figsize=(5.5, 4.0))
    ax.plot(fpr, tpr, label="ROC")
    ax.plot([0,1], [0,1], linestyle="--")
    ax.set_xlabel("FPR")
    ax.set_ylabel("TPR")
    ax.set_title(title)
    ax.legend()
    plt.show()


def plot_feature_importance(model, feature_names, title):
    if hasattr(model, "feature_importances_"):
        importances = model.feature_importances_
    elif hasattr(model, "get_booster"):
        importances = model.get_booster().get_score(importance_type="gain")
        # Map a vector consistent with feature order if returned as dict
        if isinstance(importances, dict):
            # XGBoost names features as f0, f1, ...
            vect = np.zeros(len(feature_names))
            for i, _ in enumerate(feature_names):
                vect[i] = importances.get(f"f{i}", 0.0)
            importances = vect
    else:
        print("[INFO] El modelo no expone importancias de features.")
        return

    fig, ax = plt.subplots(figsize=(6.0, 4.0))
    idx = np.argsort(importances)[::-1]
    ax.bar(range(len(feature_names)), np.array(importances)[idx])
    ax.set_xticks(range(len(feature_names)))
    ax.set_xticklabels([feature_names[i] for i in idx], rotation=0)
    ax.set_title(title)
    plt.tight_layout()
    plt.show()


In [None]:

# =====================
# Flujo principal
# =====================
df = load_dataset(CSV_PATH)
print("Shape:", df.shape)
display(df.head(3))

# Derivar features (solo 2 features)
df_feats = derive_features(df, flow_keys=FLOW_KEYS, timestamp_col=TIMESTAMP_COL)
print("Features derivadas:")
display(df_feats.head(5))

# Etiquetas
y = get_labels(df, df_feats, th_packet_loss=TH_PACKET_LOSS, th_dup_acks=TH_DUP_ACKS)
print("Distribución de etiquetas (0=no congestión, 1=congestión):")
print(y.value_counts(dropna=False))

# Dataset final para ML (solo 2 columnas)
X = df_feats[["packet_loss_rate", "dup_acks"]].copy().fillna(0.0)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE, stratify=y
)

# Entrenar
rf, xgb = train_models(X_train, y_train)

# Evaluar
models = {"RandomForest": rf, "XGBoost": xgb}
results = evaluate_models(models, X_train, y_train, X_test, y_test)
print("Resultados en test:")
for name, m in results.items():
    print(f"{name}: " + ", ".join([f"{k}={v:.4f}" if isinstance(v, (int, float)) else f"{k}={v}" for k,v in m.items()]))

# Predicciones para gráficos
y_pred_rf = rf.predict(X_test)
try:
    y_prob_rf = rf.predict_proba(X_test)[:, 1]
except Exception:
    y_prob_rf = None

y_pred_xgb = xgb.predict(X_test)
try:
    y_prob_xgb = xgb.predict_proba(X_test)[:, 1]
except Exception:
    y_prob_xgb = None

# Gráficos (uno por plot)
plot_confusion_matrix(y_test, y_pred_rf, title="Matriz de confusión - RandomForest")
plot_confusion_matrix(y_test, y_pred_xgb, title="Matriz de confusión - XGBoost")

plot_roc(y_test, y_prob_rf, title="ROC - RandomForest")
plot_roc(y_test, y_prob_xgb, title="ROC - XGBoost")

# Importancias
plot_feature_importance(rf, X.columns.tolist(), title="Importancia de features - RandomForest")
plot_feature_importance(xgb, X.columns.tolist(), title="Importancia de features - XGBoost")

# Guardar modelos y columnas
Path("/mnt/data").mkdir(parents=True, exist_ok=True)
dump(rf, "/mnt/data/rf_congestion_classifier.joblib")
dump(xgb, "/mnt/data/xgb_congestion_classifier.joblib")
pd.Series(X.columns, name="feature_names").to_csv("/mnt/data/feature_names.csv", index=False)

print("\nModelos guardados:") 
print("- /mnt/data/rf_congestion_classifier.joblib") 
print("- /mnt/data/xgb_congestion_classifier.joblib") 
print("- /mnt/data/feature_names.csv")



## Próximos pasos
1. **Ajustar `derive_features(...)`** para que `dup_acks` use tu columna real (si existe) o un proxy más fiel a tu logging.
2. **Reemplazar el fallback de etiquetas** si ya cuentas con `congestion` (o un label derivado de tu ground truth).
3. Incorporar validación temporal (train/test split por tiempo) si tus datos son series temporales.
4. Registrar parámetros/artefactos con MLflow o similar para trazabilidad.
