# Autoencodeur ‚Äì pipeline clean (Keras/TensorFlow)
Ce notebook est une version **Keras** du pipeline autoencodeur, organis√©e comme un *mini-article* :

1. **Donn√©es & alignement** (expression ‚Üî clinique)  
2. **Construction de la cible** (Yes/No ‚Üí 1/0) et s√©paration *labellis√©s* / *non labellis√©s*  
3. **Split patient-level** (stratifi√©) et **pr√©traitements sans fuite**  
4. **Autoencodeur (Keras)** + early stopping  
5. **Latent space (Z)** : extraction + visualisations 2D/3D  
6. **√âvaluation downstream** : kNN sur **X_in** vs **Z** (ROC/PR + m√©triques)  
7. **Cas illustratif** : un patient + ses K voisins (distances + outcomes)

> ‚ö†Ô∏è Important : pour √©viter toute fuite, les transforms (scaler/PCA) et l‚Äôautoencodeur sont **fit uniquement sur le set d‚Äôentra√Ænement AE**.


## 0) Setup & param√®tres
- Mets ici les chemins vers tes fichiers (expression + meta).
- Adapte les noms de colonnes d‚ÄôID si besoin.
- Choisis si tu veux entra√Æner l‚Äôautoencodeur aussi sur les patients **non labellis√©s** (souvent utile en semi-supervis√© non param√©trique).


In [None]:

# --- Imports ---
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import (
    roc_auc_score, average_precision_score, f1_score, balanced_accuracy_score,
    RocCurveDisplay, PrecisionRecallDisplay
)
from sklearn.neighbors import KNeighborsClassifier, NearestNeighbors

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# --- Reproductibilit√© ---
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

# --- I/O ---
EXPR_PATH = r"PATH/TO/expression.csv"   # ex: matrice (patients x g√®nes)
META_PATH = r"PATH/TO/meta.csv"         # ex: table clinique

# Colonnes d'ID √† adapter √† ton format
EXPR_ID_COL = "patient"                 # ou "barcode"/"submitter_id"/etc.
META_ID_COL = "patient"

TARGET_COL  = "paper_recurred_progressed"

# --- Hyperparams ---
TEST_SIZE = 0.20
LATENT_DIM = 32
BATCH_SIZE = 64
EPOCHS = 300
PATIENCE = 25

# R√©duction de dimension avant AE (souvent indispensable avec 21k g√®nes)
USE_PCA = True
PCA_NCOMP = 512     # 256‚Äì1024 typiquement

# Entra√Æner AE aussi sur les non-labellis√©s (recommand√© si beaucoup de non labellis√©s)
INCLUDE_UNLABELED_IN_AE_TRAIN = True

# kNN downstream
K_NEIGHBORS = 10

# Dossier figures
FIGDIR = "figures_autoencoder_keras"
os.makedirs(FIGDIR, exist_ok=True)

print("TensorFlow:", tf.__version__)


## 1) Chargement & alignement patient-level
On charge :
- une matrice d‚Äôexpression (patients √ó g√®nes)
- une table `meta` (patients √ó variables cliniques)

Puis on aligne les patients sur l‚Äôintersection des IDs.


In [None]:

# --- Chargement ---
# ‚ö†Ô∏è adapte read_csv / index_col selon tes fichiers
expr = pd.read_csv(EXPR_PATH)
meta = pd.read_csv(META_PATH)

# Si tes matrices sont d√©j√† index√©es par patient, tu peux faire :
# expr = pd.read_csv(EXPR_PATH, index_col=0)
# expr.index.name = EXPR_ID_COL

# Ici, on suppose qu'il y a une colonne ID dans expr
assert EXPR_ID_COL in expr.columns, f"Colonne {EXPR_ID_COL} absente dans expr"
assert META_ID_COL in meta.columns, f"Colonne {META_ID_COL} absente dans meta"

expr = expr.set_index(EXPR_ID_COL)
meta = meta.set_index(META_ID_COL)

# Alignement IDs (intersection)
common_ids = expr.index.intersection(meta.index)
expr = expr.loc[common_ids].copy()
meta = meta.loc[common_ids].copy()

print("N patients align√©s:", len(common_ids))
print("Expr shape (patients x features):", expr.shape)


## 2) Construction de la cible (y) + masques
On mappe `Yes/No` ‚Üí `1/0`.  
Ensuite :
- `idx_labeled` : patients avec y connu
- `idx_unlabeled` : y manquant


In [None]:

# --- y binaire propre ---
y_str = meta[TARGET_COL].astype(str).str.strip().str.lower()
y = y_str.map({"yes": 1, "no": 0})

mask_labeled = y.notna().to_numpy()
idx_labeled = np.where(mask_labeled)[0]
idx_unlabeled = np.where(~mask_labeled)[0]

print(f"Total: {len(y)}")
print(f"Labeled: {len(idx_labeled)} | Unlabeled: {len(idx_unlabeled)}")

# X complet (pour encoder tout le monde ensuite)
X_all = expr.to_numpy(dtype=np.float32)

# y pour les labellis√©s uniquement
y_labeled = y.iloc[idx_labeled].to_numpy(dtype=np.int32)
print("Positifs:", int(y_labeled.sum()), "/", len(y_labeled), f"({y_labeled.mean()*100:.1f}%)")


## 3) Split patient-level (stratifi√©) + sets AE/train/test
On fait le split **uniquement** sur les patients labellis√©s (car il faut `stratify=y`).  
Puis on d√©finit :

- `idx_test` : labellis√©s test  
- `idx_train_lab` : labellis√©s train  
- `idx_train_ae` : set d‚Äôentra√Ænement de l‚Äôautoencodeur (train labellis√©s + √©ventuellement non labellis√©s)

üëâ L‚ÄôAE **ne voit jamais** les patients du test, pour √©viter toute fuite.


In [None]:

# Indices relatifs √† la liste idx_labeled
rel_all = np.arange(len(idx_labeled))
rel_train, rel_test = train_test_split(
    rel_all, test_size=TEST_SIZE, random_state=SEED, stratify=y_labeled
)

idx_train_lab = idx_labeled[rel_train]
idx_test      = idx_labeled[rel_test]

if INCLUDE_UNLABELED_IN_AE_TRAIN:
    idx_train_ae = np.concatenate([idx_train_lab, idx_unlabeled])
else:
    idx_train_ae = idx_train_lab.copy()

print("Train labeled:", len(idx_train_lab))
print("Test labeled :", len(idx_test))
print("Train AE     :", len(idx_train_ae))

# Sanity: pas d'intersection train_ae / test
assert len(np.intersect1d(idx_train_ae, idx_test)) == 0


## 4) Pr√©traitements sans fuite (scaler + PCA optionnel)
- **Scaler** fit sur `idx_train_ae` uniquement.
- **PCA** (si activ√©) fit sur `idx_train_ae` uniquement.


In [None]:

# --- scaler ---
scaler = StandardScaler(with_mean=True, with_std=True)
X_train_ae_scaled = scaler.fit_transform(X_all[idx_train_ae])

X_all_scaled = scaler.transform(X_all)  # transform sur tous, mais scaler fit sur train_ae

# --- PCA (optionnelle) ---
if USE_PCA:
    pca = PCA(n_components=PCA_NCOMP, random_state=SEED)
    X_train_ae_in = pca.fit_transform(X_train_ae_scaled)
    X_all_in      = pca.transform(X_all_scaled)
    input_dim = X_train_ae_in.shape[1]
    explained = pca.explained_variance_ratio_.sum()
    print(f"PCA: input_dim={input_dim} | variance expliqu√©e ~ {explained:.2f}")
else:
    X_train_ae_in = X_train_ae_scaled
    X_all_in      = X_all_scaled
    input_dim = X_train_ae_in.shape[1]
    print(f"No PCA: input_dim={input_dim}")

# X_in pour train/test labellis√©s (downstream)
X_train_in = X_all_in[idx_train_lab]
X_test_in  = X_all_in[idx_test]
y_train    = y.iloc[idx_train_lab].to_numpy(dtype=np.int32)
y_test     = y.iloc[idx_test].to_numpy(dtype=np.int32)

print("X_train_in:", X_train_in.shape, "| X_test_in:", X_test_in.shape)


## 5) Autoencodeur (Keras)
Architecture Dense sym√©trique :
- Encodeur : 512 ‚Üí 256 ‚Üí `LATENT_DIM`
- D√©codeur : 256 ‚Üí 512 ‚Üí reconstruction


In [None]:

def build_autoencoder(input_dim: int, latent_dim: int = 32, dropout: float = 0.1):
    inp = keras.Input(shape=(input_dim,), name="X_in")

    x = layers.Dense(512)(inp)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Dropout(dropout)(x)

    x = layers.Dense(256)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Dropout(dropout)(x)

    z = layers.Dense(latent_dim, name="Z")(x)

    x = layers.Dense(256)(z)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Dropout(dropout)(x)

    x = layers.Dense(512)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Dropout(dropout)(x)

    out = layers.Dense(input_dim, name="X_recon")(x)

    auto = keras.Model(inp, out, name="autoencoder")
    encoder = keras.Model(inp, z, name="encoder")

    auto.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mse")
    return auto, encoder

autoencoder, encoder = build_autoencoder(input_dim=input_dim, latent_dim=LATENT_DIM, dropout=0.15)
autoencoder.summary()

callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=PATIENCE, restore_best_weights=True),
    keras.callbacks.ReduceLROnPlateau(monitor="val_loss", patience=8, factor=0.5, min_lr=1e-5)
]

history = autoencoder.fit(
    X_train_ae_in, X_train_ae_in,
    validation_split=0.1,
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=callbacks,
    verbose=1
)

# --- Figure : learning curves ---
plt.figure()
plt.plot(history.history["loss"], label="train")
plt.plot(history.history["val_loss"], label="val")
plt.xlabel("Epoch")
plt.ylabel("MSE reconstruction")
plt.legend()
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig1_training_curve.png"), dpi=200)
plt.show()


## 6) Extraction du latent space Z + reconstruction error

In [None]:

# --- Encodage ---
Z_all = encoder.predict(X_all_in, batch_size=512, verbose=0)

# --- Reconstruction error ---
X_recon = autoencoder.predict(X_all_in, batch_size=512, verbose=0)
recon_err = np.mean((X_all_in - X_recon)**2, axis=1)

print("Z_all:", Z_all.shape)
print("Recon err: mean", float(np.mean(recon_err)), "std", float(np.std(recon_err)))

# --- Figure : distribution recon error ---
plt.figure()
plt.hist(recon_err, bins=50)
plt.xlabel("Reconstruction error (MSE)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig2_recon_error_hist.png"), dpi=200)
plt.show()

# --- Figure : recon error par label (labellis√©s uniquement) ---
plt.figure()
plt.hist(recon_err[idx_labeled][y_labeled==0], bins=40, alpha=0.7, label="No")
plt.hist(recon_err[idx_labeled][y_labeled==1], bins=40, alpha=0.7, label="Yes")
plt.xlabel("Reconstruction error (MSE)")
plt.ylabel("Count")
plt.legend()
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig2b_recon_error_by_label.png"), dpi=200)
plt.show()


## 7) Visualisation du latent space (Z) ‚Äì 2D & 3D (PCA sur Z)

In [None]:

from sklearn.decomposition import PCA

pcaZ2 = PCA(n_components=2, random_state=SEED)
Z2 = pcaZ2.fit_transform(Z_all)

plt.figure()
plt.scatter(Z2[:,0], Z2[:,1], s=10, alpha=0.6)
plt.xlabel("PC1 (Z)")
plt.ylabel("PC2 (Z)")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig3_Z_PCA2_all.png"), dpi=200)
plt.show()

# Color√© par label (labellis√©s seulement)
plt.figure()
mask0 = (y.iloc[idx_labeled].to_numpy()==0)
mask1 = (y.iloc[idx_labeled].to_numpy()==1)
plt.scatter(Z2[idx_labeled][mask0,0], Z2[idx_labeled][mask0,1], s=12, alpha=0.6, label="No")
plt.scatter(Z2[idx_labeled][mask1,0], Z2[idx_labeled][mask1,1], s=12, alpha=0.6, label="Yes")
plt.xlabel("PC1 (Z)")
plt.ylabel("PC2 (Z)")
plt.legend()
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig3b_Z_PCA2_labeled.png"), dpi=200)
plt.show()

# 3D
from mpl_toolkits.mplot3d import Axes3D  # noqa
pcaZ3 = PCA(n_components=3, random_state=SEED)
Z3 = pcaZ3.fit_transform(Z_all)

fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.scatter(Z3[:,0], Z3[:,1], Z3[:,2], s=8, alpha=0.5)
ax.set_xlabel("PC1 (Z)")
ax.set_ylabel("PC2 (Z)")
ax.set_zlabel("PC3 (Z)")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig3c_Z_PCA3_all.png"), dpi=200)
plt.show()


## 8) Downstream : kNN sur X_in vs kNN sur Z (ROC/PR)

In [None]:

def eval_knn(Xtr, Xte, ytr, yte, k=10):
    clf = KNeighborsClassifier(n_neighbors=k, weights="distance")
    clf.fit(Xtr, ytr)
    proba = clf.predict_proba(Xte)[:,1]
    pred = (proba >= 0.5).astype(int)

    out = {
        "ROC_AUC": roc_auc_score(yte, proba) if len(np.unique(yte))==2 else np.nan,
        "PR_AUC": average_precision_score(yte, proba) if len(np.unique(yte))==2 else np.nan,
        "F1": f1_score(yte, pred, zero_division=0),
        "BalAcc": balanced_accuracy_score(yte, pred),
        "proba": proba,
        "pred": pred
    }
    return out

res_X = eval_knn(X_train_in, X_test_in, y_train, y_test, k=K_NEIGHBORS)

Z_train = Z_all[idx_train_lab]
Z_test  = Z_all[idx_test]
res_Z = eval_knn(Z_train, Z_test, y_train, y_test, k=K_NEIGHBORS)

print("kNN on X_in:", {k:v for k,v in res_X.items() if k not in ("proba","pred")})
print("kNN on Z   :", {k:v for k,v in res_Z.items() if k not in ("proba","pred")})

# ROC
plt.figure()
RocCurveDisplay.from_predictions(y_test, res_X["proba"], name="kNN on X_in")
RocCurveDisplay.from_predictions(y_test, res_Z["proba"], name="kNN on Z")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig4_ROC_X_vs_Z.png"), dpi=200)
plt.show()

# PR
plt.figure()
PrecisionRecallDisplay.from_predictions(y_test, res_X["proba"], name="kNN on X_in")
PrecisionRecallDisplay.from_predictions(y_test, res_Z["proba"], name="kNN on Z")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig4b_PR_X_vs_Z.png"), dpi=200)
plt.show()


## 9) Cas illustratif : un patient test + ses K voisins en Z

In [None]:

# Choisir un patient du test (ex: le premier)
i_test = 0
idx_patient = idx_test[i_test]

# Voisins cherch√©s dans le TRAIN labellis√© (sinon fuite)
nbrs = NearestNeighbors(n_neighbors=K_NEIGHBORS, metric="euclidean")
nbrs.fit(Z_train)

dist, ind = nbrs.kneighbors(Z_all[idx_patient].reshape(1,-1))
dist = dist.flatten()
ind = ind.flatten()

neighbor_idx_global = idx_train_lab[ind]
neighbor_y = y.iloc[neighbor_idx_global].to_numpy(dtype=int)

# Pr√©diction simple : moyenne pond√©r√©e (inverse distance)
w = 1.0 / (dist + 1e-6)
pred_risk = float(np.sum(w * neighbor_y) / np.sum(w))

print("Patient index (global):", idx_patient)
print("Predicted risk (neighbors):", pred_risk)
print("Neighbors positives:", int(neighbor_y.sum()), "/", len(neighbor_y))

plt.figure(figsize=(7,3))
plt.bar(np.arange(len(dist)), dist)
plt.xlabel("Neighbor rank")
plt.ylabel("Distance in Z")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig5_patient_neighbor_distances.png"), dpi=200)
plt.show()

plt.figure(figsize=(7,2.6))
plt.bar(np.arange(len(neighbor_y)), neighbor_y)
plt.yticks([0,1], ["No","Yes"])
plt.xlabel("Neighbor rank")
plt.title(f"Neighbor labels (Yes=1) | predicted risk={pred_risk:.2f}")
plt.tight_layout()
plt.savefig(os.path.join(FIGDIR, "Fig5b_patient_neighbor_labels.png"), dpi=200)
plt.show()

# Afficher IDs si dispo
patient_id = expr.index[idx_patient]
neighbor_ids = expr.index[neighbor_idx_global]
display(pd.DataFrame({"neighbor_id": neighbor_ids, "distance": dist, "y": neighbor_y}))


## 10) Export (optionnel) : Z + recon_err

In [None]:

out = pd.DataFrame(Z_all, index=expr.index, columns=[f"Z{i+1}" for i in range(Z_all.shape[1])])
out["recon_err"] = recon_err
out["y"] = y  # NaN pour non labellis√©s

OUT_CSV = "latent_Z_keras.csv"
out.to_csv(OUT_CSV)
print("Saved:", OUT_CSV)
