# MGMT radiomics baseline pipeline (dummy → RF → GA-RF)

This notebook shows how to:
- load the four pre-split CSV files you created,
- get floor performance using dummy classifiers,
- train a plain Random Forest on all radiomics features,
- plug in a GA-selected feature subset to train GA-RF.

All comments in the code are in English so you can paste this into GitHub or a report.


In [1]:
import random
import pandas as pd
import numpy as np

from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.model_selection import StratifiedKFold, cross_val_score

## 1. Load data
We load `X_train.csv`, `X_val.csv`, `y_train.csv`, and `y_val.csv`.  
Labels are squeezed to 1D so that scikit-learn can use them directly.  
If your labels are strings like `"Methylated"`, you can map them to integers right after loading.


In [2]:
# set seeds for reproducibility
random.seed(42)
np.random.seed(42)

# read feature matrices
X_train = pd.read_csv("../data/X_train.csv")   # training features
X_val   = pd.read_csv("../data/X_val.csv")     # validation features

# read target vectors (squeeze → Series, not DataFrame)
y_train = pd.read_csv("../data/y_train.csv").squeeze()
y_val   = pd.read_csv("../data/y_val.csv").squeeze()

# quick sanity check on shapes
print("X_train:", X_train.shape)   # (n_train, d)
print("X_val:  ", X_val.shape)     # (n_val, d)
print("y_train:", y_train.shape)   # (n_train,)
print("y_val:  ", y_val.shape)     # (n_val,)


# --- GA needs these two globals ---
New_FS = X_train.copy()              # GA will select columns from here
y_trn  = y_train.reset_index(drop=True)

print("X_train:", X_train.shape)
print("X_val:  ", X_val.shape)
print("y_train:", y_train.shape)
print("y_val:  ", y_val.shape)

GA_POP_PATH   = "../data/ga_pop.npy"
GA_SCORE_PATH = "../data/ga_scores.npy"

size  = 50
n_feat = New_FS.shape[1]   # now this works

# if labels are text, map to 0/1 here
# y_train = (y_train == "Methylated").astype(int)
# y_val   = (y_val == "Methylated").astype(int)


X_train: (42, 725)
X_val:   (11, 725)
y_train: (42,)
y_val:   (11,)
X_train: (42, 725)
X_val:   (11, 725)
y_train: (42,)
y_val:   (11,)


## 2. Dummy baselines
We train two dummy models to see the minimum performance on this split.  
`most_frequent` = always predict the majority class.  
`stratified` = predict classes according to training distribution (a bit harder baseline).


In [3]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

# dummy that always picks the most common class in y_train
dummy_mf = DummyClassifier(strategy="most_frequent")
dummy_mf.fit(X_train, y_train)
y_pred_mf = dummy_mf.predict(X_val)
acc_mf = accuracy_score(y_val, y_pred_mf)
print(f"Dummy (most_frequent) accuracy: {acc_mf:.3f}")

# dummy that samples labels according to class proportion in y_train
dummy_st = DummyClassifier(strategy="stratified", random_state=42)
dummy_st.fit(X_train, y_train)
y_pred_st = dummy_st.predict(X_val)
acc_st = accuracy_score(y_val, y_pred_st)
print(f"Dummy (stratified) accuracy:   {acc_st:.3f}")


Dummy (most_frequent) accuracy: 0.545
Dummy (stratified) accuracy:   0.273


## 3. Plain Random Forest (no GA)

We now train a real model on **all 725 radiomics features** using only the 42 training cases.  
Because the training set is small and high-dimensional, we first run a 5-fold **stratified** cross-validation on the training set to get a more stable estimate.  
After that, we fit the same RF on all 42 training samples and evaluate once on the external 11-sample validation set (`X_val`, `y_val`).  
This separates “internal CV score” from “external hold-out score”.


In [4]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced_subsample"
)

# internal CV on the 42 training samples
cv_acc = cross_val_score(rf, X_train, y_train, cv=cv, scoring="accuracy")
print("RF 5-fold CV accuracy:", cv_acc)
print("RF 5-fold CV accuracy (mean):", cv_acc.mean())

if y_train.nunique() == 2:
    cv_auc = cross_val_score(rf, X_train, y_train, cv=cv, scoring="roc_auc")
    print("RF 5-fold CV AUC:", cv_auc)
    print("RF 5-fold CV AUC (mean):", cv_auc.mean())

# fit on all 42 training samples
rf.fit(X_train, y_train)

# evaluate once on the external 11-sample validation set
y_pred_val = rf.predict(X_val)
acc_val = accuracy_score(y_val, y_pred_val)
print("\nRF external accuracy:", round(acc_val, 3))

if y_val.nunique() == 2:
    y_proba_val = rf.predict_proba(X_val)[:, 1]
    auc_val = roc_auc_score(y_val, y_proba_val)
    print("RF external AUC:     ", round(auc_val, 3))

print("\nRF external classification report:")
print(classification_report(y_val, y_pred_val))


RF 5-fold CV accuracy: [0.33333333 0.44444444 0.375      0.875      0.625     ]
RF 5-fold CV accuracy (mean): 0.5305555555555556
RF 5-fold CV AUC: [0.25  0.6   0.625 0.875 0.875]
RF 5-fold CV AUC (mean): 0.645

RF external accuracy: 0.636
RF external AUC:      0.783

RF external classification report:
              precision    recall  f1-score   support

           0       0.67      0.67      0.67         6
           1       0.60      0.60      0.60         5

    accuracy                           0.64        11
   macro avg       0.63      0.63      0.63        11
weighted avg       0.64      0.64      0.64        11



## 4. GA-RF (paper-faithful version)

Below is a version that matches the 2022 code style much more closely.

Key points:

- we recreate their globals: `New_FS`, `y_trn`, `kfold`, `model`
- we keep their function names: `initilization_of_population`, `fitness_score`, `selection`, `crossover`, `mutation`, `generations`
- we fix only the things that would break in pandas/NumPy now:
  - `np.bool` → `bool`
  - `.iloc[train].iloc[:,chromosome]` → `.iloc[train, chromosome]`
- at the end we **actually call** `generations(...)` so you see `gen 0 ...`, `gen 1 ...` printed like their script
- this version uses **your** data: `X_train` → `New_FS`, `y_train` → `y_trn`

You can change `n_gen` or `size` later. After this, in step 5, you can take `best_chromo[-1]` and train a clean RF/XGB on that subset.


In [5]:
# ================================
# 1. set up data and base model
# ================================
# make the global variables that the original GA code expects
# New_FS: feature table to search on (your X_train)
New_FS = X_train.copy()

# y_trn: labels for GA to use during k-fold CV
y_trn = y_train.reset_index(drop=True)

# kfold: GA will reuse the same stratified 5-fold split every time
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# model: this is the RF the original code keeps calling inside fitness_score
# (you can change n_estimators etc. and GA will automatically use it)
model = RandomForestClassifier(
    n_estimators=100,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced_subsample"
)

In [None]:
# ================================
# 2. population init + fitness
# ================================
def initilization_of_population(size, n_feat):
    """
    create an initial list of chromosomes
    each chromosome is a boolean mask over features
    about 30% will be OFF (False), the rest ON (True), then shuffled
    """
    np.random.seed(42)
    INIT_OFF_RATIO = 0.3
    population = []
    for _ in range(size):
        chromosome = np.ones(n_feat, dtype=bool)
        chromosome[:int(INIT_OFF_RATIO * n_feat)] = False
        np.random.shuffle(chromosome)
        population.append(chromosome)
    return population


def fitness_score(population):
    """
    for each chromosome:
      1) pick those feature columns
      2) run k-fold CV with the global RandomForest
      3) average the CV accuracy → this is the fitness
    then:
      - sort by fitness (high → low)
      - also return selection weights like the original code
    """
    scores, newtp, newfp, newtn, newfn = [], [], [], [], []

    for chromosome in population:
        tp, fp, tn, fn, acc = [], [], [], [], []

        # k-fold over the selected features
        for train_idx, test_idx in kfold.split(New_FS, y_trn):
            # pick only the features that this chromosome says "True" to
            X_train_sel = New_FS.iloc[train_idx].iloc[:, chromosome]
            X_test_sel  = New_FS.iloc[test_idx].iloc[:, chromosome]

            # train RF
            model.fit(X_train_sel, y_trn[train_idx])

            # predict
            true_labels = np.asarray(y_trn[test_idx])
            preds = model.predict(X_test_sel)

            # confusion matrix to count TP/FP/TN/FN
            tn_i, fp_i, fn_i, tp_i = confusion_matrix(true_labels, preds).ravel()

            tp.append(tp_i); fp.append(fp_i)
            tn.append(tn_i); fn.append(fn_i)

            # accuracy for this fold
            acc.append(accuracy_score(true_labels, preds))

        # average CV accuracy for this chromosome
        scores.append(np.mean(acc))

        # total counts over folds (not super important, but original code kept them)
        newtp.append(np.sum(tp)); newfp.append(np.sum(fp))
        newtn.append(np.sum(tn)); newfn.append(np.sum(fn))

    # convert to arrays for sorting
    scores = np.array(scores)
    population = np.array(population, dtype=object)

    # selection weights: higher score → higher chance to be picked
    weights = [s / scores.sum() for s in scores]

    # also sort TP/FP/TN/FN in the same order
    newtp = np.array(newtp); newfp = np.array(newfp)
    newtn = np.array(newtn); newfn = np.array(newfn)

    # sort by score descending (best first)
    inds = np.argsort(scores)[::-1]

    return (
        list(scores[inds]),
        list(population[inds]),
        list(np.array(weights)[inds]),
        list(newtp[inds]),
        list(newfp[inds]),
        list(newtn[inds]),
        list(newfn[inds]),
    )

In [7]:
# ================================
# 3. GA operators (selection / crossover / mutation)
# ================================
def selection(pop_after_fit, weights, k):
    """
    pick k chromosomes from the fitted population,
    using their fitness-based weights (roulette-wheel style)
    """
    picked = random.choices(pop_after_fit, weights=weights, k=k)
    return list(picked)


def crossover(p1, p2, crossover_rate):
    """
    single-point crossover
    two parents → two children
    sometimes we don't crossover (just copy) depending on the rate
    """
    c1, c2 = p1.copy(), p2.copy()
    if random.random() < crossover_rate and len(p1) > 2:
        pt = random.randint(1, len(p1) - 2)
        c1 = np.concatenate((p1[:pt], p2[pt:]))
        c2 = np.concatenate((p2[:pt], p1[pt:]))
    return [c1, c2]


def mutation(chromosome, mutation_rate):
    """
    go through every bit and flip it with a small probability
    this prevents GA from getting stuck too early
    """
    for i in range(len(chromosome)):
        if random.random() < mutation_rate:
            chromosome[i] = not chromosome[i]


In [8]:
# ================================
# 4. full GA loop (original style)
# ================================
def generations(size, n_feat, crossover_rate, mutation_rate, n_gen):
    """
    run GA for n_gen generations in one shot
    (this is the original style)
    """
    best_chromo = []
    best_score = []

    # start with random population
    population_nextgen = initilization_of_population(size, n_feat)

    for gen in range(n_gen):
        # evaluate population
        scores, pop_after_fit, weights, tp, fp, tn, fn = fitness_score(population_nextgen)

        # best of this generation
        top_score = scores[0]
        top_chrom = pop_after_fit[0]
        print("gen", gen, "best_acc=", round(top_score, 4), "on_features=", int(top_chrom.sum()))

        # elitism (keep 2)
        elites = pop_after_fit[:2]
        k = size - 2

        # select parents
        parents = selection(pop_after_fit, weights, k)

        # make children
        children = []
        for i in range(0, len(parents), 2):
            p1 = parents[i]
            p2 = parents[(i + 1) % len(parents)]
            for child in crossover(p1, p2, crossover_rate):
                mutation(child, mutation_rate)
                children.append(child)

        # build next population
        population_nextgen = []
        for c in elites:
            population_nextgen.append(c)
        for c in children:
            if len(population_nextgen) < size:
                population_nextgen.append(c)

        # keep history
        best_chromo.append(top_chrom)
        best_score.append(top_score)

    return best_chromo, best_score


In [9]:
# ================================
# 5. single GA step (chunkable)
# ================================
def ga_one_step(population, size, crossover_rate, mutation_rate):
    """
    run GA for exactly 1 generation
    and return next_population, best_chromo, best_score
    """
    # evaluate current population
    scores, pop_after_fit, weights, tp, fp, tn, fn = fitness_score(population)

    # best in this generation
    best_score  = scores[0]
    best_chromo = pop_after_fit[0]

    # keep top-2
    elites = pop_after_fit[:2]
    k = size - 2

    # select parents for the rest
    parents = selection(pop_after_fit, weights, k)

    # crossover + mutation
    children = []
    for i in range(0, len(parents), 2):
        p1 = parents[i]
        p2 = parents[(i + 1) % len(parents)]
        for child in crossover(p1, p2, crossover_rate):
            mutation(child, mutation_rate)
            children.append(child)

    # build next population
    next_pop = []
    for e in elites:
        next_pop.append(e)
    for c in children:
        if len(next_pop) < size:
            next_pop.append(c)

    return next_pop, best_chromo, best_score


In [10]:
# =========================
# 2. chunk 1: gen 0~9
# =========================
# start fresh population
pop = initilization_of_population(size=size, n_feat=n_feat)

history_chromo = []
history_score  = []

for gen in range(10):  # 0..9
    pop, bc, bs = ga_one_step(pop, size=size, crossover_rate=0.8, mutation_rate=0.05)
    print(f"gen {gen:02d}  best_acc={bs:.4f}  on_features={bc.sum()}")
    history_chromo.append(bc)
    history_score.append(bs)

# save to disk
np.save(GA_POP_PATH, np.array(pop, dtype=object))
np.save(GA_SCORE_PATH, np.array(history_score))
print("saved after 0-9")


gen 00  best_acc=0.6500  on_features=508
gen 01  best_acc=0.6722  on_features=481
gen 02  best_acc=0.6778  on_features=457
gen 03  best_acc=0.6778  on_features=457
gen 04  best_acc=0.6972  on_features=448
gen 05  best_acc=0.7000  on_features=459
gen 06  best_acc=0.7000  on_features=459
gen 07  best_acc=0.7000  on_features=459
gen 08  best_acc=0.7000  on_features=459
gen 09  best_acc=0.7000  on_features=459
saved after 0-9


In [11]:
# ================================
# 2. chunk 2: gen 10~19
# (run later, in a new cell)
# ================================
pop = list(np.load(GA_POP_PATH, allow_pickle=True))
history_score = list(np.load(GA_SCORE_PATH, allow_pickle=True))

for gen in range(10, 20):   # 10..19
    pop, bc, bs = ga_one_step(pop, size=size, crossover_rate=0.8, mutation_rate=0.05)
    print(f"gen {gen:02d}  best_acc={bs:.4f}  on_features={bc.sum()}")
    history_chromo.append(bc)
    history_score.append(bs)

np.save(GA_POP_PATH, np.array(pop, dtype=object))
np.save(GA_SCORE_PATH, np.array(history_score))
print("saved after 10-19")


gen 10  best_acc=0.7000  on_features=459
gen 11  best_acc=0.7000  on_features=459
gen 12  best_acc=0.7000  on_features=459
gen 13  best_acc=0.7000  on_features=459
gen 14  best_acc=0.7000  on_features=459
gen 15  best_acc=0.7000  on_features=459
gen 16  best_acc=0.7000  on_features=459
gen 17  best_acc=0.7000  on_features=459
gen 18  best_acc=0.7000  on_features=459
gen 19  best_acc=0.7000  on_features=459
saved after 10-19


In [12]:
# ================================
# 3. chunk 3: gen 20~29 + pick features
# ================================
pop = list(np.load(GA_POP_PATH, allow_pickle=True))
history_score = list(np.load(GA_SCORE_PATH, allow_pickle=True))

history_chromo = []

for gen in range(20, 30):   # 20..29
    pop, bc, bs = ga_one_step(pop, size=size, crossover_rate=0.8, mutation_rate=0.05)
    print(f"gen {gen:02d}  best_acc={bs:.4f}  on_features={bc.sum()}")
    history_chromo.append(bc)
    history_score.append(bs)

np.save(GA_POP_PATH, np.array(pop, dtype=object))
np.save(GA_SCORE_PATH, np.array(history_score))
print("saved after 20-29")

gen 20  best_acc=0.7000  on_features=459
gen 21  best_acc=0.7194  on_features=371
gen 22  best_acc=0.7194  on_features=371
gen 23  best_acc=0.7194  on_features=371
gen 24  best_acc=0.7194  on_features=371
gen 25  best_acc=0.7194  on_features=371
gen 26  best_acc=0.7194  on_features=371
gen 27  best_acc=0.7194  on_features=371
gen 28  best_acc=0.7222  on_features=372
gen 29  best_acc=0.7222  on_features=372
saved after 20-29


In [13]:
# inspect last/best chromosome
best_mask = pop[0]
selected_cols = np.where(best_mask)[0]
print("first 30 selected:", selected_cols[:30])
print("total selected:", len(selected_cols))

first 30 selected: [ 0  4  6  8  9 18 21 22 23 24 26 29 33 34 36 38 39 40 41 42 43 47 48 49
 51 52 53 55 57 58]
total selected: 372
