# GBM MGMT Methylation Prediction from Radiomics (TCGA-GBM / BraTS-TCGA-GBM)

This notebook trains machine learning models to predict MGMT promoter methylation status in glioblastoma (GBM) using pre-extracted radiomic features.

**Data source**
- Radiomic features were extracted from pre-operative, skull-stripped, co-registered multi-modal MRI (T1, T1-Gd, T2, FLAIR) of TCGA-GBM subjects.
- The dataset (BraTS–TCGA-GBM) includes volumetric features, distance-to-ventricle features, intensity statistics, histogram features, spatial location features, shape descriptors, and multiple texture families (GLCM, GLRLM, GLSZM, NGTDM).
- Each row corresponds to one subject (patient), identified by an `ID` like `TCGA-06-0125`.

**Goal**
1. Load radiomic features.
2. Attach MGMT promoter methylation status (methylated vs unmethylated).
3. Train and evaluate several models:
   - Dummy baseline
   - Random Forest
   - XGBoost
   - RBF SVM

We will report accuracy, sensitivity, specificity, and ROC AUC using stratified 5-fold cross-validation, which is a common evaluation setup in MGMT radiogenomics studies.


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split # type: ignore
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.model_selection import train_test_split

from collections import Counter

## 1. Load radiomic features

We load the radiomic feature table.

What this table is:
- `ID`: subject identifier such as `TCGA-06-0125`.
- `Date`: scan date.
- `VOLUME_*`: enhancing tumor core / edema / whole tumor volumes and ratios.
- `DIST_*`: distance of tumor components from anatomical landmarks (e.g. ventricles).
- `INTENSITY_*`: mean / std of intensities per sequence and subregion.
- `HISTO_*`: histogram bin counts of intensities.
- `SPATIAL_*`: coarse anatomical location flags (frontal, temporal, insula, etc.).
- Shape descriptors like `ECCENTRICITY_*`, `SOLIDITY_*`.
- Texture features:
  - `TEXTURE_GLCM_*`
  - `TEXTURE_GLRLM_*`
  - `TEXTURE_GLSZM_*`
  - `TEXTURE_NGTDM_*`
- Tumor growth model features: `TGM_*`.

In this step we just:
1. import libraries
2. read the CSV (radiomics features)
3. do very light inspection (shape, head, columns)


In [2]:
# ------------------------------------------------------------------
# 1. Load training data
# ------------------------------------------------------------------

# load dataset
train_radiomics_path = "../data/Training_dataset.csv"
test_radiomics_path = "../data/test_dataset.csv"

df_train = pd.read_csv(train_radiomics_path)

print("Radiomics dataframe loaded.")
print("Shape (rows, cols):", df_train.shape)

# Show the first few rows just to confirm it looks correct
display(df_train.head(5))

# Save a copy of the original columns for reference
all_columns_train = df_train.columns.tolist()
print("Number of columns:", len(all_columns_train))


Radiomics dataframe loaded.
Shape (rows, cols): (53, 734)


Unnamed: 0.1,Unnamed: 0,ID,IDH1_status,MGMT_status,Methyl_class,G-CIMP,Exp_class,Therapy_class,Age,Gender,...,TGM_Cog_Z_4,TGM_T_4,TGM_Cog_X_5,TGM_Cog_Y_5,TGM_Cog_Z_5,TGM_T_5,TGM_Cog_X_6,TGM_Cog_Y_6,TGM_Cog_Z_6,TGM_T_6
0,0,TCGA-02-0006,WT,UNMETHYLATED,CL_2,non-G-CIMP,Mesenchymal,"Standard Radiation, TMZ Chemo",56.2,FEMALE,...,,,,,,,,,,
1,1,TCGA-02-0009,WT,METHYLATED,CL_4,non-G-CIMP,Classical,Nonstandard Radiation,61.5,FEMALE,...,,,,,,,,,,
2,2,TCGA-02-0011,WT,METHYLATED,CL_6,non-G-CIMP,Proneural,"TMZ Chemoradiation, TMZ Chemo",19.0,FEMALE,...,,,,,,,,,,
3,3,TCGA-02-0027,WT,METHYLATED,CL_1,non-G-CIMP,Classical,"TMZ Chemoradiation, TMZ Chemo",33.9,FEMALE,...,,,,,,,,,,
4,4,TCGA-02-0033,WT,METHYLATED,CL_2,non-G-CIMP,Mesenchymal,Standard Radiation,55.0,MALE,...,,,,,,,,,,


Number of columns: 734


In [3]:
# Inspect columns to identify:
all_columns_train

['Unnamed: 0',
 'ID',
 'IDH1_status',
 'MGMT_status',
 'Methyl_class',
 'G-CIMP',
 'Exp_class',
 'Therapy_class',
 'Age',
 'Gender',
 'VOLUME_ET',
 'VOLUME_NET',
 'VOLUME_ED',
 'VOLUME_TC',
 'VOLUME_WT',
 'VOLUME_BRAIN',
 'VOLUME_ET_OVER_NET',
 'VOLUME_ET_OVER_ED',
 'VOLUME_NET_OVER_ED',
 'VOLUME_ET_over_TC',
 'VOLUME_NET_over_TC',
 'VOLUME_ED_over_TC',
 'VOLUME_ET_OVER_WT',
 'VOLUME_NET_OVER_WT',
 'VOLUME_ED_OVER_WT',
 'VOLUME_TC_OVER_WT',
 'VOLUME_ET_OVER_BRAIN',
 'VOLUME_NET_OVER_BRAIN',
 'VOLUME_ED_over_BRAIN',
 'VOLUME_TC_over_BRAIN',
 'VOLUME_WT_OVER_BRAIN',
 'DIST_Vent_TC',
 'DIST_Vent_ED',
 'INTENSITY_Mean_ET_T1Gd',
 'INTENSITY_STD_ET_T1Gd',
 'INTENSITY_Mean_ET_T1',
 'INTENSITY_STD_ET_T1',
 'INTENSITY_Mean_ET_T2',
 'INTENSITY_STD_ET_T2',
 'INTENSITY_Mean_ET_FLAIR',
 'INTENSITY_STD_ET_FLAIR',
 'INTENSITY_Mean_NET_T1Gd',
 'INTENSITY_STD_NET_T1Gd',
 'INTENSITY_Mean_NET_T1',
 'INTENSITY_STD_NET_T1',
 'INTENSITY_Mean_NET_T2',
 'INTENSITY_STD_NET_T2',
 'INTENSITY_Mean_NET_FLAIR',
 'I

In [4]:

df_test = pd.read_csv(test_radiomics_path)

print("Radiomics dataframe loaded.")
print("Shape (rows, cols):", df_test.shape)

# Show the first few rows 
display(df_test.head(5))

# Save a copy of the original columns for reference
all_columns_test = df_test.columns.tolist()
print("Number of columns:", len(all_columns_test))

Radiomics dataframe loaded.
Shape (rows, cols): (65, 725)


Unnamed: 0,MGMT promoter status,VOLUME_ET,VOLUME_NET,VOLUME_ED,VOLUME_TC,VOLUME_WT,VOLUME_BRAIN,VOLUME_ET_OVER_NET,VOLUME_ET_OVER_ED,VOLUME_NET_OVER_ED,...,TGM_Cog_Z_4,TGM_T_4,TGM_Cog_X_5,TGM_Cog_Y_5,TGM_Cog_Z_5,TGM_T_5,TGM_Cog_X_6,TGM_Cog_Y_6,TGM_Cog_Z_6,TGM_T_6
0,Unmethylated,224,23189,27760,23413,51173,1260788,0.00966,0.008069,0.835339,...,,,,,,,,,,
1,Methylated,0,77985,117448,77985,195433,1421096,0.0,0.0,0.663996,...,,,,,,,,,,
2,Methylated,1276,110787,118819,112063,230882,1469438,0.011518,0.010739,0.932401,...,,,,,,,,,,
3,Methylated,89546,93298,7407,182844,190251,1348463,0.959785,12.089375,12.595923,...,,,,,,,,,,
4,Unmethylated,0,10557,24143,10557,34700,1270855,0.0,0.0,0.43727,...,,,,,,,,,,


Number of columns: 725


In [5]:
all_columns_test

['MGMT promoter status',
 'VOLUME_ET',
 'VOLUME_NET',
 'VOLUME_ED',
 'VOLUME_TC',
 'VOLUME_WT',
 'VOLUME_BRAIN',
 'VOLUME_ET_OVER_NET',
 'VOLUME_ET_OVER_ED',
 'VOLUME_NET_OVER_ED',
 'VOLUME_ET_over_TC',
 'VOLUME_NET_over_TC',
 'VOLUME_ED_over_TC',
 'VOLUME_ET_OVER_WT',
 'VOLUME_NET_OVER_WT',
 'VOLUME_ED_OVER_WT',
 'VOLUME_TC_OVER_WT',
 'VOLUME_ET_OVER_BRAIN',
 'VOLUME_NET_OVER_BRAIN',
 'VOLUME_ED_over_BRAIN',
 'VOLUME_TC_over_BRAIN',
 'VOLUME_WT_OVER_BRAIN',
 'DIST_Vent_TC',
 'DIST_Vent_ED',
 'INTENSITY_Mean_ET_T1Gd',
 'INTENSITY_STD_ET_T1Gd',
 'INTENSITY_Mean_ET_T1',
 'INTENSITY_STD_ET_T1',
 'INTENSITY_Mean_ET_T2',
 'INTENSITY_STD_ET_T2',
 'INTENSITY_Mean_ET_FLAIR',
 'INTENSITY_STD_ET_FLAIR',
 'INTENSITY_Mean_NET_T1Gd',
 'INTENSITY_STD_NET_T1Gd',
 'INTENSITY_Mean_NET_T1',
 'INTENSITY_STD_NET_T1',
 'INTENSITY_Mean_NET_T2',
 'INTENSITY_STD_NET_T2',
 'INTENSITY_Mean_NET_FLAIR',
 'INTENSITY_STD_NET_FLAIR',
 'INTENSITY_Mean_ED_T1Gd',
 'INTENSITY_STD_ED_T1Gd',
 'INTENSITY_Mean_ED_T1',
 'INT

## 2. Build supervised dataset (X features, y label)

Goal of this step:
1. Pick out the target label column (`MGMT_status`) from the dataframe.
   - We convert `"UNMETHYLATED" -> 0`, `"METHYLATED" -> 1`.
2. Drop non-feature metadata columns (like `ID`, `Date`) so the model doesn’t try to “learn the patient name”.
3. Create:
   - `X` = numeric radiomic features (will go into the model)
   - `y` = MGMT label as 0/1 (ground truth)
4. Split into train/validation using stratified split (so class balance is preserved).

After this step we will have:
- `X_train, X_val, y_train, y_val` ready for modeling.
- `X` only contains numeric radiomic features (volumes / texture / etc.).
- `y` is the MGMT promoter methylation status in binary form (0 = unmethylated, 1 = methylated).

Note:
- We assume `Training_dataset.csv` already has a column called `MGMT_status`.
- If your file doesn't have that column, supervised learning for MGMT prediction is impossible until we merge MGMT labels from somewhere else (e.g. legacy TCGA clinical file).


In [6]:
# ------------------------------------------------------------------
# helper: find label column and convert to binary 0/1
# ------------------------------------------------------------------

def extract_label_series(df):
    """
    Find the MGMT label column in this dataframe, and return y (0/1) plus the name of that column.
    We accept either 'MGMT_status' or 'MGMT promoter status'.
    """
    label_candidates = ["MGMT_status", "MGMT promoter status"]
    label_col = None

    for c in label_candidates:
        if c in df.columns:
            label_col = c
            break

    if label_col is None:
        raise ValueError("No MGMT label column found in dataframe.")

    def mgmt_to_binary(x):
        # normalize variations like 'UNMETHYLATED', 'Unmethylated', etc.
        s = str(x).strip().upper()
        if "UNMETH" in s:
            return 0
        if "METH" in s:
            return 1
        raise ValueError(f"Unexpected MGMT label value: {x}")

    y_bin = df[label_col].apply(mgmt_to_binary)
    return y_bin, label_col


In [7]:

# ------------------------------------------------------------------
# helper: build feature matrix X from a df
# ------------------------------------------------------------------

def build_feature_matrix(df, label_col):
    """
    1. Define a superset of columns we NEVER want to feed to the model.
       (Some exist only in train, some only in test; we'll drop the intersection.)
    2. Drop them if present.
    3. Keep numeric columns only.
    """
    non_feature_cols_master = [
        "Unnamed: 0",            # artifact index col
        "ID",                    # TCGA barcode / subject identifier
        "IDH1_status",           # molecular subtype info, may leak biology
        "Methyl_class",
        "G-CIMP",
        "Exp_class",
        "Therapy_class",
        "Gender",                # categorical text; we'll ignore for now
        label_col,               # the ground-truth label itself
    ]

    # Drop columns that actually exist in this df
    to_drop = [c for c in non_feature_cols_master if c in df.columns]
    df_reduced = df.drop(columns=to_drop, errors="ignore")

    # Keep numeric-only columns
    X_num = df_reduced.select_dtypes(include=[np.number])

    return X_num


In [8]:

# ------------------------------------------------------------------
# 2A. TRAIN SET: get y_train_full / X_train_full
# ------------------------------------------------------------------

y_train_full, train_label_col = extract_label_series(df_train)
print("Training label column used:", train_label_col)
print("Training label distribution:", Counter(y_train_full))

X_train_full = build_feature_matrix(df_train, train_label_col)
print("Training features shape after cleanup:", X_train_full.shape)

# sanity check: align shapes
assert len(X_train_full) == len(y_train_full), "Row count mismatch in training set."

Training label column used: MGMT_status
Training label distribution: Counter({0: 27, 1: 26})
Training features shape after cleanup: (53, 725)


In [9]:
X_train_full.head(5)

Unnamed: 0,Age,VOLUME_ET,VOLUME_NET,VOLUME_ED,VOLUME_TC,VOLUME_WT,VOLUME_BRAIN,VOLUME_ET_OVER_NET,VOLUME_ET_OVER_ED,VOLUME_NET_OVER_ED,...,TGM_Cog_Z_4,TGM_T_4,TGM_Cog_X_5,TGM_Cog_Y_5,TGM_Cog_Z_5,TGM_T_5,TGM_Cog_X_6,TGM_Cog_Y_6,TGM_Cog_Z_6,TGM_T_6
0,56.2,1662,384,36268,2046,38314,1469432,4.328125,0.045826,0.010588,...,,,,,,,,,,
1,61.5,4362,4349,15723,8711,24434,1295721,1.002989,0.277428,0.276601,...,,,,,,,,,,
2,19.0,33404,48612,45798,82016,127814,1425843,0.687155,0.729377,1.061444,...,,,,,,,,,,
3,33.9,12114,7587,34086,19701,53787,1403429,1.596679,0.355395,0.222584,...,,,,,,,,,,
4,55.0,34538,7137,65653,41675,107328,1365237,4.839288,0.526069,0.108708,...,,,,,,,,,,


In [10]:
y_train_full.head(5)

0    0
1    1
2    1
3    1
4    1
Name: MGMT_status, dtype: int64

In [11]:

# ------------------------------------------------------------------
# 2B. TEST SET: get y_test / X_test
# ------------------------------------------------------------------

y_test, test_label_col = extract_label_series(df_test)
print("Test label column used:", test_label_col)
print("Test label distribution:", Counter(y_test))

X_test = build_feature_matrix(df_test, test_label_col)
print("Test features shape after cleanup:", X_test.shape)

assert len(X_test) == len(y_test), "Row count mismatch in test set."

Test label column used: MGMT promoter status
Test label distribution: Counter({1: 52, 0: 13})
Test features shape after cleanup: (65, 724)


In [12]:

# ------------------------------------------------------------------
# 2C. Train/Val split (internal validation on training data)
# ------------------------------------------------------------------

X_train, X_val, y_train, y_val = train_test_split(
    X_train_full,
    y_train_full,
    test_size=0.2,
    random_state=42,
    stratify=y_train_full,
)

print("Internal split shapes:")
print(f"  X_train: {X_train.shape}   X_val: {X_val.shape}")
print(f"  y_train: {y_train.shape}   y_val: {y_val.shape}")

# helper: make a small summary table for class balance
def class_balance_frame(y, name):
    vc = y.value_counts(dropna=False)
    frac = y.value_counts(normalize=True, dropna=False)
    out = pd.DataFrame({
        "count": vc,
        "fraction": frac
    })
    out.index.name = f"{name} label"
    return out

print("\nClass balance in training split:")
display(class_balance_frame(y_train, "train"))

print("Class balance in validation split:")
display(class_balance_frame(y_val, "val"))

print("Class balance in FULL external test set:")
display(class_balance_frame(y_test, "test"))


Internal split shapes:
  X_train: (42, 725)   X_val: (11, 725)
  y_train: (42,)   y_val: (11,)

Class balance in training split:


Unnamed: 0_level_0,count,fraction
train label,Unnamed: 1_level_1,Unnamed: 2_level_1
1,21,0.5
0,21,0.5


Class balance in validation split:


Unnamed: 0_level_0,count,fraction
val label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6,0.545455
1,5,0.454545


Class balance in FULL external test set:


Unnamed: 0_level_0,count,fraction
test label,Unnamed: 1_level_1,Unnamed: 2_level_1
1,52,0.8
0,13,0.2


In [13]:
# ------------------------------------------------------------------
# 2D. Persist split to disk so other notebooks can reuse it
# ------------------------------------------------------------------

output_dir = "../data"

# X_* are DataFrames already
X_train.to_csv(f"{output_dir}/X_train.csv", index=False)
X_val.to_csv(f"{output_dir}/X_val.csv", index=False)

# y_* are Series, turn them into single-column DataFrames for clean CSVs
y_train.to_frame(name="MGMT_status_bin").to_csv(f"{output_dir}/y_train.csv", index=False)
y_val.to_frame(name="MGMT_status_bin").to_csv(f"{output_dir}/y_val.csv", index=False)

print("\nSaved the following files to ../data:")
print(" - X_train.csv")
print(" - X_val.csv")
print(" - y_train.csv")
print(" - y_val.csv")


Saved the following files to ../data:
 - X_train.csv
 - X_val.csv
 - y_train.csv
 - y_val.csv
