
# Train a CatBoost Voxel-wise Recurrence Classifier

This notebook provides a scaffolding workflow to recreate the voxel-level glioblastoma recurrence model used by the inference pipeline. It mirrors the repository's feature extraction, scaling, and CatBoost training steps while leaving space for you to plug in your own cohort specifics.



## 1. Environment setup

Install a recent toolchain that aligns with CatBoost and PyRadiomics. The versions below have been validated together, but you can adjust them as long as the APIs remain compatible.


In [4]:
# Install Python 3.10 first if you don’t have it
!py -3.10 -m venv .venv
!.\.venv\Scripts\activate
!python -m pip install --upgrade pip setuptools wheel
%pip install numpy==1.26.4 pandas==2.1.4 pyarrow==14.0.2  scikit-learn==1.3.2 SimpleITK==2.2.1 nibabel==5.1.0 tqdm==4.66.1 catboost==1.2.5 pyradiomics==3.0.1
!python -m ipykernel install --user --name gbm-ml --display-name "Python 3.10 (gbm-ml)"


No suitable Python runtime found
Pass --list (-0) to see all detected environments on your machine
or set environment variable PYLAUNCHER_ALLOW_INSTALL to use winget
or open the Microsoft Store to the requested version.
The system cannot find the path specified.


Collecting pip
  Obtaining dependency information for pip from https://files.pythonhosted.org/packages/b7/3f/945ef7ab14dc4f9d7f40288d2df998d1837ee0888ec3659c813487572faa/pip-25.2-py3-none-any.whl.metadata
  Downloading pip-25.2-py3-none-any.whl.metadata (4.7 kB)
Collecting setuptools
  Obtaining dependency information for setuptools from https://files.pythonhosted.org/packages/a3/dc/17031897dae0efacfea57dfd3a82fdd2a2aeb58e0ff71b77b87e44edc772/setuptools-80.9.0-py3-none-any.whl.metadata
  Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting wheel
  Obtaining dependency information for wheel from https://files.pythonhosted.org/packages/0b/2c/87f3254fd8ffd29e4c02732eee68a83a1d3c346ae39bc6822dcbcb697f2b/wheel-0.45.1-py3-none-any.whl.metadata
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Downloading pip-25.2-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/1

ERROR: Ignored the following versions that require a different python version: 1.21.2 Requires-Python >=3.7,<3.11; 1.21.3 Requires-Python >=3.7,<3.11; 1.21.4 Requires-Python >=3.7,<3.11; 1.21.5 Requires-Python >=3.7,<3.11; 1.21.6 Requires-Python >=3.7,<3.11
ERROR: Could not find a version that satisfies the requirement SimpleITK==2.2.1 (from versions: 1.0.1, 1.2.0, 2.1.0, 2.1.1.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.2)
ERROR: No matching distribution found for SimpleITK==2.2.1


Installed kernelspec gbm-ml in C:\Users\phkya\AppData\Roaming\jupyter\kernels\gbm-ml



## 2. Imports & paths

Point the configuration at your pre-processed dataset. Each patient folder should contain the five MRI sequences, the peritumoral mask, and a recurrence (or recurrence-free) label map for supervision.


In [None]:

import os
from pathlib import Path

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


from ..extract_feature_functions import create_dataset

# Configure the root directory that contains individual patient sub-folders
DATA_ROOT = Path("Patients")  # TODO: update with your actual dataset location

# Name of the segmentation containing voxel-wise supervision labels
RECURRENCE_MASK_NAME = "recurrence.nii.gz"  # TODO: adjust to your label file name

# Destination for the fitted scaler and CatBoost model
OUTPUT_DIR = Path("artifacts")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


ImportError: attempted relative import with no known parent package


## 3. Feature extraction

Run the repository's voxel-wise extractor to ensure every patient has a `voxel_features.parquet`. If you already generated them with the inference pipeline you can skip this step.


In [8]:

# This call mirrors the production pipeline and uses Params.yaml automatically.
# It only processes patients that are missing their parquet feature tables.
create_dataset(str(DATA_ROOT))


NameError: name 'create_dataset' is not defined


## 4. Assemble voxel-level design matrix

The helper below loads each patient's radiomic features, imputes NaNs with per-feature means, and applies a scaler once it is fitted. During the initial pass we collect the raw features to derive the scaler parameters.


In [None]:

import nibabel as nib
from tqdm.notebook import tqdm

# Collect per-voxel features and labels for every patient
feature_frames = []
label_frames = []
patient_ids = []

for patient_dir in tqdm(sorted(DATA_ROOT.iterdir()), desc="Patients"):
    if not patient_dir.is_dir():
        continue

    voxel_features_path = patient_dir / "voxel_features.parquet"
    recurrence_mask_path = patient_dir / RECURRENCE_MASK_NAME

    if not voxel_features_path.exists():
        raise FileNotFoundError(f"Missing features for {patient_dir.name}; run the extractor first")
    if not recurrence_mask_path.exists():
        raise FileNotFoundError(f"Missing recurrence labels ({RECURRENCE_MASK_NAME}) for {patient_dir.name}")

    # Load features as produced by the repo utilities
    features = pd.read_parquet(voxel_features_path)

    # Derive the voxel-level binary label from the recurrence segmentation
    recurrence_mask = nib.load(str(recurrence_mask_path)).get_fdata().astype(bool).ravel()

    # Restrict labels to voxels that belong to the peritumoral ROI (same ordering as features)
    if features.shape[0] != recurrence_mask.shape[0]:
        raise ValueError(
            f"Label mask for {patient_dir.name} does not match feature voxel count: "
            f"{recurrence_mask.shape[0]} vs {features.shape[0]}"
        )

    feature_frames.append(features)
    label_frames.append(pd.Series(recurrence_mask.astype(np.uint8), index=features.index))
    patient_ids.extend([patient_dir.name] * len(features))

X_raw = pd.concat(feature_frames, axis=0).reset_index(drop=True)
y = pd.concat(label_frames, axis=0).reset_index(drop=True)
patient_ids = np.array(patient_ids)

print(f"Aggregated {len(X_raw)} voxels across {len(np.unique(patient_ids))} patients")



## 5. Handle missing values & scale features

The training notebook replicates the inference-time preprocessing by imputing each feature with its mean and fitting a `StandardScaler` (or whichever transformer you prefer). Save the fitted scaler to reuse during inference.


In [None]:

# Impute NaN values with column means
X_imputed = X_raw.fillna(X_raw.mean())

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# Persist the fitted scaler for later use
scaler_path = OUTPUT_DIR / "voxel_scaler.joblib"
import joblib
joblib.dump(scaler, scaler_path)
print(f"Scaler saved to {scaler_path}")



## 6. Train/validation split

Construct a patient-level split to avoid voxel leakage. Adjust the strategy if you need cross-validation or stratification.


In [None]:

# Derive patient-level indices for splitting
unique_patients = np.unique(patient_ids)
train_patients, valid_patients = train_test_split(
    unique_patients,
    test_size=0.2,
    random_state=42,
)

train_mask = np.isin(patient_ids, train_patients)
valid_mask = np.isin(patient_ids, valid_patients)

X_train, X_valid = X_scaled[train_mask], X_scaled[valid_mask]
y_train, y_valid = y.iloc[train_mask], y.iloc[valid_mask]

print(f"Training voxels: {X_train.shape[0]} | Validation voxels: {X_valid.shape[0]}")



## 7. Configure and train CatBoost

Define the CatBoost hyperparameters that mirror the production model. The block below uses binary logloss with class weights to mitigate imbalance, but you can tune as needed.


In [None]:

cat_params = dict(
    loss_function="Logloss",
    learning_rate=0.05,
    depth=6,
    iterations=2000,
    l2_leaf_reg=3.0,
    random_seed=42,
    verbose=100,
    class_weights=None,  # e.g., {0: 1.0, 1: 5.0} if recurrence voxels are rare
)

train_pool = Pool(X_train, label=y_train)
valid_pool = Pool(X_valid, label=y_valid)

model = CatBoostClassifier(**cat_params)
model.fit(train_pool, eval_set=valid_pool, use_best_model=True)



## 8. Evaluate & inspect

Use CatBoost's built-in metrics or scikit-learn utilities to quantify performance. The following skeleton computes AUC as an example.


In [None]:

from sklearn.metrics import roc_auc_score

valid_pred_proba = model.predict_proba(X_valid)[:, 1]
auc = roc_auc_score(y_valid, valid_pred_proba)
print(f"Validation ROC-AUC: {auc:.3f}")



## 9. Persist the training artefacts

Export the CatBoost model together with the feature names and scaler metadata expected by the inference pipeline. The pickle format mirrors the structure that `main.py` consumes (`{"scaler": ..., "models_dict": {"CAT": model}}`).


In [None]:

model_path = OUTPUT_DIR / "catboost_voxel_model.cbm"
model.save_model(model_path)
print(f"CatBoost model saved to {model_path}")

# Bundle scaler + model into the repository's expected pickle structure
import pickle

pickle_payload = {
    "scaler": scaler,
    "models_dict": {"CAT": model},
    "feature_names": X_raw.columns.tolist(),
    "catboost_params": cat_params,
}

pickle_path = OUTPUT_DIR / "model.pkl"
with open(pickle_path, "wb") as f:
    pickle.dump(pickle_payload, f)

print(f"Serialized inference bundle saved to {pickle_path}")



## 10. Next steps

- Perform hyperparameter tuning (e.g., cross-validation, Bayesian optimisation) for better performance.
- Incorporate class balancing strategies if recurrence voxels are scarce.
- Track experiments with tools such as Weights & Biases or MLflow.
- Validate predictions on held-out patients before deploying the model.
