This notebook demonstrates how to fine‑tune the DreaMS model for a binary classification task (detecting chlorine in molecules) using the MassSpecGym dataset. We’ll:

1. **Annotate** the MassSpecGym MGF with chlorine labels.
2. **Prepare** a `ChlorineDetectionDataset` and a `BenchmarkDataModule`.
3. **Train** a baseline MLP classifier.
4. **Fine‑tune** the DreaMS encoder with a classification head (`LitDreamsClassifier`).
5. **Evaluate** on the test split and **save** the checkpoint.

All paths are defined relative to `PROJECT_ROOT` for reproducibility.

In [1]:
import sys
from pathlib import Path

# assume this notebook lives in notebooks/, so parent() is the repo root
sys.path.append(str(Path().resolve().parent))
from paths import PROJECT_ROOT

from benchmark.utils.data import annotate_mgf_with_label

import torch
from massspecgym.data.transforms import SpecBinner
from benchmark.data.datasets import ChlorineDetectionDataset
from benchmark.data.data_module import BenchmarkDataModule

In [None]:
# Paths
DATA_DIR   = PROJECT_ROOT / "data" / "massspecgym"
ORIG_MGF   = DATA_DIR / "MassSpecGym.mgf"
LABELED_MGF = DATA_DIR / "MassSpecGym_chlorine.mgf"

#### Here we define function for annotation of our data. It is important as it will set ground truth for our data. Here we are working with MassSpecGym data where each mass spectra is annotated with correct molecule and based on this we can further annotate our spectra.

#### therefore here is are solving chlorine detection problem, we pull molecule associated with mass spectra and ask if molecule contains Chlorine, if yes we assign mass spectra label with value 1.0, respectively 0.0 if it does not contain Chlorine,

In [None]:
# Define labeling function: 1.0 if 'Cl' in FORMULA
label_fn = lambda md: float("Cl" in md.get("FORMULA", ""))

In [None]:
# Write out labeled MGF
annotate_mgf_with_label(ORIG_MGF, LABELED_MGF, label_fn)
print(f"Labeled MGF written to: {LABELED_MGF}")