# Getting Started with InstaNovo-P

## InstaNovo-P: A de novo peptide sequencing model for phosphoproteomics

InstaNovo-P is a phosphorylation-specific version of the transformer-based [InstaNovo](https://github.com/instadeepai/InstaNovo) model, fine-tuned on extensive phosphoproteomics datasets.

**Paper:** [InstaNovo-P: A de novo peptide sequencing model for phosphoproteomics](https://doi.org/10.1101/2025.05.14.654049)

This notebook demonstrates how to run inference with the released InstaNovo-P checkpoint using the latest `instanovo` package (>=1.1.2). For reproducing the fine-tuning, see the [README](../README.md).

> **Note:** This notebook uses the latest `instanovo` package for inference. The training code in this repository uses an older codebase (based on InstaNovo v0.1.6).

## 1. Installation

Install the latest `instanovo` package. This is required for loading and running the InstaNovo-P checkpoint.

In [None]:
import os
import sys
if 'google.colab' in sys.modules or 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
    !uv pip install --system "instanovo[cu124]>=1.1.2" torchvision tf-nightly
else:
    !uv pip install "instanovo[cu124]>=1.1.2"

## 2. Load the InstaNovo-P Model

The InstaNovo-P checkpoint is automatically downloaded from the [GitHub release](https://github.com/instadeepai/InstaNovo/releases/download/1.1.2/instanovo-phospho-v1.0.0.ckpt).

In [None]:
import torch
from instanovo.transformer.model import InstaNovo

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

model, config = InstaNovo.from_pretrained('instanovo-phospho-v1.0.0')
model = model.to(device).eval()
print(f"Model loaded with {sum(p.numel() for p in model.parameters()):,} parameters")

## 3. Load the InstaNovo-P Test Dataset

Download the test fold of the [InstaNovo-P dataset](https://huggingface.co/datasets/InstaDeepAI/InstaNovo-P) from HuggingFace.

We download only the test partition (10% subset) to keep this demo fast.

In [None]:
from datasets import load_dataset
import pandas as pd

data_files = {"test": "dataset-phospho-test-0000-0001.parquet"}
dataset = load_dataset("InstaDeepAI/InstaNovo-P", data_files=data_files, split="test[:10%]")
print(f"Loaded {len(dataset):,} spectra")
dataset

In [None]:
from instanovo.utils import SpectrumDataFrame

df = pd.DataFrame(dataset)
sdf = SpectrumDataFrame.from_pandas(df)
sdf

## 4. Prepare the DataLoader

In [None]:
from instanovo.transformer.dataset import SpectrumDataset, collate_batch
from torch.utils.data import DataLoader

ds = SpectrumDataset(
    sdf,
    model.residue_set,
    config.get("n_peaks", 200),
    return_str=True,
    annotated=True,
)

dl = DataLoader(ds, batch_size=64, shuffle=False, num_workers=0, collate_fn=collate_batch)
print(f"DataLoader ready: {len(dl)} batches")

## 5. Decoding

Three decoding strategies are available:

| Strategy | Speed | Recall | Use when |
|----------|-------|--------|----------|
| Greedy Search (`num_beams=1`) | Fastest (>10x) | Good | Filtering at 5% FDR |
| Beam Search (`num_beams>1`) | Moderate | Better | General use |
| Knapsack Beam Search | Slowest | Best | Maximum recall needed |

### Greedy / Beam Search

In [None]:
from instanovo.inference import BeamSearchDecoder, GreedyDecoder

num_beams = 1  # Set to 5 for beam search

if num_beams > 1:
    decoder = BeamSearchDecoder(model=model)
    print(f"Using Beam Search with {num_beams} beams")
else:
    decoder = GreedyDecoder(model=model)
    print("Using Greedy Search")

## 6. Run Inference

Evaluate on the test set:

In [None]:
import numpy as np
from tqdm.notebook import tqdm
from instanovo.inference import ScoredSequence

preds = []
targs = []
probs = []

for _, batch in tqdm(enumerate(dl), total=len(dl)):
    spectra, precursors, _, peptides, _ = batch
    spectra = spectra.to(device)
    precursors = precursors.to(device)

    with torch.no_grad():
        p = decoder.decode(
            spectra=spectra,
            precursors=precursors,
            beam_size=num_beams,
            max_length=config["max_length"],
        )

    preds += [x.sequence if isinstance(x, ScoredSequence) else [] for x in p]
    probs += [
        x.sequence_log_probability if isinstance(x, ScoredSequence) else -float("inf") for x in p
    ]
    targs += list(peptides)

print(f"Inference complete: {len(preds):,} predictions")

## 7. Evaluation Metrics

### Performance without filtering

In [None]:
from instanovo.utils.metrics import Metrics

metrics = Metrics(model.residue_set, config["isotope_error_range"])

aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
    targs, preds
)
aa_error_rate = metrics.compute_aa_er(targs, preds)
auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))

print(f"Amino acid error rate:  {aa_error_rate:.5f}")
print(f"Amino acid precision:   {aa_precision:.5f}")
print(f"Amino acid recall:      {aa_recall:.5f}")
print(f"Peptide precision:      {peptide_precision:.5f}")
print(f"Peptide recall:         {peptide_recall:.5f}")
print(f"Area under PR curve:    {auc:.5f}")

### Performance at 5% FDR

In [None]:
fdr = 5 / 100

_, threshold = metrics.find_recall_at_fdr(targs, preds, np.exp(probs), fdr=fdr)
aa_precision_fdr, aa_recall_fdr, peptide_recall_fdr, peptide_precision_fdr = (
    metrics.compute_precision_recall(targs, preds, np.exp(probs), threshold=threshold)
)

print(f"Performance at {fdr*100:.1f}% FDR:\n")
print(f"Amino acid precision:   {aa_precision_fdr:.5f}")
print(f"Amino acid recall:      {aa_recall_fdr:.5f}")
print(f"Peptide precision:      {peptide_precision_fdr:.5f}")
print(f"Peptide recall:         {peptide_recall_fdr:.5f}")
print(f"Confidence threshold:   {threshold:.5f}")

> **Note:** To reproduce the results from the paper, evaluate on the full InstaNovo-P test set.

## 8. Save Predictions

In [None]:
pred_df = pd.DataFrame(
    {
        "targets": targs,
        "tokenized_predictions": preds,
        "predictions": ["".join(x) for x in preds],
        "log_probabilities": probs,
    }
)

pred_df.to_csv("predictions_instanovo_phospho.csv", index=False)
print(f"Predictions saved to predictions_instanovo_phospho.csv")
pred_df.head()