Multimodal fMRI Brain Encoding — predict cortical surface activity from text, audio, and video.
NForge predicts human fMRI responses to naturalistic multimodal stimuli, enabling research into how the brain integrates language, sound, and vision simultaneously.
The brain doesn't process language, sound, and video in isolation — it integrates them into a unified perceptual experience. NForge models this process by:
- Accepting any combination of text, audio, or video as input
- Extracting deep multimodal features using state-of-the-art foundation models (LLaMA 3.2, V-JEPA2, Wav2Vec-BERT)
- Predicting cortical surface fMRI responses via a Transformer-based encoding model
- Projecting predictions onto the fsaverage5 brain mesh (~10,000 vertices per hemisphere) for interpretable visualisation
| Feature | TRIBE v2 | NForge |
|---|---|---|
| Package layout | Flat module | src/ layout with subpackages |
| ROI attention maps | ✗ | ✓ Which brain regions attend to which moments |
| Real-time streaming | ✗ | ✓ Sliding-window prediction from live feature streams |
| Modality attribution | ✗ | ✓ Per-vertex text / audio / video importance scores |
| Cross-subject generalisation | ✗ | ✓ Few-shot subject adaptation via ridge regression |
torch.compile support |
✗ | ✓ Optional backbone compilation for faster training |
| Memory management | Basic | Explicit GC after each study load |
| Test coverage | ✗ | ✓ Unit tests for model, inference, and streaming |
# Core (inference only)
pip install nforge
# With training support (PyTorch Lightning, WandB)
pip install "nforge[training]"
# With brain visualisation (nilearn, PyVista)
pip install "nforge[plotting]"
# With streaming support
pip install "nforge[streaming]"
# Everything
pip install "nforge[training,plotting,streaming,attribution]"
# Development
pip install "nforge[dev]"from nforge import NForgeModel
# Load pretrained model from HuggingFace Hub or a local checkpoint directory
model = NForgeModel.from_pretrained(
"facebook/tribev2", # or "/path/to/local/checkpoint"
cache_folder="./nforge_cache",
device="auto", # "cuda" if available, else "cpu"
)
# Build events from any of: text, audio, or video
events = model.get_events_dataframe(video_path="movie_clip.mp4")
# events = model.get_events_dataframe(audio_path="podcast.wav")
# events = model.get_events_dataframe(text_path="story.txt")
# Predict fMRI responses
preds, segments = model.predict(events)
# preds: np.ndarray of shape (n_segments, n_vertices)Understand which temporal windows most strongly drove each brain region:
from nforge.core.attention import AttentionExtractor, attention_to_roi_scores
from nforge.data.loader import get_hcp_labels
from nforge.viz.roi_maps import plot_roi_attention
import torch
roi_indices = get_hcp_labels(mesh="fsaverage5") # HCP MMP1.0 parcellation
loader = model.data.get_loaders(events=events, split_to_build="all")["all"]
batch = next(iter(loader)).to(model._model.device)
with torch.inference_mode():
_, attn_maps = model._model(batch, return_attn=True)
roi_scores = attention_to_roi_scores(attn_maps, roi_indices)
# roi_scores: {"V1": np.ndarray(T,), "MT+": ..., ...}
fig = plot_roi_attention(roi_scores, mesh="fsaverage5", views=["left", "right"])
fig.savefig("roi_attention.png")Run predictions from a live feature stream without pre-loading the full clip:
from nforge.inference.streaming import StreamingPredictor
sp = StreamingPredictor.from_nforge_model(
model,
window_trs=40, # context window length
step_trs=1, # emit every TR
tr_seconds=1.0,
device="cuda",
)
# Push pre-extracted feature tensors one TR at a time
for tr_features in my_live_extractor():
# features: {"audio": tensor(n_layers, D), "video": ..., "text": ...}
pred = sp.push_frame(features=tr_features)
if pred is not None:
# pred: np.ndarray of shape (n_vertices,) — current TR's cortical activity
visualise_brain(pred)
# Flush remaining predictions
final_preds = sp.flush()Note: Streaming operates at the feature level. The caller must provide pre-extracted feature tensors from running extractor models (e.g. Wav2Vec2, V-JEPA2, LLaMA).
Find out how much text, audio, and video each contributed to predictions at each vertex:
from nforge.inference.attribution import ModalityAttributor
from nforge.data.loader import get_hcp_labels
roi_indices = get_hcp_labels(mesh="fsaverage5")
attributor = ModalityAttributor(
model._model,
method="ablation", # or "gradient" for integrated gradients
roi_indices=roi_indices,
)
scores = attributor.attribute(batch)
# scores["text"]: np.ndarray(n_vertices,) — text importance per vertex
# scores["audio"]: np.ndarray(n_vertices,)
# scores["video"]: np.ndarray(n_vertices,)
# scores["text_roi"]: {"V1": 0.42, "MT+": 0.18, ...} — ROI summaries
print("Top text-driven vertices:", scores["text"].argsort()[-5:])Methods:
"ablation"— compares predictions with each modality zeroed out. Fast and intuitive."gradient"— integrated gradients over 5 interpolation steps. More faithful.
Adapt the model to a new, unseen subject from a small calibration set:
from nforge.core.subject import SubjectAdapter
# Option 1: Ridge regression (recommended) — fits a new predictor head
adapter = SubjectAdapter.from_ridge(
model=model._model,
calibration_loader=calibration_loader, # DataLoader with new-subject fMRI
regularization=1e-3,
device="cuda",
)
new_subject_id = adapter.inject_into_model(model._model)
# Option 2: Nearest-neighbour (zero-shot, no fitting)
adapter = SubjectAdapter.from_nearest_neighbor(
model=model._model,
calibration_loader=calibration_loader,
)
new_subject_id = adapter.inject_into_model(model._model)
print(f"New subject registered as subject_id = {new_subject_id}")Input stimuli (text / audio / video)
│
▼
Foundation model extractors
├── Text: LLaMA 3.2-3B (layers: 0, 0.2, 0.4, 0.6, 0.8, 1.0)
├── Audio: Wav2Vec-BERT (layers: 0.75, 1.0)
└── Video: V-JEPA2-ViT-G (layers: 0.75, 1.0)
│
▼
Per-modality MLP projectors
→ concatenate / sum / stack (configurable)
│
▼
Combiner MLP (optional)
│
▼
Temporal positional embeddings
│
▼
Transformer Encoder (8 layers, self-attention over time)
│
▼
Subject-specific linear head (SubjectLayers)
│
▼
AdaptiveAvgPool1d → output_timesteps
│
▼
Cortical surface predictions (fsaverage5, ~20k vertices bilateral)
Key design choices:
- Layer-wise aggregation: features from multiple Transformer depth levels are concatenated and jointly projected, capturing both low-level and high-level representations.
- Subject layers: each training subject has its own linear prediction head, capturing individual anatomical differences.
- Hemodynamic offset: fMRI features are offset by 5 TRs (~5 s) to account for the haemodynamic response function.
- SLURM cluster with GPU access
- Datasets downloaded and accessible at
$DATAPATH - Output directory at
$SAVEPATH
export DATAPATH=/path/to/neuroimaging/data
export SAVEPATH=/path/to/output
export SLURM_PARTITION=your_gpu_partition
export WANDB_ENTITY=your_wandb_entity # optional| Dataset | Subjects | Stimuli | TR (s) |
|---|---|---|---|
| Algonauts2025Bold | 4 | TV sitcom "Friends" + movies | 1.49 |
| Wen2017 | 3 | Short videos (11.7 s) | ~2 |
| Lahner2024Bold | 10 | Short videos (6.2 s) | ~2 |
| Lebel2023Bold | 8 | Spoken narrative (6–18 s) | ~2 |
# Quick local test (3 epochs, 3 timelines, no cluster)
python -m nforge.configs.experiments.test_run
# Full cortical training on SLURM
python -m nforge.configs.experiments.corticalnforge/
├── src/nforge/
│ ├── core/
│ │ ├── model.py # FmriEncoder config + FmriEncoderModel
│ │ ├── attention.py # AttentionExtractor + ROI attention scores
│ │ └── subject.py # SubjectAdapter for cross-subject generalisation
│ ├── data/
│ │ ├── loader.py # MultiStudyLoader + HCP ROI utilities
│ │ ├── transforms.py # Event transforms
│ │ ├── fmri_utils.py # Template spaces + NforgeSurfaceProjector
│ │ └── studies/ # Algonauts2025, Wen2017, Lahner2024, Lebel2023
│ ├── training/
│ │ ├── experiment.py # NForgeExperiment + Data config
│ │ ├── module.py # BrainModule (PyTorch Lightning)
│ │ └── losses.py # PearsonLoss, WeightedMSELoss
│ ├── inference/
│ │ ├── predictor.py # NForgeModel (from_pretrained / predict)
│ │ ├── streaming.py # StreamingPredictor
│ │ └── attribution.py # ModalityAttributor
│ ├── viz/
│ │ ├── cortical.py # Nilearn cortical surface rendering
│ │ ├── subcortical.py # Subcortical structure visualisation
│ │ └── roi_maps.py # ROI attention map rendering
│ └── configs/
│ ├── defaults.py # Default experiment configuration
│ └── experiments/ # test_run, cortical scripts
├── tests/
│ ├── test_model.py
│ ├── test_inference.py
│ └── test_streaming.py
├── examples/
│ └── quick_start.py
└── pyproject.toml
NForge builds on TRIBE v2. If you use it in research, please cite:
@article{dascoli2026tribe,
title = {A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience},
author = {d'Ascoli, Stéphane and others},
journal = {arXiv},
year = {2026},
url = {https://arxiv.org/abs/2502.06808}
}