Multi-timescale optimization with bidirectional knowledge bridges for continual learning
Implementation of Google's Nested Learning (NeurIPS 2025) with a novel extension: bidirectional knowledge bridges that enable explicit cross-timescale learning.
Deep learning models suffer from catastrophic forgetting: when learning new tasks, they lose performance on previously learned tasks. This is a fundamental limitation for deploying ML systems that need to continuously learn.
We extend Google's Nested Learning approach with bidirectional knowledge bridges that enable memory banks at different timescales to teach each other:
- Fast β Slow: When fast memory discovers consistent patterns, it shares them with slower banks
- Slow β Fast: When slow memory has consolidated knowledge, it guides fast memory's exploration
Key Result: Bridges shift the accuracy-forgetting Pareto frontier, achieving 62% higher accuracy at the same retention level compared to the baseline.
- π§ Multi-timescale optimization - Fast, medium, and slow memory banks updating at different frequencies
- π Knowledge bridges - Bidirectional transfer between timescales (our novel contribution)
- βοΈ Tunable trade-off - Single hyperparameter controls accuracy vs. retention balance
- π Reproducible experiments - All results with JSON outputs and visualization scripts
| Method | Avg Accuracy | Forgetting | Retention |
|---|---|---|---|
| SGD Baseline | 19.4% | 99.1% | 0.9% |
| CMS (reg=5.0) | 9.8% | 85.6% | 14.4% |
| CMS + Bridges (reg=5.0) | 18.5% | 94.0% | 6.0% |
| CMS (reg=20.0) | 11.5% | 59.3% | 40.7% |
| CMS + Bridges (reg=20.0) | 18.7% | 61.9% | 38.1% |
Key Insight: Bridges consistently improve accuracy at every regularization level. The trade-off between accuracy and retention is tunable via the regularization strength.
Different applications need different trade-offs:
- High adaptation (low reg): Best for rapidly changing domains (trends, new fraud patterns)
- High retention (high reg): Best for safety-critical systems (medical, autonomous)
- Balanced: Best for most production systems
# Clone the repository
git clone https://github.com/jstiltner/collaborative-nested-learning
cd collaborative-nested-learning
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install in development mode
pip install -e .import torch
from src.optimizers.collaborative_cms import CollaborativeCMSOptimizer
# Your model
model = torch.nn.Sequential(
torch.nn.Linear(784, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, 10)
)
# Create optimizer with knowledge bridges
optimizer = CollaborativeCMSOptimizer(
model.parameters(),
lr=0.01,
hidden_dim=64,
regularization_strength=5.0, # Tune this for your use case
enable_bridges=True
)
# Training loop
for batch in dataloader:
loss = criterion(model(batch.x), batch.y)
optimizer.zero_grad()
loss.backward()
optimizer.step()# Run the main ablation study
python benchmarks/run_ablation.py
# Run bridge ablation (with vs without bridges)
python benchmarks/run_bridge_ablation.py
# Run regularization sweep
python benchmarks/run_reg_sweep.py
# Generate visualizations
python experiments/visualize_contribution.pyResults are saved to experiments/results/ as JSON files. Run the analysis script:
python experiments/results_analysis.pyβββββββββββββββββββββββββββββββββββββββββββββββββββ
β Input Gradient β
ββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βββββββββΌβββββββ
β Fast Memory β ββββ Updates every step
β β β
βββββββββ¬βββββββ β
β β Bidirectional
Bridge β β Knowledge
β β Transfer
βββββββββΌβββββββ β
βMedium Memory β ββββ€ Updates every 10 steps
β β β
βββββββββ¬βββββββ β
β β
Bridge β β
β β
βββββββββΌβββββββ β
β Slow Memory β ββββ Updates every 50 steps
β β
βββββββββ¬βββββββ
β
βΌ
Parameter Update
Novel contribution: The bridges enable bidirectional knowledge flow with learned gating that determines when and how much to transfer.
collaborative-nested-learning/
βββ src/
β βββ optimizers/ # Optimizer implementations
β β βββ deep_momentum.py # Learned momentum optimizer
β β βββ nested_optimizer.py # Multi-timescale optimizer
β β βββ collaborative_cms.py # Full implementation with bridges
β βββ bridges/ # Knowledge bridge mechanisms
β β βββ knowledge_bridges.py
β βββ memory/ # Memory bank implementations
β βββ memory_bank.py
β βββ continuum.py # Continuum Memory System
βββ benchmarks/ # Benchmark scripts
β βββ split_mnist.py # Split-MNIST dataset
β βββ metrics.py # Evaluation metrics
β βββ run_*.py # Various ablation studies
βββ experiments/ # Analysis and visualization
β βββ results/ # JSON result files
β βββ results_analysis.py # Analysis script
β βββ visualize_contribution.py
βββ figures/ # Generated visualizations
βββ tests/ # Unit tests
βββ docs/ # Documentation
If you use this work, please cite:
@software{stiltner2025collaborative,
author = {Stiltner, Jason},
title = {Collaborative Nested Learning: Bidirectional Knowledge Bridges for Continual Learning},
year = {2025},
url = {https://github.com/jstiltner/collaborative-nested-learning}
}And the original Nested Learning paper:
@inproceedings{behrouz2025nested,
title={Nested Learning},
author={Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peilin and Mirrokni, Vahab},
booktitle={NeurIPS},
year={2025}
}Contributions welcome! Please see CONTRIBUTING.md for guidelines.
Areas for contribution:
- Additional benchmarks (CIFAR-100, language modeling)
- Adaptive bridge topology
- Integration with PyTorch Lightning / HuggingFace
- Performance optimizations
# Install dev dependencies
pip install -r requirements.txt
# Run tests
pytest tests/
# Format code
black src/ tests/ benchmarks/
isort src/ tests/ benchmarks/This project is open source under the Apache 2.0 License.
See LICENSING.md for commercial use details.
- Nested Learning (NeurIPS 2025) - Original paper
- Titans - Precursor architecture
- Elastic Weight Consolidation - Alternative approach
Jason Stiltner
- Website: jasonstiltner.com
- LinkedIn: jason-stiltner
ML Engineer with experience deploying production systems across 190 hospitals. Interested in continual learning and self-improving systems.
Status: π§ Active Development | π Benchmarked | π Documented


