Ice 🧊 — Next-Generation Recurrent-Depth Transformer

Disclaimer: Ice is an independent, community-driven theoretical reconstruction based solely on publicly available research and the OpenMythos project. It is not affiliated with, endorsed by, or connected to Anthropic, DeepSeek, or any of their proprietary systems.

Ice is an open-source, theoretical implementation of a next-generation Recurrent-Depth Transformer (RDT) — also known as a Looped Transformer. It extends the OpenMythos architecture with seven novel innovations for superior reasoning capability.

🧠 Core Architecture

Input
  ↓
[Prelude P]        — standard TransformerBlocks, run once
  ↓
[Recurrent Block R] — 1 TransformerBlock, looped T times
  |  h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
  |___{with MSRD, MoDAv2, AEC, GDLoRA, ACT, SLU}___
  ↓
[Coda C]           — standard TransformerBlocks, run once
  ↓
RMSNorm → LM Head → Output logits

The recurrent block update rule at each loop step t:

h_{t+1} = A·h_t + B·e + Transformer(h_t, e)

Where:

h_t is the hidden state after loop t
e is the encoded input (from the Prelude), injected at every loop
A and B are learned injection parameters
The Transformer block applies attention and MoE FFN as usual

The injection of e at every step prevents the model from drifting — it keeps the original input signal alive throughout the entire recurrence depth.

✨ Seven Innovations

🔀 1. MSRD — Multi-Scale Recurrent Depths

Instead of a single recurrent block, Ice runs multiple parallel recurrent blocks at different effective depths (scale multipliers). Shallower scales capture local patterns quickly; deeper scales perform longer reasoning chains. Outputs are fused via learned gating.

👁️ 2. MoDA v2 — Full History Cross-Loop Attention

Extends Mixture-of-Depths Attention to allow each attention head to jointly attend to the complete history of recurrent loop states, not just the current layer. Each iteration's attention can "look back" at what previous reasoning concluded, creating a true latent-space chain-of-thought.

💾 3. AEC — Adaptive Expert Caching

During deep loop iterations, if the expert routing confidence exceeds a threshold and the decision hasn't changed recently, cached expert outputs are reused instead of recomputed. This saves significant compute during deep reasoning.

🔧 4. GDLoRA — Gated Delta LoRA

An input-dependent, depth-wise LoRA adapter that selectively applies depth-specific refinements. The model learns to gate when refinement is needed vs. when base parameters suffice — preventing adapter interference in a weight-shared looped architecture.

⚡ 5. SLU — Speculative Loop Unrolling

A lightweight predictor estimates input difficulty and suggests an optimal number of loop iterations. The model speculatively executes at the predicted depth, with a verifier to check if more iterations are needed.

📦 6. MLCA — Multi-Level Cache Attention

Older KV cache entries are progressively compressed through average pooling, reducing memory footprint while preserving semantic information. Multiple compression levels enable long-context processing with bounded memory.

🎯 7. CDL — Contrastive Depth Learning

A training objective that explicitly rewards the model for using deeper reasoning on harder problems and penalizes shallow shortcuts. Enables emergent systematic generalization without explicit chain-of-thought.

📊 Model Variants

Variant	dim	Experts	Loop iters	Context	MSRD Scales	Max Output
`ice_1b`	2048	64	16	4k	(1,2,4)	4k
`ice_3b`	3072	64	16	4k	(1,2,4)	4k
`ice_10b`	4096	128	24	8k	(1,2,4)	4k
`ice_50b`	6144	256	32	8k	(1,2,4,8)	4k
`ice_100b`	8192	256	32	1M	(1,2,4,8)	128k
`ice_500b`	12288	512	48	1M	(1,2,4,8,16)	128k
`ice_1t`	16384	1024	64	1M	(1,2,4,8,16)	256k

📦 Installation

pip install torch>=2.1.0 transformers>=4.40.0

# Clone and install
git clone https://github.com/guanlan2/ice.git
cd ice
pip install -e .

# Optional: enable Flash Attention 2 in GQA mode (requires CUDA + build tools)
pip install flash-attn>=2.8.3

🚀 Quick Start

import torch
from ice import ice_3b, IceModel

# Create model from preset
cfg = ice_3b()
model = IceModel(cfg)

# Forward pass
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids)  # uses default loop depth
print(f"Logits: {logits.shape}")  # (2, 16, 32000)

# Deeper reasoning: increase loop depth
logits_deep = model(ids, n_loops=32)
print(f"Deep logits: {logits_deep.shape}")

# Text generation with speculative unrolling
output = model.generate(ids, max_new_tokens=128, n_loops=16, temperature=0.8)
print(f"Generated: {output.shape}")

# Verify stability
rho = model.get_spectral_radius()
print(f"Spectral radius ρ(A) = {rho:.6f} {'✓ stable' if rho < 1 else '✗ unstable'}")

🔬 Attention Modes

Ice supports three attention implementations selectable via attn_type:

Mode	Class	Description
`"gqa"`	`GQAttention`	Grouped Query Attention (Ainslie et al., 2023) — fewer KV heads than Q heads
`"mla"`	`MLAttention`	Multi-Latent Attention (DeepSeek-V2, 2024) — 10–20× smaller KV cache
`"moda_v2"`	`MoDAv2Attention`	Full-history cross-loop attention — latent CoT

from ice import IceConfig, IceModel

# MLA mode (default)
cfg = IceConfig(attn_type="mla", kv_lora_rank=512, dim=2048, n_heads=16)

# MoDA v2 mode (enables cross-loop history attention)
cfg = IceConfig(attn_type="moda_v2", moda_history_window=8, dim=2048, n_heads=16)

model = IceModel(cfg)

🏋️ Training

A training script for the 3B model on FineWeb-Edu is included at training/pretrain.py:

# Single GPU
python training/pretrain.py

# Multi-GPU (auto-detects GPU count)
torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/pretrain.py

Key design choices:

Optimizer: AdamW
Dataset: HuggingFaceFW/fineweb-edu (sample-10BT by default)
Tokenizer: openai/gpt-oss-20b via IceTokenizer
Parallelism: PyTorch DDP via torchrun
Precision: bfloat16 on H100/A100, float16 + GradScaler on older GPUs
Schedule: Linear warmup (2000 steps) → cosine decay
Target: 30B tokens (~Chinchilla-adjusted for looped architecture)

📐 Architecture Details

Spectral Radius Guarantee

The LTI injection matrix A is parameterized to guarantee ρ(A) < 1 by construction:

A_continuous = -exp(log_A)          — always negative diagonal
A_discrete   = exp(Δt · A_continuous) — ZOH discretization, values ∈ (0, 1)

This makes the looped model unconditionally stable regardless of learning rate or batch noise (ParCae, Prairie et al., 2026).

Multi-Resolution RoPE

When using MoDA v2, Ice precomputes RoPE frequencies at three resolution levels (θ, 2θ, 4θ), enabling the model to attend to patterns at different timescales simultaneously.

Expert-Level Balance

The MoE router uses DeepSeek-V3 style routing with optional:

Auxiliary-loss-free bias routing (moe_n_groups > 1)
Group-limited routing (moe_topk_groups)
Sigmoid gating with route scaling

⚠️ Verification Status

IMPORTANT: This is a theoretical reconstruction. The following aspects have been verified:

Component	Status	Notes
Core architecture	✅ Verified	All unit tests pass (15/15)
Forward pass	✅ Verified	Correct shapes through full pipeline
Spectral radius	✅ Verified	ρ(A) < 1 guaranteed by construction
Text generation	✅ Verified	Autoregressive loop works correctly
MSRD fusion	✅ Verified	Multi-scale gated fusion works
AEC caching	✅ Verified	Expert output caching functional
Large-scale training	❌ Not verified	Requires GPU cluster
Convergence on real data	❌ Not verified	Requires extended training run
Benchmark performance	❌ Not verified	Requires trained checkpoints
Inference scaling laws	❌ Not verified	Theoretical extrapolation only

📚 References

OpenMythos: github.com/kyegomez/OpenMythos
Looped Transformers / RDT: Giannou et al. (2023), Deletang et al. (2023)
Systematic Generalization: Csordás et al. (2024), Saunshi et al. (2025)
ParCae Stability: Prairie et al. (2026), arXiv 2604.12946
DeepSeek-V2 MLA: arXiv 2405.04434
DeepSeek-V3 MoE: github.com/deepseek-ai/DeepSeek-V3
GQA: Ainslie et al. (2023), arXiv 2305.13245
MoDA: arXiv 2603.15619
ACT: Graves (2016), arXiv 1603.08983

📄 License

MIT License — see LICENSE for details.

_{Built with ❄️ by guanlan2 and 冰儿 🧊}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
ice		ice
tests		tests
training		training
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ice 🧊 — Next-Generation Recurrent-Depth Transformer

🧠 Core Architecture

✨ Seven Innovations

🔀 1. MSRD — Multi-Scale Recurrent Depths

👁️ 2. MoDA v2 — Full History Cross-Loop Attention

💾 3. AEC — Adaptive Expert Caching

🔧 4. GDLoRA — Gated Delta LoRA

⚡ 5. SLU — Speculative Loop Unrolling

📦 6. MLCA — Multi-Level Cache Attention

🎯 7. CDL — Contrastive Depth Learning

📊 Model Variants

📦 Installation

🚀 Quick Start

🔬 Attention Modes

🏋️ Training

📐 Architecture Details

Spectral Radius Guarantee

Multi-Resolution RoPE

Expert-Level Balance

⚠️ Verification Status

📚 References

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ice 🧊 — Next-Generation Recurrent-Depth Transformer

🧠 Core Architecture

✨ Seven Innovations

🔀 1. MSRD — Multi-Scale Recurrent Depths

👁️ 2. MoDA v2 — Full History Cross-Loop Attention

💾 3. AEC — Adaptive Expert Caching

🔧 4. GDLoRA — Gated Delta LoRA

⚡ 5. SLU — Speculative Loop Unrolling

📦 6. MLCA — Multi-Level Cache Attention

🎯 7. CDL — Contrastive Depth Learning

📊 Model Variants

📦 Installation

🚀 Quick Start

🔬 Attention Modes

🏋️ Training

📐 Architecture Details

Spectral Radius Guarantee

Multi-Resolution RoPE

Expert-Level Balance

⚠️ Verification Status

📚 References

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages