Disclaimer: Ice is an independent, community-driven theoretical reconstruction based solely on publicly available research and the OpenMythos project. It is not affiliated with, endorsed by, or connected to Anthropic, DeepSeek, or any of their proprietary systems.
Ice is an open-source, theoretical implementation of a next-generation Recurrent-Depth Transformer (RDT) β also known as a Looped Transformer. It extends the OpenMythos architecture with seven novel innovations for superior reasoning capability.
Input
β
[Prelude P] β standard TransformerBlocks, run once
β
[Recurrent Block R] β 1 TransformerBlock, looped T times
| h_{t+1} = AΒ·h_t + BΒ·e + Transformer(h_t, e)
|___{with MSRD, MoDAv2, AEC, GDLoRA, ACT, SLU}___
β
[Coda C] β standard TransformerBlocks, run once
β
RMSNorm β LM Head β Output logits
The recurrent block update rule at each loop step t:
h_{t+1} = AΒ·h_t + BΒ·e + Transformer(h_t, e)
Where:
h_tis the hidden state after loop teis the encoded input (from the Prelude), injected at every loopAandBare learned injection parameters- The Transformer block applies attention and MoE FFN as usual
The injection of e at every step prevents the model from drifting β it keeps the original input signal alive throughout the entire recurrence depth.
Instead of a single recurrent block, Ice runs multiple parallel recurrent blocks at different effective depths (scale multipliers). Shallower scales capture local patterns quickly; deeper scales perform longer reasoning chains. Outputs are fused via learned gating.
Extends Mixture-of-Depths Attention to allow each attention head to jointly attend to the complete history of recurrent loop states, not just the current layer. Each iteration's attention can "look back" at what previous reasoning concluded, creating a true latent-space chain-of-thought.
During deep loop iterations, if the expert routing confidence exceeds a threshold and the decision hasn't changed recently, cached expert outputs are reused instead of recomputed. This saves significant compute during deep reasoning.
An input-dependent, depth-wise LoRA adapter that selectively applies depth-specific refinements. The model learns to gate when refinement is needed vs. when base parameters suffice β preventing adapter interference in a weight-shared looped architecture.
A lightweight predictor estimates input difficulty and suggests an optimal number of loop iterations. The model speculatively executes at the predicted depth, with a verifier to check if more iterations are needed.
Older KV cache entries are progressively compressed through average pooling, reducing memory footprint while preserving semantic information. Multiple compression levels enable long-context processing with bounded memory.
A training objective that explicitly rewards the model for using deeper reasoning on harder problems and penalizes shallow shortcuts. Enables emergent systematic generalization without explicit chain-of-thought.
| Variant | dim | Experts | Loop iters | Context | MSRD Scales | Max Output |
|---|---|---|---|---|---|---|
ice_1b |
2048 | 64 | 16 | 4k | (1,2,4) | 4k |
ice_3b |
3072 | 64 | 16 | 4k | (1,2,4) | 4k |
ice_10b |
4096 | 128 | 24 | 8k | (1,2,4) | 4k |
ice_50b |
6144 | 256 | 32 | 8k | (1,2,4,8) | 4k |
ice_100b |
8192 | 256 | 32 | 1M | (1,2,4,8) | 128k |
ice_500b |
12288 | 512 | 48 | 1M | (1,2,4,8,16) | 128k |
ice_1t |
16384 | 1024 | 64 | 1M | (1,2,4,8,16) | 256k |
pip install torch>=2.1.0 transformers>=4.40.0
# Clone and install
git clone https://github.com/guanlan2/ice.git
cd ice
pip install -e .
# Optional: enable Flash Attention 2 in GQA mode (requires CUDA + build tools)
pip install flash-attn>=2.8.3import torch
from ice import ice_3b, IceModel
# Create model from preset
cfg = ice_3b()
model = IceModel(cfg)
# Forward pass
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids) # uses default loop depth
print(f"Logits: {logits.shape}") # (2, 16, 32000)
# Deeper reasoning: increase loop depth
logits_deep = model(ids, n_loops=32)
print(f"Deep logits: {logits_deep.shape}")
# Text generation with speculative unrolling
output = model.generate(ids, max_new_tokens=128, n_loops=16, temperature=0.8)
print(f"Generated: {output.shape}")
# Verify stability
rho = model.get_spectral_radius()
print(f"Spectral radius Ο(A) = {rho:.6f} {'β stable' if rho < 1 else 'β unstable'}")Ice supports three attention implementations selectable via attn_type:
| Mode | Class | Description |
|---|---|---|
"gqa" |
GQAttention |
Grouped Query Attention (Ainslie et al., 2023) β fewer KV heads than Q heads |
"mla" |
MLAttention |
Multi-Latent Attention (DeepSeek-V2, 2024) β 10β20Γ smaller KV cache |
"moda_v2" |
MoDAv2Attention |
Full-history cross-loop attention β latent CoT |
from ice import IceConfig, IceModel
# MLA mode (default)
cfg = IceConfig(attn_type="mla", kv_lora_rank=512, dim=2048, n_heads=16)
# MoDA v2 mode (enables cross-loop history attention)
cfg = IceConfig(attn_type="moda_v2", moda_history_window=8, dim=2048, n_heads=16)
model = IceModel(cfg)A training script for the 3B model on FineWeb-Edu is included at training/pretrain.py:
# Single GPU
python training/pretrain.py
# Multi-GPU (auto-detects GPU count)
torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/pretrain.pyKey design choices:
- Optimizer: AdamW
- Dataset: HuggingFaceFW/fineweb-edu (sample-10BT by default)
- Tokenizer: openai/gpt-oss-20b via IceTokenizer
- Parallelism: PyTorch DDP via torchrun
- Precision: bfloat16 on H100/A100, float16 + GradScaler on older GPUs
- Schedule: Linear warmup (2000 steps) β cosine decay
- Target: 30B tokens (~Chinchilla-adjusted for looped architecture)
The LTI injection matrix A is parameterized to guarantee Ο(A) < 1 by construction:
A_continuous = -exp(log_A) β always negative diagonal
A_discrete = exp(Ξt Β· A_continuous) β ZOH discretization, values β (0, 1)
This makes the looped model unconditionally stable regardless of learning rate or batch noise (ParCae, Prairie et al., 2026).
When using MoDA v2, Ice precomputes RoPE frequencies at three resolution levels (ΞΈ, 2ΞΈ, 4ΞΈ), enabling the model to attend to patterns at different timescales simultaneously.
The MoE router uses DeepSeek-V3 style routing with optional:
- Auxiliary-loss-free bias routing (
moe_n_groups > 1) - Group-limited routing (
moe_topk_groups) - Sigmoid gating with route scaling
IMPORTANT: This is a theoretical reconstruction. The following aspects have been verified:
| Component | Status | Notes |
|---|---|---|
| Core architecture | β Verified | All unit tests pass (15/15) |
| Forward pass | β Verified | Correct shapes through full pipeline |
| Spectral radius | β Verified | Ο(A) < 1 guaranteed by construction |
| Text generation | β Verified | Autoregressive loop works correctly |
| MSRD fusion | β Verified | Multi-scale gated fusion works |
| AEC caching | β Verified | Expert output caching functional |
| Large-scale training | β Not verified | Requires GPU cluster |
| Convergence on real data | β Not verified | Requires extended training run |
| Benchmark performance | β Not verified | Requires trained checkpoints |
| Inference scaling laws | β Not verified | Theoretical extrapolation only |
- OpenMythos: github.com/kyegomez/OpenMythos
- Looped Transformers / RDT: Giannou et al. (2023), Deletang et al. (2023)
- Systematic Generalization: CsordΓ‘s et al. (2024), Saunshi et al. (2025)
- ParCae Stability: Prairie et al. (2026), arXiv 2604.12946
- DeepSeek-V2 MLA: arXiv 2405.04434
- DeepSeek-V3 MoE: github.com/deepseek-ai/DeepSeek-V3
- GQA: Ainslie et al. (2023), arXiv 2305.13245
- MoDA: arXiv 2603.15619
- ACT: Graves (2016), arXiv 1603.08983
MIT License β see LICENSE for details.
Built with βοΈ by guanlan2 and ε°εΏ π§