Skip to content

guanlan2/Ice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ice 🧊 β€” Next-Generation Recurrent-Depth Transformer

Version PyTorch License

Disclaimer: Ice is an independent, community-driven theoretical reconstruction based solely on publicly available research and the OpenMythos project. It is not affiliated with, endorsed by, or connected to Anthropic, DeepSeek, or any of their proprietary systems.

Ice is an open-source, theoretical implementation of a next-generation Recurrent-Depth Transformer (RDT) β€” also known as a Looped Transformer. It extends the OpenMythos architecture with seven novel innovations for superior reasoning capability.

🧠 Core Architecture

Input
  ↓
[Prelude P]        β€” standard TransformerBlocks, run once
  ↓
[Recurrent Block R] β€” 1 TransformerBlock, looped T times
  |  h_{t+1} = AΒ·h_t + BΒ·e + Transformer(h_t, e)
  |___{with MSRD, MoDAv2, AEC, GDLoRA, ACT, SLU}___
  ↓
[Coda C]           β€” standard TransformerBlocks, run once
  ↓
RMSNorm β†’ LM Head β†’ Output logits

The recurrent block update rule at each loop step t:

h_{t+1} = AΒ·h_t + BΒ·e + Transformer(h_t, e)

Where:

  • h_t is the hidden state after loop t
  • e is the encoded input (from the Prelude), injected at every loop
  • A and B are learned injection parameters
  • The Transformer block applies attention and MoE FFN as usual

The injection of e at every step prevents the model from drifting β€” it keeps the original input signal alive throughout the entire recurrence depth.

✨ Seven Innovations

πŸ”€ 1. MSRD β€” Multi-Scale Recurrent Depths

Instead of a single recurrent block, Ice runs multiple parallel recurrent blocks at different effective depths (scale multipliers). Shallower scales capture local patterns quickly; deeper scales perform longer reasoning chains. Outputs are fused via learned gating.

πŸ‘οΈ 2. MoDA v2 β€” Full History Cross-Loop Attention

Extends Mixture-of-Depths Attention to allow each attention head to jointly attend to the complete history of recurrent loop states, not just the current layer. Each iteration's attention can "look back" at what previous reasoning concluded, creating a true latent-space chain-of-thought.

πŸ’Ύ 3. AEC β€” Adaptive Expert Caching

During deep loop iterations, if the expert routing confidence exceeds a threshold and the decision hasn't changed recently, cached expert outputs are reused instead of recomputed. This saves significant compute during deep reasoning.

πŸ”§ 4. GDLoRA β€” Gated Delta LoRA

An input-dependent, depth-wise LoRA adapter that selectively applies depth-specific refinements. The model learns to gate when refinement is needed vs. when base parameters suffice β€” preventing adapter interference in a weight-shared looped architecture.

⚑ 5. SLU β€” Speculative Loop Unrolling

A lightweight predictor estimates input difficulty and suggests an optimal number of loop iterations. The model speculatively executes at the predicted depth, with a verifier to check if more iterations are needed.

πŸ“¦ 6. MLCA β€” Multi-Level Cache Attention

Older KV cache entries are progressively compressed through average pooling, reducing memory footprint while preserving semantic information. Multiple compression levels enable long-context processing with bounded memory.

🎯 7. CDL β€” Contrastive Depth Learning

A training objective that explicitly rewards the model for using deeper reasoning on harder problems and penalizes shallow shortcuts. Enables emergent systematic generalization without explicit chain-of-thought.

πŸ“Š Model Variants

Variant dim Experts Loop iters Context MSRD Scales Max Output
ice_1b 2048 64 16 4k (1,2,4) 4k
ice_3b 3072 64 16 4k (1,2,4) 4k
ice_10b 4096 128 24 8k (1,2,4) 4k
ice_50b 6144 256 32 8k (1,2,4,8) 4k
ice_100b 8192 256 32 1M (1,2,4,8) 128k
ice_500b 12288 512 48 1M (1,2,4,8,16) 128k
ice_1t 16384 1024 64 1M (1,2,4,8,16) 256k

πŸ“¦ Installation

pip install torch>=2.1.0 transformers>=4.40.0

# Clone and install
git clone https://github.com/guanlan2/ice.git
cd ice
pip install -e .

# Optional: enable Flash Attention 2 in GQA mode (requires CUDA + build tools)
pip install flash-attn>=2.8.3

πŸš€ Quick Start

import torch
from ice import ice_3b, IceModel

# Create model from preset
cfg = ice_3b()
model = IceModel(cfg)

# Forward pass
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids)  # uses default loop depth
print(f"Logits: {logits.shape}")  # (2, 16, 32000)

# Deeper reasoning: increase loop depth
logits_deep = model(ids, n_loops=32)
print(f"Deep logits: {logits_deep.shape}")

# Text generation with speculative unrolling
output = model.generate(ids, max_new_tokens=128, n_loops=16, temperature=0.8)
print(f"Generated: {output.shape}")

# Verify stability
rho = model.get_spectral_radius()
print(f"Spectral radius ρ(A) = {rho:.6f} {'βœ“ stable' if rho < 1 else 'βœ— unstable'}")

πŸ”¬ Attention Modes

Ice supports three attention implementations selectable via attn_type:

Mode Class Description
"gqa" GQAttention Grouped Query Attention (Ainslie et al., 2023) β€” fewer KV heads than Q heads
"mla" MLAttention Multi-Latent Attention (DeepSeek-V2, 2024) β€” 10–20Γ— smaller KV cache
"moda_v2" MoDAv2Attention Full-history cross-loop attention β€” latent CoT
from ice import IceConfig, IceModel

# MLA mode (default)
cfg = IceConfig(attn_type="mla", kv_lora_rank=512, dim=2048, n_heads=16)

# MoDA v2 mode (enables cross-loop history attention)
cfg = IceConfig(attn_type="moda_v2", moda_history_window=8, dim=2048, n_heads=16)

model = IceModel(cfg)

πŸ‹οΈ Training

A training script for the 3B model on FineWeb-Edu is included at training/pretrain.py:

# Single GPU
python training/pretrain.py

# Multi-GPU (auto-detects GPU count)
torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/pretrain.py

Key design choices:

  • Optimizer: AdamW
  • Dataset: HuggingFaceFW/fineweb-edu (sample-10BT by default)
  • Tokenizer: openai/gpt-oss-20b via IceTokenizer
  • Parallelism: PyTorch DDP via torchrun
  • Precision: bfloat16 on H100/A100, float16 + GradScaler on older GPUs
  • Schedule: Linear warmup (2000 steps) β†’ cosine decay
  • Target: 30B tokens (~Chinchilla-adjusted for looped architecture)

πŸ“ Architecture Details

Spectral Radius Guarantee

The LTI injection matrix A is parameterized to guarantee ρ(A) < 1 by construction:

A_continuous = -exp(log_A)          β€” always negative diagonal
A_discrete   = exp(Ξ”t Β· A_continuous) β€” ZOH discretization, values ∈ (0, 1)

This makes the looped model unconditionally stable regardless of learning rate or batch noise (ParCae, Prairie et al., 2026).

Multi-Resolution RoPE

When using MoDA v2, Ice precomputes RoPE frequencies at three resolution levels (ΞΈ, 2ΞΈ, 4ΞΈ), enabling the model to attend to patterns at different timescales simultaneously.

Expert-Level Balance

The MoE router uses DeepSeek-V3 style routing with optional:

  • Auxiliary-loss-free bias routing (moe_n_groups > 1)
  • Group-limited routing (moe_topk_groups)
  • Sigmoid gating with route scaling

⚠️ Verification Status

IMPORTANT: This is a theoretical reconstruction. The following aspects have been verified:

Component Status Notes
Core architecture βœ… Verified All unit tests pass (15/15)
Forward pass βœ… Verified Correct shapes through full pipeline
Spectral radius βœ… Verified ρ(A) < 1 guaranteed by construction
Text generation βœ… Verified Autoregressive loop works correctly
MSRD fusion βœ… Verified Multi-scale gated fusion works
AEC caching βœ… Verified Expert output caching functional
Large-scale training ❌ Not verified Requires GPU cluster
Convergence on real data ❌ Not verified Requires extended training run
Benchmark performance ❌ Not verified Requires trained checkpoints
Inference scaling laws ❌ Not verified Theoretical extrapolation only

πŸ“š References

πŸ“„ License

MIT License β€” see LICENSE for details.


Built with ❄️ by guanlan2 and ε†°ε„Ώ 🧊

About

Next-Generation Recurrent-Depth Transformer - 7 innovations beyond OpenMythos: MSRD, MoDAv2, AEC, GDLoRA, SLU, MLCA, CDL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages