MicroMixer-2 is a research project exploring MLP-Mixer architectures for causal language modeling. Instead of using Transformer attention mechanisms, this project uses only MLP layers with token mixing and channel mixing to generate text.
- π« No Attention: Completely removes self-attention mechanisms
- π§± MLP-Only: Uses MLP-Mixer with token/channel mixing
- π Byte-Level: 256 vocabulary byte tokenizer (multilingual capable)
- π RoPE: Rotary Position Embedding for length generalization
- β‘ HyperMixing: O(S) complexity token mixing via hypernetworks
- π Knowledge-Optimized: Standard MLP (not GatedMLP) for better knowledge capacity
- π― V4 Innovations: DropPath, label smoothing, padding-aware loss
graph TD
A[Byte Input] --> B[Token Embedding]
B --> C[RoPE Position Encoding]
C --> D[MicroMixerLayer Γ N]
D --> E[LayerNorm]
E --> F[LM Head]
F --> G[Byte Output]
subgraph "MicroMixerLayer"
H[LayerNorm] --> I[HyperMixing]
I --> J[DropPath + Residual]
J --> K[LayerNorm]
K --> L[MlpBlock]
L --> M[DropPath + Residual]
end
style A fill:#007BFF,color:#fff
style G fill:#00D620,color:#fff
style D fill:#AE00FF,color:#fff
| Innovation | Description |
|---|---|
| DropPath | Stochastic depth regularization - randomly skips residual branches |
| Label Smoothing | Prevents overconfident predictions (0.1 default) |
| Padding-Aware Loss | Ignores padding tokens in cross-entropy loss |
| Increased Depth | 5 layers for 1M model (vs 3 in V3) |
| Model | Parameters | Hidden Dim | Hyper Dim | Seq Len | Layers |
|---|---|---|---|---|---|
| 100K | ~125K | 84 | 48 | 64 | 3 |
| 300K | ~431K | 128 | 64 | 128 | 4 |
| 500K | ~856K | 176 | 88 | 128 | 4 |
| 1M | ~1.02M | 168 | 84 | 4096 | 5 |
git clone https://github.com/llaa33219/MicroMixer-2.git
cd MicroMixer-2
pip install -e .import torch
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer
# Load model
config = MicroMixerConfig(
max_seq_len=4096,
hidden_dim=168,
hyper_hidden_dim=84,
channel_mlp_dim=448,
num_layers=5,
)
model = MicroMixer(config)
checkpoint = torch.load("checkpoints/discord-1M-4096-pure/epoch_2.pt", weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# Generate
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello\nAssistant:")])
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=128, temperature=0.7, top_k=40)
print(tokenizer.decode(output[0].tolist()))# Train 1M model on Discord-Dialogues
python train.py --model 1M --dataset discord-dialogues --epochs 3 --batch-size 64 --max-seq-len 4096
# Train on other datasets
python train.py --model 1M --dataset toxic-chat --epochs 10
python train.py --model 1M --dataset gemini-35 --epochs 10
python train.py --model 1M --dataset glm-51 --epochs 5Supported datasets:
- Discord-Dialogues: 7.3M Discord conversations (default)
- toxic-chat: 10K toxic chat samples
- gemini-3.5-flash-distilled: 25K distilled samples
- GLM-5.1-Reasoning: 746K reasoning traces
| Epoch | Train Loss | Train PPL | Val Loss | Val PPL |
|---|---|---|---|---|
| 1 | 2.89 | 18.04 | 2.68 | 14.65 |
| 2 | 2.57 | 13.05 | 2.63 | 13.88 |
| 3 | 2.54 | 12.68 | 2.62 | 13.73 |
- Small Model Size: Largest model is only ~1M parameters
- Grammar Issues: Generated text often has grammatical errors
- Repetitive Patterns: Tends to repeat learned phrases from training data
- Limited Context: Max 4096 tokens context window
This project explores whether MLP-only architectures can perform competitive language modeling without attention mechanisms. Key questions:
- Can MLP-Mixer match Transformer quality at small scales?
- How does HyperMixing compare to standard token mixing?
- What are the fundamental limits of attention-free language models?
- HyperMixer (ACL 2023): Hypernetwork-based token mixing with O(S) complexity
- Physics of Language Models (2024): Standard MLP outperforms GatedMLP for knowledge storage
- RoPE: Rotary Position Embedding for length generalization
- DropPath (2016): Stochastic depth for regularization
MicroMixer-2/
βββ src/
β βββ model.py # MicroMixer V4 architecture
β βββ data.py # Dataset loading
β βββ trainer.py # Training loop
β βββ tokenizer.py # Byte-level tokenizer
βββ checkpoints/
β βββ discord-1M-4096-pure/ # Discord-Dialogues trained model
βββ train.py # Training script
βββ generate_test.py # Generation testing
βββ pyproject.toml # Project config
βββ LICENSE # Apache 2.0
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.