MicroMixer-1 is a research project exploring MLP-Mixer architectures for causal language modeling. Instead of using Transformer attention mechanisms, this project uses only MLP layers with token mixing and channel mixing to generate text.
- 🚫 No Attention: Completely removes self-attention mechanisms
- 🧱 MLP-Only: Uses MLP-Mixer with token/channel mixing
- 📝 Byte-Level: 256 vocabulary byte tokenizer (multilingual capable)
- 🔄 RoPE: Rotary Position Embedding for length generalization
- ⚡ HyperMixing: O(S) complexity token mixing via hypernetworks
graph TD
A[Byte Input] --> B[Token Embedding]
B --> C[RoPE Position Encoding]
C --> D[ImprovedMixerLayer × N]
D --> E[LayerNorm]
E --> F[LM Head]
F --> G[Byte Output]
subgraph "ImprovedMixerLayer"
H[LayerNorm] --> I[HyperMixing]
I --> J[Residual]
J --> K[LayerNorm]
K --> L[MlpBlock]
L --> M[Residual]
end
style A fill:#007BFF,color:#fff
style G fill:#00D620,color:#fff
style D fill:#AE00FF,color:#fff
| Model | Parameters | Hidden Dim | Seq Len | Layers | Checkpoint |
|---|---|---|---|---|---|
| 100K | 136,908 | 84 | 64 | 3 | v2_hyper_100k |
| 300K | 331,680 | 128 | 128 | 3 | v2_hyper_300k |
| 500K | 557,328 | 176 | 128 | 3 | v2_hyper_500k |
| 1M | 967,584 | 224 | 256 | 3 | v2_hyper_1M |
git clone https://github.com/llaa33219/MicroMixer-1.git
cd MicroMixer-1
pip install -e .import torch
from src.model import MicroMixerV2, MicroMixerV2Config
from src.tokenizer import ByteTokenizer
# Load model
config = MicroMixerV2Config(
max_seq_len=256,
hidden_dim=224,
channel_mlp_dim=576,
num_layers=3,
use_hyper=True,
)
model = MicroMixerV2(config)
checkpoint = torch.load("checkpoints/v2_hyper_1M/epoch_4.pt", weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# Generate
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("Once upon a time")])
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=64, temperature=0.8, top_k=40)
print(tokenizer.decode(output[0].tolist()))# Train 100K model for 10 epochs
python train.py --model 100k --version v2 --epochs 10 --batch-size 32
# Train 1M model with custom settings
python train.py --model 1M --version v2 --epochs 5 --lr 1e-4 --max-samples 10000The models are trained on:
- TinyStories: Simple children's stories
- DailyDialog: Conversational data
- Small Model Size: Largest model is only ~1M parameters
- Grammar Issues: Generated text often has grammatical errors
- Repetitive Patterns: Tends to repeat learned phrases from training data
- Short Context: Limited context window (64-256 tokens)
This project explores whether MLP-only architectures can perform competitive language modeling without attention mechanisms. Key questions:
- Can MLP-Mixer match Transformer quality at small scales?
- How does HyperMixing compare to standard token mixing?
- What are the fundamental limits of attention-free language models?
MicroMixer-1/
├── src/
│ ├── model.py # MicroMixerV1, V2, HyperMixing
│ ├── config.py # Model configurations
│ ├── data.py # Dataset loading
│ ├── trainer.py # Training loop
│ └── tokenizer.py # Byte-level tokenizer
├── checkpoints/
│ ├── v2_hyper_100k/ # 100K model checkpoints
│ ├── v2_hyper_300k/ # 300K model checkpoints
│ ├── v2_hyper_500k/ # 500K model checkpoints
│ └── v2_hyper_1M/ # 1M model checkpoints
├── train.py # Training script
├── generate_test.py # Generation testing
├── pyproject.toml # Project config
└── LICENSE # Apache 2.0
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.