MicroMixer-2

Attention-Free Language Model based on MLP-Mixer (V4)

No Attention. No Transformers. Pure MLP.

📋 Overview

MicroMixer-2 is a research project exploring MLP-Mixer architectures for causal language modeling. Instead of using Transformer attention mechanisms, this project uses only MLP layers with token mixing and channel mixing to generate text.

Key Features

🚫 No Attention: Completely removes self-attention mechanisms
🧱 MLP-Only: Uses MLP-Mixer with token/channel mixing
📝 Byte-Level: 256 vocabulary byte tokenizer (multilingual capable)
🔄 RoPE: Rotary Position Embedding for length generalization
⚡ HyperMixing: O(S) complexity token mixing via hypernetworks
📚 Knowledge-Optimized: Standard MLP (not GatedMLP) for better knowledge capacity
🎯 V4 Innovations: DropPath, label smoothing, padding-aware loss

🏗️ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[MicroMixerLayer × N]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    subgraph "MicroMixerLayer"
        H[LayerNorm] --> I[HyperMixing]
        I --> J[DropPath + Residual]
        J --> K[LayerNorm]
        K --> L[MlpBlock]
        L --> M[DropPath + Residual]
    end
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff

V4 Innovations

Innovation	Description
DropPath	Stochastic depth regularization - randomly skips residual branches
Label Smoothing	Prevents overconfident predictions (0.1 default)
Padding-Aware Loss	Ignores padding tokens in cross-entropy loss
Increased Depth	5 layers for 1M model (vs 3 in V3)

Model Variants

Model	Parameters	Hidden Dim	Hyper Dim	Seq Len	Layers
100K	~125K	84	48	64	3
300K	~431K	128	64	128	4
500K	~856K	176	88	128	4
1M	~1.02M	168	84	4096	5

🚀 Quick Start

Installation

git clone https://github.com/llaa33219/MicroMixer-2.git
cd MicroMixer-2
pip install -e .

Generate Text

import torch
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer

# Load model
config = MicroMixerConfig(
    max_seq_len=4096,
    hidden_dim=168,
    hyper_hidden_dim=84,
    channel_mlp_dim=448,
    num_layers=5,
)

model = MicroMixer(config)
checkpoint = torch.load("checkpoints/discord-1M-4096-pure/epoch_2.pt", weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Generate
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello\nAssistant:")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=128, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0].tolist()))

Train Your Own

# Train 1M model on Discord-Dialogues
python train.py --model 1M --dataset discord-dialogues --epochs 3 --batch-size 64 --max-seq-len 4096

# Train on other datasets
python train.py --model 1M --dataset toxic-chat --epochs 10
python train.py --model 1M --dataset gemini-35 --epochs 10
python train.py --model 1M --dataset glm-51 --epochs 5

📊 Training Data

Supported datasets:

Discord-Dialogues: 7.3M Discord conversations (default)
toxic-chat: 10K toxic chat samples
gemini-3.5-flash-distilled: 25K distilled samples
GLM-5.1-Reasoning: 746K reasoning traces

📈 Training Results (1M on Discord-Dialogues)

Epoch	Train Loss	Train PPL	Val Loss	Val PPL
1	2.89	18.04	2.68	14.65
2	2.57	13.05	2.63	13.88
3	2.54	12.68	2.62	13.73

⚠️ Limitations

Small Model Size: Largest model is only ~1M parameters
Grammar Issues: Generated text often has grammatical errors
Repetitive Patterns: Tends to repeat learned phrases from training data
Limited Context: Max 4096 tokens context window

🔬 Research Motivation

This project explores whether MLP-only architectures can perform competitive language modeling without attention mechanisms. Key questions:

Can MLP-Mixer match Transformer quality at small scales?
How does HyperMixing compare to standard token mixing?
What are the fundamental limits of attention-free language models?

Research Basis

HyperMixer (ACL 2023): Hypernetwork-based token mixing with O(S) complexity
Physics of Language Models (2024): Standard MLP outperforms GatedMLP for knowledge storage
RoPE: Rotary Position Embedding for length generalization
DropPath (2016): Stochastic depth for regularization

📁 Project Structure

MicroMixer-2/
├── src/
│   ├── model.py          # MicroMixer V4 architecture
│   ├── data.py           # Dataset loading
│   ├── trainer.py        # Training loop
│   └── tokenizer.py      # Byte-level tokenizer
├── checkpoints/
│   └── discord-1M-4096-pure/  # Discord-Dialogues trained model
├── train.py              # Training script
├── generate_test.py      # Generation testing
├── pyproject.toml        # Project config
└── LICENSE               # Apache 2.0

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Built with ❤️ for research

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
checkpoints/discord-1M-4096-pure		checkpoints/discord-1M-4096-pure
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_test.py		generate_test.py
logo.svg		logo.svg
pyproject.toml		pyproject.toml
test_model_pt.py		test_model_pt.py
train.py		train.py
upload_to_hf.py		upload_to_hf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MicroMixer-2

📋 Overview

Key Features

🏗️ Architecture

V4 Innovations

Model Variants

🚀 Quick Start

Installation

Generate Text

Train Your Own

📊 Training Data

📈 Training Results (1M on Discord-Dialogues)

⚠️ Limitations

🔬 Research Motivation

Research Basis

📁 Project Structure

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MicroMixer-2

📋 Overview

Key Features

🏗️ Architecture

V4 Innovations

Model Variants

🚀 Quick Start

Installation

Generate Text

Train Your Own

📊 Training Data

📈 Training Results (1M on Discord-Dialogues)

⚠️ Limitations

🔬 Research Motivation

Research Basis

📁 Project Structure

🤝 Contributing

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages