Skip to content

llaa33219/MicroMixer-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MicroMixer-2 Logo

MicroMixer-2

Attention-Free Language Model based on MLP-Mixer (V4)

Architecture License Python

No Attention. No Transformers. Pure MLP.



GitHub Stars GitHub Forks


πŸ“‹ Overview

MicroMixer-2 is a research project exploring MLP-Mixer architectures for causal language modeling. Instead of using Transformer attention mechanisms, this project uses only MLP layers with token mixing and channel mixing to generate text.

Key Features

  • 🚫 No Attention: Completely removes self-attention mechanisms
  • 🧱 MLP-Only: Uses MLP-Mixer with token/channel mixing
  • πŸ“ Byte-Level: 256 vocabulary byte tokenizer (multilingual capable)
  • πŸ”„ RoPE: Rotary Position Embedding for length generalization
  • ⚑ HyperMixing: O(S) complexity token mixing via hypernetworks
  • πŸ“š Knowledge-Optimized: Standard MLP (not GatedMLP) for better knowledge capacity
  • 🎯 V4 Innovations: DropPath, label smoothing, padding-aware loss

πŸ—οΈ Architecture

graph TD
    A[Byte Input] --> B[Token Embedding]
    B --> C[RoPE Position Encoding]
    C --> D[MicroMixerLayer Γ— N]
    D --> E[LayerNorm]
    E --> F[LM Head]
    F --> G[Byte Output]
    
    subgraph "MicroMixerLayer"
        H[LayerNorm] --> I[HyperMixing]
        I --> J[DropPath + Residual]
        J --> K[LayerNorm]
        K --> L[MlpBlock]
        L --> M[DropPath + Residual]
    end
    
    style A fill:#007BFF,color:#fff
    style G fill:#00D620,color:#fff
    style D fill:#AE00FF,color:#fff
Loading

V4 Innovations

Innovation Description
DropPath Stochastic depth regularization - randomly skips residual branches
Label Smoothing Prevents overconfident predictions (0.1 default)
Padding-Aware Loss Ignores padding tokens in cross-entropy loss
Increased Depth 5 layers for 1M model (vs 3 in V3)

Model Variants

Model Parameters Hidden Dim Hyper Dim Seq Len Layers
100K ~125K 84 48 64 3
300K ~431K 128 64 128 4
500K ~856K 176 88 128 4
1M ~1.02M 168 84 4096 5

πŸš€ Quick Start

Installation

git clone https://github.com/llaa33219/MicroMixer-2.git
cd MicroMixer-2
pip install -e .

Generate Text

import torch
from src.model import MicroMixer, MicroMixerConfig
from src.tokenizer import ByteTokenizer

# Load model
config = MicroMixerConfig(
    max_seq_len=4096,
    hidden_dim=168,
    hyper_hidden_dim=84,
    channel_mlp_dim=448,
    num_layers=5,
)

model = MicroMixer(config)
checkpoint = torch.load("checkpoints/discord-1M-4096-pure/epoch_2.pt", weights_only=False)
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()

# Generate
tokenizer = ByteTokenizer()
input_ids = torch.tensor([tokenizer.encode("User: Hello\nAssistant:")])

with torch.no_grad():
    output = model.generate(input_ids, max_new_tokens=128, temperature=0.7, top_k=40)

print(tokenizer.decode(output[0].tolist()))

Train Your Own

# Train 1M model on Discord-Dialogues
python train.py --model 1M --dataset discord-dialogues --epochs 3 --batch-size 64 --max-seq-len 4096

# Train on other datasets
python train.py --model 1M --dataset toxic-chat --epochs 10
python train.py --model 1M --dataset gemini-35 --epochs 10
python train.py --model 1M --dataset glm-51 --epochs 5

πŸ“Š Training Data

Supported datasets:


πŸ“ˆ Training Results (1M on Discord-Dialogues)

Epoch Train Loss Train PPL Val Loss Val PPL
1 2.89 18.04 2.68 14.65
2 2.57 13.05 2.63 13.88
3 2.54 12.68 2.62 13.73

⚠️ Limitations

  • Small Model Size: Largest model is only ~1M parameters
  • Grammar Issues: Generated text often has grammatical errors
  • Repetitive Patterns: Tends to repeat learned phrases from training data
  • Limited Context: Max 4096 tokens context window

πŸ”¬ Research Motivation

This project explores whether MLP-only architectures can perform competitive language modeling without attention mechanisms. Key questions:

  1. Can MLP-Mixer match Transformer quality at small scales?
  2. How does HyperMixing compare to standard token mixing?
  3. What are the fundamental limits of attention-free language models?

Research Basis

  • HyperMixer (ACL 2023): Hypernetwork-based token mixing with O(S) complexity
  • Physics of Language Models (2024): Standard MLP outperforms GatedMLP for knowledge storage
  • RoPE: Rotary Position Embedding for length generalization
  • DropPath (2016): Stochastic depth for regularization

πŸ“ Project Structure

MicroMixer-2/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ model.py          # MicroMixer V4 architecture
β”‚   β”œβ”€β”€ data.py           # Dataset loading
β”‚   β”œβ”€β”€ trainer.py        # Training loop
β”‚   └── tokenizer.py      # Byte-level tokenizer
β”œβ”€β”€ checkpoints/
β”‚   └── discord-1M-4096-pure/  # Discord-Dialogues trained model
β”œβ”€β”€ train.py              # Training script
β”œβ”€β”€ generate_test.py      # Generation testing
β”œβ”€β”€ pyproject.toml        # Project config
└── LICENSE               # Apache 2.0

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


Built with ❀️ for research

GitHub Issues GitHub Pull Requests

About

MLP-Mixer based language model - V4 with HyperMixing, DropPath, and Discord-Dialogues training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages