π Shimmy v1.7.0: The MoE Revolution is Here!
π Shimmy v1.7.0: The MoE Revolution is Here!
π₯ BREAKTHROUGH: Run 42B+ Models on Consumer Hardware
Shimmy v1.7.0 unleashes the MoE (Mixture of Experts) CPU Offloading Revolution - enabling massive expert models to run on everyday GPUs with up to 99.9% VRAM reduction.
π₯ What's New & Game-Changing
β‘ MoE CPU Offloading Technology
Transform impossible into possible:
--cpu-moe: Automatically offload MoE layers to CPU--n-cpu-moe N: Fine-tune performance with precise layer control- Massive Memory Savings: 15GB models β 4GB VRAM usage
- Enterprise Ready: Deploy 42B parameter models on 8GB consumer cards
π Real Performance Gains (Validated)
- GPT-OSS 20B: 71.5% VRAM reduction (15GB β 4.3GB actual measurement)
- Phi-3.5-MoE 42B: Runs on consumer hardware for the first time
- DeepSeek 16B: Intelligent CPU-GPU hybrid execution
- Smart Tradeoffs: Accept 2-7x slower inference for 10-100x memory savings
π οΈ Technical Excellence
- First-Class Rust: Enhanced llama.cpp bindings with MoE support
- Cross-Platform: Windows MSVC CUDA, macOS ARM64 Metal, Linux x86_64/ARM64
- Production Tested: 295/295 tests passing, comprehensive validation pipeline
- Still Tiny: Sub-5MB binary maintains legendary efficiency
π― Use Cases Unlocked
π’ Enterprise Deployment
- Cost Revolution: Run large models without GPU farm investments
- Scalable AI: Deploy expert models on existing infrastructure
- Flexible Performance: Balance speed vs. memory for any workload
- On-Premises Ready: Keep sensitive data in-house with minimal hardware
π¬ Research & Development
- Democratized Access: Test large models on developer laptops
- Rapid Iteration: Prototype MoE architectures efficiently
- Educational Power: Advanced AI models accessible to everyone
- Hybrid Intelligence: Combine CPU and GPU resources intelligently
π Quick Start Your MoE Journey
Installation Options
# Install from crates.io (LIVE NOW!)
cargo install shimmy
# Or grab platform binaries below β¬οΈπ€ Ready-to-Use MoE Models
Curated collection on HuggingFace - optimized for CPU offloading:
π₯ Recommended Starting Points
# Download and run Phi-3.5-MoE 42B (Q4 K-M) - Best balance of quality/performance
huggingface-cli download MikeKuykendall/phi-3.5-moe-q4-k-m-cpu-offload-gguf
./shimmy serve --cpu-moe --model-path phi-3.5-moe-q4-k-m.gguf
# Or DeepSeek-MoE 16B (Q4 K-M) - Faster alternative
huggingface-cli download MikeKuykendall/deepseek-moe-16b-q4-k-m-cpu-offload-gguf
./shimmy serve --cpu-moe --model-path deepseek-moe-16b-q4-k-m.ggufπ Complete Model Collection
| Model | Size | Quantization | VRAM | Use Case | Download |
|---|---|---|---|---|---|
| Phi-3.5-MoE | 42B | Q8.0 | ~4GB | π Maximum Quality | phi-3.5-moe-q8-0-cpu-offload-gguf |
| Phi-3.5-MoE | 42B | Q4 K-M | ~2.5GB | β‘ Recommended | phi-3.5-moe-q4-k-m-cpu-offload-gguf |
| Phi-3.5-MoE | 42B | Q2 K | ~1.5GB | π Ultra Fast | phi-3.5-moe-q2-k-cpu-offload-gguf |
| DeepSeek-MoE | 16B | Q8.0 | ~2GB | π― High Precision | deepseek-moe-16b-q8-0-cpu-offload-gguf |
| DeepSeek-MoE | 16B | Q4 K-M | ~1.2GB | β Budget Pick | deepseek-moe-16b-q4-k-m-cpu-offload-gguf |
| DeepSeek-MoE | 16B | Q2 K | ~800MB | π¨ Lightning Fast | deepseek-moe-16b-q2-k-cpu-offload-gguf |
| GPT-OSS | 21B | Various | ~3GB | π¬ Research/Testing | gpt-oss-20b-moe-cpu-offload-gguf |
π― Model Selection Guide
- π₯ First Time? β Phi-3.5-MoE Q4 K-M (best balance)
- πͺ High-End GPU (8GB+)? β Phi-3.5-MoE Q8.0 (maximum quality)
- π» Limited VRAM (4GB)? β DeepSeek-MoE Q4 K-M (budget friendly)
- β‘ Speed Critical? β DeepSeek-MoE Q2 K (blazing fast)
- π¬ Research/Validation? β GPT-OSS 21B (proven baseline)
β‘ Launch Commands
# Enable MoE CPU offloading magic
./shimmy serve --cpu-moe --port 11435 --model-path your-model.gguf
# Fine-tune performance for your hardware
./shimmy serve --n-cpu-moe 8 --port 11435 --model-path your-model.gguf
# Standard OpenAI-compatible API - zero changes to your code!
curl -X POST http://localhost:11435/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "your-model", "prompt": "Explain quantum computing in simple terms"}'π¦ Cross-Platform Binaries
Choose your platform and start the revolution:
| Platform | Binary | Features |
|---|---|---|
| π§ Linux x86_64 | shimmy-linux-x86_64 |
SafeTensors + llama.cpp + MoE |
| π¦Ύ Linux ARM64 | shimmy-linux-arm64 |
Native ARM64 + full MoE support |
| πͺ Windows x86_64 | shimmy-windows-x86_64.exe |
CUDA GPU + MoE offloading |
| π macOS Intel | shimmy-macos-intel |
SafeTensors + Apple MLX |
| π macOS Apple Silicon | shimmy-macos-arm64 |
Metal GPU + MLX + MoE power |
All binaries include zero Python dependencies and native SafeTensors support.
π Why This Changes Everything
Before Shimmy v1.7.0: "I need a $10,000 GPU to run expert models"
After Shimmy v1.7.0: "I'm running 42B models on my gaming laptop"
This isn't just an update - it's sustainable AI democratization. Organizations can now:
- β Deploy cutting-edge models without infrastructure overhaul
- β Experiment with state-of-the-art architectures on existing hardware
- β Scale AI capabilities based on actual needs, not hardware limits
- β Maintain complete data sovereignty with on-premises deployment
π Validated & Transparent
- Multi-Model Testing: 3 models validated across all platforms
- Real Baselines: Controlled A/B testing with actual measurements
- Production Quality: Comprehensive release gate system
- Open Development: Technical validation report available
π€ Join the Revolution
- π Start Now:
cargo install shimmy - π Learn More: Technical Documentation
- π Report Issues: GitHub Issues
- π Upstream: Supporting llama-cpp-rs PR #839
Ready to revolutionize your AI deployment? The future of efficient model serving is here. Download Shimmy v1.7.0 and experience the MoE revolution! π