# Making LLaVA Tiny via MoE Knowledge Distillation
### Blog by Ankan Das

## Motivation  
I’ve always been fascinated by multimodal models that can interpret both images and text. But most of them are huge and impractical to deploy in real-world devices.  
This paper caught my attention for its smart use of Mixture-of-Experts (MoE) and distillation to make a model like LLaVA tiny and efficient — without losing performance.

## Historical Context  
Multimodal models like CLIP, Flamingo, and LLaVA have shown remarkable performance in combining vision and language.  
However, their size limits real-world usability. LLaVA-MoD is a response to this — achieving better results with fewer parameters and less data, using MoE and distillation.

## What I Learned from the Paper  

### 📌 Architecture Overview  
The model follows this sequence:  
`Image → CLIP Vision Encoder (frozen) → Vision-Language Adapter ← Text`  
→ Combined features go into a Language Model with Mixture-of-Experts (MoE).  

### 📌 What is MoE?  
Instead of using one feed-forward block for every token, the MoE approach uses a **router** to select the best 2 experts for each token — reducing computation.  

### 📌 Progressive Distillation  
1. **Mimic Distillation** – Student mimics teacher outputs using KL loss.  
2. **Preference Distillation** – Student is shown good and bad responses, and learns to prefer the better one.  
This helps avoid hallucinations and improves reasoning.

### Simulating Top-k Routing with Scores

In [None]:
import numpy as np

# Simulate router scores for 4 experts
scores = np.array([0.2, 0.5, 0.1, 0.9])
top_k = scores.argsort()[-2:][::-1]

print("Top-2 selected experts (highest scoring):", top_k)

### Simulating Preference Distillation Logic

In [None]:
# Simulate preference scoring in distillation

teacher_scores = {
    "Answer_A+": 0.85,  # Preferred answer
    "Answer_A-": 0.45   # Less preferred
}

student_scores = {
    "Answer_A+": 0.60,
    "Answer_A-": 0.55
}

# Preference loss logic: Encourage higher score for A+ than A-
preference_margin = teacher_scores["Answer_A+"] - teacher_scores["Answer_A-"]
student_margin = student_scores["Answer_A+"] - student_scores["Answer_A-"]

print("Teacher preference margin:", preference_margin)
print("Student preference margin:", student_margin)

if student_margin < preference_margin:
    print("Apply loss to reinforce preference toward A+")
else:
    print("No adjustment needed")

## Reflections  

**(a) What surprised me:**  
How effective preference distillation was in reducing hallucinations, even better than instruction tuning in some cases.  

**(b) Scope for improvement:**  
The MoE model could be made even lighter by using quantization or low-rank approximation. It would also be interesting to try this architecture on audio+text data.

## References  
- [LLaVA-MoD Paper (ResearchGate)](https://www.researchgate.net/publication/383494370_LLaVA-MoD_Making_LLaVA_Tiny_via_MoE_Knowledge_Distillation)  
- [GitHub Repo](https://github.com/shufangxun/LLaVA-MoD)  
- Hugging Face Transformers  
- CLIP Model  
- Qwen-VL Documentation