# LLM Model Zoo: How to Choose and Compare Language Models

This notebook covers the landscape of LLMs, how to evaluate them, and selecting the right (open-source) model for your use case.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hyperskill-content/hyperskill-ml-notebooks/blob/main/Attention_developments/llm_model_zoo.ipynb)

## 🚀 Prerequisites

Make sure you're comfortable with the topics below before starting this notebook:

| # | Topic (clickable links)|
|---|-------|
| 1 | **[Introduction to LLMs](https://hyperskill.org/learn/step/51833)**  |
| 2 | **[Attention Mechanisms in Transformers](https://hyperskill.org/learn/step/52060)** |


*If you're new to any item above, review it quickly, then dive back in here.*

## 1. What's the Idea Behind Model Zoos? <a id="sec-idea"></a>

An LLM model zoo is a collection or repository of available large language models that users can browse, compare, and select from based on their specific needs

### Why Do We Need So Many Models?

| **The Problem** | **The Solution** |
|-----------------|------------------|
| One model can't be the best on all tasks | Different models for different tasks |
| Some tasks need speed, others need quality | Fast models vs. accurate models |
| Not everyone has the required hardware | Small models for laptops, big models for servers |
| Different languages and cultures | Specialized models for specific regions |

**Model Zoo** = A collection of pre-trained models you can choose from.

## 2. Understanding the Open LLM Leaderboard <a id="sec-leaderboard"></a>

The [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) evaluates models on multiple tasks and gives them an overall score. It only includes open-source models (no GPT-4, Claude, etc.), models available on Hugging Face, and reproducible results with standardized benchmarks.

### **Key Evaluation Tasks**

The leaderboard uses the following benchmarks:

| Benchmark | Description |
|-----------|-------------|
| **IFEval** | Measures how well models follow specific instructions and formatting requirements. Tests adherence to constraints like word limits, specific formats, or structural requirements rather than just providing good content. |
| **BBH** | Evaluates complex reasoning across challenging tasks from the BIG-Bench collection. Focuses on problems requiring multi-step reasoning, planning, and sophisticated cognitive abilities. |
| **MATH** | Assesses mathematical problem-solving capabilities through competition-level mathematics problems. Requires step-by-step reasoning across algebra, geometry, and calculus. |
| **GPQA** | Tests expert-level knowledge in science fields including biology, chemistry, and physics. Questions are designed to be difficult enough that even experts with Google access would find them challenging. |
| **MUSR** | Evaluates multi-step soft reasoning that requires multiple inference steps and handling of ambiguous or incomplete information. Tests nuanced thinking used in everyday problem-solving. |
| **MMLU-PRO** | Enhanced version of MMLU with more challenging graduate-level questions across 14 subject areas. Features harder questions and better answer choices to avoid saturation issues. |

### **How to Read the Leaderboard**

**Step-by-step guide:**

1. **Overall Score**: Higher = better (usually 0-100 scale)
2. **Model Size**: Look at parameters (7B, 13B, 70B, etc.)
3. **License**: Check if you can use it commercially
4. **Date**: Newer models often perform better

**Pro tip**: Don't just look at the top score! Consider:
- **Size vs Performance**: A 7B model scoring 75 might be better than a 70B model scoring 80 for your use case
- **Task-specific scores**: If you need math, look at GSM8k specifically
- **Your hardware**: Can you actually run the top model?

### **Using the Leaderboard for Model Selection**

**Example**: You need a model for customer support.

1. **Go to**: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/
2. **Filter by size**: If you have 16GB RAM, look at 7B-8B models
3. **Pick top 3 candidates**: Maybe Qwen2.5-7B, Llama-3.1-8B, Mistral-Nemo-12B
4. **Test them**: Download and compare on your specific task

### **Leaderboard Limitations**

**Remember**: The leaderboard is helpful but not perfect!

- **Doesn't test everything**: No coding, multilingual, or creative writing benchmarks
- **Standardized tests**: Real-world performance might differ
- **Gaming possible**: Some models might be specifically trained for these tests
- **No commercial models**: Can't compare with GPT, Claude directly

**Best practice**: Use the leaderboard as a starting point, then test models on your specific tasks.

## 3. Model Selection: Size, Hardware & Performance <a id="sec-selection"></a>

Choosing the right model isn't just about performance - you need to consider **hardware requirements**, **inference speed**, and **practical constraints**. Let's break this down systematically!

### **Hardware Requirements by Model Size**

**Memory requirements** (approximate values for inference):

| Model Size | RAM Needed | GPU Memory | Examples | Best For |
|------------|------------|------------|----------|----------|
| **1-3B** | 4-8GB | 6-8GB | Phi-3-mini, Llama-3.2-1B | Laptops, mobile, edge |
| **7-8B** | 16-32GB | 12-16GB | Llama-3.1-8B, Qwen2.5-7B | Workstations, small servers |
| **13-14B** | 32-64GB | 24-32GB | Qwen2.5-14B, Phi-3-medium | Professional workstations |
| **30-40B** | 64-128GB | 48-80GB | Mixtral-8x7B (MoE efficiency) | Server clusters |
| **70B+** | 128GB+ | 80GB+ | Llama-3.1-70B, Qwen2.5-72B | Data center, cloud |

### **Performance Considerations**

**Inference Speed Factors:**
1. **Model size**: Bigger = slower (generally)
2. **Architecture**: MoE models (Mixtral) can be faster than expected
3. **Hardware**: GPU > CPU, newer GPUs > older
4. **Optimization**: Quantization, optimized libraries (vLLM, TGI)
5. **Batch size**: Multiple requests together = more efficient

### **The Hardware-Performance Matrix**

| Your Hardware | Recommended Models | Expected Performance |
|---------------|-------------------|---------------------|
| **Laptop (8-16GB RAM)** | Phi-3-mini, Qwen2.5-3B | Good for simple tasks, 2-5 tokens/sec |
| **Workstation (32GB RAM, RTX 4090)** | Llama-3.1-8B, Qwen2.5-7B | Excellent for most tasks, 20-50 tokens/sec |
| **Server (64GB RAM, A100)** | Qwen2.5-14B, Mixtral-8x7B | Professional quality, 50-100 tokens/sec |
| **Cloud/Cluster (Multi-GPU)** | Llama-3.1-70B, Qwen2.5-72B | Top-tier quality, 100+ tokens/sec |

### **Optimization Techniques**

**Make models run faster and use less memory:**

1. **Quantization**: Reduce precision (FP16, INT8, INT4)
2. **Model Optimization Libraries**:
   - **vLLM**: Fast inference server
   - **TensorRT-LLM**: NVIDIA GPU acceleration
   - **Ollama**: Easy local deployment
   - **llama.cpp**: CPU-optimized inference

3. **Hardware-Specific Optimizations**:
   - **Apple Silicon**: Use MLX for M1/M2/M3 Macs
   - **NVIDIA GPUs**: Use CUDA, TensorRT
   - **AMD GPUs**: Use ROCm
   - **Intel CPUs**: Use optimized BLAS libraries

### **Model Selection Decision Tree**

```
What hardware do you have?
├── Laptop (8-16GB) → Phi-3-mini, Qwen2.5-3B, or use APIs
├── Workstation (RTX 3090/4090) → Llama-3.1-8B, Qwen2.5-7B
├── Server (A100/H100) → Qwen2.5-14B, Mixtral-8x7B
└── Multi-GPU cluster → Llama-3.1-70B, Qwen2.5-72B

What's your primary task?
├── General chat → Llama-3.1 series
├── Multilingual → Qwen2.5 series  
├── Coding → Qwen2.5-Coder or Code Llama
├── Math/reasoning → Qwen2.5-Math or high-scoring MMLU models
└── Efficiency focus → Phi-3 or Mixtral series

How much can you spend on hardware?
├── $0-500 → Use smaller models or cloud APIs
├── $500-3000 → RTX 4090 + 7B-8B models
├── $3000-10000 → Workstation + 14B models
└── $10000+ → Server-grade + 70B+ models
```

### **Practical Tips for Model Selection**

1. **Start small**: Begin with 7B-8B models, upgrade only if needed
2. **Check the leaderboard**: Compare models of similar size
3. **Test on your data**: Leaderboard scores don't guarantee good performance on your task
4. **Consider fine-tuning**: A smaller fine-tuned model often beats a larger general model
5. **Factor in deployment**: Easier deployment might be worth slight performance loss

### **Real-World Example**

**Scenario**: Building a customer support chatbot

1. **Check leaderboard**: Qwen2.5-7B scores 76.9 (good), Llama-3.1-8B scores 77.4 (slightly better)
2. **Check hardware**: You have RTX 4090 (24GB) - both models will fit
3. **Consider specialization**: Customer support needs good instruction following and multilingual support
4. **Decision**: Try both models, but Qwen2.5-7B might be better for multilingual customers
5. **Optimization**: Use GGUF quantization to speed up inference

## 4. Deep Dive: Major Open-Source Families <a id="sec-deep-dive"></a>

Analysis of major open-source model families, including specifications and optimal use cases.

### Llama (Meta)

The Llama model family ranges from 1B to 405B parameters across versions 3.1 and 3.2. It offers strong instruction following and general knowledge capabilities, with Llama 3.2 adding vision support and mobile-optimized variants. The family has the largest open-source community and uses the Llama Community License for commercial applications. Llama models are widely used for both research and commercial deployments due to their reliable performance and extensive ecosystem support.

**Limitations:**
- Primarily English-focused
- Large models require significant computational resources
- License restrictions for very large deployments

**Best Use Cases:**
- General-purpose applications
- Domain-specific fine-tuning
- Commercial products requiring reliable performance
- Applications requiring strong instruction adherence

**Hugging Face Models:** https://huggingface.co/meta-llama/collections

---

### Qwen (Alibaba)

The Qwen model family is a series of LLMs ranging from 0.5B to 235B parameters across multiple generations (1.0, 1.5, 2.5, and 3.0). The family excels in multilingual capabilities, particularly Chinese and English, and offers specialized variants for coding (Qwen-Coder), mathematics (Qwen-Math), reasoning (QwQ), and multimodal tasks (Qwen-VL). The latest Qwen 3.0 series introduces efficient MoE architecture and advanced reasoning capabilities, making it competitive with leading commercial models.

**Limitations:**
- Smaller community compared to Llama
- Documentation often primarily in Chinese
- Newer family with less established ecosystem

**Best Use Cases:**
- Multilingual applications
- Software development and coding assistance
- Mathematical and scientific reasoning tasks
- Research projects requiring permissive licensing

**Hugging Face Models:** https://huggingface.co/Qwen/collections

---

### Mixtral (Mistral AI)
The Mixtral model family uses Mixture-of-Experts (MoE) architecture with models like Mixtral 8x7B (47B total, 13B active) and 8x22B (141B total, 39B active), plus the dense Mistral NeMo 12B model. The family provides efficient inference by activating only a subset of parameters per token while maintaining large model performance, with context lengths up to 128K tokens. Mixtral models are designed for European regulatory compliance and offer strong reasoning capabilities. They are suitable for high-throughput deployments where computational efficiency and regulatory compliance are priorities.

**Limitations:**
- MoE deployment complexity requiring specialized infrastructure
- High memory requirements despite efficiency gains
- Smaller ecosystem than Llama

**Best Use Cases:**
- High-throughput serving requiring efficiency
- European deployments needing regulatory compliance
- Long document processing applications
- Reasoning-intensive tasks

**Hugging Face Models:** https://huggingface.co/mistralai/models

---

### Phi (Microsoft)

The Phi model family, developed by Microsoft, includes Phi-3 (3.8B-14B parameters) and the newer Phi-4 (14B parameters specializing in complex reasoning and mathematics). Additional variants include Phi-4-mini with enhanced multilingual support and Phi-4-multimodal for vision and audio processing. The family achieves high performance per parameter through training on curated synthetic data and is optimized for mobile and edge deployment with 128K context length. Licensed under MIT, Phi models offer the most permissive commercial licensing and are designed for resource-constrained environments where efficiency is prioritized over absolute performance.

**Limitations:**
- Notable weakness in following complex or specific instructions
- Limited multilingual support
- Smaller community and ecosystem

**Best Use Cases:**
- Mobile and edge applications
- Resource-constrained environments
- Applications prioritizing model size over absolute performance
- Prototyping and proof-of-concept development

**Hugging Face Models:** https://huggingface.co/microsoft/collections

---

### Family Comparison

| Feature | Llama | Qwen | Mixtral | Phi |
|---------|-------|------|---------|-------|
| **Primary Use** | General purpose | Multilingual | Efficiency | Edge/Mobile |
| **License** | Custom (commercial OK) | Apache 2.0 | Apache 2.0 | MIT |
| **Multilingual** | Good | Excellent | Good | Limited |
| **Community** | Largest | Growing | Medium | Small |
| **Hardware Requirements** | High (large models) | Medium-High | Medium | Low |
| **Deployment Complexity** | Low | Low | High (MoE) | Very Low |

### Selection Guidelines

**Choose Llama for**: Established ecosystem, general-purpose applications, large community support    
**Choose Qwen for**: Multilingual requirements, coding/mathematics tasks, permissive licensing      
**Choose Mixtral for**: Efficiency-focused deployments, EU compliance, reasoning tasks    
**Choose Phi for**: Resource constraints, edge deployment, mobile applications

### Evaluation Resources

**Performance Benchmarks**:
- LMSYS Chatbot Arena: https://chat.lmsys.org/?leaderboard
- Hugging Face Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/
- Artificial Analysis: https://artificialanalysis.ai/

**Model Access**:
- Hugging Face Hub: https://huggingface.co/models
- Ollama: https://ollama.ai/library
- Together AI: https://www.together.ai/

Note: Model rankings change frequently. Check current benchmarks before making decisions.

## 5. Hands-on: Working with Hugging Face Models <a id="sec-hands-on"></a>

Time to get practical! We'll explore how to discover, load, and compare different Hugging Face models. This section focuses on **open-source models only**.

### **Discovering Models on Hugging Face**

**Step 1: Using the Hugging Face Hub**
- Go to https://huggingface.co/models
- Filter by task (text-generation, text-classification, etc.)
- Sort by trending, most downloads, or recently updated
- Check model cards for details, examples, and benchmarks

**Step 2: Reading Model Cards**        
Every model has a **model card** with **performance metrics** (scores on benchmarks), **hardware requirements** (memory, GPU needs), **usage examples** (code to get started), **license information** (commercial use allowed?), **intended use cases** (what it's good/bad at)


#### **Example 1: Llama Family - General Purpose Chat**

**Model**: `meta-llama/Llama-3.2-1B-Instruct` (1B parameters)
- **Memory needed**: ~2-4GB
- **Best for**: General conversation, instruction following
- **Hardware**: Runs on laptops, CPUs

In [None]:
import pandas as pd
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer

# Example 1: TinyLlama
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Explain quantum computing in simple terms."
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

#### **Example 2: Qwen Family - Multilingual & Math**

**Model**: `Qwen/Qwen2.5-0.5B-Instruct` (0.5B parameters)
- **Memory needed**: ~1GB
- **Best for**: Multilingual tasks, basic math, coding
- **Special**: Excellent for non-English languages

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Example 2: Qwen2.5
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Give me a short introduction to large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

#### **Example 3: Phi-3 Family - Efficient & Mobile-Ready**

**Model**: `microsoft/Phi-3-mini-4k-instruct` (3.8B parameters)
- **Memory needed**: ~4-8GB
- **Best for**: Efficient inference, mobile deployment, reasoning
- **Special**: Great performance per parameter

#### **Model Comparison Function**

Let's create a function to compare multiple models side-by-side on the same task:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Model Comparison
models_to_compare = [
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "Qwen/Qwen2.5-0.5B-Instruct"
]

prompt = "What are the benefits of renewable energy?"
results = {}

for model_name in models_to_compare:
    try:
        print(f"\nTesting {model_name}...")

        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype="auto",
            device_map="cpu"
        )

        # Simple generation for DialoGPT, chat template for instruct models
        if "DialoGPT" in model_name:
            inputs = tokenizer.encode(prompt, return_tensors="pt")
            outputs = model.generate(inputs, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
        else:
            messages = [{"role": "user", "content": prompt}]
            text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
            model_inputs = tokenizer([text], return_tensors="pt")

            generated_ids = model.generate(**model_inputs, max_new_tokens=200)
            generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
            response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

        results[model_name] = response.strip()
        print(f"Response: {response.strip()}")

    except Exception as e:
        print(f"Error with {model_name}: {str(e)}")
        results[model_name] = f"Error: {str(e)}"

print(f"\nComparison completed for {len(results)} models.")

### **Performance Optimization Tips**

If you're running into memory issues, here are some optimization techniques:

In [None]:
import torch
print("\nChecking available optimizations...")


try:
    from transformers import BitsAndBytesConfig
    print("BitsAndBytesConfig available for quantization")
    quantization_available = True
except ImportError:
    print("BitsAndBytesConfig not available")
    print("Install with: pip install bitsandbytes")
    quantization_available = False

print("\nDemonstrating memory-efficient model loading:")

try:
    # Use a very small model for demonstration
    model_name = "distilgpt2"  # Small, widely compatible model

    print(f"Loading {model_name} with memory optimizations...")

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Method 1: Low CPU memory usage
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float32,  # Compatible dtype
        device_map="cpu",  # CPU only for compatibility
        low_cpu_mem_usage=True,  # Memory optimization
        # use_cache=False,  # Reduce memory during generation
    )

    print("Model loaded with memory optimizations!")

    # Quick test
    test_text = "Artificial intelligence is"
    inputs = tokenizer(test_text, return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=15,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Test output: {result}")

    # Clean up
    del model, tokenizer

except Exception as e:
    print(f"Memory optimization demo failed: {e}")

# Example 3: Using Pipeline for efficiency
print("\nUsing Pipeline for efficient inference:")

try:
    # Pipeline is often more memory efficient for simple tasks
    from transformers import pipeline

    generator = pipeline(
        "text-generation",
        model="distilgpt2",
        torch_dtype=torch.float32,
        device="cpu",  # CPU for compatibility
        model_kwargs={"low_cpu_mem_usage": True}
    )

    result = generator(
        "The future of AI is",
        max_new_tokens=20,
        temperature=0.7,
        do_sample=True
    )

    print("Pipeline created successfully!")
    print(f"Pipeline output: {result[0]['generated_text']}")

    del generator

except Exception as e:
    print(f"Pipeline demo failed: {e}")

torch.cuda.empty_cache() if torch.cuda.is_available() else None
print("\nMemory cleaned up.")

### **Model Comparison Summary**

| Model | Size | Memory | Strengths | Best Use Cases |
|-------|------|---------|-----------|---------------|
| **Llama-3.2-1B** | 1B | 2-4GB | General purpose, instruction following | Chatbots, general tasks |
| **Qwen2.5-1.5B** | 1.5B | 3-6GB | Multilingual, math, coding | Global apps, education |

**Remember**: These lightweight models are perfect for learning, prototyping, and many production use cases. Start here before moving to larger models!