# Open-Source LLMs: Models, Ecosystem, and Application Development

This notebook explores the rich ecosystem of open-source large language models (LLMs), how to evaluate and use them effectively, and how to build applications with them. Building on your understanding of transformer architecture and fine-tuning techniques, we'll now focus on the practical aspects of leveraging these models to create valuable products.

## Learning Objectives

By the end of this notebook, you'll understand:
- The current landscape of open-source LLMs and their capabilities
- How to evaluate and select the right model for different use cases
- Practical deployment considerations and optimization techniques
- How to build applications with open-source LLMs using popular frameworks
- Product development considerations when working with LLMs
- How to implement evaluation frameworks to measure model performance

Let's start by setting up our environment with the necessary libraries.

In [None]:
# Install required packages
%pip install transformers datasets accelerate huggingface_hub sentence-transformers langchain langchain-community \
    llama-cpp-python llamaindex chainlit torch optimum tiktoken einops auto-gptq bitsandbytes peft faiss-cpu \
    evaluate openai gradio -q

In [None]:
# Import core libraries
import os
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML, Markdown
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import HfApi, list_models

# Set plotting styles
%matplotlib inline
plt.style.use('ggplot')
sns.set(style="whitegrid")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set up HuggingFace API (will need a token if accessing gated models)
hf_api = HfApi()

## 1. The Open-Source LLM Ecosystem

The landscape of open-source LLMs has exploded in recent years, offering increasingly powerful models that rival commercial alternatives. Understanding this ecosystem is essential for making informed decisions about which models to use for your applications.

### Key Open-Source Model Families

Let's explore the major open-source LLM families and their characteristics:

In [None]:
# Create a dataframe of major open-source LLM families
model_families = {
    "Family": [
        "Llama", 
        "Mistral", 
        "Phi", 
        "Gemma", 
        "Pythia", 
        "Falcon", 
        "BLOOM", 
        "Qwen", 
        "MPT",
        "Orca",
        "Yi",
        "OLMo"
    ],
    "Creator": [
        "Meta", 
        "Mistral AI", 
        "Microsoft", 
        "Google", 
        "EleutherAI", 
        "TII", 
        "BigScience", 
        "Alibaba", 
        "MosaicML",
        "Microsoft",
        "01.AI",
        "AI2"
    ],
    "Latest Version": [
        "Llama-3 (70B)", 
        "Mistral 7B + Mixtral 8x7B", 
        "Phi-3 (mini/small/medium)", 
        "Gemma 2 (2B/7B/27B)", 
        "Pythia (1B-12B)", 
        "Falcon-180B", 
        "BLOOM-176B", 
        "Qwen2 (0.5B-72B)", 
        "MPT-30B",
        "Orca 2 (13B)",
        "Yi-34B",
        "OLMo 7B"
    ],
    "Key Strengths": [
        "Strong all-round performance, widely used", 
        "Efficient, strong reasoning, MoE architecture", 
        "Small but powerful, math capabilities", 
        "Lightweight with strong capabilities", 
        "Open weights, multiple checkpoints", 
        "Arabic language capabilities, large context", 
        "Multilingual (46+ languages)", 
        "Strong in Chinese and English, tool use", 
        "Long context window, ALiBi positioning",
        "Instruction-following, reasoning",
        "Quality in Chinese and English",
        "Full transparency in training"
    ],
    "License": [
        "Meta AI LLAMA 2, LLAMA 3 agreements", 
        "Apache 2.0", 
        "MIT", 
        "Apache 2.0", 
        "Apache 2.0", 
        "TII license", 
        "RAIL", 
        "Qwen License", 
        "Apache 2.0",
        "MIT",
        "Apache 2.0",
        "Apache 2.0"
    ]
}

model_df = pd.DataFrame(model_families)
display(model_df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{'selector': 'th', 'props': [('text-align', 'left')]}]))

# Create a more detailed visualization of model sizes
def visualize_model_sizes():
    # Model sizes in billions of parameters
    models = [
        "Phi-3 Mini (3.8B)", "Gemma 2B", "Phi-3 Small (7B)", "Mistral 7B", "Llama-3 8B", 
        "Orca 2 (13B)", "Phi-3 Medium (14B)", "Gemma 2 (27B)", "Yi (34B)", "Llama-3 70B", 
        "Qwen2 72B", "Mixtral 8x7B (MoE)", "Falcon-180B", "BLOOM-176B"
    ]
    sizes = [
        3.8, 2, 7, 7, 8, 
        13, 14, 27, 34, 70, 
        72, 47, 180, 176  # Mixtral effective size is lower due to MoE
    ]
    
    # Create categories
    categories = [
        "Small", "Small", "Small", "Small", "Small",
        "Medium", "Medium", "Medium", "Medium", "Large",
        "Large", "Large (MoE)", "Very Large", "Very Large"
    ]
    
    # Create color map
    category_colors = {"Small": "#1f77b4", "Medium": "#ff7f0e", "Large": "#2ca02c", "Large (MoE)": "#9467bd", "Very Large": "#d62728"}
    colors = [category_colors[cat] for cat in categories]
    
    # Create the plot
    plt.figure(figsize=(12, 8))
    bars = plt.barh(models, sizes, color=colors)
    plt.xscale('log')
    plt.xlabel('Billions of Parameters (log scale)')
    plt.title('Open-Source LLM Sizes')
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    
    # Add size labels
    for bar in bars:
        width = bar.get_width()
        label_x_pos = width * 1.01
        plt.text(label_x_pos, bar.get_y() + bar.get_height()/2, f'{width}B', 
                 va='center', fontsize=8)
    
    # Add a legend
    from matplotlib.patches import Patch
    legend_elements = [Patch(facecolor=category_colors[cat], label=cat) for cat in category_colors]
    plt.legend(handles=legend_elements, loc='lower right')
    
    plt.tight_layout()
    plt.show()

# Call the visualization function
visualize_model_sizes()

### Model Architecture Innovations

Open-source LLMs continue to evolve with new architectural innovations that improve performance, efficiency, and capabilities. Let's explore some of the key innovations in recent models:

In [None]:
# Create a visual representation of key LLM innovations
innovations = {
    "Innovation": [
        "Mixture of Experts (MoE)",
        "Grouped-Query Attention (GQA)",
        "Sliding Window Attention",
        "ALiBi Positional Encoding",
        "RoPE (Rotary Position Embedding)",
        "Multi-Query Attention (MQA)",
        "Flash Attention",
        "GELU Activation Function"
    ],
    "Description": [
        "Routes tokens through specialized sub-networks, activating only a subset for each token",
        "Shares key/value pairs across groups of attention heads for efficiency",
        "Limits attention to a sliding window of tokens for efficiency with longer contexts",
        "Attention with Linear Biases - adds bias to attention scores based on distance",
        "Applies rotations to embedding vectors based on position, better preserves token relationships",
        "Each query head attends to the same key/value pair for efficiency",
        "Optimized attention algorithm that reduces memory usage and improves speed",
        "Gaussian Error Linear Unit - smoother activation function than ReLU"
    ],
    "Benefits": [
        "Larger effective model size with fewer active parameters, better specialization",
        "Reduces memory usage and increases inference speed",
        "Enables longer context processing with linear scaling",
        "Better extrapolation to longer sequences than learned positional embeddings",
        "Better handling of relative positions and sequence extrapolation",
        "Reduces memory bandwidth requirements during inference",
        "Significantly faster training and inference with less memory",
        "Better gradient flow and performance than ReLU"
    ],
    "Example Models": [
        "Mixtral 8x7B, Qwen1.5-MoE, DeepSeek-MoE",
        "Llama 2/3, Gemma, Qwen",
        "Mistral, Phi-2",
        "MPT, Pythia, Dolly",
        "Llama, Mistral, Falcon, Yi",
        "PaLM, Gemma (variant)",
        "Most recent models use this implementation",
        "BERT, GPT-2/3/4, most modern LLMs"
    ]
}

innovations_df = pd.DataFrame(innovations)
display(innovations_df)

Let's visualize one of the most impactful innovations - Mixture of Experts (MoE) - to understand how it allows models to achieve greater effective capacity with fewer active parameters:

In [None]:
# Visualize Mixture of Experts (MoE) architecture
def visualize_moe():
    plt.figure(figsize=(14, 8))
    
    # Define the figure layout
    ax = plt.subplot(1, 1, 1)
    ax.set_xlim(0, 100)
    ax.set_ylim(0, 100)
    ax.axis('off')
    
    # Draw the input
    plt.text(50, 95, "Input Token Representation", ha='center', fontsize=14, fontweight='bold')
    plt.arrow(50, 90, 0, -5, head_width=1, head_length=1, fc='black', ec='black')
    
    # Draw the router
    router_rect = plt.Rectangle((40, 75), 20, 10, fc='#ff9999', ec='black')
    ax.add_patch(router_rect)
    plt.text(50, 80, "Router Network", ha='center', va='center', fontsize=11)
    
    # Draw the experts
    expert_colors = ['#66b3ff', '#99ff99', '#ffcc99', '#c2c2f0', '#ffb3e6', '#c2f0c2', '#99ccff', '#ffb3b3']
    expert_positions = [(10+i*20, 50) for i in range(8)]
    expert_rects = []
    
    for i, (x, y) in enumerate(expert_positions):
        expert_rect = plt.Rectangle((x-7.5, y-10), 15, 20, fc=expert_colors[i], ec='black')
        ax.add_patch(expert_rect)
        plt.text(x, y, f"Expert\n{i+1}", ha='center', va='center', fontsize=9)
        expert_rects.append(expert_rect)
    
    # Draw connections from router to experts
    active_experts = [1, 3, 6]  # Let's assume experts 1, 3, and 6 are active for this token
    for i in range(8):
        x, y = expert_positions[i]
        alpha = 1.0 if i in active_experts else 0.2
        width = 2.0 if i in active_experts else 0.5
        linestyle = '-' if i in active_experts else '--'
        plt.plot([50, x], [75, y+10], color='black', alpha=alpha, linewidth=width, linestyle=linestyle)
    
    # Draw the output combiner
    plt.arrow(10, 40, 0, -5, head_width=1, head_length=1, fc='black', ec='black', alpha=0.2)
    plt.arrow(30, 40, 0, -5, head_width=1, head_length=1, fc='black', ec='black')
    plt.arrow(50, 40, 0, -5, head_width=1, head_length=1, fc='black', ec='black', alpha=0.2)
    plt.arrow(70, 40, 0, -5, head_width=1, head_length=1, fc='black', ec='black')
    plt.arrow(90, 40, 0, -5, head_width=1, head_length=1, fc='black', ec='black', alpha=0.2)
    
    combiner_rect = plt.Rectangle((40, 25), 20, 10, fc='#9999ff', ec='black')
    ax.add_patch(combiner_rect)
    plt.text(50, 30, "Output Combiner", ha='center', va='center', fontsize=11)
    
    # Draw the output
    plt.arrow(50, 25, 0, -5, head_width=1, head_length=1, fc='black', ec='black')
    plt.text(50, 15, "Output Token Representation", ha='center', fontsize=14, fontweight='bold')
    
    # Add title and description
    plt.figtext(0.5, 0.02, "In Mixture of Experts (MoE), only a subset of 'expert' networks process each token.\n"
                "This allows for much larger total parameter counts while keeping compute requirements manageable.", 
                ha='center', fontsize=12, bbox={"facecolor":"white", "alpha":0.8, "pad":5})
    
    plt.suptitle("Mixture of Experts (MoE) Architecture", fontsize=16, y=0.98)
    plt.tight_layout(rect=[0, 0.08, 1, 0.95])
    plt.show()
    
    # Create a comparison table between standard and MoE models
    comparison = pd.DataFrame({
        "Metric": ["Total Parameters", "Active Parameters per Token", "Computational Cost", "Memory Usage", "Specialization", "Examples"],
        "Dense Model": ["70B", "70B (100%)", "Proportional to model size", "Full model size", "General-purpose", "Llama-3 70B, GPT-3.5"],
        "MoE Model": ["47B (Effective 125B)", "13B (~10-30%)", "Much lower than equivalent dense model", "Can be higher due to expert parameters", "Experts specialize in different tasks", "Mixtral 8x7B, Qwen-MoE"]
    })
    
    display(comparison)

# Call the visualization function
visualize_moe()

### Licensing and Usage Restrictions

Understanding licensing is crucial for product development. Open-source LLMs use various licenses, some with commercial restrictions. Let's explain the main license types and their implications:

In [None]:
# Create a comparison of common LLM licenses
license_info = {
    "License": [
        "Apache 2.0",
        "MIT",
        "Meta Llama 2/3 License",
        "RAIL (Responsible AI License)",
        "Custom Research-Only"
    ],
    "Commercial Use": [
        "✅ Permitted",
        "✅ Permitted",
        "✅ Permitted (with limitations)",
        "⚠️ Limited - requires responsible use",
        "❌ Not permitted"
    ],
    "Modification": [
        "✅ Permitted",
        "✅ Permitted",
        "✅ Permitted",
        "✅ Permitted with same restrictions",
        "⚠️ Limited to research"
    ],
    "Distribution": [
        "✅ Permitted with attribution",
        "✅ Permitted with attribution",
        "⚠️ Additional terms for redistribution",
        "⚠️ Responsible use requirements apply",
        "⚠️ Typically only for academic purposes"
    ],
    "Use Cases Prohibited": [
        "None specified",
        "None specified",
        "Various harmful uses, 700M+ MAU without approval",
        "Deception, harassment, illegality, harm",
        "Typically all commercial applications"
    ],
    "Example Models": [
        "Mistral, Yi, Gemma, OLMo, Phi",
        "Phi-3, Orca-2",
        "Llama-2, Llama-3",
        "BLOOM, FLAN-T5",
        "Some early GPT-Neo variants"
    ]
}

license_df = pd.DataFrame(license_info)
display(license_df)

# Highlight key considerations for product managers
display(Markdown("""### Key Licensing Considerations for Product Managers

When building products with open-source LLMs, consider these license factors:

1. **Content Generation Risk**: Even with permissive licenses, you're responsible for user-generated outputs
2. **MAU Limitations**: Some licenses (like Llama) restrict usage beyond certain user thresholds
3. **Attribution Requirements**: Most licenses require proper attribution in your products
4. **Liability Issues**: Most licenses provide no warranty or liability protection
5. **Combining Models**: When fine-tuning or merging models, licensing becomes more complex
6. **Monitoring Changes**: License terms for popular models can evolve over time
7. **Internal vs. Customer-Facing**: Some licenses have different terms based on how models are deployed

For product development, Apache 2.0 and MIT licenses generally offer the most flexibility."""))

## 2. Evaluating and Selecting Open-Source LLMs

Choosing the right model involves understanding the trade-offs between different models based on their capabilities, size, and computational requirements. Let's create a structured approach to model selection.

In [None]:
# Create a framework for model selection
selection_framework = pd.DataFrame({
    "Consideration": [
        "Task Requirements",
        "Model Size",
        "Compute Resources",
        "Inference Speed",
        "Specialization",
        "Multilinguality",
        "Licensing",
        "Ecosystem Support"
    ],
    "Questions to Ask": [
        "What specific capabilities does my application need? (reasoning, code generation, creativity, etc.)",
        "What's the maximum model size my infrastructure can support? Consider RAM requirements.",
        "Do you have access to GPUs/TPUs? What memory constraints exist?",
        "What response time is acceptable for your application?",
        "Is your use case better served by a specialized model or a general-purpose one?",
        "What languages does your application need to support?",
        "What licensing constraints apply to your business use case?",
        "How well supported is the model in frameworks and tools you plan to use?"
    ],
    "Recommendations": [
        "Map specific tasks to benchmark results; different models excel at different tasks",
        "Smaller ≠ worse; consider Phi-3, Gemma 2, or Mistral 7B for efficiency",
        "Consider 4-bit/8-bit quantization for larger models; MoE models when RAM is limited",
        "Smaller models or optimized variants (GPTQ, GGUF) for lower latency",
        "For code: CodeLlama, Starcoder; For math: Phi models; For reasoning: Mistral",
        "BLOOM, XGLM for broad coverage; Yi, Qwen for Asian languages",
        "Apache 2.0/MIT for maximum flexibility; check MAU limits and other restrictions",
        "Llama, Mistral, and Phi families have the broadest adapter/library support"
    ]
})

display(selection_framework)

# Create a visualization of the LLM performance landscape
def visualize_model_performance_landscape():
    # Model performance data (approximated based on various benchmarks)
    # This is simplified data for visualization purposes
    models = [
        "Llama-3 70B", "Mixtral 8x7B", "Llama-3 8B", "Mistral 7B", 
        "Phi-3 Small", "Phi-3 Medium", "Gemma 2 27B", "Gemma 2 7B", "Gemma 2 2B",
        "Yi 34B", "Qwen2 7B", "Qwen2 72B"
    ]
    
    reasoning = [92, 87, 80, 78, 76, 85, 86, 75, 65, 84, 77, 90]  # Reasoning capabilities
    knowledge = [90, 85, 82, 75, 73, 83, 81, 73, 63, 80, 76, 89]  # Knowledge capabilities
    sizes = [70, 47, 8, 7, 7, 14, 27, 7, 2, 34, 7, 72]  # Size in billions of parameters
    
    # Create a scatterplot
    plt.figure(figsize=(12, 8))
    
    # Size the points by model size
    sizes_scaled = [100 * (s/max(sizes)) + 100 for s in sizes]
    
    # Create scatter plot
    sc = plt.scatter(knowledge, reasoning, s=sizes_scaled, alpha=0.7, 
                    c=sizes, cmap='viridis', edgecolors='black', linewidths=1)
    
    # Add labels for each point
    for i, model in enumerate(models):
        plt.annotate(model, (knowledge[i], reasoning[i]), 
                     xytext=(5, 5), textcoords='offset points', fontsize=9)
    
    # Add a color bar to indicate model size
    cbar = plt.colorbar(sc)
    cbar.set_label('Model Size (B parameters)', rotation=270, labelpad=20)
    
    # Add quadrant lines
    plt.axhline(y=80, color='gray', linestyle='--', alpha=0.5)
    plt.axvline(x=80, color='gray', linestyle='--', alpha=0.5)
    
    # Add quadrant labels
    plt.text(65, 95, "High Reasoning\nLower Knowledge", ha='center', fontsize=10)
    plt.text(95, 95, "Strong Generalists\nHigh Reasoning & Knowledge", ha='center', fontsize=10)
    plt.text(65, 65, "Smaller Models\nMore Limited Capabilities", ha='center', fontsize=10)
    plt.text(95, 65, "Strong Knowledge\nLower Reasoning", ha='center', fontsize=10)
    
    plt.xlabel('Knowledge Capabilities')
    plt.ylabel('Reasoning Capabilities')
    plt.title('LLM Performance Landscape (Approximate)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

# Call the visualization function
visualize_model_performance_landscape()

### Benchmarking Models for Your Specific Task

While general benchmarks provide a high-level overview, it's crucial to evaluate models on tasks specific to your application. Let's create a simple framework for evaluating models on custom tasks.

In [None]:
# Define a function to evaluate models on specific tasks
def evaluate_models_on_task(models, task_examples, evaluation_criteria, models_config=None):
    """
    A framework for evaluating multiple models on a specific task.
    
    Args:
        models: List of model names or paths
        task_examples: List of input examples for the task
        evaluation_criteria: Dictionary of criteria functions that score outputs
        models_config: Optional configuration parameters for each model
        
    Returns:
        DataFrame with evaluation results
    """
    results = {"Model": []}
    for criterion in evaluation_criteria:
        results[criterion] = []
    results["Inference Time (s)"] = []
    
    import time
    
    # For demonstration purposes, we'll simulate the evaluation
    # In a real scenario, you would load each model and run inference
    
    # Simulated performance data (would be calculated from actual model outputs)
    performance_data = {
        "llama-3-8b": {"Accuracy": 0.82, "Relevance": 0.85, "Creativity": 0.78, "Time": 0.8},
        "mistral-7b-instruct": {"Accuracy": 0.79, "Relevance": 0.88, "Creativity": 0.75, "Time": 0.7},
        "phi-3-mini": {"Accuracy": 0.75, "Relevance": 0.80, "Creativity": 0.72, "Time": 0.5}
    }
    
    for model in models:
        model_id = model.split("/")[-1].lower()
        results["Model"].append(model)
        
        # In a real implementation, you would:
        # 1. Load the model
        # 2. Run inference on each example
        # 3. Apply evaluation criteria to outputs
        # 4. Aggregate scores across examples
        
        # For this demonstration, we'll use simulated data
        for criterion in evaluation_criteria:
            if model_id in performance_data and criterion in performance_data[model_id]:
                results[criterion].append(performance_data[model_id][criterion])
            else:
                # Randomly generate a score if no data available
                import random
                results[criterion].append(round(random.uniform(0.65, 0.90), 2))
        
        # Add inference time
        if model_id in performance_data and "Time" in performance_data[model_id]:
            results["Inference Time (s)"].append(performance_data[model_id]["Time"])
        else:
            results["Inference Time (s)"].append(round(random.uniform(0.5, 2.0), 1))
    
    # Create DataFrame from results
    results_df = pd.DataFrame(results)
    
    # Add an overall score (weighted average)
    weights = {"Accuracy": 0.5, "Relevance": 0.3, "Creativity": 0.2}
    results_df["Overall Score"] = sum(results_df[criterion] * weight for criterion, weight in weights.items())
    
    return results_df

# Define some sample task examples (e.g., creative writing prompts)
task_examples = [
    "Write a short story about an AI assistant helping a product manager design a new app.",
    "Create a marketing description for a smart home device that leverages AI.",
    "Draft an email to stakeholders explaining the benefits of our new LLM-powered feature."
]

# Define evaluation criteria (would be functions that score outputs in a real implementation)
evaluation_criteria = {"Accuracy": None, "Relevance": None, "Creativity": None}

# Models to evaluate
models_to_evaluate = [
    "meta-llama/Llama-3-8B-Instruct",
    "mistralai/Mistral-7B-Instruct-v0.2",
    "microsoft/phi-3-mini"
]

# Run the evaluation (simulated)
results_df = evaluate_models_on_task(models_to_evaluate, task_examples, evaluation_criteria)

# Display results
display(results_df)

# Visualize the results
def visualize_evaluation_results(results_df):
    # Create a radar chart for each model
    criteria = [c for c in results_df.columns if c not in ["Model", "Inference Time (s)", "Overall Score"]]
    
    # Number of variables
    N = len(criteria)
    
    # Create angles for each criterion
    angles = [n / float(N) * 2 * np.pi for n in range(N)]
    angles += angles[:1]  # Close the loop
    
    # Create the plot
    fig, ax = plt.subplots(figsize=(10, 8), subplot_kw=dict(polar=True))
    
    # Add criterion labels
    plt.xticks(angles[:-1], criteria, fontsize=12)
    
    # Draw y-axis labels (0.2 to 1.0 by 0.2)
    ax.set_rlabel_position(0)
    plt.yticks([0.2, 0.4, 0.6, 0.8, 1.0], ["0.2", "0.4", "0.6", "0.8", "1.0"], fontsize=10)
    plt.ylim(0, 1)
    
    # Plot each model
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
    for i, (_, row) in enumerate(results_df.iterrows()):
        model_name = row["Model"].split("/")[-1]  # Extract just the model name for cleaner labels
        values = [row[criterion] for criterion in criteria]
        values += values[:1]  # Close the loop
        
        # Plot values
        ax.plot(angles, values, linewidth=2, linestyle='solid', label=model_name, color=colors[i % len(colors)])
        ax.fill(angles, values, alpha=0.1, color=colors[i % len(colors)])
    
    # Add legend
    plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
    
    plt.title("Model Performance Comparison", size=15, y=1.1)
    plt.tight_layout()
    plt.show()
    
    # Also create a bar chart for overall scores and inference time
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
    
    # Plot overall scores
    models = [m.split("/")[-1] for m in results_df["Model"]]
    ax1.bar(models, results_df["Overall Score"], color=colors[:len(models)])
    ax1.set_title("Overall Performance Score")
    ax1.set_ylim(0, 1)
    ax1.set_ylabel("Score")
    for i, v in enumerate(results_df["Overall Score"]):
        ax1.text(i, v + 0.02, f"{v:.2f}", ha='center')
        
    # Plot inference times
    ax2.bar(models, results_df["Inference Time (s)"], color=colors[:len(models)])
    ax2.set_title("Inference Time (lower is better)")
    ax2.set_ylabel("Seconds")
    for i, v in enumerate(results_df["Inference Time (s)"]):
        ax2.text(i, v + 0.05, f"{v:.1f}s", ha='center')
    
    plt.tight_layout()
    plt.show()

# Visualize the evaluation results
visualize_evaluation_results(results_df)

## 3. Model Optimization and Deployment

Once you've selected a model, you'll need to optimize it for deployment. This includes quantization, pruning, and efficient serving techniques.

In [None]:
# Create a comparison of model optimization techniques
optimization_techniques = pd.DataFrame({
    "Technique": [
        "Quantization (4-bit)",
        "Quantization (8-bit)",
        "GPTQ Quantization",
        "GGUF Format",
        "KV Cache Optimization",
        "Pruning",
        "Speculative Decoding",
        "Flash Attention Implementation",
        "Continuous Batching"
    ],
    "Description": [
        "Reduce precision from 16/32-bit to 4-bit",
        "Reduce precision from 16/32-bit to 8-bit",
        "Quantization with additional optimizations",
        "Optimized model format for CPU and GPU inference",
        "Optimizing attention key/value caching",
        "Removing less important weights",
        "Using a smaller model to 'draft' predictions",
        "Optimized implementation of attention mechanism",
        "Process multiple requests together dynamically"
    ],
    "Memory Reduction": [
        "~75-80%",
        "~50%",
        "~75%",
        "~60-75%",
        "15-30%",
        "10-30%",
        "N/A",
        "30-40%",
        "N/A"
    ],
    "Speed Improvement": [
        "0.8-2x",
        "1.2-1.5x",
        "1.5-3x",
        "1.5-4x",
        "1.3-1.7x",
        "1.1-1.3x",
        "2-5x",
        "2-4x",
        "2-10x (for multiple requests)"
    ],
    "Quality Impact": [
        "Moderate loss",
        "Minimal loss",
        "Low to moderate loss",
        "Format-dependent",
        "None",
        "Low to moderate loss",
        "None (with appropriate draft model)",
        "None",
        "None"
    ],
    "Implementation Complexity": [
        "Low with bitsandbytes",
        "Low",
        "Medium",
        "Low with llama.cpp",
        "Medium",
        "High",
        "High",
        "Low with recent libraries",
        "High"
    ]
})

display(optimization_techniques)