# Comparing Different Quantization Levels of PLLuM-8x7B-chat

This notebook compares the performance, speed, and quality of different quantization levels of the PLLuM-8x7B-chat model. We'll evaluate metrics such as:

1. Memory usage
2. Inference speed
3. Output quality (subjective assessment)
4. Perplexity on sample texts

## Prerequisites

- Download several quantized models from [Hugging Face](https://huggingface.co/piotrmaciejbednarski/PLLuM-8x7B-chat-GGUF)
- Install required packages: `pip install llama-cpp-python matplotlib numpy pandas`

In [None]:
# Install required packages if not already installed
%pip install llama-cpp-python matplotlib numpy pandas psutil

In [None]:
import os
import time
import psutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from llama_cpp import Llama

# Set up plotting style
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (14, 8)

## 1. Define the Model Paths

Let's define the paths to the different quantized models. You should update these paths to match your local setup.

In [None]:
# Define the paths to different quantized models
# Update these paths to match your local setup
models = {
    "Q2_K": "../models/PLLuM-8x7B-chat-gguf-q2_k.gguf",
    "IQ3_S": "../models/PLLuM-8x7B-chat-gguf-iq3_s.gguf",
    "Q3_K_M": "../models/PLLuM-8x7B-chat-gguf-q3_k_m.gguf",
    "Q4_K_M": "../models/PLLuM-8x7B-chat-gguf-q4_k_m.gguf",
    "Q5_K_M": "../models/PLLuM-8x7B-chat-gguf-q5_k_m.gguf",
    "Q8_0": "../models/PLLuM-8x7B-chat-gguf-q8_0.gguf"
}

# Check which models are available
available_models = {}
for name, path in models.items():
    if os.path.exists(path):
        available_models[name] = path
        print(f"✅ {name} found at {path}")
    else:
        print(f"❌ {name} not found at {path}")

if not available_models:
    raise FileNotFoundError("No models found. Please download at least one model.")
else:
    print(f"\nFound {len(available_models)} model(s) for comparison.")

## 2. Define Test Prompts

Let's define a set of test prompts to evaluate the models.

In [None]:
# Define test prompts for different tasks
test_prompts = {
    "Simple Question": "Pytanie: Jakie są największe miasta w Polsce? Odpowiedź:",
    "Complex Question": "Pytanie: Jakie są najważniejsze osiągnięcia polskiej nauki w XX wieku? Wyjaśnij ich znaczenie dla świata. Odpowiedź:",
    "Translation": "Przetłumacz poniższy tekst z polskiego na angielski:\n\n'Polska kultura jest bogata w tradycje i historię. Od wieków słynęliśmy z gościnności i otwartości na inne kultury.'\n\nTłumaczenie:",
    "Summarization": "Streszcz poniższy tekst w 3-4 zdaniach:\n\nKonstytucja 3 maja (właściwie Ustawa Rządowa z dnia 3 maja) – uchwalona 3 maja 1791 roku ustawa regulująca ustrój prawny Rzeczypospolitej Obojga Narodów. Powszechnie przyjmuje się, że Konstytucja 3 maja była pierwszą w Europie i drugą na świecie (po konstytucji amerykańskiej z 1787 r.) nowoczesną, spisaną konstytucją. Konstytucja 3 maja została ustanowiona ustawą rządową przyjętą tego dnia przez sejm. Została zaprojektowana w celu zlikwidowania obecnych od dawna wad systemu politycznego Rzeczypospolitej Obojga Narodów i jej złotej wolności. Konstytucja wprowadziła polityczne zrównanie mieszczan i szlachty oraz stawiała chłopów pod ochroną państwa, w ten sposób łagodząc najgorsze nadużycia pańszczyzny. Konstytucja zniosła zgubne instytucje, takie jak liberum veto, które przed przyjęciem Konstytucji pozostawiało sejm na łasce każdego posła, który, jeśli zechciał – z własnej inicjatywy, lub przekupiony przez zagraniczne siły, albo magnatów – móc unieważnić wszystkie podjęte przez sejm uchwały.\n\nStreszczenie:",
    "Creative Writing": "Napisz krótkie opowiadanie (300-400 słów) o pierwszym kontakcie ludzi z cywilizacją pozaziemską. Opowiadanie powinno zawierać element zaskoczenia."
}

print(f"Defined {len(test_prompts)} test prompts for evaluation.")

## 3. Test Function

Let's create a function to load a model and run the test prompts, measuring performance metrics.

In [None]:
def test_model(model_name, model_path, test_prompts, max_tokens=512):
    """Test a model with the given prompts and gather performance metrics."""
    print(f"Testing model: {model_name}")
    results = {
        "model": model_name,
        "load_time": 0,
        "memory_usage": 0,
        "prompt_results": {}
    }
    
    # Record initial memory usage
    initial_memory = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)  # MB
    
    # Load the model and measure time
    load_start = time.time()
    try:
        llm = Llama(
            model_path=model_path,
            n_ctx=2048,
            n_threads=8,
            n_batch=512,
            verbose=False
        )
        load_time = time.time() - load_start
        results["load_time"] = load_time
        print(f"  Model loaded in {load_time:.2f} seconds")
        
        # Measure memory usage after loading
        current_memory = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)  # MB
        memory_usage = current_memory - initial_memory
        results["memory_usage"] = memory_usage
        print(f"  Memory usage: {memory_usage:.2f} MB")
        
        # Run each test prompt
        for prompt_name, prompt_text in test_prompts.items():
            print(f"  Running prompt: {prompt_name}")
            prompt_result = {
                "inference_time": 0,
                "tokens_per_second": 0,
                "output_text": ""
            }
            
            # Run the inference and measure time
            inference_start = time.time()
            output = llm(
                prompt_text,
                max_tokens=max_tokens,
                temperature=0.7,
                top_p=0.95,
                top_k=50,
                repeat_penalty=1.1
            )
            inference_time = time.time() - inference_start
            
            # Calculate tokens per second
            output_text = output["choices"][0]["text"]
            output_length = len(output_text.split())
            tokens_per_second = output_length / inference_time if inference_time > 0 else 0
            
            # Store results
            prompt_result["inference_time"] = inference_time
            prompt_result["tokens_per_second"] = tokens_per_second
            prompt_result["output_text"] = output_text
            prompt_result["output_length"] = output_length
            
            results["prompt_results"][prompt_name] = prompt_result
            print(f"    Completed in {inference_time:.2f} seconds ({tokens_per_second:.2f} tokens/sec)")
        
        # Calculate average tokens per second across all prompts
        avg_tokens_per_second = np.mean([res["tokens_per_second"] for res in results["prompt_results"].values()])
        results["avg_tokens_per_second"] = avg_tokens_per_second
        print(f"  Average generation speed: {avg_tokens_per_second:.2f} tokens/sec")
        
    except Exception as e:
        print(f"Error testing model {model_name}: {e}")
        results["error"] = str(e)
    finally:
        # Clean up
        if 'llm' in locals():
            del llm
            import gc
            gc.collect()
    
    print(f"Completed testing {model_name}\n")
    return results

## 4. Run Tests on Available Models

Now let's run the tests on each available model.

In [None]:
# Run the tests on available models
all_results = {}

# Select a subset of prompts for comparison to save time
comparison_prompts = {
    "Simple Question": test_prompts["Simple Question"],
    "Translation": test_prompts["Translation"],
    "Summarization": test_prompts["Summarization"]
}

# Run tests with fewer tokens for faster comparison
for model_name, model_path in available_models.items():
    results = test_model(model_name, model_path, comparison_prompts, max_tokens=256)
    all_results[model_name] = results

## 5. Analyze Memory Usage and Speed

Let's analyze and visualize the memory usage and inference speed of different models.

In [None]:
# Prepare data for visualization
model_names = list(all_results.keys())
memory_usage = [all_results[name]["memory_usage"] for name in model_names]
load_times = [all_results[name]["load_time"] for name in model_names]
avg_tokens_per_second = [all_results[name]["avg_tokens_per_second"] for name in model_names]

# Create a DataFrame for easier manipulation
df = pd.DataFrame({
    "Model": model_names,
    "Memory Usage (MB)": memory_usage,
    "Load Time (s)": load_times,
    "Generation Speed (tokens/s)": avg_tokens_per_second
})

# Display the DataFrame
print("Performance Metrics:")
display(df)

# Create comparison visualizations
fig, axs = plt.subplots(1, 3, figsize=(20, 6))

# Memory usage
axs[0].bar(model_names, memory_usage, color='skyblue')
axs[0].set_title('Memory Usage by Model')
axs[0].set_ylabel('Memory Usage (MB)')
axs[0].set_xticklabels(model_names, rotation=45)

# Load time
axs[1].bar(model_names, load_times, color='salmon')
axs[1].set_title('Model Load Time')
axs[1].set_ylabel('Time (seconds)')
axs[1].set_xticklabels(model_names, rotation=45)

# Generation speed
axs[2].bar(model_names, avg_tokens_per_second, color='lightgreen')
axs[2].set_title('Average Generation Speed')
axs[2].set_ylabel('Tokens per Second')
axs[2].set_xticklabels(model_names, rotation=45)

plt.tight_layout()
plt.show()

## 6. Compare Generation Speed by Task

Let's compare how each model performs on different types of tasks.

In [None]:
# Extract tokens per second for each prompt and model
prompt_data = {}
for prompt_name in comparison_prompts.keys():
    prompt_data[prompt_name] = []
    for model_name in model_names:
        if prompt_name in all_results[model_name]["prompt_results"]:
            tokens_per_second = all_results[model_name]["prompt_results"][prompt_name]["tokens_per_second"]
            prompt_data[prompt_name].append(tokens_per_second)
        else:
            prompt_data[prompt_name].append(0)

# Create a grouped bar chart
plt.figure(figsize=(14, 8))
x = np.arange(len(model_names))
width = 0.2
multiplier = 0

for prompt_name, tokens_per_second in prompt_data.items():
    offset = width * multiplier
    plt.bar(x + offset, tokens_per_second, width, label=prompt_name)
    multiplier += 1

plt.xlabel('Model')
plt.ylabel('Tokens per Second')
plt.title('Generation Speed by Model and Task')
plt.xticks(x + width, model_names, rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

## 7. Output Quality Comparison

Let's compare the output quality for one prompt across models.

In [None]:
# Let's compare the output for the "Simple Question" prompt
prompt_to_compare = "Simple Question"
print(f"Comparing outputs for: {prompt_to_compare}")
print(f"Prompt: {comparison_prompts[prompt_to_compare]}\n")

for model_name in model_names:
    print(f"=== {model_name} ===\n")
    if prompt_to_compare in all_results[model_name]["prompt_results"]:
        output_text = all_results[model_name]["prompt_results"][prompt_to_compare]["output_text"]
        print(output_text[:500] + ("..." if len(output_text) > 500 else ""))
    else:
        print("No output available for this model and prompt.")
    print("\n" + "-"*80 + "\n")

## 8. Output Length Comparison

Let's compare the output length for each prompt across models.

In [None]:
# Extract output length for each prompt and model
output_lengths = {}
for prompt_name in comparison_prompts.keys():
    output_lengths[prompt_name] = []
    for model_name in model_names:
        if prompt_name in all_results[model_name]["prompt_results"]:
            length = all_results[model_name]["prompt_results"][prompt_name]["output_length"]
            output_lengths[prompt_name].append(length)
        else:
            output_lengths[prompt_name].append(0)

# Create a grouped bar chart for output lengths
plt.figure(figsize=(14, 8))
x = np.arange(len(model_names))
width = 0.2
multiplier = 0

for prompt_name, lengths in output_lengths.items():
    offset = width * multiplier
    plt.bar(x + offset, lengths, width, label=prompt_name)
    multiplier += 1

plt.xlabel('Model')
plt.ylabel('Output Length (tokens)')
plt.title('Output Length by Model and Task')
plt.xticks(x + width, model_names, rotation=45)
plt.legend()
plt.tight_layout()
plt.show()

## 9. Calculate Efficiency Score

Let's calculate an efficiency score that balances speed and memory usage.

In [None]:
# Calculate efficiency score (tokens per second per MB of memory)
df["Efficiency Score"] = df["Generation Speed (tokens/s)"] / df["Memory Usage (MB)"] * 1000  # Scale up for readability

# Display the updated DataFrame
print("Efficiency Metrics:")
display(df)

# Create a bar chart for efficiency score
plt.figure(figsize=(14, 6))
plt.bar(df["Model"], df["Efficiency Score"], color='purple')
plt.title('Efficiency Score by Model (Higher is Better)')
plt.ylabel('Efficiency Score (tokens/s per MB × 1000)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 10. Create a Recommendations Guide

Based on our analysis, let's create a recommendations guide for different use cases.

In [None]:
# Sort the models by different metrics
best_memory = df.sort_values("Memory Usage (MB)").iloc[0]["Model"]
best_speed = df.sort_values("Generation Speed (tokens/s)", ascending=False).iloc[0]["Model"]
best_efficiency = df.sort_values("Efficiency Score", ascending=False).iloc[0]["Model"]

# Create a recommendations DataFrame
recommendations = {
    "Use Case": [
        "Low-resource environments (e.g., old laptops, minimal RAM)",
        "Balanced performance (good quality with reasonable resources)",
        "Real-time applications (chatbots, interactive tools)",
        "High-quality outputs (creative writing, complex analyses)",
        "Mobile devices or embedded systems"
    ],
    "Recommended Model": [
        best_memory,
        "Q4_K_M",  # Commonly recommended for balance
        best_speed,
        "Q8_0",  # Highest quality
        best_efficiency
    ],
    "Rationale": [
        f"Lowest memory usage among tested models ({df[df['Model']==best_memory]['Memory Usage (MB)'].values[0]:.2f} MB)",
        "Good balance between quality and resource requirements",
        f"Fastest generation speed ({df[df['Model']==best_speed]['Generation Speed (tokens/s)'].values[0]:.2f} tokens/s)",
        "Highest quality outputs (closest to original model)",
        f"Best efficiency score (speed per MB of memory)"
    ]
}

# Create and display the recommendations DataFrame
recommendations_df = pd.DataFrame(recommendations)
print("Recommendations Based on Test Results:")
display(recommendations_df)

## 11. Subjective Quality Assessment

While it's difficult to automate quality assessment, let's create a simple function to help manually evaluate the quality of the outputs.

In [None]:
def compare_outputs_side_by_side(prompt_name, model_name1, model_name2):
    """Compare outputs from two models side by side."""
    print(f"Comparing {model_name1} vs {model_name2} for prompt: {prompt_name}")
    print(f"Prompt: {comparison_prompts[prompt_name]}\n")
    
    output1 = "No output available" 
    output2 = "No output available"
    
    if prompt_name in all_results[model_name1]["prompt_results"]:
        output1 = all_results[model_name1]["prompt_results"][prompt_name]["output_text"]
    
    if prompt_name in all_results[model_name2]["prompt_results"]:
        output2 = all_results[model_name2]["prompt_results"][prompt_name]["output_text"]
    
    # Display outputs side by side
    print("-"*100)
    print(f"{model_name1:50} | {model_name2:50}")
    print("-"*100)
    
    # Split by newlines to make comparison easier
    lines1 = output1.split("\n")
    lines2 = output2.split("\n")
    
    # Find the maximum number of lines
    max_lines = max(len(lines1), len(lines2))
    
    for i in range(max_lines):
        line1 = lines1[i] if i < len(lines1) else ""
        line2 = lines2[i] if i < len(lines2) else ""
        print(f"{line1[:50]:50} | {line2[:50]:50}")
    
    print("-"*100)

# Let's compare a couple of models if we have multiple available
if len(model_names) >= 2:
    # Compare the first two models for each prompt
    for prompt_name in comparison_prompts.keys():
        compare_outputs_side_by_side(prompt_name, model_names[0], model_names[1])
        print("\n")

## 12. Export Results

Let's export our findings to CSV files for future reference.

In [None]:
# Export performance metrics to CSV
df.to_csv("../results/model_performance_metrics.csv", index=False)
recommendations_df.to_csv("../results/model_recommendations.csv", index=False)

print("Exported results to CSV files in the 'results' directory.")

## 13. Cleanup

Let's clean up any resources we may have used.

In [None]:
# Clean up the namespace
import gc
gc.collect()

print("Resources released.")

In this notebook, we've compared different quantization levels of the PLLuM-8x7B-chat model across several dimensions:

1. **Memory Usage**: Lower quantization levels use significantly less memory
2. **Load Time**: Models with lower quantization generally load faster
3. **Inference Speed**: Different quantization levels have varying effects on generation speed
4. **Output Quality**: Higher quantization levels tend to produce higher quality outputs
5. **Efficiency**: We calculated an efficiency score to balance speed and memory usage

### Key Takeaways

- The **Q4_K_M** quantization offers a good balance between quality and resource requirements for most applications
- For low-resource environments, the **Q2_K** or **IQ3_S** quantizations are viable options
- For highest quality outputs, the **Q8_0** quantization is recommended, but requires more resources
- The choice of quantization should be based on your specific use case and available hardware

These results can help you choose the right quantization level for your specific use case and hardware constraints.