# Working with llama.cpp in Pixeltable

This tutorial demonstrates how to use Pixeltable's built-in `llama.cpp` integration to run local LLMs efficiently.

<div class="alert alert-block alert-info"><!-- mdx:none -->
<b>If you are running this tutorial in Colab:</b>
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the <code>Runtime -> Change runtime type</code> menu item at the top, then select the <code>GPU</code> radio button and click on <code>Save</code>.
</div>

### Important notes

- Models are automatically downloaded from Hugging Face and cached locally
- Different quantization levels are available for performance/quality tradeoffs
- Consider memory usage when choosing models and quantizations

## Set up environment

First, let's install Pixeltable with llama.cpp support:

In [None]:
%pip install -qU pixeltable llama-cpp-python huggingface-hub

## Create a table for chat completions

Now let's create a table that will contain our inputs and responses.

In [None]:
import pixeltable as pxt
from pixeltable.functions import llama_cpp

pxt.drop_dir('llama_demo', force=True)
pxt.create_dir('llama_demo')

t = pxt.create_table('llama_demo.chat', {'input': pxt.String})

Next, we add a computed column that calls the Pixeltable `create_chat_completion` UDF, which adapts the corresponding llama.cpp API call. In our examples, we'll use pretrained models from the Hugging Face repository. llama.cpp makes it easy to do this by specifying a repo_id (from the URL of the model) and filename from the model repo; the model will then be downloaded and cached automatically.

(If this is your first time using Pixeltable, the <a href="https://docs.pixeltable.com/tutorials/tables-and-data-operations">Pixeltable Fundamentals</a> tutorial contains more details about table creation, computed columns, and UDFs.)

For this demo we'll use `Qwen2.5-0.5B`, a very small (0.5-billion parameter) model that still produces decent results. We'll use `Q5_K_M` (5-bit) quantization, which gives an excellent balance of quality and efficiency.

In [None]:
# Add a computed column that uses llama.cpp for chat completion
# against the input.

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': t.input}
]

t.add_computed_column(result=llama_cpp.create_chat_completion(
    messages,
    repo_id='Qwen/Qwen2.5-0.5B-Instruct-GGUF',
    repo_filename='*q5_k_m.gguf'
))

# Extract the output content from the JSON structure returned
# by llama_cpp.

t.add_computed_column(output=t.result.choices[0].message.content)

## Test chat completion

Let's try a simple query:

In [None]:
# Test with a simple question
t.insert([
    {'input': 'What is the capital of France?'},
    {'input': 'What are some edible species of fish?'},
    {'input': 'Who are the most prominent classical composers?'}
])

In [None]:
t.select(t.input, t.output).collect()

## Comparing models

Local model frameworks like `llama.cpp` make it easy to compare the output of different models. Let's try comparing the output from `Qwen` against a somewhat larger model, `Llama-3.2-1B`. As always, when we add a new computed column to our table, it's automatically evaluated against the existing table rows.

In [None]:
t.add_computed_column(result_l3=llama_cpp.create_chat_completion(
    messages,
    repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF',
    repo_filename='*Q5_K_M.gguf'
))

t.add_computed_column(output_l3=t.result_l3.choices[0].message.content)

t.select(t.input, t.output, t.output_l3).collect()

Just for fun, let's try running against a different system prompt with a different persona.

In [None]:
messages_teacher = [
    {'role': 'system',
     'content': 'You are a patient school teacher. '
                'Explain concepts simply and clearly.'},
    {'role': 'user', 'content': t.input}
]

t.add_computed_column(result_teacher=llama_cpp.create_chat_completion(
    messages_teacher,
    repo_id='bartowski/Llama-3.2-1B-Instruct-GGUF',
    repo_filename='*Q5_K_M.gguf'
))

t.add_computed_column(output_teacher=t.result_teacher.choices[0].message.content)

t.select(t.input, t.output_teacher).collect()

## Additional Resources

- [Pixeltable Documentation](https:/docs.pixeltable.com/)
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)