# Exercise: Setting Up and Running a Llama-Cpp Model

This notebook demonstrates how to install dependencies, download a model from Hugging Face,
and load it using the `llama_cpp` library. Follow the steps below to understand how to:
- Install necessary Python packages.
- Download a GGUF model file from Hugging Face.
- Load and interact with the model.

Make sure you have the required dependencies installed before running the notebook.

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install necessary dependencies for running the Llama model
!pip3 install llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
!pip3 install huggingface_hub==0.28.0

In [None]:
# Import required libraries for model execution and downloading
from llama_cpp import Llama
from huggingface_hub import hf_hub_download

In [None]:
# Download the LLM model file from Hugging Face
# Ensure the repository ID and filename match the desired model
# Download the LLM, you can search in Hugging Face for mode GGUF LLMs
model_path = hf_hub_download(
    repo_id="Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF",
    filename="qwen2.5-coder-0.5b-instruct-q4_k_m.gguf",
    force_download=False,
)

In [None]:
# Create the LLM
llm = Llama(
    model_path=model_path,
    # n_gpu_layers=-1, # Uncomment to use GPU acceleration
    n_ctx=2048,  # Uncomment to increase the context window
)

In [None]:
# Prompt the LLM
llm(
    # prompt for the LLM with prefix and suffix
    (
        "<|im_start|>user\n"
        "Name the planets in the solar system<|im_end|>\n"
        "<|im_start|>assistant\n"
    ),
    # Generate up to 2048 tokens, set to None to generate up to the end of the context window
    max_tokens=2048,
    # Stop generating just before the model would generate a new question
    stop=["<|im_end|>\n"],
    temperature=0.0
)