# llama.cpp Quickstart (Python)

This notebook demonstrates how to run quantized GGUF models using `llama-cpp-python`. Guide: [llama.cpp Deployment](https://slmhub.gitbook.io/slmhub/docs/deploy/quickstarts/llama-cpp).

## 1. Install
We install `llama-cpp-python` with CUDA support (if GPU available) for acceleration.

In [None]:
!CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python

## 2. Download a GGUF Model
We use `huggingface_hub` to download a pre-quantized GGUF of Phi-3.

In [None]:
from huggingface_hub import hf_hub_download

model_name = "microsoft/Phi-3-mini-4k-instruct-gguf"
model_file = "Phi-3-mini-4k-instruct-q4.gguf"

model_path = hf_hub_download(model_name, filename=model_file)
print(f"Model downloaded to: {model_path}")

## 3. Run Inference
Load the model and generate text.

In [None]:
from llama_cpp import Llama

# Initialize
llm = Llama(
    model_path=model_path,
    n_gpu_layers=-1, # Offload all layers to GPU
    n_ctx=2048,
    verbose=False
)

# Generate
output = llm(
    "Q: Name the planets in the solar system. A: ", 
    max_tokens=64, 
    stop=["Q:", "\n"],
    echo=True
)

print(output['choices'][0]['text'])

## 4. Chat Format
Using the chat API which handles special tokens.

In [None]:
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

print(response['choices'][0]['message']['content'])