In [None]:
import json
import requests
from IPython.display import JSON, Markdown


# LLaMA C++ HTTP Server Basics


In [None]:
%%bash

which llama-server

In [None]:
%%bash

llama-server --help

## Quick Start

To get started right away, open a terminal and run the following command, making sure to use the correct path for the model you have.

```bash
MODEL="./models/gemma-1.1-7b-it.Q4_K_M.gguf"
llama-server \
    --model $MODEL \
    --host localhost \
    --port 8080
```


## Health Check

In [None]:
response = requests.get("http://localhost:8080/health")

print(response.content)

### Response Format

- HTTP status code 503
  - Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
  - Explanation: the model is still being loaded.
- HTTP status code 200
  - Body: `{"status": "ok" }`
  - Explanation: the model is successfully loaded and the server is ready.

## Basic Example

In [None]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
    }
)

In [None]:
json_response = json.loads(response.content)
JSON(json_response)

In [None]:
Markdown(json_response["content"])

## Checking Server Global Properties


This `/props` API endpoint allows you to get the current global settings for the server. By default, it is read-only: to make changes to global properties, you need to start server with the `--props` option.


In [None]:
response = requests.get("http://localhost:8080/props")

In [None]:
print(response)

In [None]:
_json_data = json.loads(response.content)
JSON(_json_data)

### Response Format
- `system_prompt`: the default value for the model's system prompt (if any).
- `default_generation_settings`: the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
- `total_slots`: the total number of slots for process requests (defined by `--parallel` option).
- `chat_template`: the model's original Jinja2 prompt template (if any).

## Changing Server Global Properties 

To use the `/props` API endpoint POST method, you need to start server with `--props`.

```bash
MODEL="./models/gemma-1.1-7b-it.Q4_K_M.gguf"
llama-server \
    --model $MODEL \
    --host localhost \
    --port 8080 \
    --props
```

## Metrics

If you launch your server using the `--metrics` option, then this will expose a [Prometheus-compatible](https://prometheus.io/) metrics exporter.

```bash
MODEL="./models/gemma-1.1-7b-it.Q4_K_M.gguf"
llama-server \
    --model $MODEL \
    --host localhost \
    --port 8080 \
    --metrics
```

### Available metrics:

- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
- `llamacpp:tokens_predicted_total`: Number of generation tokens processed.
- `llamacpp:prompt_tokens_seconds`: Average prompt throughput in tokens/s.
- `llamacpp:predicted_tokens_seconds`: Average generation throughput in tokens/s.
- `llamacpp:kv_cache_usage_ratio`: KV-cache usage. `1` means 100 percent usage.
- `llamacpp:kv_cache_tokens`: KV-cache tokens.
- `llamacpp:requests_processing`: Number of requests processing.
- `llamacpp:requests_deferred`: Number of requests deferred.

### Basic Example

In [None]:
response = requests.get(
    url="http://localhost:8080/metrics",
)

In [None]:
current_metrics = (
    response.content
            .decode("utf-8")
)
print(current_metrics)