In [19]:
import json
import requests
from IPython.display import JSON, Markdown


# LLaMA C++ HTTP Server Basics


In [1]:
%%bash

which llama-server

/Users/pughdr/Documents/Training/kaust-generative-ai/local-deployment-llama-cpp/env/bin/llama-server


In [2]:
%%bash

llama-server --help

----- common params -----

-h,    --help, --usage                  print usage and exit
--version                               show version and build info
--verbose-prompt                        print a verbose prompt before generation (default: false)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
-Cr,   --cpu-range lo-hi                range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1>                      use strict CPU placement (default: 0)
--prio N                                set process/thread priority : 0-normal, 1-medium,

## Quick Start

To get started right away, open a terminal and run the following command, making sure to use the correct path for the model you have.

```bash
MODEL="./models/gemma-1.1-7b-it.Q4_K_M.gguf"
llama-server \
    --model $MODEL \
    --host localhost \
    --port 8080
```


## Health Check

In [20]:
response = requests.get("http://localhost:8080/health")

print(response.content)

b'{"status":"ok"}'


### Response Format

- HTTP status code 503
  - Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
  - Explanation: the model is still being loaded.
- HTTP status code 200
  - Body: `{"status": "ok" }`
  - Explanation: the model is successfully loaded and the server is ready.

## Basic Example

In [21]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
    }
)

In [23]:
json_response = json.loads(response.content)
JSON(json_response)

<IPython.core.display.JSON object>

In [24]:
Markdown(json_response["content"])



**Answer:**

The sky is blue due to a phenomenon called **Rayleigh scattering**. 

* Sunlight is composed of all the colors of the rainbow, each with a specific wavelength.
* When sunlight interacts with molecules in the atmosphere, such as nitrogen and oxygen, the molecules scatter the light.
* Different wavelengths of light are scattered differently.
* Shorter wavelengths of light, like blue light, are scattered more efficiently than longer wavelengths.

**Therefore:**

* More blue light is scattered in all directions, reaching our eyes and making the sky appear blue.
* Longer wavelengths of light, like red light, are scattered less and tend to travel in a straight line, away from our eyes.