In [101]:
import json
import numpy as np
import requests
from IPython.display import JSON, Markdown


# 1. LLaMA C++ HTTP Server API Endpoints

## 1.1 Start `llama-server`

```bash
MODEL="./models/gemma-1.1-7b-it.Q4_K_M.gguf"
llama-server \
    --model $MODEL \
    --host localhost \
    --port 8080
```

## 1.2 GET `/health`

Returns heath check result

In [108]:
response = requests.get("http://localhost:8080/health")
print(response.content)

b'{"status":"ok"}'


### 1.2.1 Response format

- HTTP status code 503
  - Body: `{"error": {"code": 503, "message": "Loading model", "type": "unavailable_error"}}`
  - Explanation: the model is still being loaded.
- HTTP status code 200
  - Body: `{"status": "ok" }`
  - Explanation: the model is successfully loaded and the server is ready.

### 1.1 POST

#### 1.1.1 Prompting

    *Options:*

    `prompt`: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, if `cache_prompt` is `true`, the prompt is compared to the previous completion and only the "unseen" suffix is evaluated. A `BOS` token is inserted at the start, if all of the following conditions are true:

      - The prompt is a string or an array with the first element given as a string
      - The model's `tokenizer.ggml.add_bos_token` metadata is `true`

    `cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`

          `stream`: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to `true`.

    `stop`: Specify a JSON array of stopping strings.
    These words will not be included in the completion, so make sure to add them to the prompt for the next iteration. Default: `[]`

In [110]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
    }
)

In [113]:
print(response)

<Response [200]>


In [116]:
json_data = json.loads(response.content)
JSON(json_data)

<IPython.core.display.JSON object>

#### 1.1.2 Response format

- Note: When using streaming mode (`stream`), only `content` and `stop` will be returned until end of completion.

- `completion_probabilities`: An array of token probabilities for each completion. The array's length is `n_predict`. Each item in the array has the following structure:

```json
{
  "content": "<the token selected by the model>",
  "probs": [
    {
      "prob": float,
      "tok_str": "<most likely token>"
    },
    {
      "prob": float,
      "tok_str": "<second most likely token>"
    },
    ...
  ]
},
```

Notice that each `probs` is an array of length `n_probs`.

- `content`: Completion result as a string (excluding `stopping_word` if any). In case of streaming mode, will contain the next token as a string.
- `stop`: Boolean for use with `stream` to check whether the generation has stopped (Note: This is not related to stopping words array `stop` from input options)
- `generation_settings`: The provided options above excluding `prompt` but including `n_ctx`, `model`. These options may differ from the original ones in some way (e.g. bad values filtered out, strings converted to tokens, etc.).
- `model`: The path to the model loaded with `-m`
- `prompt`: The provided `prompt`
- `stopped_eos`: Indicating whether the completion has stopped because it encountered the EOS token
- `stopped_limit`: Indicating whether the completion stopped because `n_predict` tokens were generated before stop words or EOS was encountered
- `stopped_word`: Indicating whether the completion stopped due to encountering a stopping word from `stop` JSON array provided
- `stopping_word`: The stopping word encountered which stopped the generation (or "" if not stopped due to a stopping word)
- `timings`: Hash of timing information about the completion such as the number of tokens `predicted_per_second`
- `tokens_cached`: Number of tokens from the prompt which could be re-used from previous completion (`n_past`)
- `tokens_evaluated`: Number of tokens evaluated in total from the prompt
- `truncated`: Boolean indicating if the context size was exceeded during generation, i.e. the number of tokens provided in the prompt (`tokens_evaluated`) plus tokens generated (`tokens predicted`) exceeded the context size (`n_ctx`)

In [None]:
_generated_text = json_data["content"]
Markdown(_generated_text)

#### 1.1.3 Random Number Generator (RNG) Seed

The RNG seed is used to initialize the random number generator that influences the text generation process. By setting a specific value for `seed` you can obtain consistent and reproducible results across multiple runs with the same input and settings. This can be helpful for testing, debugging, or comparing the effects of different options on the generated text to see when they diverge. If the seed is set to a value less than 0, a random seed will be used, which will result in different outputs on each run. The default value is -1 which will choose a random value for `seed`.

In [128]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42
    }
)

In [129]:
_json_data = json.loads(response.content)
_generated_text = _json_data["content"]
Markdown(_generated_text)



**Answer:**

The sky is blue due to a phenomenon called **Rayleigh scattering**. 

* Sunlight is composed of all the colors of the rainbow, each with a specific wavelength. 
* When sunlight interacts with molecules in the atmosphere, such as nitrogen and oxygen, the different wavelengths are scattered in different directions. 
* Shorter wavelengths of light, like blue light, are scattered more efficiently than longer wavelengths. 
* Since our eyes are most sensitive to blue light, we perceive the sky as blue.

#### 1.1.4 Number of Tokens to Predict

The `n_predict` (default: -1) controls the number of tokens the model generates in response to the input prompt. By adjusting this value, you can influence the length of the generated text. A higher value will result in longer text, while a lower value will produce shorter text.

Even though all models have a finite context window, a value of -1 will enable *infinite* text generation. How? When the context window is full half of the tokens after `n_keep` will be discarded. The context must then be re-evaluated before generation can resume. On large models and/or large context windows, this can result in a significant pause in output. If the output delay is undesirable, a value of -2 will stop generation immediately when the context is filled.

It is important to note that the generated text may be shorter than the specified number of tokens if an End-of-Sequence (EOS) token or a reverse prompt is encountered. In interactive mode, text generation will pause and control will be returned to the user. In non-interactive mode, the program will end. In both cases, the text generation may stop before reaching the specified `n_predict` value. If you want the model to keep going without ever producing End-of-Sequence on its own, you can use the `ignore_eos` parameter.


     `n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
    By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.

    `min_keep`: If greater than 0, force samplers to return N possible tokens at minimum. Default: `0`

    `t_max_predict_ms`: Set a time limit in milliseconds for the prediction (a.k.a. text-generation) phase. The timeout will trigger if the generation takes more than the specified time (measured since the first token was generated) and if a new-line character has already been generated. Useful for FIM applications. Default: `0`, which is disabled.

In [130]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False
    }
)

In [131]:
_json_data = json.loads(response.content)
_generated_text = _json_data["content"]
Markdown(_generated_text)



**Answer:**

The sky is blue due to a phenomenon called **Rayleigh scattering**. 

* Sunlight is composed of all the colors of the rainbow, each with a specific wavelength. 
* When sunlight interacts with molecules in the atmosphere, such as nitrogen and oxygen, the different wavelengths are scattered in different directions. 
* Shorter wavelengths of light, like blue light, are scattered more efficiently than longer wavelengths. 
* Since our eyes are most sensitive to blue light, we perceive the sky as blue.

#### 1.1.5 Temperature

The `temperature` hyperparameter that controls the randomness of the generated text. It affects the probability distribution of the model's output tokens. A higher temperature makes the output more random and creative, while a lower temperature makes the output more focused, deterministic, and conservative.

    `dynatemp_range`: Dynamic temperature range. The final temperature will be in the range of `[temperature - dynatemp_range; temperature + dynatemp_range]` Default: `0.0`, which is disabled.

    `dynatemp_exponent`: Dynamic temperature exponent. Default: `1.0`

In [132]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.1,
        "dynatemp_exponent": 1.0
    }
)

In [133]:
_json_data = json.loads(response.content)
_generated_text = _json_data["content"]

Markdown(_generated_text)



**Answer:**

The sky is blue due to a phenomenon called **Rayleigh scattering**. 

* Sunlight is composed of all the colors of the rainbow, each with a specific wavelength. 
* When sunlight interacts with molecules in the atmosphere, such as nitrogen and oxygen, the different wavelengths are scattered in different directions. 
* Shorter wavelengths of light, such as blue light, are scattered more efficiently than longer wavelengths. 
* As a result, more blue light is scattered in all directions, reaching our eyes and making the sky appear blue.

#### 1.1.6 Repeat Penalty

The `repeat_penalty` option helps prevent the model from generating repetitive or monotonous text. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. The default value is 1.1.

The `repeat_last_n` option controls the number of tokens in the history to consider for penalizing repetition. A larger value will look further back in the generated text to prevent repetitions, while a smaller value will only consider recent tokens. A value of 0 disables the penalty, and a value of -1 sets the number of tokens considered equal to the context size, `ctx_size`. The default value is 64. 

    `penalize_nl`: Penalize newline tokens when applying the repeat penalty. Default: `true`


In [134]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.1,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True
    }
)

In [135]:
_json_data = json.loads(response.content)
_generated_text = _json_data["content"]

Markdown(_generated_text)



**Answer:** 
The sky is blue due to the process of **Rayleigh scattering**. Sunlight consists mainly  of all wavelengths in a range from violet through red. When sunlight interacts with molecules like nitrogen and oxygen, shorter wavelength light (violet &blue) gets scattered more efficiently than longer wave length lights(orange&red).

* Shorter waves have higher frequency/energy per unit area which results into greater scattering power against the particles of air .
**The blue wavelengths are dispersed in all directions by these gases.** 


- Most  of this diffused light reaches our eyes from a relatively small portion (about one degree) directly overhead. That's why we see mainly skyblue during clear weather when there isn’t much dust or cloud to absorb the scattered sunlight .

**Factors influencing colour of Sky:**
* Time & Latitude - different time zones experience differently coloured skies due variations in sun elevation and composition  of atmosphere 


- Cloud Coverage/ Dust particles – clouds block direct contact between Sunlight& air molecules thereby limiting scattering. Similarly, large dust particle can scatter all wavelengths equally leading to a less blue sky .

#### 1.1.7 Top-K Sampling

Top-k sampling is a text generation method that selects the next token only from the `--top-k` most likely tokens predicted by the model. It helps reduce the risk of generating low-probability or nonsensical tokens, but it may also limit the diversity of the output. A higher value for top-k (e.g., 100) will consider more tokens and lead to more diverse text, while a lower value (e.g., 10) will focus on the most probable tokens and generate more conservative text. The default value is 40.

    `top_k`: Limit the next token selection to the K most probable tokens.  Default: `40`


In [136]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.1,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "top_k": 40,
    }
)

In [137]:
json_data = json.loads(response.content)
_generated_text = json_data["content"]

Markdown(_generated_text)



**Answer:** 
The sky is blue due to the process of **Rayleigh scattering**. Sunlight consists mainly  of all wavelengths in a range from violet through red. When sunlight interacts with molecules like nitrogen and oxygen, shorter wavelength light (violet &blue) gets scattered more efficiently than longer wave length lights(orange&red).

* Shorter waves have higher frequency/energy per unit area which results into greater scattering power against the particles of air .
**The blue wavelengths are dispersed in all directions by these gases.** 


- Most  of this diffused light reaches our eyes from a relatively small portion (about one degree) directly overhead. That's why we see mainly skyblue during clear weather when there isn’t much dust or cloud to absorb the scattered sunlight .

**Factors influencing colour of Sky:**
* Time & Latitude - different time zones experience differently coloured skies due variations in sun elevation and composition  of atmosphere 


- Cloud Coverage/ Dust particles – clouds block direct contact between Sunlight& air molecules thereby limiting scattering. Similarly, large dust particle can scatter all wavelengths equally leading to a less blue sky .

#### 1.1.8 Top-P Sampling

Top-p sampling, `top-p`, also known as nucleus sampling, is another text generation method that selects the next token from a subset of tokens that together have a cumulative probability of at least p. This method provides a balance between diversity and quality by considering both the probabilities of tokens and the number of tokens to sample from. A higher value for top-p (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. The default value is 0.9.

    `top_p`: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P. Default: `0.95`

In [138]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.1,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "top_k": 40,
        "top_p": 0.95,
    }
)

In [139]:
json_data = json.loads(response.content)
_generated_text = json_data["content"]

Markdown(_generated_text)



**Answer:** 
The sky is blue due to the process of **Rayleigh scattering**. Sunlight consists mainly  of all wavelengths in a range from violet through red. When sunlight interacts with molecules like nitrogen and oxygen, shorter wavelength light (violet &blue) gets scattered more efficiently than longer wave length lights(orange&red).

* Shorter waves have higher frequency/energy per unit area which results into greater scattering power against the particles of air .
**The blue wavelengths are dispersed in all directions by these gases.** 


- Most  of this diffused light reaches our eyes from a relatively small portion (about one degree) directly overhead. That's why we see mainly skyblue during clear weather when there isn’t much dust or cloud to absorb the scattered sunlight .

**Factors influencing colour of Sky:**
* Time & Latitude - different time zones experience differently coloured skies due variations in sun elevation and composition  of atmosphere 


- Cloud Coverage/ Dust particles – clouds block direct contact between Sunlight& air molecules thereby limiting scattering. Similarly, large dust particle can scatter all wavelengths equally leading to a less blue sky .

#### 1.1.9 Min-P Sampling

The `--min-p` sampling method sets a minimum base probability threshold for token selection and aims to ensure a balance of quality and variety in the generated text. The `--min-p` method was designed as an alternative to `--top-p`. The parameter $p$ represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with $p=0.05$ and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out. The default value is 0.1.

    `min_p`: The minimum probability for a token to be considered, relative to the probability of the most likely token. Default: `0.05`
    

In [140]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.1,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "top_k": 40,
        "top_p": 0.95,
        "min_p": 0.05,
    }
)

In [141]:
json_data = json.loads(response.content)
_generated_text = json_data["content"]

Markdown(_generated_text)



**Answer:** 
The sky is blue due to the process of **Rayleigh scattering**. Sunlight consists mainly  of all wavelengths in a range from violet through red. When sunlight interacts with molecules like nitrogen and oxygen, shorter wavelength light (violet &blue) gets scattered more efficiently than longer wave length lights(orange&red).

* Shorter waves have higher frequency/energy per unit area which results into greater scattering power against the particles of air .
**The blue wavelengths are dispersed in all directions by these gases.** 


- Most  of this diffused light reaches our eyes from a relatively small portion (about one degree) directly overhead. That's why we see mainly skyblue during clear weather when there isn’t much dust or cloud to absorb the scattered sunlight .

**Factors influencing colour of Sky:**
* Time & Latitude - different time zones experience differently coloured skies due variations in sun elevation and composition  of atmosphere 


- Cloud Coverage/ Dust particles – clouds block direct contact between Sunlight& air molecules thereby limiting scattering. Similarly, large dust particle can scatter all wavelengths equally leading to a less blue sky .

#### 1.1.10 Locally Typical Sampling

Locally typical sampling, `typical_p`, promotes the generation of contextually coherent and diverse text by sampling tokens that are typical or expected based on the surrounding context. By setting the parameter $p$ between 0 and 1, you can control the balance between producing text that is locally coherent and diverse. The default setting is $p=1.0$, which disables locally typical sampling.


In [142]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "top_k": 40,
        "top_p": 0.95,
        "min_p": 0.05,
        "typical_p": 1.0,
    }
)

In [143]:
json_data = json.loads(response.content)
_generated_text = json_data["content"]

Markdown(_generated_text)



**Answer:** 
The sky is blue due to the process of **Rayleigh scattering**. Sunlight consists mainly  of all wavelengths or colors. When sunlight interacts with molecules in Earth's atmosphere, like nitrogen and oxygen atoms these particles scatter light rays randomly without altering their direction significantly . But different color lights are scattered differently based on wavelength:

- Shorter Wavelength (blue/violet) - gets dispersed more efficiently
_LongerWavelength_(red / infra red)_ is less effectively diffused. 


**The blue wavelengths of sunlight:**  travel in straight lines until they reach our eyes, giving us the impression that sky appears **BLUE**.

#### 1.1.11 Mirostat Sampling

Mirostat is an algorithm that actively maintains the quality of generated text within a desired range during text generation. It aims to strike a balance between coherence and diversity, avoiding low-quality output caused by excessive repetition (boredom traps) or incoherence (confusion traps). To enable Mirostat sampling set `mirostat` to 1 = Mirostat 1.0 or 2 = Mirostat 2.0. By default Mirostat sampling is disabled `mirostat` to 0.

The `mirostat_lr` option sets the Mirostat learning rate (eta). The learning rate influences how quickly the algorithm responds to feedback from the generated text. A lower learning rate will result in slower adjustments, while a higher learning rate will make the algorithm more responsive. The default value is `0.1`.

The `mirostat_tau` option sets the Mirostat target entropy (tau), which represents the desired perplexity value for the generated text. Adjusting the target entropy allows you to control the balance between coherence and diversity in the generated text. A lower value will result in more focused and coherent text, while a higher value will lead to more diverse and potentially less coherent text. The default value is `5.0`.


In [125]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "top_k": 40,
        "top_p": 0.95,
        "min_p": 0.05,
        "typical_p": 1.0,
        "mirostat": 0.0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1 
        
    }
)

In [127]:
json_data = json.loads(response.content)
_generated_text = json_data["content"]

Markdown(_generated_text)



**Answer:**

The sky is blue due to a phenomenon called **Rayleigh scattering**. 

* Sunlight is composed of all the colors of the rainbow, each with a specific wavelength.
* When sunlight interacts with molecules in the atmosphere, such as nitrogen and oxygen, the molecules scatter the light.
* Different wavelengths of light scatter differently. 
* Shorter wavelengths of light, such as blue light, scatter more efficiently than longer wavelengths.
* Since blue light is scattered more evenly across the sky, we see a predominance of blue light when looking up at the sky.

**Additional factors influencing the color of the sky:**

* **Time of day:** The sun is higher in the sky during the day, resulting in more direct sunlight and a brighter blue sky.
* **Altitude:** Higher altitudes have thinner atmospheres, leading to less scattering and a clearer blue sky.
* **Cloud coverage:** Clouds block the sunlight and scatter less light, resulting in a less blue sky.
* **Pollution:** Pollution in the atmosphere can scatter different wavelengths of light, affecting the color of the sky.

#### 1.1.12 Samplers

`samplers`: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. Default: `["top_k", "tfs_z", "typical_p", "top_p", "min_p", "temperature"]` - these are all the available values.

In [144]:
response = requests.post(
    url="http://localhost:8080/completion",
    json={
        "prompt": "Why is the sky blue?",
        "cache_prompt": False,
        "stream": False,
        "stop": [],
        "seed": 42,
        "n_predict": -1,
        "n_keep": 0,
        "min_keep": 0,
        "ignore_eos": False,
        "temperature": 0.8,
        "dynatemp_range": 0.0,
        "dynatemp_exponent": 1.0,
        "repeat_penalty": 1.1,
        "repeat_last_n": 64,
        "penalize_nl": True,
        "top_k": 40,
        "top_p": 0.95,
        "min_p": 0.05,
        "typical_p": 1.0,
        "mirostat": 0.0,
        "mirostat_tau": 5.0,
        "mirostat_eta": 0.1,
        "samplers": [
            "top_k",
            "tfs_z",
            "typical_p",
            "top_p",
            "min_p",
            "temperature",
        ]
    }
)

In [145]:
json_data = json.loads(response.content)
_generated_text = json_data["content"]

Markdown(_generated_text)



**Answer:** 
The sky is blue due to the process of **Rayleigh scattering**. Sunlight consists mainly  of all wavelengths or colors. When sunlight interacts with molecules in Earth's atmosphere, like nitrogen and oxygen atoms these particles scatter light rays randomly without altering their direction significantly . But different color lights are scattered differently based on wavelength:

- Shorter Wavelength (blue/violet) - gets dispersed more efficiently
_LongerWavelength_(red / infra red)_ is less effectively diffused. 


**The blue wavelengths of sunlight:**  travel in straight lines until they reach our eyes, giving us the impression that sky appears **BLUE**.

### 2.1 POST

In [160]:
response = requests.post(
    url="http://localhost:8080/tokenize",
    json={
        "content": "Why is the sky blue?",
        "with_special": False,
        "with_pieces": False
    }
)

In [162]:
json_data = json.loads(response.content)
JSON(json_data)

<IPython.core.display.JSON object>

### 3.1 POST

In [166]:
response = requests.post(
    url="http://localhost:8080/detokenize",
    json=json_data
)

In [169]:
json_response = json.loads(response.content)
JSON(json_response)

<IPython.core.display.JSON object>

### 4.0 Start and embedding server

To use the `embedding` API endpoint you must start the server using the `--embeddings` option (and without using the `--reranking` option). Open a new terminal and run the following command to start a new server with the embeddings endpoint enabled.

```bash
MODEL="./models/gemma-1.1-7b-it.Q4_K_M.gguf"
llama-server --model $MODEL --embeddings
```

### 4.1 POST 

The same as [the embedding example](../embedding) does.

    *Options:*

    `content`: Set the text to process.

    `image_data`: An array of objects to hold base64-encoded image `data` and its `id`s to be reference in `content`. You can determine the place of the image in the content as in the following: `Image: [img-21].\nCaption: This is a picture of a house`. In this case, `[img-21]` will be replaced by the embeddings of the image with id `21` in the following `image_data` array: `{..., "image_data": [{"data": "<BASE64_STRING>", "id": 21}]}`. Use `image_data` only with multimodal models, e.g., LLaVA.

In [60]:
%%bash --out response

curl \
    --request POST \
    --url http://localhost:8080/embedding \
    --header "Content-Type: application/json" \
    --data '{"content": "Hello World!"}'


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 66890  100 66863  100    27   148k     61 --:--:-- --:--:-- --:--:--  148k


In [61]:
json_response = json.loads(response)
JSON(json_response)

<IPython.core.display.JSON object>

In [66]:
embedding_arr = np.array(json_response["embedding"])
embedding_arr

array([-0.01669182,  0.01353396,  0.00052864, ...,  0.00345376,
       -0.00073828, -0.00154076])

### 5.0 Start the reranking server

### 5.1 POST 

Similar to https://jina.ai/reranker/ but might change in the future.
Requires a reranker model (such as [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)) and the `--embedding --pooling rank` options.

    *Options:*

    `query`: The query against which the documents will be ranked.

    `documents`: An array strings representing the documents to be ranked.

    *Aliases:*
      - `/rerank`
      - `/v1/rerank`
      - `/v1/reranking`

    *Examples:*

    ```shell
    curl http://127.0.0.1:8012/v1/rerank \
        -H "Content-Type: application/json" \
        -d '{
            "model": "some-model",
                "query": "What is panda?",
                "top_n": 3,
                "documents": [
                    "hi",
                "it is a bear",
                "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
                ]
        }' | jq
    ```

### 6.1 POST 

Takes a prefix and a suffix and returns the predicted completion as stream.

*Options:*

- `input_prefix`: Set the prefix of the code to infill.
- `input_suffix`: Set the suffix of the code to infill.
- `input_extra`:  Additional context inserted before the FIM prefix.
- `prompt`:       Added after the `FIM_MID` token

`input_extra` is array of `{"filename": string, "text": string}` objects.

The endpoint also accepts all the options of `/completion`.

If the model has `FIM_REPO` and `FIM_FILE_SEP` tokens, the [repo-level pattern](https://arxiv.org/pdf/2409.12186) is used:

```txt
<FIM_REP>myproject
<FIM_SEP>{chunk 0 filename}
{chunk 0 text}
<FIM_SEP>{chunk 1 filename}
{chunk 1 text}
...
<FIM_SEP>filename
<FIM_PRE>[input_prefix]<FIM_SUF>[input_suffix]<FIM_MID>[prompt]
```

If the tokens are missing, then the extra context is simply prefixed at the start:

```txt
[input_extra]<FIM_PRE>[input_prefix]<FIM_SUF>[input_suffix]<FIM_MID>[prompt]
```

### 7.1 GET 

This endpoint is public (no API key check). By default, it is read-only. To make POST request to change global properties, you need to start server with `--props`

**Response format**

```json
{
  "default_generation_settings": { ... },
  "total_slots": 1,
  "chat_template": ""
}
```

- `default_generation_settings` - the default generation settings for the `/completion` endpoint, which has the same fields as the `generation_settings` response object from the `/completion` endpoint.
- `total_slots` - the total number of slots for process requests (defined by `--parallel` option)
- `chat_template` - the model's original Jinja2 prompt template

In [147]:
response = requests.get("http://localhost:8080/props")

In [148]:
print(response)

<Response [200]>


In [149]:
_json_data = json.loads(response.content)
JSON(_json_data)

<IPython.core.display.JSON object>

### 1.9 POST

Change server global properties. To use this endpoint with POST method, you need to start server with `--props`

*Options:*

- None yet

## 1.10 POST `/v1/chat/completions`: OpenAI-compatible Chat Completions API

Given a ChatML-formatted json description in `messages`, it returns the predicted completion. Both synchronous and streaming mode are supported, so scripted and interactive applications work fine. While no strong claims of compatibility with OpenAI API spec is being made, in our experience it suffices to support many apps. Only models with a [supported chat template](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) can be used optimally with this endpoint. By default, the ChatML template will be used.

    *Options:*

    See [OpenAI Chat Completions API documentation](https://platform.openai.com/docs/api-reference/chat). While some OpenAI-specific features such as function calling aren't supported, llama.cpp `/completion`-specific features such as `mirostat` are supported.

    The `response_format` parameter supports both plain JSON output (e.g. `{"type": "json_object"}`) and schema-constrained JSON (e.g. `{"type": "json_object", "schema": {"type": "string", "minLength": 10, "maxLength": 100}}` or `{"type": "json_schema", "schema": {"properties": { "name": { "title": "Name",  "type": "string" }, "date": { "title": "Date",  "type": "string" }, "participants": { "items": {"type: "string" }, "title": "Participants",  "type": "string" } } } }`), similar to other OpenAI-inspired API providers.

    *Examples:*

    You can use either Python `openai` library with appropriate checkpoints:

    ```python
    import openai

    client = openai.OpenAI(
        base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
        api_key = "sk-no-key-required"
    )

    completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
        {"role": "user", "content": "Write a limerick about python exceptions"}
    ]
    )

    print(completion.choices[0].message)
    ```

    ... or raw HTTP requests:

    ```shell
    curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
    {
        "role": "system",
        "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
    },
    {
        "role": "user",
        "content": "Write a limerick about python exceptions"
    }
    ]
    }'
    ```

## 1.11 POST `/v1/embeddings`: OpenAI-compatible embeddings API

    *Options:*

    See [OpenAI Embeddings API documentation](https://platform.openai.com/docs/api-reference/embeddings).

    *Examples:*

  - input as string

    ```shell
    curl http://localhost:8080/v1/embeddings \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
            "input": "hello",
            "model":"GPT-4",
            "encoding_format": "float"
    }'
    ```

  - `input` as string array

    ```shell
    curl http://localhost:8080/v1/embeddings \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer no-key" \
    -d '{
            "input": ["hello", "world"],
            "model":"GPT-4",
            "encoding_format": "float"
    }'
    ```

