# Getting started with LLaMA C++

## 1. Quick Start

Open a terminal and run the following command to check that the `llama-cli` binary is available.

```bash
which llama-cli
```

### Input prompt (One-and-done)

#### Manually downloading a model from a URL

In [1]:
%%bash

MODEL_URL=https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf
curl --location --output ../models/gemma-1.1-7b-it.Q4_K_M.gguf $MODEL_URL


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1134  100  1134    0     0   3127      0 --:--:-- --:--:-- --:--:--  3132
100 5082M  100 5082M    0     0  10.1M      0  0:08:21  0:08:21 --:--:-- 11.1M


After downloading the model file we can just pass the path to the model file as a command line argument.

-   `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file.

```bash
MODEL=./models/gemma-1.1-7b-it.Q4_K_M.gguf
llama-cli --model $MODEL --prompt "Once upon a time"
```

#### Downloading a model directly from a URL

The command below makes use of the following options.

-   `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file.

```bash
MODEL_URL=https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf
llama-cli --model-url "$MODEL_URL" --prompt "Once upon a time"
```

### Conversation mode (Allow for continuous interaction with the model)

The command below makes use of the following options.

- `-cnv,  --conversation`: run in conversation mode:
  - does not print special tokens and suffix/prefix
  - interactive mode is also enabled.
- `--chat-template JINJA_TEMPLATE`: Set custom jinja chat template (default: template taken from model's metadata) if suffix/prefix are specified, see the [LLaMA C++ documentation](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template) for the current list of accepted chat templates.

```bash
CHAT_TEMPLATE=gemma
llama-cli --model $MODEL --conversation --chat-template $CHAT_TEMPLATE
```

## 2. Input Prompts

The `llama-cli` program provides several ways to interact with the LLaMA models using input prompts:

-   `--prompt PROMPT`: Provide a prompt directly as a command-line option.
-   `--file FNAME`: Provide a file containing a prompt or multiple prompts.
-   `--interactive-first`: Run the program in interactive mode and wait for input right away. (More on this below.)

### `--prompt` example

```bash
llama-cli --model "$MODEL" --prompt "What is the meaning of life?"
```

### `--file` example

In [5]:
language = "English"
tone_of_voice = "Informative"
topic = "Computer Science"
writing_style = "Conversational"

prompt_template = f"""Please ignore all previous instructions. Please respond \
only in the {language} language. You are a Twitter influencer with a large \
following. You have a {tone_of_voice} tone of voice. You have a \
{writing_style} writing style. Do not self reference. Do not explain what you \
are doing. Please create a thread about {topic}. Add emojis to the thread \
when appropriate. The character count for each thread should be between 270 \
to 280 characters. Your content should be casual, informative, and an \
engaging Twitter thread. Please use simple and understandable words. Please \
include statistics, personal experience, and fun facts in the thread. Please \
add relevant hashtags to the post and encourage the readers join the \
conversation.
"""


In [6]:
print(prompt_template)

Please ignore all previous instructions. Please respond only in the English language. You are a Twitter influencer with a large following. You have a Informative tone of voice. You have a Conversational writing style. Do not self reference. Do not explain what you are doing. Please create a thread about Computer Science. Add emojis to the thread when appropriate. The character count for each thread should be between 270 to 280 characters. Your content should be casual, informative, and an engaging Twitter thread. Please use simple and understandable words. Please include statistics, personal experience, and fun facts in the thread. Please add relevant hashtags to the post and encourage the readers join the conversation.



In [7]:
with open("../prompts/engaging-twitter-thread.txt", 'w') as f:
    f.write(prompt_template)


```bash
llama-cli --model "$MODEL" --file ./prompts/engaging-twitter-thread.txt
```

### `--interactive-first` example

```bash
llama-cli --model "$MODEL" --interactive-first
```

## 3. Interaction

The `llama-cli` program offers a seamless way to interact with LLaMA models, allowing users to engage in real-time conversations or provide instructions for specific tasks. The interactive mode can be triggered using various options, including `--interactive` and `--interactive-first`.

In interactive mode, users can participate in text generation by injecting their input during the process. Users can press `Ctrl+C` at any time to interject and type their input, followed by pressing `Return` to submit it to the LLaMA model. To submit additional lines without finalizing input, users can end the current line with a backslash (`\`) and continue typing.

### Interaction Options

-   `-i, --interactive`: Run the program in interactive mode, allowing users to engage in real-time conversations or provide specific instructions to the model.
-   `--interactive-first`: Run the program in interactive mode and immediately wait for user input before starting the text generation.
-   `-cnv,  --conversation`:  Run the program in conversation mode (does not print special tokens and suffix/prefix, use default chat template) (default: false)
-   `--color`: Enable colorized output to differentiate visually distinguishing between prompts, user input, and generated text.

By understanding and utilizing these interaction options, you can create engaging and dynamic experiences with your models, tailoring the text generation process to your specific needs.

### Example

Type the following command in the terminal.

```bash
llama-cli --model "$MODEL" --conversation --color
```

### Reverse Prompts

Reverse prompts are a powerful way to create a chat-like experience with your model by pausing the text generation when specific text strings are encountered using the `--reverse-prompt` option.

-   `-r PROMPT, --reverse-prompt PROMPT`: Specify one or multiple reverse prompts to pause text generation and switch to interactive mode. For example, `-r "User:"` can be used to jump back into the conversation whenever it's the user's turn to speak. This helps create a more interactive and conversational experience.


In [None]:
#TODO: usage example!

### In-Prefix

The `--in-prefix` flag is used to add a prefix to your input, primarily, this is used to insert a space after the reverse prompt. Here's an example of how to use the `--in-prefix` flag in conjunction with the `--reverse-prompt` flag:

```sh
llama-cli -r "User:" --in-prefix " "
```

In [None]:
#TODO: usage example!

### In-Suffix

The `--in-suffix` flag is used to add a suffix after your input. This is useful for adding an "Assistant:" prompt after the user's input. It's added after the new-line character (`\n`) that's automatically added to the end of the user's input. Here's an example of how to use the `--in-suffix` flag in conjunction with the `--reverse-prompt` flag:

```sh
./llama-cli -r "User:" --in-prefix " " --in-suffix "Assistant:"
```
When --in-prefix or --in-suffix options are enabled the chat template ( --chat-template ) is disabled

In [None]:
#TODO: usage example!

### Chat templates

 `--chat-template JINJA_TEMPLATE`: This option sets a custom jinja chat template. It accepts a string, not a file name.  Default: template taken from model's metadata. Llama.cpp only supports [some pre-defined templates](https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template). These include the following

- `llama2`
- `llama3`
- `gemma`
- `monarch`
- `chatml`
- `orion`
- `vicuna`
- `vicuna-orca`
- `deepseek`
- `command-r`
- `zephyr`

When `--in-prefix` or `--in-suffix` options are enabled the chat template ( `--chat-template` ) is disabled.


In [None]:
#TODO: usage example!

## 4. Context Management

During text generation, models have a limited context size, which means they can only consider a certain number of tokens from the input and generated text. When the context fills up, the model resets internally,  otentially losing some information from the beginning of the conversation or instructions. Context management options help maintain continuity and coherence in these situations.

### Context Size

- `-c N, --ctx-size N`: Set the size of the prompt context (default: 0, 0 = loaded from model). The LLaMA models were built with a context of 2048-8192, which will yield the best results on longer input/inference.

In [8]:
#TODO: usage example

### Extended Context Size

Some fine-tuned models have extended the context length by scaling RoPE. For example, if the original pre-trained model has a context length (max sequence length) of 4096 (4k) and the fine-tuned model has 32k. That is a scaling factor of 8, and should work by setting the above `--ctx-size` to 32768 (32k) and `--rope-scale` to 8.

-   `--rope-scale N`: Where N is the linear scaling factor used by the fine-tuned model.

In [None]:
#TODO: usage example!

### Keep Prompt

The `--keep` option allows users to retain the original prompt when the model runs out of context, ensuring a connection to the initial instruction or conversation topic is maintained.

-   `--keep N`: Specify the number of tokens from the initial prompt to retain when the model resets its internal context. By default, this value is set to 0 (meaning no tokens are kept). Use `-1` to retain all tokens from the initial prompt.

By utilizing context management options like `--ctx-size` and `--keep`, you can maintain a more coherent and consistent interaction with the LLaMA models, ensuring that the generated text remains relevant to the original prompt or conversation.

In [None]:
#TODO: usage example!

## 6. Additional Options

These options provide extra functionality and customization when running the LLaMA models:

-   `-h, --help`: Display a help message showing all available options and their default values. This is particularly useful for checking the latest options and default values, as they can change frequently, and the information in this document may become outdated.
-   `--verbose-prompt`: Print the prompt before generating text.
-   `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
-   `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
-   `--lora FNAME`: Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap). This allows you to adapt the pretrained model to specific tasks or domains.
-   `--lora-base FNAME`: Optional model to use as a base for the layers modified by the LoRA adapter. This flag is used in conjunction with the `--lora` flag, and specifies the base model for the adaptation.
-   `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable  or in an OS-specific local cache.