# Running Open Source LLM - CPU/GPU-hybrid option via llama.cpp

In this tutorial, we will learn how to run open source LLM in a reasonably large range of hardware, even those with low-end GPU only or no GPU at all.

Traditionally AI models are trained and run using deep learning library/frameworks such as `tensorflow` (Google), `pytorch` (Meta), `huggingface` etc. Although they can be used directly in production, they are also designed to be used by AI/ML researcher to heavily customize in order to push the Sota (State of the art) forward. As such they carry lots of "baggage".

This is one of the key insight exploited by the man behind the project of `ggml`, a low level, C reimplementation of just the parts that are actually needed to run inference of transformer based neural network. `llama.cpp` then build on top of this to make it possible to run LLM on CPU only. (The actual history of the project is quite a bit more messy and what you hear is a sanitized version) Later on, they also added ability to partially or fully offload model to GPU, so that one can still enjoy partial acceleration.

`llama.cpp` is by itself just a C program - you compile it, then run it from the command line. This is one way to run LLM, but it is also possible to call LLM from inside python using a form of FFI (Foreign Function Interface) - in this case the "official" binding recommended is `llama-cpp-python`, and that's what we'll use today.

## (Optional) Running llama.cpp from command line

Reference:

- [Official llama.cpp repo](https://github.com/ggerganov/llama.cpp)

You may skip this subsection, but if you want to most direct experience, you can run the following commands:

- Install `llama.cpp` *(note that we go for the absolute minimum installation without any performance enhancement)*:

In [1]:
!git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make

Cloning into 'llama.cpp'...
remote: Enumerating objects: 4744, done.[K
remote: Counting objects: 100% (2280/2280), done.[K
remote: Compressing objects: 100% (434/434), done.[K
remote: Total 4744 (delta 2125), reused 1861 (delta 1846), pack-reused 2464[K
Receiving objects: 100% (4744/4744), 3.96 MiB | 14.84 MiB/s, done.
Resolving deltas: 100% (3231/3231), done.
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 

- Download model using `aria2c` for robustness *(detailed explanation of how to choose model, quantization level, and prompts format skipped as they're covered in next section) (Also note we need the `-o` flag as HuggingFace uses git LFS for large files, so the link redirect and the filename need to be corrected)*:

In [2]:
!apt-get update && apt-get install -y aria2

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Get:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Get:7 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1,070 kB]
Hit:11 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu focal InRelease
Get:12 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [2,866 kB]
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/ma

In [14]:
!aria2c https://huggingface.co/TheBloke/vicuna-13b-v1.3.0-GGML/resolve/main/vicuna-13b-v1.3.0.ggmlv3.q3_K_M.bin -o vicuna-13b-v1.3.0.ggmlv3.q3_K_M.bin


07/11 20:59:33 [[1;32mNOTICE[0m] Downloading 1 item(s)

07/11 20:59:33 [[1;32mNOTICE[0m] CUID#7 - Redirecting to https://cdn-lfs.huggingface.co/repos/6d/58/6d58b4d5ddcda9696a42e991ffa02e907a318665d026fae7962da31e446bba86/4767c77db1b80896b0b6441e784708ddd0d2dc4885e52867603c1da81fee1f36?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27vicuna-13b-v1.3.0.ggmlv3.q3_K_M.bin%3B+filename%3D%22vicuna-13b-v1.3.0.ggmlv3.q3_K_M.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1689368373&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTY4OTM2ODM3M319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy82ZC81OC82ZDU4YjRkNWRkY2RhOTY5NmE0MmU5OTFmZmEwMmU5MDdhMzE4NjY1ZDAyNmZhZTc5NjJkYTMxZTQ0NmJiYTg2LzQ3NjdjNzdkYjFiODA4OTZiMGI2NDQxZTc4NDcwOGRkZDBkMmRjNDg4NWU1Mjg2NzYwM2MxZGE4MWZlZTFmMzY%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=NQPYL%7EmpKWunf2V%7EKldhvBJD28UvL8OP

- Run the compiled program in interactive/chat mode (output speed depends on the CPU, but expect 1-5 tokens/sec in general, so please be patient waiting for outputs) (If it prematurely return control, you may use the prompt "Please Continue.") (In colab you can enter text by clicking the area with a mouse first):

In [7]:
!./llama.cpp/main -m ./vicuna-13b-v1.3.0.ggmlv3.q3_K_M.bin -n 750 --repeat_penalty 1.1 \
--color -i -r "USER:" \
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\nUSER: What are the steps to create a new website?\nASSISTANT:"

main: build = 819 (5bf2a27)
main: seed  = 1689105364
llama.cpp: loading model from ./vicuna-13b-v1.3.0.ggmlv3.q3_K_M.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 12 (mostly Q3_K - Medium)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 8068.43 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB

system_info: n_threads = 2 / 2 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WAS

Explanation of the command line arguments (see also [this](https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md)):

- `-m`: Path to model
- `-n`: Max. new number of token generated
- `--repeat_penalty`: Repetition penalty
- `--color`: Enable color
- `-i`: Run in interactive mode
- `-r`: Reverse prompt - program will detect presence of this word as a cue to pause generation and pass control back to user
- `-p`: Initial prompt

In this chat mode, the LLM will generate continuation of the initial prompt until it encounter reverse prompt, then it will be user's turn. After user entered next query, that input will be appended to the total text generated so far and fed to LLM for text continuation again. This process will loop to create a UX that is like a chatbot.

## Installation and model download

In this section we will assume you have a Nvidia GPU.

First we will install the `llama-cpp-python` library:

In [4]:
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.70.tar.gz (1.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.6 MB[0m [31m9.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m27.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.1-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.6/45.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding whe

This command might be confusing so let's explain it. To make it very clear please always keep in mind that `llama.cpp` is the actual library to run LLM inference, and it is a C program; `llama-cpp-python` on the other hand is a python FFI binding to this underlying library.

Note the use of the `FORCE_CMAKE=1` ephemeral enviornment variable in the shell to change `pip`'s behavior as the library build the underlying `llama.cpp`, which is vendorized. `llama.cpp` itself can be built with either `make` or `CMake`, but XX supports the various compiler flag to enable builds with GPU support enabled only when using `CMake`, which is why we have to do this.

Then we use another ephemeral shell enviornment variable `CMAKE_ARGS` to change the arguments passed to `Cmake`. This part is just following instruction in `llama.cpp`'s installation guide.

Now we want to download models. `llama.cpp` uses its own specific model formats, using the same name as the base library, GGML. GGML is a minimalistic format to store model weights and also supports its own quantization scheme.

### A short guide on choosing models

*(Note: this is just an introduction, I have another post that goes into more details)*

The open source community has been thriving around fine-tuned LLM produced by enthusiasts. Although the ecosystem is largest for fine-tunes based on the llama foundation models, with dataset produced using model extraction from OpenAI's GPT3.5/4, plus self-instruct methods like in the Alpaca paper, there have been other families too more recently, such as Falcon, MPT, X-Gen, and StarCoder. Choosing a specific model to use involves many factors, and the models have many relevant technical specifications. But generally speaking, one may choose a model suitable for his/her use case based on these steps:

1. **Choose a family** - The r/Localllama has a wiki with recommendation, and there are kind of a consensus on the best general purpose choice for the most common situations - Vicuna/Manticore/Guanaco for generic use, WizardLM for complex instruction following, SuperCoT (Super Chain of thought) for LLM based application development use (i.e. agent and langchain etc). There are also suggested models for Story writing and roleplay. You may also look at public leaderboard/evaluation/arena that ranks the model's relative strength, such as one compiled by HuggingFace, or the one at LMSYS.
2. **Decide on model size and quantization level** - Generally, the larger (the model size) the better as it will be more intelligent. So you usually would pick the largest one that can fit in memory. Memory requirement can be calculated based on model size plus quantization level, and sometimes you may want to make a trade off of choosing a lower quantization level while switching to larger model for the same memory budget (as experiments have shown this trade off results in better output quality overall, so it is worth it). In general, for very small/edge device, pick 7B (but don't expect to do any complex tasks). Otherwise, for 16GB, pick 33B at q3 for more demanding use case, or 13B at q4.
3. **Locate the model on HuggingFace** - Go to the HuggingFace website, and use the search function, entering just the model name, then click "Show all results". You will likely see a dazzling list of models with same name but different technical parameters. Based on your choice in previous step, choose the one with the correct parameters, remembering that it should have the keyword "GGML" on it and NOT "GPTQ" (which is for the *other* option to run LLM). Names with neither of it are the original fp16 model. The user `TheBloke` has a semi-autoamted approach to uploading quantized model and has a reliable model format.

### Downloading model with help of huggingface-hub

Although we may download models using the website's own UI and the browser's builtin mechanism, or via command line tools, we may also use an official huggingface client library for convinience.

The library let us access the HuggingFace API programmatically. Although for our case, we will only be using its function to download models, keep in mind that it can also be used for things like searching for models, uploading your own models/datasets, and so on.

There are two approach to download models:

- The `hf_hub_download` function let us download specific file, while being able to pin to a specific version too.
- The `snapshot_download` function let us download entire repository, as well as a more targeted/flexible download with include/exclude pattern.

So let's first install the client:

In [1]:
!pip install huggingface_hub

Collecting huggingface_hub
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/268.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m266.2/268.8 kB[0m [31m8.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.16.4


And then download some models:

In [2]:
from huggingface_hub import hf_hub_download

#path = hf_hub_download(repo_id="TheBloke/WizardLM-Uncensored-SuperCOT-StoryTelling-30B-GGML", filename="WizardLM-Uncensored-SuperCOT-Storytelling.ggmlv3.q3_K_S.bin")
path = hf_hub_download(repo_id="mindrage/Manticore-13B-Chat-Pyg-Guanaco-GGML", filename="Manticore-13B-Chat-Pyg-Guanaco-GGML-q3_K_M.bin")

Downloading (…)naco-GGML-q3_K_M.bin:   0%|          | 0.00/6.25G [00:00<?, ?B/s]

In [3]:
path

'/root/.cache/huggingface/hub/models--mindrage--Manticore-13B-Chat-Pyg-Guanaco-GGML/snapshots/723b3a9e341c49ead85e28b8606366c19b9a8ff5/Manticore-13B-Chat-Pyg-Guanaco-GGML-q3_K_M.bin'

Here `hf_hub_download` let us download specific files from a repo. You can even pin a particular version/tag if necessary. After the download is done it returns a full path to where the file is stored.

## Running LLM from python

Now let's get to business. First initialize the library, which will detect GPUs:

In [5]:
from llama_cpp import Llama

Next, we load the models (you may need to change the value of `n_gpu_layers`, see below):

In [6]:
llm = Llama(model_path=path, n_ctx=2048, n_gpu_layers=45)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 


The `n_gpu_layers` argument is useful when you have enabled GPU in your builds. It is the number of layers in the neural network to offload to GPU. 0 means CPU only inference, while the larger this value, the more offloading happen which would hopefully increase speed, but will also consume more VRAM.

Be careful that if you set it too large, you may get OOM (Out-of-memory) on your GPU and as this is a C program you may get strange behavior like things hanging or crashing the jupyter kernel. Better to be conservative at first and increase using experiment to estimate VRAM cost per layer to calculate how much you can afford to offload.

In the author's test:

- (33B/q3 model) Offload 23 layers: 7.1GB ram (main) + 6GB vram (GPU)
- (13B/q3 model) Offload all 43 layers: 5.5GB ram (main) + 8GB vram (GPU)

*Also note that recent Nvidia driver has a form of aggressive memory offloading where they offload memory back to the main memory, which results in significant speed degradation because LLM is heavily memory and memory-bandwidth bound. Although they've been working on it, the advice at the moment is to stay at an old enough driver version or to downgrade.*

And now we run a short smoke test:

In [7]:
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, echo=True)

For a full list of arguments available, please refer to the [manual](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.llama.Llama.__call__).

Note that the underlying `llama.cpp` will output statistics on generation speed to console (but suppressed on Google Colab). The time taken has the following breakdown:

- Sample: Time to run the sampling algorithm to select token based on the probability distribution returned by the LLM. (Which may involves modifying the probabilities as a form of post-processing) Usually insignificant.
- Prompt eval: Time taken to run the user prompt through the network to generate the internal values, which will be reused in subsequent runs of "Prompt + partially generated text" fed through the network.
- Eval: Time taken to actually compute the next token predictions by feeding the input (Prompt + partially generated text) through the neural network.

Then a token per second stat is computed. Notice however that eval time actually gradually increases as the length of prompt preceeding it increases (since we are appending to the prompt token-by-token essentially). This is related to the **quadratic bottleneck of the attention mechanism**.

With the lecturing done, let's check if it worked or not:

In [8]:
output

{'id': 'cmpl-e7d950e2-c133-433c-9d9e-a7e2bd887283',
 'object': 'text_completion',
 'created': 1689188164,
 'model': '/root/.cache/huggingface/hub/models--mindrage--Manticore-13B-Chat-Pyg-Guanaco-GGML/snapshots/723b3a9e341c49ead85e28b8606366c19b9a8ff5/Manticore-13B-Chat-Pyg-Guanaco-GGML-q3_K_M.bin',
 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uran',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 15, 'completion_tokens': 32, 'total_tokens': 47}}

The `output` object returned by the library contains details of the text-completion. One things that may be very useful to application developer is it contains meta-info on the reason of stopping - is it because the LLM emited one of the stop token, say, or is it because it has generated up to the maximum number of new tokens we allowed it to? (This is found in `output["choices"][0]["finish_reason"]`)

Let's show the actual LLM response:

In [9]:
output["choices"][0]["text"]

'Q: Name the planets in the solar system? A: 1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uran'

Next, let's try a more serious query to the LLM, using instruction prompt format:

In [10]:
instruct_prompt = """### Instruction
Write an essay on the decline of globalization after 2010, giving reasons for why and situate it in historical context.
### Response
"""
output2 = llm(instruct_prompt, max_tokens=512, echo=True)

Llama.generate: prefix-match hit


You may have noticed an output `Llama.generate: prefix-match hit`. This refers to the cache (*not* the same as an external cache grated on that saves the full prompt response pair, which is a nice trick to improve system performance at the application level, but is not an intrinsic part of LLM). Instead, this cache is related to how Transformer and text generation works, so that we may partially save and reuse the internal values of the neural network during evaluation. The net effect is that prompt evaluation can mostly be skipped from second call (on the same prompt prefix) onward. (Knowledge of transformer would also let one explain why a prefix match is used)

In [12]:
print(output2["choices"][0]["text"])

### Instruction
 Write an essay on the decline of globalization after 2010, giving reasons for why and situate it in historical context.
### Response
Globalization has been a hotly debated topic in recent years, with many arguing that it has either stalled or even reversed course since 2010. This decline can be attributed to a number of factors, both economic and political. In this essay, we will examine the reasons for the decline of globalization after 2010, situate it in historical context, and explore the implications for the future of international trade and cooperation.
Historical Context:
Globalization has a long history, dating back to the earliest trade routes between ancient civilizations. However, it accelerated in the late 20th century with the advent of technology, trade liberalization, and the rise of global supply chains. This period of rapid growth was known as the "golden age" of globalization, which lasted from the 1980s until the financial crisis of 2008.
Post-2010:


Before we move on, let's briefly explain the concept of **prompt format** and **reverse prompt**. Recall that a foundation model is pretty raw and simply perform *text completion*, where it gives what it believes to be the most *natural* continuation of the text. Then LLM that we actually use are usually instruction or chat fine-tuned, so that they behave more inituitively.

In the open source LLM ecosystem, model fine-tune are often focused on one or the other:

- If it is *instruct tuned*, then you should prompt it like giving a student a worksheet with tasks to perform. This is also called **Alpaca format**.
- If it is *tuned on chats*, then you should first describe the role-playing background, then give a transcript of chat with user and assistant taking turns, each conversational turn on a separate line. This is also called **Vicuna format**.

Because this might be confusing to beginner, we have deliberately picked a model that merged multiple fine-tune, so what it've been trained on is diverse enough to work with *both* prompt formats - for the most part you don't need to worry with this part as you'd normally need to for other models.

We've seen Reverse prompt in the optional section in raw `llama.cpp`. What it does is to detect the presence of specific words and pause generation upon that. This is mainly useful in a chat setting as the LLM would otherwise generation *both* sides of the conversation by itself.

## More features: Streaming and Token count

This library is a relatively barebone and thin-wrapper over the base library. However, it does provide some features that may be useful for application developers. We briefly cover two examples, leaving you the reader to explore more features by reading the manual.

We may enable streaming output with the `stream=True` argument. In such case the `output` object returned will be similar, but now the `text` field will only contain the new token generated, instead of the full text with the original prompt included. We may strip the excess metadata and wrap it into a nice little generator ourself. Note the tradeoff however - the per toekn metadata can be useful to keep track of the generation process for example.

In [7]:
def run_llm_stream_naive(prompt, max_token, stopwords):
    outputs = llm(prompt, max_tokens=max_token, stop=stopwords, \
                  echo=True, stream=True)
    for output in outputs:
        tok = output['choices'][0]['text']
        yield tok

In [8]:
chat_prompt = """You are an AI ASSISTANT who is having a chat with USER.
You are helpful and will answer USER to the best of your ability.
USER: Write a poem involving the theme of embracing uncertainty in life, based on "The road not chosen" but modifying it.
ASSISTANT: Sure! """

output_stream = run_llm_stream_naive(chat_prompt, 512, ["USER:"])

for token in output_stream:
    print(token, end="", flush=True)


Here's a poem called "The Path Unknown" that embraces the uncertainty of life:

The road untraveled lies before me,
Its course unknown and yet to be seen.
I step forward, heart pounding,
Into the unknown, with a leap of faith.

The trees whisper in the breeze,
Of adventures that wait in disguise.
I will follow where they lead,
And see where the path may take me instead.

There will be twists and turns,
Ups and downs, and moments of fear.
But I will face them all with a courage true,
And learn from each experience that is new.

The unknown can be scary,
But it can also be full of light.
I will embrace the uncertainty,
And let it guide me to a brighter sight.

The path untraveled may be rough,
But it will also be full of enough.
I will follow my heart and soul,
And find the beauty that lies in the unknown. 

So I take a deep breath, and I step out of my comfort zone,
Into the unknown world that awaits me alone.
I may not know where it will lead,
But I am ready to embrace the journey and

Another common situation for application developer is wanting to feed a chunk of a document into a LLM, for example, to generate a summary while respecting the context length limit. To do so, we need a way to count the number of token of a text/cut off a text at a specific token count. But as we know token is not a one-one correspondence with words and there's no trivial method to count it accurately, other than by running the text through the tokenizer actually.

The library does provide function to help us do this:

In [10]:
# Wikipedia article
mytext = """Food chemistry is the study of chemical processes and interactions of all biological and non-biological components of foods.[1][2] The biological substances include such items as meat, poultry, lettuce, beer, milk as examples. It is similar to biochemistry in its main components such as carbohydrates, lipids, and protein, but it also includes areas such as water, vitamins, minerals, enzymes, food additives, flavors, and colors. This discipline also encompasses how products change under certain food processing techniques and ways either to enhance or to prevent them from happening. An example of enhancing a process would be to encourage fermentation of dairy products with microorganisms that convert lactose to lactic acid; an example of preventing a process would be stopping the browning on the surface of freshly cut apples using lemon juice or other acidulated water.

History of food chemistry
The scientific approach to food and nutrition arose with attention to agricultural chemistry in the works of J. G. Wallerius, Humphry Davy, and others. For example, Davy published Elements of Agricultural Chemistry, in a Course of Lectures for the Board of Agriculture (1813) in the United Kingdom which would serve as a foundation for the profession worldwide, going into a fifth edition. Earlier work included that by Carl Wilhelm Scheele, who isolated malic acid from apples in 1785.

Some of the findings of Liebig on food chemistry were translated and published by Eben Horsford in Lowell Massachusetts in 1848.[3]

In 1874 the Society of Public Analysts was formed, with the aim of applying analytical methods to the benefit of the public.[4] Its early experiments were based on bread, milk and wine.

It was also out of concern for the quality of the food supply, mainly food adulteration and contamination issues that would first stem from intentional contamination to later with chemical food additives by the 1950s. The development of colleges and universities worldwide, most notably in the United States, would expand food chemistry as well with research of the dietary substances, most notably the Single-grain experiment during 1907-11. Additional research by Harvey W. Wiley at the United States Department of Agriculture during the late 19th century would play a key factor in the creation of the United States Food and Drug Administration in 1906. The American Chemical Society would establish their Agricultural and Food Chemistry Division in 1908 while the Institute of Food Technologists would establish their Food Chemistry Division in 1995.

Food chemistry concepts are often drawn from rheology, theories of transport phenomena, physical and chemical thermodynamics, chemical bonds and interaction forces, quantum mechanics and reaction kinetics, biopolymer science, colloidal interactions, nucleation, glass transitions and freezing/disordered or noncrystalline solids, and thus has Food Physical Chemistry as a foundation area.[5][6]

Water in food systems
Main article: Water
A major component of food is water, which can encompass anywhere from 50% in meat products to 95% in lettuce, cabbage, and tomato products. It is also an excellent place for bacterial growth and food spoilage if it is not properly processed. One way this is measured in food is by water activity which is very important in the shelf life of many foods during processing. One of the keys to food preservation in most instances is reduce the amount of water or alter the water's characteristics to enhance shelf-life. Such methods include dehydration, freezing, and refrigeration[7][8][9][10] This field encompasses the "physiochemical principles of the reactions and conversions that occur during the manufacture, handling, and storage of foods".[11]
"""

doc_as_tokenlist = llm.tokenize(bytes(mytext, "utf-8"))
print(len(doc_as_tokenlist))

920


In [11]:
chunk_example = doc_as_tokenlist[150:200]
print(llm.detokenize(chunk_example).decode("utf-8"))

 would be to encourage fermentation of dairy products with microorganisms that convert lactose to lactic acid; an example of preventing a process would be stopping the browning on the surface of freshly cut apples using le


## Gradio Interface for more experimentation

So far we've been interacting with LLM through a programmatic interface, or the command line. For heavier experimentations, we may need trial and error on prompts as smallish LLM like the one we're using so far can be sensitive to minute details in how it is prompted in terms of output quality. In such case using an UI, preferrably one that automatically saves the history of request/response/configuration, would seem to make more sense.

This part is mostly the same regardless of the backend of how LLM response is actually generated, and we will develop this more fully in the next tutorial where we use the GPU-only option. For now let's get something barebone working:

In [12]:
!pip install gradio

Collecting gradio
  Downloading gradio-3.36.1-py3-none-any.whl (19.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m65.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles (from gradio)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.100.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.7/65.7 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>=0.2.7 (from gradio)
  Downloading gradio_client-0.2.9-py3-none-any.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.8/288.8 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [14]:
import gradio as gr

default_prompt = """You are an AI ASSISTANT who is having a chat with USER.
You are helpful and will answer USER to the best of your ability.
USER: Hi! How are you today?
ASSISTANT: """

def submit_llm(prompt, stop, max_token):
    text = ""
    if stop == "":
        stoplist = []
    else:
        stoplist = [stop]
    output_stream = run_llm_stream_naive(prompt, max_token, stoplist)
    # Gradio would substitute your "yield" output into the UI directly,
    # so we need to apply the append text logic ourself
    for token in output_stream:
        text = text + token
        yield text

demo = gr.Interface(submit_llm, \
                    inputs=[gr.TextArea(label="Prompt", value=default_prompt), gr.Textbox(label="Reverse prompt", value="USER:"), gr.Slider(10, 1024, value=512, label="Max New Token")], \
                    outputs=[gr.TextArea(label="LLM Response", show_copy_button=True)])
demo.queue()
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://3035b5604c8655ef91.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




## Cleaning Up

When we're all done, we may free up the memories with the following code:

In [15]:
from llama_cpp import llama_free, llama_free_model

In [16]:
llama_free_model(llm.model)

In [17]:
llama_free(llm.ctx)