# Running Open Source LLM - GPU option via exllama
In this tutorial, we will run LLM on the GPU entirely, which will allow us to speed it up significantly. The recommended software for this used to be `auto-gptq`, but its generation speed has since then been surpassed by `exllama`. Also, `exllama` has the advantage that it uses a similar philosophy to `llama.cpp` in being a barebone reimplementation of just the part needed to run inference. Here, it programs the primitive operation in the Nvidia propiertrary CUDA directly, together with some basic pytorch use.

Just a quick reminder that this option requires the whole model to fit within the VRAM of the GPU.

If you are using google colab or similar cloud service, make sure you allocate a GPU VM.

## Installation and model download
Installation is straight forward. Following the instruction on [the repo](https://github.com/turboderp/exllama):

In [1]:
!git clone https://github.com/turboderp/exllama

Cloning into 'exllama'...
remote: Enumerating objects: 1316, done.[K
remote: Counting objects: 100% (659/659), done.[K
remote: Compressing objects: 100% (179/179), done.[K
remote: Total 1316 (delta 542), reused 502 (delta 480), pack-reused 657[K
Receiving objects: 100% (1316/1316), 891.94 KiB | 7.76 MiB/s, done.
Resolving deltas: 100% (926/926), done.


In [2]:
!cd exllama && pip install -r requirements.txt

Collecting safetensors==0.3.1 (from -r requirements.txt (line 2))
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece>=0.1.97 (from -r requirements.txt (line 3))
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ninja==1.11.1 (from -r requirements.txt (line 4))
  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.0/146.0 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentencepiece, safetensors, ninja
Successfully installed ninja-1.11.1 safetensors-0.3.1 sentencepiece-0

Model Download is exactly the same as before using `huggingface-hub` client library to interact with the API. Notice that unlike in the `llama.cpp`/CPU case, the GPTQ format is **not** self-contained and so we need all those extra configuration files. Therefore we use the `snapshot_download` method to download the whole repository. The method returns the local directory where files are downloaded.

Feel free to experiment with different models. You can search on the [huggingface website](https://huggingface.co). Remember to search for GPTQ quantized models. Our (tested) examples here are:

- `TheBloke/orca_mini_13B-GPTQ` - A reproduction of the Microsoft Orca paper that claims to allow good capability even with small model. Note that V1.1 is just out but we're staying on the old version for now.
- `TheBloke/WizardLM-13B-V1-0-Uncensored-SuperHOT-8K-GPTQ` - WizardLM is a good general purpose model for complex instruction following, and SuperHOT is a LoRA to enable longer context length.

*(Note: If the huggingface API returns error, it could simply be temporary overload - just rerun the cell after waiting some time)*

In [3]:
!pip install huggingface-hub

Collecting huggingface-hub
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/268.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m266.2/268.8 kB[0m [31m9.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface-hub
Successfully installed huggingface-hub-0.16.4


In [4]:
from huggingface_hub import snapshot_download

path = snapshot_download("TheBloke/orca_mini_13B-GPTQ")
#path = snapshot_download("TheBloke/WizardLM-13B-V1-0-Uncensored-SuperHOT-8K-GPTQ")

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Downloading (…)f7217abc/config.json:   0%|          | 0.00/598 [00:00<?, ?B/s]

Downloading (…)1ef7217abc/README.md:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/135 [00:00<?, ?B/s]

Downloading (…)17abc/tokenizer.json:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/208 [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Downloading (…)17abc/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)ct.order.safetensors:   0%|          | 0.00/8.11G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

## Initialization and first runs

Now let's get started. We copied the example code in the `exllama` project from the file `/example_basic.py` ([link](https://github.com/turboderp/exllama/blob/master/example_basic.py)).

Note a few things:

- It directly import python modules in-project, so we need to be in the correct directory. We do this with the magic `%cd exllama`
- If you use the SuperHOT extended context length feature, you want to change some configuration parameters. Comment/uncomment as necessary. *(TODO: link to a post to explain meaning of the params)*

The model loading will take about 1 - 2 minutes.

In [5]:
%cd exllama

/content/exllama


In [6]:
from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import os, glob

# Directory containing model, tokenizer, generator

model_directory = path

# Locate files we need within that directory

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer and generator

config = ExLlamaConfig(model_config_path)               # create config from config.json
config.model_path = model_path                          # supply path to model weights file

# Comment/uncomment these based on whether you need long context length
#config.max_seq_len = 8192
#config.compress_pos_emb = 4
#config.max_input_len = 8192

model = ExLlama(config)                                 # create ExLlama instance and load the weights
tokenizer = ExLlamaTokenizer(tokenizer_path)            # create tokenizer from tokenizer model file

cache = ExLlamaCache(model)                             # create cache for inference
generator = ExLlamaGenerator(model, tokenizer, cache)   # create generator

# Configure generator

# We deviate from the original source code - they deliberately use a crude version of
# Microsoft guidance to never choose the EOS (End of Sentence) token in the sampling process,
# which would have let the LLM stop generating text.
# Their reason is to basically force the LLM to give long response as some LLM has
# a tendency to give terse response.
# But the model we choose has usually been instruction fine-tuned and is not for the
# chatting use case, so we want the raw behavior instead.
# Therefore this line is commented out.
#generator.disallow_tokens([tokenizer.eos_token_id])

generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.95
generator.settings.top_p = 0.65
generator.settings.top_k = 100
generator.settings.typical = 0.5

Okay, now let's finally generate some text. Remember that we need to know the correct prompt format to use. This depends entirely on the author being meticulous. For the example we picked, they have included ones on the README file. See: [Orca-mini v1.0](https://huggingface.co/TheBloke/orca_mini_13B-GPTQ#prompt-template), [WizardLM v1.0 + SuperHOT](https://huggingface.co/TheBloke/WizardLM-13B-V1-0-Uncensored-SuperHOT-8K-GPTQ#prompt-template)

The generation speed is about 30 tokens/sec on a T4 GPU, using 13B models. Streaming is **not** enabled in the basic example here so you will see the response only after the LLM is completely done.

If you're out of idea, here are some example request you may use:

- Give me a recipe for Spaghetti Carbonara.
- Write an essay on the role of IT in the international supply chain.
- Write the draft for a chapter of a book on the history of the Roman Civilization. In the first chapter, cover the rise of Rome from the early days to the Roman Republic. Cover various aspects such as military/war (especially wars with Carthage), politics, civil life such as society and economics, leisures, etc.

In [7]:
# Further Prepare some config
max_new_tokens = 1024 # Modify this based on your need, can exceed 2048 if using SuperHOT

# Comment/uncomment the suitable one based on which model you choose
# Also note that it is not exact copy - we added python's built-in template string substitution

# TheBloke/orca_mini_13B-GPTQ
new_prompt_template = """### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.

### User:
{prompt}

### Response:
"""

# TheBloke/WizardLM-13B-V1-0-Uncensored-SuperHOT-8K-GPTQ
#new_prompt_template = """A chat between a curious user and an assistant. The assistant gives helpful, detailed, accurate, uncensored responses to the user's input
#USER: {prompt}
#ASSISTANT: """

# Try a simple, single text generation
request = input("Your instruction/request:")

actual_prompt = new_prompt_template.format(prompt=request)

print(actual_prompt, end="")
output = generator.generate_simple(actual_prompt, max_new_tokens = max_new_tokens)
print(output[len(actual_prompt):])

Your instruction/request:Write a short blog post on how to start a new website.
### System:
You are an AI assistant that follows instruction extremely well. Help as much as you can.

### User:
Write a short blog post on how to start a new website.

### Response:
 Starting a new website can be a complex process, but here are some general steps you can follow:

1. Identify the website's purpose and goals: Before starting a new website, it is important to identify its purpose and goals. This will help you understand what kind of content you need to create for your website.

2. Choose a domain name: A domain name is a unique identifier that helps your website stand out from others. It should be something like "www.website.com" or "https://websites.com".

3. Select a web template: Once you have identified your website's purpose and goals, you can select a web template to make your website more visually appealing.

4. Add content to your site: After selecting a web template, you can add cont

## More about exllama
The example above only give us a taste. Here are some more details about exllama:

- Its documentation is still WIP sadly. But in short you may directly read the source code, starting at `/generator.py` ([link](https://github.com/turboderp/exllama/blob/master/generator.py))
- It also supports deployment by docker compose - see the official README.
- There are other example files to illustrate its features:
  - `/example_batch.py` - With GPU for offline use case you can do batch inference which will give you even higher throughput. One example is *document summarization/indexing* for use by *retrivel-augmented LLM* later on. In this use case, you would split a document into chunks that fits well within the context length limit, then feed a prompt asking to summarize each chunk (this is the part that can be parallelized using batch inference), then store the summaries (possibly along with the original chunks) into a vector store, etc.
  - `/example_chatbot.py` - Illustrate how a multi-round conversation can be programmed using this library. Note the use of *beam search* and that by going lower level we can exert greater control on the generation process/intervene when necessary. This example also illustrate *streaming response*.
  - `/example_flask.py` - They also provided a small web server/API (?) + UI if you want that convinience.
  - `/example_lora.py` - Possible to load a LoRA too.


## Gradio Interface for more experimentation

So far we have been able to work with the LLM directly through calling library functions, which is great for application development. But if you want quick experimentations this is just poor UX.

Although in the long term, a dedicated software that works like a LLM studio/IDE may be better suited, in the medium term we can also use a rapid-prototype UI library to give us a quick UI for demo'ing LLM (which is what HuggingFace Spaces are all about).

The first thing we need to do is to implement a python function that performs LLM text generation in a nicely wrapped up, end-to-end manner. For `exllama`, here's one possible solution:

In [8]:
import torch
import os

def get_truly_random_seed_through_os():
    """
    Usually the best random sample you could get in any programming language is generated through the operating system.
    In Python, you can use the os module.

    source: https://stackoverflow.com/questions/57416925/best-practices-for-generating-a-random-seeds-to-seed-pytorch/57416967#57416967
    """
    RAND_SIZE = 4
    random_data = os.urandom(
        RAND_SIZE
    )  # Return a string of size random bytes suitable for cryptographic use.
    random_seed = int.from_bytes(random_data, byteorder="big")
    return random_seed

def run_llm_simple(prompt, max_new_tok, temp, top_p, top_k, rep_penalty, seed):
    # Need to manually set seed
    # See: https://github.com/turboderp/exllama/issues/116
    if seed < 0:
        gen_seed = get_truly_random_seed_through_os()
    else:
        gen_seed = seed
    torch.manual_seed(gen_seed)
    torch.cuda.manual_seed_all(gen_seed)
    # Generation config
    generator.settings.token_repetition_penalty_max = rep_penalty #1.2
    generator.settings.temperature = temp #0.95
    generator.settings.top_p = top_p #0.65
    generator.settings.top_k = top_k #100
    output = generator.generate_simple(prompt, max_new_tokens = max_new_tok)
    return (output[len(prompt):], gen_seed)

After that, we need to design a UI in gradio. This is the same whether the backend is `llama.cpp` or `exllama`. Since this is not a course on UI library, we will mostly skip explanations. Just install the `gradio` library, then run the code snippet below.

*(Note: One may also use `oobabooga`, but this one has the advantage of being extremely lightweight with small number of dependencies)*

**Bonus Feature**

A common use case is to do heavier experimentations with LLM trying out prompts and settings, and keeping track of all the variations can be messy.

So, we implemented a bonus feature to automatically log all request-response pair using a `sqlite` database. The file is stored at `exllama/llm_log.db`. When you're done, you should remember to manually download this file using the UI in jupyterlab (or equivalent softwares on other platform) if you're running this on a cloud with ephemeral hard disk.

In [9]:
!pip install gradio

Collecting gradio
  Downloading gradio-3.36.1-py3-none-any.whl (19.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m56.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles (from gradio)
  Downloading aiofiles-23.1.0-py3-none-any.whl (14 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.100.0-py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.7/65.7 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.0.tar.gz (4.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client>=0.2.7 (from gradio)
  Downloading gradio_client-0.2.8-py3-none-any.whl (288 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.8/288.8 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx (from gradio)
  Downloading httpx-0.24.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [30]:
import gradio as gr

#new_prompt_template = """### System:
#You are an AI assistant that follows instruction extremely well. Help as much as you can.
#
#### User:
#{prompt}
#
#### Response:
#"""

default_request = "Write a blog post on how to start a new website."
default_prompt = new_prompt_template.format(prompt=default_request)

model_name = "TheBloke/orca_mini_13B-GPTQ"

def llm_conf_preset(ui_presets, ui_temp, ui_top_p, ui_top_k, ui_rep_penalty):
    inter = (ui_presets == "None")
    # Precise: mod from llama-precise
    if ui_presets == "Precise":
        temp = 0.4
        top_p = 0.1
        top_k = 40
        rep_penalty = 1.18
    # Balanced: simple-1
    if ui_presets == "Balanced":
        temp = 0.7
        top_p = 0.9
        top_k = 20
        rep_penalty = 1.15
    # Creative: Inspired from NovelAI_StoryWriter
    if ui_presets == "Creative":
        temp = 1.2
        top_p = 0.95
        top_k = 30
        rep_penalty = 1.1
    # Use current value
    if ui_presets == "None":
        temp = ui_temp
        top_p = ui_top_p
        top_k = ui_top_k
        rep_penalty = ui_rep_penalty
    return [gr.update(interactive=inter, value=temp),\
            gr.update(interactive=inter, value=top_p),\
            gr.update(interactive=inter, value=top_k),\
            gr.update(interactive=inter, value=rep_penalty)]

example_requests = [["Give me a recipe for Spaghetti Carbonara."],\
                    ["Write an essay on the role of IT in the international supply chain."],\
                    ["Write the draft for a chapter of a book on the history of the Roman Civilization. In the first chapter, cover the rise of Rome from the early days to the Roman Republic. Cover various aspects such as military/war (especially wars with Carthage), politics, civil life such as society and economics, leisures, etc."]]
def on_select_example(evt: gr.SelectData):
    #print("Testing")
    request = evt.value[0]
    return new_prompt_template.format(prompt=request)

In [35]:
# Nice-to-have: Logging request-response pair
import sqlite3
import datetime
con = sqlite3.connect("llm_log.db")
cur = con.cursor()

#res = cur.execute("SELECT name FROM sqlite_master WHERE name='llm_req_res'")
#if res.fetchone() is None:
#    cur.execute("CREATE TABLE llm_req_res(title, year, score)")
create_table_sql = """CREATE TABLE IF NOT EXISTS llm_req_res (
    model TEXT,
    prompt TEXT,
    response TEXT,
    temperature TEXT,
    top_p TEXT,
    top_k TEXT,
    repetition_penalty TEXT,
    max_new_tokens TEXT,
    seed TEXT,
    generation_dt TEXT
)
"""

insert_log_sql = """INSERT INTO llm_req_res(model, prompt, response, temperature, top_p, top_k, repetition_penalty, max_new_tokens, seed, generation_dt) VALUES (?,?,?,?,?,?,?,?,?,?)"""

res = cur.execute(create_table_sql)
con.commit()

def insert_to_db(con, sql, data):
    cur = con.cursor()
    cur.execute(sql, data)
    con.commit()
    return cur.lastrowid

def log_llm_to_db(prompt, response, temperature, top_p, top_k, repetition_penalty, max_new_tokens, seed):
    con = sqlite3.connect("llm_log.db")
    now = datetime.datetime.now()
    model = model_name
    temp = "{:.3f}".format(temperature)
    tp = "{:.3f}".format(top_p)
    repet = "{:.3f}".format(repetition_penalty)
    insert_to_db(con, insert_log_sql, \
                 (model, prompt, response, \
                  temp, tp, top_k, repet, max_new_tokens, seed, now.strftime('%d/%m/%y %H:%M:%S.%f')))
    con.close()


In [36]:
with gr.Blocks() as demo:
    hidden_state_seed = gr.State()
    with gr.Row():
        with gr.Column():
            ui_prompt = gr.TextArea(label="Prompt", value=default_prompt)
            with gr.Accordion(label="Settings", open=False):
                ui_max_new_tok = gr.Slider(label="Max New Token",       minimum=10,   maximum=4096,   value=512 )
                ui_temp        = gr.Slider(label="Temperature",         minimum=0.0,  maximum=1.3,    value=0.7 )
                ui_top_p       = gr.Slider(label="top-p",               minimum=0.01, maximum=1.0,    value=0.95)
                ui_top_k       = gr.Slider(label="top-k",               minimum=10,   maximum=200,    value=100 )
                ui_rep_penalty = gr.Slider(label="Repeatition penalty", minimum=1.0,  maximum=2.0,    value=1.1 )
                ui_seed        = gr.Slider(label="Random Seed",         minimum=-1,   maximum=2 ** 24,value=-1  ,step = 1)
                ui_presets = gr.Radio(choices=["None", "Precise", "Balanced", "Creative"], value="None", label="Use preset?")
            btn_submit = gr.Button(value="Submit", variant="primary")
            btn_clear  = gr.ClearButton(value="Clear")
        with gr.Column():
            ui_response = gr.TextArea(label="LLM Response", show_copy_button=True)
            ui_examples = gr.Dataset(components=[gr.Textbox(visible=False)], label="Example Requests", samples=example_requests)
        btn_submit.click(run_llm_simple, [ui_prompt, ui_max_new_tok, ui_temp, ui_top_p, ui_top_k, ui_rep_penalty, ui_seed], [ui_response, hidden_state_seed]) \
          .success(log_llm_to_db, [ui_prompt, ui_response, ui_temp, ui_top_p, ui_top_k, ui_rep_penalty, ui_max_new_tok, hidden_state_seed])
        ui_presets.change(llm_conf_preset, [ui_presets, ui_temp, ui_top_p, ui_top_k, ui_rep_penalty], [ui_temp, ui_top_p, ui_top_k, ui_rep_penalty])
        ui_examples.select(fn=on_select_example, inputs=None, outputs=ui_prompt)
demo.queue()
demo.launch(debug=False)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://c65c50b8dff24f65ed.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7863 <> https://c65c50b8dff24f65ed.gradio.live




----
(Debugging Area below, please ignore)

In [19]:
#insert_to_db(con, insert_log_sql, ("aa", "bb", "cc", "0.7", "0.95", "20", "1.1", "512", "-1", "12/12/12 23:23:23.000"))

1

In [37]:
debug_cur = con.cursor()
debug_cur.execute("SELECT * FROM llm_req_res")
rows = debug_cur.fetchall()

In [22]:
n = datetime.datetime.now()
print(n.strftime('%d/%m/%y %H:%M:%S.%f'))

10/07/23 19:05:49.298450


In [23]:
"{:.3f}".format(0.95)

'0.950'

In [26]:
#log_llm_to_db("prompt1", "response1", 0.7, 0.95, 20, 1.2, 512, 42)