## Summary

Below is a summary of my findings:

- **[vLLM](https://vllm.readthedocs.io/en/latest/) was the best tool that I tried, with significantly faster latency than everything else.**  The documentation was also great, and it was easy to use.  It only supports specific versions of CUDA, which you can manage using [this approach](../cuda.qmd).  I found that getting things to work to be a bit fiddly, but once I did it was quite fast.
- **[Text Generation Inference](https://github.com/huggingface/text-generation-inference) is an ok option (but nowhere near as fast as `vLLM`) if you want to deploy LLMs in a standard way**.  TGI has a bit more features than `vLLM` like telemetry baked in ([via OpenTelemetry](https://opentelemetry.io/docs/concepts/signals/traces/)) and integration with the HF ecosystem like [inference endpoints](https://huggingface.co/inference-endpoints).  Even though its nowhere near as fast as `vLLM`, I expect that these optimization techniques will be integrated into this server over time.  However, one thing to note that as of 7/28/2023, the license for TGI was changed to be more **[restrictive that may interfere with certain commercial uses](https://github.com/huggingface/text-generation-inference/commit/bde25e62b33b05113519e5dbf75abda06a03328e)**.

### Rough Benchmarks

This study focuses on various approaches to optimizing **latency**.  Specifically, I want to know what kind tools are most effective at optimizing latency for open source LLMs. In order to focus on latency, I hold the following variables constant:

- batch size of `n = 1` for all prediction requests (holding throughput constant).[^1]  
- All experiments were conducted on a `Nvidia A6000` GPU, unless otherwise noted.
- The model used is [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the HuggingFace Hub [^2].

[^1]: It is common to explore the inference vs throughput frontier when conducting inference benchmarks.  I did not do this, since I was most interested in latency.  [Here is an example](https://github.com/mosaicml/llm-foundry/tree/main/scripts/inference/benchmarking#different-hw-setups-for-mpt-7b) of how to conduct inference benchmarks that consider both throughput and latency.
[^2]: For [Llama v2 models](https://huggingface.co/meta-llama), you must be careful to use the models ending in `-hf` as those are the ones that are compatible with the transformers library.  
[^3]: It's not an apples to apples comparison, since the largest OpenAI models are much larger than open source models.  However, I have found that fine-tuning a small model can often be better when you are trying to accomplish a very specific task.

In addition to batch size of `n = 1` and using a `A6000` GPU (unless noted otherwise), I also made sure I warmed up the model by sending an initial inference request before measuring latency.

In [None]:
#|echo: false
from IPython.display import HTML
import pandas as pd
df = pd.concat([pd.read_csv('_llama-inference/hf-endpoint/bench-hf-endpoint.csv').assign(platform='HF Hosted Inference Endpoint').assign(options='-').assign(gpu='A10G'),
                pd.read_csv('_llama-inference/tgi/bench-default.csv').assign(platform='TGI').assign(options='-').assign(gpu='A6000'), 
                pd.read_csv('_llama-inference/tgi/bench-quantize-bb.csv').assign(platform='TGI').assign(options='quantized w/ bitsandbytes').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/tgi/bench-quantize-gptq.csv').assign(platform='TGI').assign(options='quantized w/ GPTQ').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/exllama/bench-exllama.csv').assign(platform='text-generation-webui').assign(options='exllama').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/exllama/bench-ggmlcpp.csv').assign(platform='text-generation-webui').assign(options='ggml').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/vllm/bench-vllm.csv').assign(platform='vllm').assign(options='-').assign(gpu='A6000'),
                pd.read_csv('_llama-inference/vllm/modal-examples/bench-vllm.csv').assign(platform='vllm').assign(options='-').assign(gpu='A100 (on Modal Labs)')]
              )

df['tok/sec'] = df['tok_count'] / df['time']
pd.set_option('display.precision', 1)
df.groupby(['platform', 'options', 'gpu']).mean('time')[['tok/sec']]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,tok/sec
platform,options,gpu,Unnamed: 3_level_1
HF Hosted Inference Endpoint,-,A10G,30.4
TGI,-,A6000,21.1
TGI,quantized w/ GPTQ,A6000,23.6
TGI,quantized w/ bitsandbytes,A6000,1.9
text-generation-webui,exllama,A6000,30.8
text-generation-webui,ggml,A6000,9.9
vllm,-,A100 (on Modal Labs),40.9
vllm,-,A6000,46.3


In some cases I did not use an `A6000` b/c the platform didn't have that particular GPU available.  You can ignore these rows if you like, but I still think it is valuable information.  I had access to a A6000, so I just used what I had.

Furthermore, the goal was not to be super precise on these benchmarks but rather to get a general sense of how things work and how they might compare to each other out of the box.

## Background

One capability you need to be successful with open source LLMs is the ability to serve models efficiently.  There are two categories of tools for model inference:

- **Inference servers:** these help with providing a web server that can provide a REST/grpc or other interface to interact with your model as a service.  These inference servers usually have parameters to help you make [trade-offs between throughput and latency](https://www.simonwenkel.com/notes/ai/practical/latency-vs-throughput-in-machine-learning-pipelines.html). Additionally, some inference servers come with additional features like telemetry, model versioning and more. You can learn more about this topic the [serving section](../serving/index.qmd) of these notes. For LLMs, popular inference servers are the [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) and [vLLM](https://github.com/vllm-project/vllm).

- **Model Optimization**: These modify your model to make them faster for inference.  Examples include [quantization](https://huggingface.co/docs/optimum/concept_guides/quantization),  [Paged Attention](https://vllm.ai/), [Exllama](https://github.com/turboderp/exllama) and more.

It is common to use both **Inference servers** and **Model Optimization** techniques in conjunction.  Some inference servers like [TGI](https://github.com/huggingface/text-generation-inference)and [vLLM](https://vllm.readthedocs.io/en/latest/) even help you apply optimization techniques.[^4]

[^4]: [The Modular Inference Engine](https://www.modular.com/engine) is another example of an inference server that also applies optimization techniques.  At the time of this writing, this is proprietary technology, but its worth keeping an eye on this in the future.

# Notes On Tools

An important goal I of this study beyond benchmarking was to try different platforms and take note my impressions and how to use them. 

## Text Generation Inference (TGI)

:::{.callout-warning}
### License Restrictions

The license for TGI was [recently changed](https://github.com/huggingface/text-generation-inference/commit/bde25e62b33b05113519e5dbf75abda06a03328e) away from Apache 2.0 to be more restrictive.  Be careful when using TGI in commercial applications.

:::


[Text generation inference](https://github.com/huggingface/text-generation-inference) which is often referred to as “TGI” was easy to use without any optimization.  You can run it like this:

```{.bash filename=“start_server.sh”}
#!/bin/bash

if [ -z "$HUGGING_FACE_HUB_TOKEN" ]
then
  echo "HUGGING_FACE_HUB_TOKEN is not set. Please set it before running this script."
  exit 1
fi

model="TheBloke/Llama-2-7B-GPTQ"
volume=$PWD/data

docker run --gpus all \
 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
 -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 \
 --shm-size 5g -p 8081:80 \
 -v $volume:/data ghcr.io/huggingface/text-generation-inference \
 --max-best-of 1 "$@"
```

We can then run the server with this command:

```bash
bash start_server.sh --model-id "meta-llama/Llama-2-7b-hf"
```

:::{{.callout-note}}

#### Help

You can see all the options for the TGI container with the help flag like so:

```bash
docker run ghcr.io/huggingface/text-generation-inference --help | less
```
:::

### Quantization

Quantization was very difficult to get working.  There is a `—quantize` flag with accepts `bitsandbytes` and `gptq`.  The `bitsandbytes` approach makes inference __much__ slower, which [others have reported](https://github.com/huggingface/text-generation-inference/issues/309#issuecomment-1542124381).  

To make `gptq` work for llama v2 models requires a bunch of work, you have to [install the text-generation-server](https://github.com/huggingface/text-generation-inference/tree/main/server) which can take a while and is very brittle to get right.  I had to  step through the [Makefile](https://github.com/huggingface/text-generation-inference/blob/main/server/Makefile) carefully.  After that you have to download the weights with:

```bash
text-generation-server download-weights meta-llama/Llama-2-7b-hf
```

You can run the following command to perform the quantization (the last argument is the destination directory where the weights are stored).

```bash
text-generation-server quantize "meta-llama/Llama-2-7b-hf" data/quantized/
```

**However, this step is not needed for the most popular models, as someone will likely already have quantized and uploaded them to the Hub.**

#### Pre-Quantized Models

Alternatively, you can use a pre-quantized model that has been uploaded to the Hub.  [TheBloke/Llama-2-7B-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ) is a good example of one.  To get this to work, you have to be careful to set the `GPTQ_BITS` and `GPTQ_GROUPSIZE` environment variables to match the config.  For example [This config](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ/blob/main/quantize_config.json#L2-L3) necessitates setting `GPTQ_BITS=4` and `GPTQ_GROUPSIZE=128` These are already set in `start_server.sh` shown above.  [This PR](https://github.com/huggingface/text-generation-inference/pull/671) will eventually fix that.

To use the [TheBloke/Llama-2-7B-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-GPTQ) with TGI, I can use the same bash script with the following arguments:

```bash
bash start_server.sh --model-id TheBloke/Llama-2-7B-GPTQ --quantize gptq
```

## Text Generation WebUI

[Aman](https://twitter.com/tmm1/status/1683255057201135616?s=20) let me know about [text-generation-web-ui](https://github.com/oobabooga/text-generation-webui), and also [these instructions](https://github.com/paul-gauthier/aider/issues/110#issuecomment-1644318545) for quickly experimenting with [ExLlama](https://github.com/turboderp/exllama) and [ggml](https://github.com/ggerganov/ggml).

From the root of the [text-generation-web-ui](https://github.com/oobabooga/text-generation-webui) repo, you can run the following commands to start an inference server optimized with `ExLlama` or `ggml`, respectively:

```bash
python3 download-model.py TheBloke/Llama-2-7B-GPTQ
python3 server.py --listen --extensions openai --loader exllama_hf --model TheBloke_Llama-2-7B-GPTQ
```


```bash
python3 download-model.py TheBloke/Llama-2-7B-GGML
python3 server.py --listen --extensions openai --loader llamacpp --model TheBloke_Llama-2-7B-GGML
```

After the server was started, I used [this code](https://github.com/hamelsmu/llama-inference/blob/master/exllama/bench.py) to conduct the benchmark.

Overall, I didn't like this particular piece of software much.  It's bit bloated because its trying to do too many things at once (An inference server, Web UIs, and other optimizations).  That being said, the documentation is good and it is easy to use.  

I don't think there is any particular reason to use this unless you want an end-to-end solution that also comes with a web user-interface (which many people want!).

## vLLM

[vLLM](https://github.com/vllm-project/vllm.git) only works with CUDA 11.8, which I configured using [this approach](https://hamel.dev/notes/cuda.html).  After configuring CUDA and installing the right version of PyTorch, you need to install the bleeding edge from git:

```bash
pip install -U git+https://github.com/vllm-project/vllm.git
```

A good recipe to use for vLLM can be find on [these Modal docs](https://modal.com/docs/guide/ex/vllm_inference).  Surprisingly, I had much lower latency when running on a local `A6000` vs. a hosted `V100` on Modal Labs.  It's possible that I did something wrong here.  Either way, **`vLLM` offered the lowest latency compared to everything else by a significant margin.**  If I really wanted to optimize for latency today, I would reach for `vLLM`.

`vLLM` [offers a server](https://vllm.readthedocs.io/en/latest/serving/distributed_serving.html), but I benchmarked the model locally using their tools instead.  The code for the benchmarking can be [found here](https://github.com/hamelsmu/llama-inference/blob/master/vllm/bench.py):

```python
from tqdm import tqdm
from vllm import SamplingParams, LLM

#from https://modal.com/docs/guide/ex/vllm_inference

questions = [
    # Coding questions
    "Implement a Python function to compute the Fibonacci numbers.",
    "Write a Rust function that performs binary exponentiation.",
    "What are the differences between Javascript and Python?",
    # Literature
    "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
    "Who does Harry turn into a balloon?",
    "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
    # Math
    "What is the product of 9 and 8?",
    "If a train travels 120 kilometers in 2 hours, what is its average speed?",
    "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
]

MODEL_DIR = "/home/ubuntu/hamel-drive/vllm-models"

def download_model_to_folder():
    from huggingface_hub import snapshot_download
    import os

    snapshot_download(
        "meta-llama/Llama-2-7b-hf",
        local_dir=MODEL_DIR,
        token=os.environ["HUGGING_FACE_HUB_TOKEN"],
    )
    return LLM(MODEL_DIR)


def generate(question, llm, note=None):
    response = {'question': question, 'note': note}
    sampling_params = SamplingParams(
        temperature=1.0,
        top_p=1,
        max_tokens=200,
    )
    
    start = time.perf_counter()
    result = llm.generate(question, sampling_params)
    request_time = time.perf_counter() - start

    for output in result:
        response['tok_count'] = len(output.outputs[0].token_ids)
        response['time'] = request_time
        response['answer'] = output.outputs[0].text
    
    return response

if __name__ == '__main__':
    llm = download_model_to_folder()
    counter = 1
    responses = []

    for q in tqdm(questions):
        response = generate(question=q, llm=llm, note='vLLM')
        if counter >= 2:
            responses.append(response)
        counter += 1
    
    df = pd.DataFrame(responses)
    df.to_csv('bench-vllm.csv', index=False)
```

## HuggingFace Inference Endpoint

I deployed an [inference endpoint](https://ui.endpoints.huggingface.co/) on HuggingFace for [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf), on a `Nvidia A10G` GPU. I didn't try to turn on any optimizations like quantization and wanted to see what the default performance would be like.

The documentation for these interfaces can be found [here](https://huggingface.github.io/text-generation-inference/#/).  There is also [a python client](https://huggingface.co/docs/huggingface_hub/package_reference/inference_client#huggingface_hub.InferenceClient.text_generation).

Their documentation says they are using TGI under the hood.  However, my latency was significantly faster on their hosted inference platform than using TGI locally.  This could be due to the fact that I used a `A10G` with them but only a `A6000` locally.  It's worth looking into why this discrepancy exists further.

The code for this benchmark can be found [here](https://github.com/hamelsmu/llama-inference/blob/master/hf-endpoint/bench.py).