# Introduction
The purpose of this cookbook is to show you how to properly benchmark TGI. For more background details and explanation, please check out this [popular blog](https://huggingface.co/blog/tgi-benchmarking) first.

## Setup
Make sure you have an environment with TGI installed; docker is a great choice.The commands here can be easily copied/pasted into a terminal, which might be even easier. Don't feel compelled to use Jupyter. If you just want to test this out, you can duplicate and use [derek-thomas/tgi-benchmark-space](https://huggingface.co/spaces/derek-thomas/tgi-benchmark-space).

# TGI Launcher

In [None]:
!text-generation-launcher --version

/bin/bash: line 1: text-generation-launcher: command not found


Below we can see the different settings for TGI. Be sure to read through them and decide which settings are most
important for your use-case.

Here are some of the most important ones:
- `--model-id`
- `--quantize` Quantization saves memory, but does not always improve speed
- `--max-input-tokens` This allows TGI to optimize the prefilling operation
- `--max-total-tokens` In combination with the above TGI now knows what the max input and output tokens are
- `--max-batch-size` This lets TGI know how many requests it can process at once.

The last 3 together provide the necessary restrictions to optimize for your use-case. You can find a lot of performance improvements by setting these as appropriately as possible.

In [None]:
!text-generation-launcher -h

/bin/bash: line 1: text-generation-launcher: command not found


We can launch directly from the cookbook since we dont need the command to be interactive.

We will just be using defaults in this cookbook as the intent is to understand the benchmark tool.

These parameters were changed if you're running on a Space because we don't want to conflict with the Spaces server:
- `--hostname`
- `--port`

Feel free to change or remove them based on your requirements.

In [None]:
!RUST_BACKTRACE=1 \
text-generation-launcher \
--model-id astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \
--quantize gptq \
--hostname 0.0.0.0 \
--port 1337

/bin/bash: line 1: text-generation-launcher: command not found


# TGI Benchmark
Now lets learn how to launch the benchmark tool!

Here we can see the different settings for TGI Benchmark.

Here are some of the more important TGI Benchmark settings:

- `--tokenizer-name` This is required so the tool knows what tokenizer to use
- `--batch-size` This is important for load testing. We should use enough values to see what happens to throughput and latency. Do note that batch-size in the context of the benchmarking tool is number of virtual users.
- `--sequence-length` AKA input tokens, it is important to match your use-case needs
- `--decode-length` AKA output tokens, it is important to match your use-case needs
- `--runs` 10 is the default

<blockquote style="border-left: 5px solid #80CBC4; background: #263238; color: #CFD8DC; padding: 0.5em 1em; margin: 1em 0;">
  <strong>💡 Tip:</strong> Use a low number for <code style="background: #37474F; color: #FFFFFF; padding: 2px 4px; border-radius: 4px;">--runs</code> when you are exploring but a higher number as you finalize to get more precise statistics
</blockquote>


In [None]:
!text-generation-benchmark -h

/bin/bash: line 1: text-generation-benchmark: command not found


Here is an example command. Notice that I add the batch sizes of interest repeatedly to make sure all of them are used
by the benchmark tool. I'm also considering which batch sizes are important based on estimated user activity.

<blockquote style="border-left: 5px solid #FFAB91; background: #37474F; color: #FFCCBC; padding: 0.5em 1em; margin: 1em 0;">
  <strong>⚠️ Warning:</strong> Please note that the TGI Benchmark tool is designed to work in a terminal, not a jupyter notebook. This means you will need to copy/paste the command in a jupyter terminal tab. I am putting it here for convenience.
</blockquote>


In [None]:
!text-generation-benchmark \
--tokenizer-name astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \
--sequence-length 70 \
--decode-length 50 \
--batch-size 1 \
--batch-size 2 \
--batch-size 4 \
--batch-size 8 \
--batch-size 16 \
--batch-size 32 \
--batch-size 64 \
--batch-size 128

/bin/bash: line 1: text-generation-benchmark: command not found
