Basic TGI server on XLA #1

tengomucho · 2024-02-23T10:26:01Z

What does this PR do?

This is the adaptation of a simple TGI server, mostly taken from optimum-neuron, and first adapted to run on CPU, then adapted to be mapped on XLA with model compiled there.
It also includes a Docker image targeting XLA/TPU and few tests, though some now fail after compilation has been added.
While the work is not complete, this will allow to add a first step on XLA/TPU support for TGI.

The code will actually going to be modified later, but this commit will allow to better see differences from the original code. Reference version of the original code is v0.0.18.

More files copied over, these will be used too.

The code the server is based on is based on the optimum neuron code, in particular it uses many features of the NeuronModelForCausalLM. This commit should allow to use the server on any AutoModelForCausalLM, without the neuron requirements. Notable changes are the shallow modelling implementation (just a wrapper), KV cache and positional ids handling, tests adaptations. This makes most of the server tests pass correctly, not the integration tests though. On some platforms, all tests with "sample" do not work as expected for now.

It is now possible to build the image by calling `make tpu-tgi`

To avoid confusion in the future. The only part left untouched is the integration tests directory, because these tests have not been adapted yet.

This also fixes build on macOS.

It seems that different systems use different generators. This leads to different results and failing tests when using do_sample (even if logits are close). Tests using do_sample are removed, replaced by other variations of parameters.

TGI server now compiles models and runs them on the XLA backend. The compilation is quite slow and it might not be perfect, but it is a first step that starts supporting TPUs. Note that when doing this, a test now fails, probably due to an issue with XLA compilation, to be fixed later.

mfuntowicz · 2024-02-23T10:46:45Z

text-generation-inference/Dockerfile

+
+# Build cargo components (adapted from TGI original Dockerfile)
+# Note that the build image is aligned on the same Linux version as the base image (Debian bookworm/ Ubuntu 22.04)
+FROM lukemathwalker/cargo-chef:latest-rust-1.75-bookworm AS chef


Do we want to update to latest Rust 1.76?

mfuntowicz · 2024-02-23T10:48:41Z

text-generation-inference/Dockerfile

+ARG TRANSFORMERS='4.38.0'
+ARG ACCELERATE='0.27.2'
+ARG SAFETENSORS='0.4.2'
+


nit: What about having a _VERSION suffix? I find it clearer, personal take

mfuntowicz · 2024-02-23T10:50:58Z

text-generation-inference/README.md

+- the service uses a single internal static batch,
+- new requests are inserted in the static batch during prefill,
+- the static KV cache is rebuilt entirely during prefill.


Is this the case for TPU?

For now it is (optimum-neuron was doing so, I did the same). I can change it later.

mfuntowicz · 2024-02-23T10:51:36Z

text-generation-inference/README.md

+
+```
+docker run -p 8080:80 \
+       --net=host --privileged \


QQ: Do we need the --privileged ?

Yes, it's needed to expose the TPU to the docker container.

mfuntowicz · 2024-02-23T10:51:50Z

text-generation-inference/README.md

+
+```
+docker run -p 8080:80 \
+       --net=host --privileged \


Same: Do we need the --privileged ?

It is required to expose the TPU to the container, cf.: https://cloud.google.com/tpu/docs/run-in-container

text-generation-inference/integration-tests/test_gpt2.py

mfuntowicz · 2024-02-23T10:56:03Z

text-generation-inference/server/text_generation_server/model.py

+    # This will be retrieving the model snapshot and cache it.
+    start = time.time()
+    logger.info(f"Fetching revision {revision} of model {model_id}.")
+    model_path = snapshot_download(model_id, revision=revision)


QQ: Why do we need to use snapshot_download? Can't we use from_pretrained(model_id, revision) ?

OK I am removing that, we will re-introduce something similar later if we need.

from_pretrained will do more steps to achieve the very same result, and won't fetch the tokenizer.

I re-split the download and model instantiation steps.

text-generation-inference/server/text_generation_server/modelling.py

No need to snapshot_download anymore.

To be more coherent with other 🤗 projects.

dacorvo · 2024-02-26T08:01:11Z

text-generation-inference/tests/test_generator.py

@@ -62,31 +62,29 @@ def create_request(


 @pytest.mark.parametrize(
-    "input_text, token_id, token_text, do_sample",
+    "input_text, token_id, token_text",


I don't understand: since the generator is using a seed, results are deterministic. You of course need to update the expected results for your platform (because of different underlying graphs of operations), but you should not remove the sampling tests.
If the results are not deterministic, then this is a bug.

OK I am re-adding them, adapting them to the given platform.

dacorvo · 2024-02-26T08:12:20Z

text-generation-inference/server/text_generation_server/model.py

-    logger.info(f"Fetching revision {revision} of model {model_id}.")
-    model_path = snapshot_download(model_id, revision=revision)
-    end = time.time()
-    logger.info(f"Model successfully fetched in {end - start:.2f} s.")
    # This will allow to set config to update specific config such as


The TGI sequence of calls is:

download (no time-out),

server (time-out on server ready).

The fetch model is primarily supposed to be called in the the first case: it is only called also in the second case to return the path where the snapshot was downloaded.
The snapshot download is the most efficient download option, as it does not trigger any other processing associated to from_pretrained (like in your case compiling the model).
I think that with that change you will always compile the model twice.

I split, and now compilation only happens once.

This reverts commit 2eb0958. It actually seems like a good idea to have tests with do_samples. A commit will follow that will adapt tests to the platform where they are supposed to run.

shub-kris

Nice work Alvaro. LGTM, just some small comments.

shub-kris · 2024-02-26T12:20:06Z

README.md

@@ -0,0 +1,7 @@
+# Optimum-TPU
+
+This repo contains the code to optimize running 🤗 transformer models on Google TPUs.


This repository contains the code designed for optimizing the execution of 🤗 transformer models on Google TPUs.

shub-kris · 2024-02-26T12:20:50Z

README.md

+
+## Text-Generation-Inference
+
+This repository maintains a [text-generation-inference (TGI)](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) docker image for deployment on Google TPUs.


image or file?

For now just the Dockerfile. I will correct this

shub-kris · 2024-02-26T12:21:42Z

text-generation-inference/Dockerfile

+RUN pip install "torch~=2.2.0" "torch_xla[tpu]~=2.2.0" -f https://storage.googleapis.com/libtpu-releases/index.html
+
+# Install HuggingFace packages
+ARG TRANSFORMERS_VERSION='4.38.0'


Why not 4.38.1?

mfuntowicz

LGTM - Congrats @tengomucho !

tengomucho added 11 commits February 12, 2024 11:32

chore: import gitignore from optimum-neuron

2a0a429

chore: import text-generation-inference from optimum-neuron

7634f77

The code will actually going to be modified later, but this commit will allow to better see differences from the original code. Reference version of the original code is v0.0.18.

chore(tgi server): import logits_process and token_selector from neuron

bcc3a38

More files copied over, these will be used too.

doc: add basic README

81a911b

fix(pyproject): correct dependencies, description and author

5104f43

feat: docker image for TPU TGI server now works

b45a910

It is now possible to build the image by calling `make tpu-tgi`

chore: rename neuron -> tpu

6ff2b17

To avoid confusion in the future. The only part left untouched is the integration tests directory, because these tests have not been adapted yet.

feat: bump dependency versions (tgi 1.4.2, protobuf)

5601778

This also fixes build on macOS.

test: remove do_sample to increase reproducibility

2eb0958

It seems that different systems use different generators. This leads to different results and failing tests when using do_sample (even if logits are close). Tests using do_sample are removed, replaced by other variations of parameters.

tengomucho requested review from mfuntowicz, shub-kris and philschmid February 23, 2024 10:26

mfuntowicz requested changes Feb 23, 2024

View reviewed changes

tengomucho added 2 commits February 23, 2024 13:06

chore(docker): update rust version

a420467

chore(docker): add _VERSION suffixes for clarity

4985fbd

tengomucho requested a review from dacorvo February 23, 2024 13:36

mfuntowicz reviewed Feb 23, 2024

View reviewed changes

text-generation-inference/server/text_generation_server/modelling.py Outdated Show resolved Hide resolved

tengomucho added 2 commits February 23, 2024 15:53

chore: simplify model fetching and instantiation

40663c6

No need to snapshot_download anymore.

chore: rename modelling -> modeling

71af438

To be more coherent with other 🤗 projects.

dacorvo reviewed Feb 26, 2024

View reviewed changes

Revert "test: remove do_sample to increase reproducibility"

1489325

This reverts commit 2eb0958. It actually seems like a good idea to have tests with do_samples. A commit will follow that will adapt tests to the platform where they are supposed to run.

shub-kris reviewed Feb 26, 2024

View reviewed changes

tengomucho added 6 commits February 26, 2024 13:57

fix(tests): Tests are now adapted to TPU/XLA results

85a6061

chore: update transformers minor version

a8e8320

chore: rephrase README

50ec030

chore: weights download and model instantiation separated again

cd9c602

fix(generator): correct generator max_length -> n_positions

c27374e

chore(Dockerfile): add missing python dependency

b821bab

tengomucho added 2 commits February 27, 2024 09:20

chore(tests): update gpt2 model id in tests

4f7ec75

feat(tests): docker integration tests adapted to TPU

c6ddfde

tengomucho requested review from mfuntowicz, shub-kris and dacorvo February 27, 2024 09:48

fix(test): remove superflous print statements

815586f

shub-kris approved these changes Feb 27, 2024

View reviewed changes

mfuntowicz approved these changes Feb 27, 2024

View reviewed changes

tengomucho merged commit 3a8a276 into main Feb 27, 2024

mfuntowicz deleted the basic-tgi-server branch April 29, 2024 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic TGI server on XLA #1

Basic TGI server on XLA #1

tengomucho commented Feb 23, 2024

mfuntowicz Feb 23, 2024

mfuntowicz Feb 23, 2024

mfuntowicz Feb 23, 2024

tengomucho Feb 23, 2024

mfuntowicz Feb 23, 2024

shub-kris Feb 23, 2024

mfuntowicz Feb 23, 2024

tengomucho Feb 23, 2024

mfuntowicz Feb 23, 2024

tengomucho Feb 23, 2024

dacorvo Feb 23, 2024

tengomucho Feb 27, 2024

dacorvo Feb 26, 2024

tengomucho Feb 26, 2024

dacorvo Feb 26, 2024

tengomucho Feb 27, 2024

shub-kris left a comment

shub-kris Feb 26, 2024

shub-kris Feb 26, 2024

tengomucho Feb 26, 2024

shub-kris Feb 26, 2024

mfuntowicz left a comment

		@@ -0,0 +1,7 @@
		# Optimum-TPU

		This repo contains the code to optimize running 🤗 transformer models on Google TPUs.


		## Text-Generation-Inference

		This repository maintains a [text-generation-inference (TGI)](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) docker image for deployment on Google TPUs.

Basic TGI server on XLA #1

Basic TGI server on XLA #1

Conversation

tengomucho commented Feb 23, 2024

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfuntowicz left a comment

Choose a reason for hiding this comment