Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic TGI server on XLA #1

Merged
merged 25 commits into from
Feb 27, 2024
Merged

Basic TGI server on XLA #1

merged 25 commits into from
Feb 27, 2024

Conversation

tengomucho
Copy link
Collaborator

What does this PR do?

This is the adaptation of a simple TGI server, mostly taken from optimum-neuron, and first adapted to run on CPU, then adapted to be mapped on XLA with model compiled there.
It also includes a Docker image targeting XLA/TPU and few tests, though some now fail after compilation has been added.
While the work is not complete, this will allow to add a first step on XLA/TPU support for TGI.

The code will actually going to be modified later, but this commit will
allow to better see differences from the original code.

Reference version of the original code is v0.0.18.
More files copied over, these will be used too.
The code the server is based on is based on the optimum neuron code, in
particular it uses many features of the NeuronModelForCausalLM.
This commit should allow to use the server on any AutoModelForCausalLM,
without the neuron requirements.
Notable changes are the shallow modelling implementation (just a
wrapper), KV cache and positional ids handling, tests adaptations.

This makes most of the server tests pass correctly, not the integration
tests though.
On some platforms, all tests with "sample" do not work as expected for
now.
It is now possible to build the image by calling `make tpu-tgi`
To avoid confusion in the future. The only part left untouched is the
integration tests directory, because these tests have not been
adapted yet.
It seems that different systems use different generators. This leads to
different results and failing tests when using do_sample (even if logits
are close).
Tests using do_sample are removed, replaced by other variations of
parameters.
TGI server now compiles models and runs them on the XLA backend.
The compilation is quite slow and it might not be perfect, but it is a
first step that starts supporting TPUs.

Note that when doing this, a test now fails, probably due to an issue
with XLA compilation, to be fixed later.

# Build cargo components (adapted from TGI original Dockerfile)
# Note that the build image is aligned on the same Linux version as the base image (Debian bookworm/ Ubuntu 22.04)
FROM lukemathwalker/cargo-chef:latest-rust-1.75-bookworm AS chef
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to update to latest Rust 1.76?

Comment on lines 95 to 98
ARG TRANSFORMERS='4.38.0'
ARG ACCELERATE='0.27.2'
ARG SAFETENSORS='0.4.2'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What about having a _VERSION suffix? I find it clearer, personal take

Comment on lines +18 to +20
- the service uses a single internal static batch,
- new requests are inserted in the static batch during prefill,
- the static KV cache is rebuilt entirely during prefill.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case for TPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now it is (optimum-neuron was doing so, I did the same). I can change it later.


```
docker run -p 8080:80 \
--net=host --privileged \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Do we need the --privileged ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's needed to expose the TPU to the docker container.


```
docker run -p 8080:80 \
--net=host --privileged \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: Do we need the --privileged ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is required to expose the TPU to the container, cf.: https://cloud.google.com/tpu/docs/run-in-container

text-generation-inference/integration-tests/test_gpt2.py Outdated Show resolved Hide resolved
text-generation-inference/integration-tests/test_gpt2.py Outdated Show resolved Hide resolved
text-generation-inference/integration-tests/test_gpt2.py Outdated Show resolved Hide resolved
text-generation-inference/integration-tests/test_gpt2.py Outdated Show resolved Hide resolved
# This will be retrieving the model snapshot and cache it.
start = time.time()
logger.info(f"Fetching revision {revision} of model {model_id}.")
model_path = snapshot_download(model_id, revision=revision)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Why do we need to use snapshot_download? Can't we use from_pretrained(model_id, revision) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I am removing that, we will re-introduce something similar later if we need.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_pretrained will do more steps to achieve the very same result, and won't fetch the tokenizer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-split the download and model instantiation steps.

No need to snapshot_download anymore.
To be more coherent with other 🤗 projects.
@@ -62,31 +62,29 @@ def create_request(


@pytest.mark.parametrize(
"input_text, token_id, token_text, do_sample",
"input_text, token_id, token_text",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand: since the generator is using a seed, results are deterministic. You of course need to update the expected results for your platform (because of different underlying graphs of operations), but you should not remove the sampling tests.
If the results are not deterministic, then this is a bug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I am re-adding them, adapting them to the given platform.

logger.info(f"Fetching revision {revision} of model {model_id}.")
model_path = snapshot_download(model_id, revision=revision)
end = time.time()
logger.info(f"Model successfully fetched in {end - start:.2f} s.")
# This will allow to set config to update specific config such as
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TGI sequence of calls is:

  • download (no time-out),
  • server (time-out on server ready).

The fetch model is primarily supposed to be called in the the first case: it is only called also in the second case to return the path where the snapshot was downloaded.
The snapshot download is the most efficient download option, as it does not trigger any other processing associated to from_pretrained (like in your case compiling the model).
I think that with that change you will always compile the model twice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split, and now compilation only happens once.

This reverts commit 2eb0958.

It actually seems like a good idea to have tests with do_samples.
A commit will follow that will adapt tests to the platform where they
are supposed to run.
Copy link
Contributor

@shub-kris shub-kris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Alvaro. LGTM, just some small comments.

README.md Outdated
@@ -0,0 +1,7 @@
# Optimum-TPU

This repo contains the code to optimize running 🤗 transformer models on Google TPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repository contains the code designed for optimizing the execution of 🤗 transformer models on Google TPUs.

README.md Outdated

## Text-Generation-Inference

This repository maintains a [text-generation-inference (TGI)](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) docker image for deployment on Google TPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image or file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now just the Dockerfile. I will correct this

RUN pip install "torch~=2.2.0" "torch_xla[tpu]~=2.2.0" -f https://storage.googleapis.com/libtpu-releases/index.html

# Install HuggingFace packages
ARG TRANSFORMERS_VERSION='4.38.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 4.38.1?

Copy link
Member

@mfuntowicz mfuntowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Congrats @tengomucho !

@tengomucho tengomucho merged commit 3a8a276 into main Feb 27, 2024
@mfuntowicz mfuntowicz deleted the basic-tgi-server branch April 29, 2024 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants