-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic TGI server on XLA #1
Conversation
The code will actually going to be modified later, but this commit will allow to better see differences from the original code. Reference version of the original code is v0.0.18.
More files copied over, these will be used too.
The code the server is based on is based on the optimum neuron code, in particular it uses many features of the NeuronModelForCausalLM. This commit should allow to use the server on any AutoModelForCausalLM, without the neuron requirements. Notable changes are the shallow modelling implementation (just a wrapper), KV cache and positional ids handling, tests adaptations. This makes most of the server tests pass correctly, not the integration tests though. On some platforms, all tests with "sample" do not work as expected for now.
It is now possible to build the image by calling `make tpu-tgi`
To avoid confusion in the future. The only part left untouched is the integration tests directory, because these tests have not been adapted yet.
This also fixes build on macOS.
It seems that different systems use different generators. This leads to different results and failing tests when using do_sample (even if logits are close). Tests using do_sample are removed, replaced by other variations of parameters.
TGI server now compiles models and runs them on the XLA backend. The compilation is quite slow and it might not be perfect, but it is a first step that starts supporting TPUs. Note that when doing this, a test now fails, probably due to an issue with XLA compilation, to be fixed later.
text-generation-inference/Dockerfile
Outdated
|
||
# Build cargo components (adapted from TGI original Dockerfile) | ||
# Note that the build image is aligned on the same Linux version as the base image (Debian bookworm/ Ubuntu 22.04) | ||
FROM lukemathwalker/cargo-chef:latest-rust-1.75-bookworm AS chef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to update to latest Rust 1.76
?
text-generation-inference/Dockerfile
Outdated
ARG TRANSFORMERS='4.38.0' | ||
ARG ACCELERATE='0.27.2' | ||
ARG SAFETENSORS='0.4.2' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: What about having a _VERSION
suffix? I find it clearer, personal take
- the service uses a single internal static batch, | ||
- new requests are inserted in the static batch during prefill, | ||
- the static KV cache is rebuilt entirely during prefill. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the case for TPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it is (optimum-neuron was doing so, I did the same). I can change it later.
|
||
``` | ||
docker run -p 8080:80 \ | ||
--net=host --privileged \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Do we need the --privileged
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's needed to expose the TPU to the docker container.
|
||
``` | ||
docker run -p 8080:80 \ | ||
--net=host --privileged \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: Do we need the --privileged
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required to expose the TPU to the container, cf.: https://cloud.google.com/tpu/docs/run-in-container
# This will be retrieving the model snapshot and cache it. | ||
start = time.time() | ||
logger.info(f"Fetching revision {revision} of model {model_id}.") | ||
model_path = snapshot_download(model_id, revision=revision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Why do we need to use snapshot_download
? Can't we use from_pretrained(model_id, revision)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I am removing that, we will re-introduce something similar later if we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from_pretrained
will do more steps to achieve the very same result, and won't fetch the tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-split the download and model instantiation steps.
text-generation-inference/server/text_generation_server/modelling.py
Outdated
Show resolved
Hide resolved
No need to snapshot_download anymore.
To be more coherent with other 🤗 projects.
@@ -62,31 +62,29 @@ def create_request( | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"input_text, token_id, token_text, do_sample", | |||
"input_text, token_id, token_text", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand: since the generator is using a seed, results are deterministic. You of course need to update the expected results for your platform (because of different underlying graphs of operations), but you should not remove the sampling tests.
If the results are not deterministic, then this is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I am re-adding them, adapting them to the given platform.
logger.info(f"Fetching revision {revision} of model {model_id}.") | ||
model_path = snapshot_download(model_id, revision=revision) | ||
end = time.time() | ||
logger.info(f"Model successfully fetched in {end - start:.2f} s.") | ||
# This will allow to set config to update specific config such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TGI sequence of calls is:
- download (no time-out),
- server (time-out on server ready).
The fetch model is primarily supposed to be called in the the first case: it is only called also in the second case to return the path where the snapshot was downloaded.
The snapshot download is the most efficient download option, as it does not trigger any other processing associated to from_pretrained (like in your case compiling the model).
I think that with that change you will always compile the model twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split, and now compilation only happens once.
This reverts commit 2eb0958. It actually seems like a good idea to have tests with do_samples. A commit will follow that will adapt tests to the platform where they are supposed to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Alvaro. LGTM, just some small comments.
README.md
Outdated
@@ -0,0 +1,7 @@ | |||
# Optimum-TPU | |||
|
|||
This repo contains the code to optimize running 🤗 transformer models on Google TPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repository contains the code designed for optimizing the execution of 🤗 transformer models on Google TPUs.
README.md
Outdated
|
||
## Text-Generation-Inference | ||
|
||
This repository maintains a [text-generation-inference (TGI)](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) docker image for deployment on Google TPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image or file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now just the Dockerfile. I will correct this
text-generation-inference/Dockerfile
Outdated
RUN pip install "torch~=2.2.0" "torch_xla[tpu]~=2.2.0" -f https://storage.googleapis.com/libtpu-releases/index.html | ||
|
||
# Install HuggingFace packages | ||
ARG TRANSFORMERS_VERSION='4.38.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not 4.38.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Congrats @tengomucho !
What does this PR do?
This is the adaptation of a simple TGI server, mostly taken from optimum-neuron, and first adapted to run on CPU, then adapted to be mapped on XLA with model compiled there.
It also includes a Docker image targeting XLA/TPU and few tests, though some now fail after compilation has been added.
While the work is not complete, this will allow to add a first step on XLA/TPU support for TGI.