Several Inference Endpoint fixes #66

tengomucho · 2024-07-02T08:52:28Z

What does this PR do?

Removed variables form entrypoint.sh, they are passed to the launcher via env vars
Prevent issues with numpy 2.0 version
correct TGI version
add GKE ulimit command
correct CachedBatch serialization when it's None (it was causing health call to crash)
prefill and decode input pre-processing done in CPU to prevent wasting memory and time in compilation on TPU
warmup clears after prefill
fix clear implementation

This was generating a tricky error when calling "/health" at the server startup: this was calling prefill and returning None as the cached batch, that was failing to be serialized.

Doing that on TPU seems to slow down (due to compilation?) and takes a lot of memory.

This allows to correctly handle warmup.

This clears a potential issue when clearing TGI requests. When a client cancels a TGI request, two different methods can be called on the TGI server: - if the request is cancelled after prefill, then the router asks the server to "filter" the decoding batch from the corresponding request. This is correctly implemented, - if the request is cancelled during prefill, then the router asks the server to clear the whole prefill batch. This was not correctly implemented because in that configuration we cleared all requests, even those not included in that prefill batch. This is now fixed, basically reproducing TGI Neuron fix: huggingface/optimum-neuron#609

mfuntowicz · 2024-07-02T12:36:38Z

text-generation-inference/server/text_generation_server/generator.py

+        input_ids = torch.full((batch_size, seq_length), self.tokenizer.pad_token_id, dtype=torch.int64)
+        attention_mask = torch.full((batch_size, seq_length), 0, dtype=torch.int64)


qq: Can't we make it use int32 for input_ids and attention_mask ?

Umh, maybe we could, but the price would be a frequent cast between int32 and int64, because PB integers are serialized in int64. I think it's not worth doing it here, we could look at the advantages/disadvantages in another PR.

mfuntowicz · 2024-07-02T12:37:19Z

text-generation-inference/server/text_generation_server/generator.py

-            if slot.state != Slot.State.EMPTY and slot.request_id not in request_ids:
-                logger.debug(f"Removing request {slot.request_id}")
+            if slot.state != Slot.State.EMPTY and slot.id not in keep_slot_ids:
+                logger.info(f"Removing slot {slot.id} with request {slot.request_id}")


debug might be enough for this imo

mfuntowicz and others added 11 commits June 27, 2024 11:29

fix(tgi): remove all the variables from entrypoint.sh

6148c28

fix(tgi): correct version

54ed49c

fix(tgi): pin numpy version <2.0

f4a31f6

feat(tgi): entrypoint adds GKE specific command

9f5fa56

fix(generator): correct CachedBatch serialization when it's None

bbbfda9

This was generating a tricky error when calling "/health" at the server startup: this was calling prefill and returning None as the cached batch, that was failing to be serialized.

feat(generator): prefill input preparation is done on CPU

a5d50bb

Doing that on TPU seems to slow down (due to compilation?) and takes a lot of memory.

feat(generator): decode input preparation is done on CPU

9c44b86

feat(generator): support TGI truncate parameter in Request

ae0991b

fix(generator): warmup clears after prefill

9b9bd3a

This allows to correctly handle warmup.

feat(ci): release TGI images only when release is published

ed1a0dd

tengomucho requested a review from mfuntowicz July 2, 2024 09:04

tengomucho marked this pull request as ready for review July 2, 2024 09:04

mfuntowicz approved these changes Jul 2, 2024

View reviewed changes

chore(generator): turn log info -> debug on clear

c4d46fe

tengomucho merged commit 246fb24 into main Jul 3, 2024
2 checks passed

tengomucho deleted the ie-fixes branch July 3, 2024 07:59

tengomucho mentioned this pull request Jul 9, 2024

fix(tgi): remove all the variables from entrypoint.sh #64

Closed

alvarobartt mentioned this pull request Jul 12, 2024

/health endpoint not working properly #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several Inference Endpoint fixes #66

Several Inference Endpoint fixes #66

tengomucho commented Jul 2, 2024

mfuntowicz Jul 2, 2024

tengomucho Jul 3, 2024

mfuntowicz Jul 2, 2024

		input_ids = torch.full((batch_size, seq_length), self.tokenizer.pad_token_id, dtype=torch.int64)
		attention_mask = torch.full((batch_size, seq_length), 0, dtype=torch.int64)

Several Inference Endpoint fixes #66

Several Inference Endpoint fixes #66

Conversation

tengomucho commented Jul 2, 2024

What does this PR do?

mfuntowicz Jul 2, 2024

Choose a reason for hiding this comment

tengomucho Jul 3, 2024

Choose a reason for hiding this comment

mfuntowicz Jul 2, 2024

Choose a reason for hiding this comment