Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several Inference Endpoint fixes #66

Merged
merged 12 commits into from
Jul 3, 2024
Merged

Several Inference Endpoint fixes #66

merged 12 commits into from
Jul 3, 2024

Conversation

tengomucho
Copy link
Collaborator

What does this PR do?

  • Removed variables form entrypoint.sh, they are passed to the launcher via env vars
  • Prevent issues with numpy 2.0 version
  • correct TGI version
  • add GKE ulimit command
  • correct CachedBatch serialization when it's None (it was causing health call to crash)
  • prefill and decode input pre-processing done in CPU to prevent wasting memory and time in compilation on TPU
  • warmup clears after prefill
  • fix clear implementation

mfuntowicz and others added 11 commits June 27, 2024 11:29
This was generating a tricky error when calling "/health" at the server
startup: this was calling prefill and returning None as the cached
batch, that was failing to be serialized.
Doing that on TPU seems to slow down (due to compilation?) and takes a
lot of memory.
This allows to correctly handle warmup.
This clears a potential issue when clearing TGI requests.

When a client cancels a TGI request, two different methods can be called
on the TGI server:

- if the request is cancelled after prefill, then the router asks the
  server to "filter" the decoding batch from the corresponding request.
  This is correctly implemented,
- if the request is cancelled during prefill, then the router asks the
  server to clear the whole prefill batch. This was not correctly
  implemented because in that configuration we cleared all requests,
  even those not included in that prefill batch.

This is now fixed, basically reproducing TGI Neuron fix:
huggingface/optimum-neuron#609
@tengomucho tengomucho marked this pull request as ready for review July 2, 2024 09:04
Comment on lines +382 to +383
input_ids = torch.full((batch_size, seq_length), self.tokenizer.pad_token_id, dtype=torch.int64)
attention_mask = torch.full((batch_size, seq_length), 0, dtype=torch.int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Can't we make it use int32 for input_ids and attention_mask ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umh, maybe we could, but the price would be a frequent cast between int32 and int64, because PB integers are serialized in int64. I think it's not worth doing it here, we could look at the advantages/disadvantages in another PR.

if slot.state != Slot.State.EMPTY and slot.request_id not in request_ids:
logger.debug(f"Removing request {slot.request_id}")
if slot.state != Slot.State.EMPTY and slot.id not in keep_slot_ids:
logger.info(f"Removing slot {slot.id} with request {slot.request_id}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug might be enough for this imo

@tengomucho tengomucho merged commit 246fb24 into main Jul 3, 2024
2 checks passed
@tengomucho tengomucho deleted the ie-fixes branch July 3, 2024 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants