Rebase with TGI v2.0 #134

kdamaszk · 2024-04-29T06:45:29Z

What does this PR do?

Cherry-picks all changes between TGI v1.2.0 and v2.0. Updates the license to Apache 2.0

As per title

Close huggingface#1253 Close huggingface#1279

@oOraph

@oOraph --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

@OlivierDehaene

Works by removing adapter_model.safetensors from being detected as the core model file (which skips the real peft detection). # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

@OlivierDehaene

local directory overloaded still needs the directory to locate the weights files correctly. # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

…tention-v2' (huggingface#1414)

This reverts commit b83aab9.

wrap text-generation-launcher in docker image mask ldconfig failures to user (no need in most cases anyway) --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

@OlivierDehaene

Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Dong Shin <d0104.shin@gmail.com>

@OlivierDehaene

# What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

…e#1651) # What does this PR do? I have suggested similar changes over at huggingface/text-embeddings-inference#201. Here being my additional question, why `debug` is enabled during release building? (hence I didn't add the flag to script things) Applying the following optimizations: - `lto` (link time optimizations) over all code (including dependencies) - Using a single `codegen-unit` to apply optimizations within 1 code unit at build time ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. @OlivierDehaene OR @Narsil

@OlivierDehaene

- Renamed `max_input_length` into `max_input_tokens` for consistency (backward compatible change, will yell if both are set.) - Will now use the config for `max_input_tokens` `max_total_token` and `max_batch_total_tokens`. - Capping the values to 16k in order to save VRAM on behalf of users (overriddable by simply setting the values).   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? compliation -> compilation ## Before submitting - [x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

kdamaszk · 2024-04-29T06:46:02Z

@regisss please review

regisss · 2024-04-30T14:11:02Z

@kdamaszk LGTM!

However, I get an error when I try to run Llama2-7b. I launch the server and send the same request as the example in the README: https://github.com/kdamaszk/tgi-gaudi/tree/rebase_tgi_2.0?tab=readme-ov-file#running-tgi-on-gaudi
And I get the following error:

ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(32), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream: text_generation_router::infer: router/src/infer.rs:145: `inputs` tokens + `max_new_tokens` must be <= 4096. Given: 4095 `inputs` tokens and 32 `max_new_tokens`

It seems that max_new_tokens is not considered when the input is padded to 4096 tokens. Are you able to reproduce it?

yafshar · 2024-05-01T15:28:47Z

The HPU Graph issue reported in #130 still exists. @regisss @kdamaszk if you give me permission, I can add the patch here. Or you can cherry pick the patch 4963b73 to fix the issue.

yafshar · 2024-05-01T17:11:56Z

@regisss, for this update 1.15.0 & 1.15.1 the requirement is dill 0.3.8, from 4963b73
Currently the merge from tgi_gaudi:habama-main 600d033, installs the older one 0.3.7

kdamaszk · 2024-05-06T05:57:05Z

@kdamaszk LGTM!

However, I get an error when I try to run Llama2-7b. I launch the server and send the same request as the example in the README: https://github.com/kdamaszk/tgi-gaudi/tree/rebase_tgi_2.0?tab=readme-ov-file#running-tgi-on-gaudi And I get the following error:
ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(32), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream: text_generation_router::infer: router/src/infer.rs:145: `inputs` tokens + `max_new_tokens` must be <= 4096. Given: 4095 `inputs` tokens and 32 `max_new_tokens`
It seems that max_new_tokens is not considered when the input is padded to 4096 tokens. Are you able to reproduce it?

@regisss you are right. This issue is caused by new default params for max-input-length and max-total-tokens (before 1024 and 2048, now 4095 and 4096). I will modify README example to set these params properly, WDYT?

kdamaszk · 2024-05-06T05:57:49Z

The HPU Graph issue reported in #130 still exists. @regisss @kdamaszk if you give me permission, I can add the patch here. Or you can cherry pick the patch 4963b73 to fix the issue.

thanks @yafshar, I will cherry-pick this commit

regisss · 2024-05-06T06:04:47Z

@kdamaszk LGTM!
However, I get an error when I try to run Llama2-7b. I launch the server and send the same request as the example in the README: https://github.com/kdamaszk/tgi-gaudi/tree/rebase_tgi_2.0?tab=readme-ov-file#running-tgi-on-gaudi And I get the following error:
ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(32), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream: text_generation_router::infer: router/src/infer.rs:145: `inputs` tokens + `max_new_tokens` must be <= 4096. Given: 4095 `inputs` tokens and 32 `max_new_tokens`
It seems that max_new_tokens is not considered when the input is padded to 4096 tokens. Are you able to reproduce it?
@regisss you are right. This issue is caused by new default params for max-input-length and max-total-tokens (before 1024 and 2048, now 4095 and 4096). I will modify README example to set these params properly, WDYT?

Sounds good!

A temp solution to address overriding issue installing dill with habana torch from gaudi-docker/1.15.0 - Having `import __main__ as _main_module` in the global space of the dill module causes some overriding issue on hpu graph destructor

regisss

LGTM!

Side note, I cannot stop the Docker container with ctrl+c anymore. Not a big deal, maybe it's not even related to TGI Gaudi.

kdamaszk · 2024-05-06T07:36:14Z

LGTM!

Side note, I cannot stop the Docker container with ctrl+c anymore. Not a big deal, maybe it's not even related to TGI Gaudi.

Right, I observed the same behavior. This is caused by this PR: huggingface#1716. Due to that, the tgi_entrypoint.sh is run instead of text-generation-launcher.
I'm not sure what is the purpose of this change. We can come back to the previous solution if you want -- it was more user-friendly.

regisss · 2024-05-06T07:39:24Z

Right, I observed the same behavior. This is caused by this PR: huggingface#1716. Due to that, the tgi_entrypoint.sh is run instead of text-generation-launcher. I'm not sure what is the purpose of this change. We can come back to the previous solution if you want -- it was more user-friendly.

Okay I see. Let me discuss that with TGI maintainers to see if we want to directly modify this there.

fxmarty and others added 30 commits April 18, 2024 10:09

Fix AMD documentation (huggingface#1307)

ab34c16

As per title

Add a stale bot. (huggingface#1313)

a41c1a6

Speculative (huggingface#1308)

a7f52f3

feat: mixtral (huggingface#1328)

9aef902

chore: formatting

79f268f

v1.3.0

db5053f

v1.3.1

09c556d

feat: add quant to mixtral (huggingface#1337)

f9b58ac

v1.3.2

05f8c85

fix: default max_new_tokens to 100

2f88d8d

fix: fix gpt-q params loading

c974437

feat: add more latency metrics in forward (huggingface#1346)

5c9ef06

fix: fix triton OutOfResources import

28fcdcc

fix: fix quant linear autotune

b3c2d72

fix: slice stopping criteria buffer

04dbf7a

fix: only keep stop sequence buffer if we have some

214ec0e

fix: max_past default value must be -1, not 0 (huggingface#1348)

bb62005

v1.3.3

3600fc9

feat: relax mistral requirements (huggingface#1351)

a95e6d6

Close huggingface#1253 Close huggingface#1279

fix: fix logic if sliding window key is not present in config (huggin…

ecb0db4

…gface#1352)

fix: fix offline (huggingface#1341) (huggingface#1347)

5ff9e81

@oOraph --------- Signed-off-by: Raphael Glon <oOraph@users.noreply.github.com> Co-authored-by: Raphael Glon <oOraph@users.noreply.github.com>

fix: fix gpt-q with groupsize = -1 (huggingface#1358)

b7299e1

docs: Change URL for Habana Gaudi support in doc (huggingface#1343)

3e22ad9

feat: update exllamav2 kernels (huggingface#1370)

7eeabb9

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

v1.3.4

62646c2

docs: update required CUDA version to 12.2

fc9173a

fix: fix local loading for .bin models (huggingface#1419)

118344b

Fix missing make target platform for local install: 'install-flash-at…

92ddb41

…tention-v2' (huggingface#1414)

Narsil and others added 12 commits April 25, 2024 17:53

Easier defaults for models stemmed from configs.

e428c7c

Revert "Easier defaults for models stemmed from configs."

c4ee0a6

This reverts commit b83aab9.

fix(router): fix a possible deadlock in next_batch (huggingface#1731)

e6421f6

feat: medusa v2 (huggingface#1734)

f6d5c2e

v2.0.0 (huggingface#1736)

c6a31b9

Merge branch 'habana-main' into rebase_tgi_2.0

600d033

hsubramony mentioned this pull request Apr 30, 2024

V1.2.2 release #129

Closed

5 tasks

yafshar and others added 2 commits May 6, 2024 09:15

A patch to address HPU Graphs issue with DILL

3d78027

A temp solution to address overriding issue installing dill with habana torch from gaudi-docker/1.15.0 - Having `import __main__ as _main_module` in the global space of the dill module causes some overriding issue on hpu graph destructor

Update README example commands

0bbec63

kdamaszk requested a review from regisss May 6, 2024 06:34

regisss approved these changes May 6, 2024

View reviewed changes

regisss merged commit 81182be into huggingface:habana-main May 6, 2024

regisss mentioned this pull request May 6, 2024

Unable to stop TGI after serving models huggingface/text-generation-inference#1842

Closed

4 tasks

kdamaszk linked an issue May 6, 2024 that may be closed by this pull request

update the base image from 1.14 to 1.15 #127

Closed

kdamaszk mentioned this pull request May 6, 2024

update the base image from 1.14 to 1.15 #127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase with TGI v2.0 #134

Rebase with TGI v2.0 #134

kdamaszk commented Apr 29, 2024

kdamaszk commented Apr 29, 2024

regisss commented Apr 30, 2024

yafshar commented May 1, 2024

yafshar commented May 1, 2024 •

edited

kdamaszk commented May 6, 2024

kdamaszk commented May 6, 2024

regisss commented May 6, 2024

regisss left a comment

kdamaszk commented May 6, 2024

regisss commented May 6, 2024

Rebase with TGI v2.0 #134

Rebase with TGI v2.0 #134

Conversation

kdamaszk commented Apr 29, 2024

What does this PR do?

kdamaszk commented Apr 29, 2024

regisss commented Apr 30, 2024

yafshar commented May 1, 2024

yafshar commented May 1, 2024 • edited

kdamaszk commented May 6, 2024

kdamaszk commented May 6, 2024

regisss commented May 6, 2024

regisss left a comment

Choose a reason for hiding this comment

kdamaszk commented May 6, 2024

regisss commented May 6, 2024

yafshar commented May 1, 2024 •

edited