Improve the defaults for the launcher #1727

Narsil · 2024-04-11T12:52:55Z

What does this PR do?

Renamed max_input_length into max_input_tokens for consistency (backward compatible change, will yell if both are set.)
Will now use the config for max_input_tokens max_total_token and max_batch_total_tokens.
Capping the values to 16k in order to save VRAM on behalf of users (overriddable by simply setting the values).

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-04-11T12:56:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dacorvo · 2024-04-11T13:12:44Z

docs/source/basic_tutorials/launcher.md


 ```
 ## MAX_TOTAL_TOKENS
 ```shell
      --max-total-tokens <MAX_TOTAL_TOKENS>
-          This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be
+          This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be. Default to min(max_position_embeddings - 1, 16384)


Suggested change

This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be. Default to min(max_position_embeddings - 1, 16384)

This is the most important value to set as it defines the "memory budget" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be. Default to min(max_position_embeddings, 16384)

dacorvo · 2024-04-11T13:14:26Z

launcher/src/main.rs

+        // TODO get config.
+        match args.max_batch_prefill_tokens {
+            Some(max_batch_prefill_tokens) => max_batch_prefill_tokens,
+            None => {


If max_input_tokens is set, shouldn't we use that value instead ?
Also, if max_batch_size is set, maybe use max_batch_size * max_input_tokens

drbh · 2024-04-11T16:11:13Z

launcher/src/main.rs

+    use hf_hub::{api::sync::Api, Repo, RepoType};
+
+    #[derive(Deserialize)]
+    struct Config {
+        max_position_embeddings: usize,
+    }


nit: can we move these outside of main?

drbh · 2024-04-11T16:14:36Z

launcher/src/main.rs

+            } else {
+                api.model(model_id)
+            };
+            repo.get("config.json").unwrap()


will this fail in a case where config.json is not in the repo?

drbh · 2024-04-11T16:16:31Z

launcher/src/main.rs

+        let content = std::fs::read_to_string(filename).unwrap();
+        let config: Config = serde_json::from_str(&content).unwrap();


similar as above, are these infallible?

OlivierDehaene · 2024-04-11T16:31:07Z

You get a lot of errors because the max_total_tokens is higher than the maximum number of tokens in the cache like: ArgumentValidation("max_total_tokens must be <= max_batch_total_tokens. Given: 16384 and 12416").

Do you think we should change this to a warning and make max_total_tokens = min(max_total_tokens, max_batch_total_tokens) if the user has not explicetly provided a max_total_tokens?

Narsil · 2024-04-11T17:29:36Z

You get a lot of errors because the max_total_tokens is higher than the maximum number of tokens in the cache like: ArgumentValidation("max_total_tokens must be <= max_batch_total_tokens. Given: 16384 and 12416").

Do you think we should change this to a warning and make max_total_tokens = min(max_total_tokens, max_batch_total_tokens) if the user has not explicetly provided a max_total_tokens?

I'd say hard error if user provided, but also better default if not user provided.

dacorvo · 2024-04-12T11:03:41Z

launcher/src/main.rs

@@ -257,7 +257,7 @@ struct Args {
    /// Limits the number of tokens for the prefill operation.
    /// Since this operation take the most memory and is compute bound, it is interesting
    /// to limit the number of requests that can be sent.
-    /// Default to min(max_input_length + 50, 16384) to give a bit of room.
+    /// Default to `max_input_length + 50` to give a bit of room.


You mean tokens, right ? 😉

launcher/src/main.rs

@OlivierDehaene

- Renamed `max_input_length` into `max_input_tokens` for consistency (backward compatible change, will yell if both are set.) - Will now use the config for `max_input_tokens` `max_total_token` and `max_batch_total_tokens`. - Capping the values to 16k in order to save VRAM on behalf of users (overriddable by simply setting the values).   Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@OlivierDehaene

# What does this PR do? - Renamed `max_input_length` into `max_input_tokens` for consistency (backward compatible change, will yell if both are set.) - Will now use the config for `max_input_tokens` `max_total_token` and `max_batch_total_tokens`. - Capping the values to 16k in order to save VRAM on behalf of users (overriddable by simply setting the values).   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

dacorvo reviewed Apr 11, 2024

View reviewed changes

drbh reviewed Apr 11, 2024

View reviewed changes

Narsil added 8 commits April 12, 2024 07:28

Easier defaults for models stemmed from configs.

3c71d2f

Better defaults (and LOG_COLORIZE).

bd01d44

Update default doc.

a4c86e8

No unwrap.

9ce9f39

Making things work most of the time.

d43e10e

Change things around when we don't have a tokenizer.

179ee4e

Remove the override ?

9176ecb

Adding some wiggle room.

289b072

Narsil force-pushed the improve_defaults branch from 118516a to 289b072 Compare April 12, 2024 07:29

Narsil added 5 commits April 12, 2024 08:24

Fixing default for BNB + cuda graphs (they don't work together).

c4ebcea

Max_seq_len (old mpt config.)

cd07211

"Fixing t5" just use more RAM for this test.

1e5150f

Smaller default for max_input_length.

e595585

Forgot the doc again.

16386b8

dacorvo reviewed Apr 12, 2024

View reviewed changes

Narsil commented Apr 12, 2024

View reviewed changes

launcher/src/main.rs Outdated Show resolved Hide resolved

Narsil added 2 commits April 12, 2024 14:08

Update launcher/src/main.rs

b75bd5b

Update the doc.

f66c9f3

Narsil merged commit 1b2670c into main Apr 12, 2024
7 of 8 checks passed

Narsil deleted the improve_defaults branch April 12, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the defaults for the launcher #1727

Improve the defaults for the launcher #1727

Narsil commented Apr 11, 2024

HuggingFaceDocBuilderDev commented Apr 11, 2024

dacorvo Apr 11, 2024

dacorvo Apr 11, 2024

Narsil Apr 11, 2024

drbh Apr 11, 2024

drbh Apr 11, 2024

drbh Apr 11, 2024

OlivierDehaene commented Apr 11, 2024 •

edited

Narsil commented Apr 11, 2024

dacorvo Apr 12, 2024

		let content = std::fs::read_to_string(filename).unwrap();
		let config: Config = serde_json::from_str(&content).unwrap();

Improve the defaults for the launcher #1727

Improve the defaults for the launcher #1727

Conversation

Narsil commented Apr 11, 2024

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 11, 2024

dacorvo Apr 11, 2024

Choose a reason for hiding this comment

dacorvo Apr 11, 2024

Choose a reason for hiding this comment

Narsil Apr 11, 2024

Choose a reason for hiding this comment

drbh Apr 11, 2024

Choose a reason for hiding this comment

drbh Apr 11, 2024

Choose a reason for hiding this comment

drbh Apr 11, 2024

Choose a reason for hiding this comment

OlivierDehaene commented Apr 11, 2024 • edited

Narsil commented Apr 11, 2024

dacorvo Apr 12, 2024

Choose a reason for hiding this comment

OlivierDehaene commented Apr 11, 2024 •

edited