Model warmup support with AOT and endpoint for JetStream #92

vivianrwu · 2024-05-29T21:47:24Z

This PR aims to support model warmup on the JetStream server using Ahead-of-time compilation (AOT). This is to eliminate compile times after the model server has been loaded, allowing the server to serve without extra latencies.

This covers using AOT on prefill, generate, and insert, and using their respective compiles in the prefill and generate threads when serving. A driver flag self.warmup_enabled is added to monitor this.

To enable modelwarmup, one can set --enable_model_warmup=True/true when running the Maxengine or jetstream-pytorch server. This will enable a wrapper engine over the original engine that will call the respective AOT-compiled prefills, inserts, and generates when appropriate in prefill_threads and generate_threads

Test coverage can be found under test_server.py to check if the model warmup has been successful.
This functionality will be useful for model server readiness check and pod startup on GKE, to denote that a pod is ready to serve without extra latencies from compilation time.

This has been validated with Maxtext on HEAD (AI-Hypercomputer/maxtext@f8ae413) on GKE

  - model_name=gemma-7b
  - tokenizer_path=assets/tokenizer.gemma
  - per_device_batch_size=1
  - max_prefill_predict_length=1024
  - max_target_length=2048
  - async_checkpointing=false
  - ici_fsdp_parallelism=1
  - ici_autoregressive_parallelism=-1
  - ici_tensor_parallelism=1
  - scan_layers=false
  - weight_dtype=bfloat16
  - load_parameters_path=<ckpt_path>
  - enable_model_warmup=true

curl --request POST --header "Content-type: application/json" -s localhost:8000/generate --data '{
    "prompt": "What are the top 5 programming languages",
    "max_tokens": 200
}'
{
    "response": " for data science in 2023?\n\n1. Python\n2. R\n3. SQL\n4. Java\n5. Scala\n\n**Note:** The order is based on popularity and demand in the data science industry in 2023."
}

The below covers initial testing on jetstream-pytorch and the latency differences

Latency difference on first request after model server has loaded:

- --size=7b
- --model_name=llama-2
- --batch_size=4
- --max_cache_length=2048
- --quantize_weights=False
- --quantize_kv_cache=False
- tpu platform: v5e-8

Before AOT model warmup (includes compilation): 34.43s
After AOT model warmup (no further compilation needed): 1.81s

This is an initial implementation of AOT for model warmup. With higher batch sizes in jetstream-pytorch, we observe the detokenizing generate step and time to first response after AOT to be slower. Latency below is measured from the time the request is sent to the time that a response is outputted.

Batch size	Time (s) no AOT, before compilation	Time (s) no AOT, after compilation	Time (s) AOT, after compilation	Time (ms) no AOT, detokenizing generate step	Time (ms) AOT, detokenizing generate step
4	34.43	1.53	1.81	5.11	6.33
8	36.18	1.76	3.06	6.27	9.10
16	34.85	2.19	4.66	7.84	14.10
32	36.52	2.83	5.58	11.20	24.11
56	39.48	3.69	9.69	15.47	42.83
64	not recorded	not recorded	OOM	not recorded	OOM

FanhaiLu1 · 2024-06-06T22:31:42Z

jetstream/core/orchestrator.py

-          true_length=true_length,
-      )
+      if self.warmup_enabled:
+        padded_token_length = token_utils.take_nearest_length(


In general, I feel the warmup code should be outside of orchstrator. We would like keep orchstrator only contain necessary functions (benchmark or warmup is not necessary function), make sure the code is clean and clear. The logic is already very complex to read right now.

Can you do refactor and move warmup out of this class?

just curious, do you mean we move the warmup logic to a separate function and then invoke that function here or we call AOT at a completely different place outside of orchestrator?

I moved the warmup code out of orchestrator but kept the check (if self.warmup_enabled) + its corresponding logic because of the following functionality: Once model warmup is called, its compiled form (prefill, insert, and generate), will be stored in their respective dictionaries for their corresponding bucket length. This compiled form should be called from now on, or else the JetStream server will experience compilation times.

Correct me if I'm wrong, we know what type of data prefill or decode should be process for warmup. In this case, all the code can be outside orchestrator.

@JoeZijunZhou Please also take a look. The orchestrator is already complex, we's better to keep this class only have main function code, other wise, it's hard to maintain and refactor in future.

+1. I feel it's feasible to implement a wrapper for the engines, and pass the compiled engines etc in the driver init here if warmup is on: https://github.com/google/JetStream/blob/main/jetstream/core/server_lib.py#L141. Then, we don't need to change the orchestrator and the engine API, making the AOT warmup logics decoupled from the existing jetstream core and engine API.

Added two things:

Logic to bake the warmup into model server startup

Wrapper engine definition WarmedUpEngine

Added some extra logic to help facilitate the engine / define the warm up state, since that is used later on in prefill threads and generate threads to determine which bucket is needed to called with.

@FanhaiLu1 Added the wrapper logic, ptal and let me know if it looks good! Thanks. We can have a follow up PR to address the performance degradation that occurs at larger batch sizes

I feel it's feasible to move the warmup state and its related handling into WarmedUpEngine, WDYT?

+1 for Zijun's comments

Thanks, decoupled the warmup handling per our discussion offline!

FanhaiLu1 · 2024-06-06T22:33:38Z

Thanks for adding the warmup support! Which vm did you run the test?

vivianrwu · 2024-06-06T22:36:30Z

Thanks for adding the warmup support! Which vm did you run the test?

@FanhaiLu1 I tested using v5e-8, I'll add this detail to the PR comment too.

liurupeng · 2024-06-07T05:40:41Z

jetstream/core/orchestrator.py

-          true_length=true_length,
-      )
+      if self.warmup_enabled:
+        padded_token_length = token_utils.take_nearest_length(


just curious, do you mean we move the warmup logic to a separate function and then invoke that function here or we call AOT at a completely different place outside of orchestrator?

jetstream/core/orchestrator.py

jetstream/core/proto/jetstream.proto

JoeZijunZhou

Great work! I feel we could refactor the warmup logic to make it clean and decoupled.

JoeZijunZhou · 2024-06-12T01:13:40Z

jetstream/core/orchestrator.py

-          true_length=true_length,
-      )
+      if self.warmup_enabled:
+        padded_token_length = token_utils.take_nearest_length(


+1. I feel it's feasible to implement a wrapper for the engines, and pass the compiled engines etc in the driver init here if warmup is on: https://github.com/google/JetStream/blob/main/jetstream/core/server_lib.py#L141. Then, we don't need to change the orchestrator and the engine API, making the AOT warmup logics decoupled from the existing jetstream core and engine API.

jetstream/engine/token_utils.py

qihqi · 2024-06-12T17:41:41Z

jetstream/core/orchestrator.py

-          true_length=true_length,
-      )
+      if self.warmup_enabled:
+        padded_token_length = token_utils.take_nearest_length(


You can probably put the logic of warm up into another Engine implementation that takes an instance of Engine instance:

class WarmedUpEngine(engine_api.Engine): def __init__(self, downstream_engine: engine_api.Engine): # do compile, setup the dicts that maps int to jax Compiled. def prefill(self, ...): return self.compiled_prefill[seqlen](*args, **kwargs) etc # same for insert / generate

Then,

in orchestrator, you only need

if warmed up: self.engine = WarmedUpEngine(self.engine)

in init and rest dont need to change.

jetstream/core/proto/jetstream.proto

jetstream/engine/aot_utils.py

jetstream/engine/engine_api.py

jetstream/engine/aot_utils.py

_enabled from unit test

This reverts commit 2ffe761.

_api

JoeZijunZhou

Synced with @vivianrwu , need to do complete refactor in orchestrator.py and server_lib.py in the following PRs. Approve for the init PR.

vivianrwu requested a review from vipannalla as a code owner May 29, 2024 21:47

vivianrwu marked this pull request as draft May 29, 2024 21:47

vivianrwu changed the title ~~[WIP] Model warmup support and endpoint for JetStream~~ Model warmup support and endpoint for JetStream Jun 6, 2024

vivianrwu marked this pull request as ready for review June 6, 2024 18:12

vivianrwu changed the title ~~Model warmup support and endpoint for JetStream~~ Model warmup support with AOT and endpoint for JetStream Jun 6, 2024

vivianrwu marked this pull request as draft June 6, 2024 18:18

vivianrwu marked this pull request as ready for review June 6, 2024 20:58

FanhaiLu1 reviewed Jun 6, 2024

View reviewed changes

liurupeng reviewed Jun 7, 2024

View reviewed changes

JoeZijunZhou reviewed Jun 12, 2024

View reviewed changes

qihqi reviewed Jun 12, 2024

View reviewed changes

jetstream/engine/token_utils.py Show resolved Hide resolved

qihqi reviewed Jun 12, 2024

View reviewed changes

jetstream/engine/aot_utils.py Show resolved Hide resolved

vivianrwu added 15 commits July 2, 2024 21:52

initial setup for model warmup support

46a3237

add engine api variables for prefill and insert

cae6f61

Add AOT warmup fixes

a3af89b

fix jetstream_pb2 spelling

e8f49e1

fix jetstream_pb2 spelling

75f5934

remove references to history

2a5cf0b

reformat files with pyink

c4d5f96

fix typo in modelwarmuprequest

0d3e216

remove absl logging

7e8d619

refactor model warmup outside of orchestrator

324edfe

fix stub to use utilities in test_server

bb47229

retrigger checks

5feed8a

resolve libraries

81dca4b

fix pylint

70018c6

fix pylint

1574399

Refactor model warmup engines and bake logic into server start

e1c55f7

vivianrwu force-pushed the modelwarmup branch from be1d7e1 to e1c55f7 Compare July 2, 2024 22:03

qihqi reviewed Jul 9, 2024

View reviewed changes

jetstream/engine/aot_utils.py Outdated Show resolved Hide resolved

jetstream/engine/aot_utils.py Show resolved Hide resolved

vivianrwu added 11 commits July 10, 2024 20:24

refactor warmup logic even more

36a6387

Merge branch 'google:main' into modelwarmup

81f6fad

fix pytype and modelwarmup func

31f5263

remove warmup

67ee0af

_enabled from unit test

add back in list

952dbab

Add model warmup into server_lib run

2ffe761

Revert "Add model warmup into server_lib run"

35e1178

This reverts commit 2ffe761.

Add model warmup into server_lib run

8ea85f7

import engine

a3f4b05

_api

fix pylint issues

9b198fa

Rename instances of WarmedUpEngine to JetStreamEngine

a2ff45d

vivianrwu mentioned this pull request Jul 11, 2024

Add enable_model_warmup flag for AOT compilation at model server start AI-Hypercomputer/maxtext#763

Closed

JoeZijunZhou approved these changes Jul 11, 2024

View reviewed changes

JoeZijunZhou merged commit 196beda into AI-Hypercomputer:main Jul 11, 2024
3 checks passed

vivianrwu mentioned this pull request Jul 11, 2024

Add enable_model_warmup flag for AOT compilation at model server start AI-Hypercomputer/maxtext#764

Open

This comment was marked as resolved.

Sign in to view

vivianrwu mentioned this pull request Jul 17, 2024

Add enable_model_warmup flag for AOT compilation at model server start AI-Hypercomputer/maxtext#783

Merged

vivianrwu mentioned this pull request Aug 7, 2024

Manual model warmup to resolve AOT model warmup performance degradation #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model warmup support with AOT and endpoint for JetStream #92

Model warmup support with AOT and endpoint for JetStream #92

vivianrwu commented May 29, 2024 •

edited

Loading

FanhaiLu1 Jun 6, 2024

liurupeng Jun 7, 2024

vivianrwu Jun 11, 2024

FanhaiLu1 Jun 11, 2024

JoeZijunZhou Jun 12, 2024 •

edited

Loading

vivianrwu Jul 2, 2024

vivianrwu Jul 8, 2024 •

edited

Loading

JoeZijunZhou Jul 8, 2024

FanhaiLu1 Jul 10, 2024

vivianrwu Jul 10, 2024

FanhaiLu1 commented Jun 6, 2024

vivianrwu commented Jun 6, 2024

liurupeng Jun 7, 2024

JoeZijunZhou left a comment

JoeZijunZhou Jun 12, 2024 •

edited

Loading

qihqi Jun 12, 2024

JoeZijunZhou left a comment •

edited

Loading

This comment was marked as resolved.

Model warmup support with AOT and endpoint for JetStream #92

Model warmup support with AOT and endpoint for JetStream #92

Conversation

vivianrwu commented May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeZijunZhou Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vivianrwu Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FanhaiLu1 commented Jun 6, 2024

vivianrwu commented Jun 6, 2024

Choose a reason for hiding this comment

JoeZijunZhou left a comment

Choose a reason for hiding this comment

JoeZijunZhou Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoeZijunZhou left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as resolved.

vivianrwu commented May 29, 2024 •

edited

Loading

JoeZijunZhou Jun 12, 2024 •

edited

Loading

vivianrwu Jul 8, 2024 •

edited

Loading

JoeZijunZhou Jun 12, 2024 •

edited

Loading

JoeZijunZhou left a comment •

edited

Loading