CUDA graph compilation #154

tgaddair · 2024-01-02T06:02:56Z

This PR adds support for compiling the model into a static CUDA graph. See Accelerating PyTorch with CUDA Graphs for more details on CUDA graphs and how they can reduce latency.

To enable this (experimental) feature:

lorax-launcher ... --compile

There is a tradeoff to be aware of when using CUDA graphs, namely that it increases memory overhead by 3-10GB depending on model size. However, the observed decrease in latency can be as much as 50%, so if you don't need to run with very large batch sizes and are more latency constrained than throughput, this is a very compelling feature to enable.

In practice, CUDA graphs are most useful in cases where there are excess GPU flops available, such as during decoding. As such, we do not use the compiled version of the model during prefill, only during the decoding steps. Which means in practice that the benefits of enabling compilation will be most pronounced when generating longer sequences (for which more time is spent during decoding).

Current limitations:

Batch size < 256
LoRA rank >= 8 and <= 32
Only one LoRA rank in the batch

If any of these conditions are not met, then LoRAX will fallback to using eager execution for the batch.

Thanks to folks on the Punica team for updating kernels to support graph tracing. Additionally, we modified kernels to support padding with -1 (necessary for CUDA graph's requirement that input shapes be constant across batches).

Comparison:

gpt2-medium, time to generate 100 tokens:

no adapter

baseline: 1.044 s
cuda graph: 0.422 s

1 adapter (rank 16)

baseline: 1.503 s
cuda graph: 0.583 s

tgaddair added 30 commits December 19, 2023 15:31

WIP: cuda graphs

eeb04f6

Copy in adapter data

546d837

Wrap

9c93c63

Cleanup

af358be

Added graph utils

371e12f

Merge branch 'main' into cuda-graph

3b994e1

WIP

5fc0d1d

Cleanup debug

8a685e3

WIP: padding

eb62c4a

More cleanup

0894e93

Copy

11b693f

Fix shapes

8914d1d

Block tables

6c6edac

Fixed shapes

5eadfb1

Adapter data

659076a

Fixed dtypes

0e913fa

Fixed copy

05a2520

Lora graph

f5dff8d

Fixed lora

4444266

Renamed

cc2c706

Plumb v tensor

5edf242

Fixed plumbing

af23513

Refactor

e214b5c

Fixed bgmv for tracing

ada6272

Changed bgmv to use pointers

1187f77

Removed unused num_layers arg

d095fb6

Fixed sgmv tracing with upstream punica changes

6e8a667

WIP: graph sgmv

50972c6

More debug

4cebc11

DEBUG: infer with same inputs

201f6de

tgaddair added 4 commits January 2, 2024 12:41

Fixed mistral

48170dc

Removed unused code

53ff158

Style: pointer on the left

2361f97

Handle cases where cuda graph cannot be used

2abdcfb

tgaddair mentioned this pull request Jan 2, 2024

Does lorax currently support GPT2 finetuned adapters? #84

Open

4 tasks

tgaddair added 2 commits January 2, 2024 13:25

Empty dict

8c1623f

Plumb --compile

8b72559

tgaddair marked this pull request as ready for review January 2, 2024 21:50

tgaddair added 2 commits January 2, 2024 14:05

Added docs

8ef59bf

Fixed formatting

39304d9

tgaddair requested a review from geoffreyangus January 2, 2024 22:11

tgaddair added 12 commits January 2, 2024 14:12

Updated limitations

be4b88e

Updated launcher docs

ee7f574

Fixed rope for graph

6461974

Update limitations

71fbded

ValueError

1c5fbf9

Avoid cloning output

b50b9ff

Fixed padding

b5194d6

Refactor

262af45

Pad segments to power of 2

c65a6fa

Updated load test

f537d4f

Fixed deadlock in sgmv_shrink

b3773c6

Updated docs

179683c

tgaddair force-pushed the cuda-graph branch from 9a84a0f to 179683c Compare January 4, 2024 05:43

tgaddair added 3 commits January 3, 2024 21:44

Merge

a2e9fa7

Removed unused import

e00a09f

Merge branch 'main' into cuda-graph

08061d3

tgaddair merged commit f20789d into main Jan 4, 2024
1 check passed

tgaddair deleted the cuda-graph branch January 4, 2024 16:40

tgaddair mentioned this pull request Jan 23, 2024

Fix RoPE and YARN scaling #202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA graph compilation #154

CUDA graph compilation #154

tgaddair commented Jan 2, 2024 •

edited

CUDA graph compilation #154

CUDA graph compilation #154

Conversation

tgaddair commented Jan 2, 2024 • edited

tgaddair commented Jan 2, 2024 •

edited