Sequence parallel prefill attention kernel by ivanium · Pull Request #2 · ivanium/sglang

ivanium · 2024-07-21T05:04:21Z

This PR mainly implemented RadixAttention.seq_parallel_extend_forward_flashinfer() method. We adopted the parallelization strategy discussed before and overlapped communication and computation in each iteration within the ring attention algorithm. Specifically, each SP worker has:

q tensor: [batch_size, seq_len, q_head_num // SP_SIZE, head_dim] # partitioned along the q_head_num dimension. Assuming that q_head_num is divisible by SP_SIZE.
k tensor: [batch_size, seq_len // SP_SIZE, k_head_num, head_dim] # Here seq_len // SP_SIZE should be adjusted correspondingly when seq_len cannot be divisible by SP_SIZE.
v tensor: [batch_size, seq_len // SP_SIZE, v_head_num, head_dim] # Same as k tensor

NOTE: for now the kernel is only tested when seq_len can be divided by SP_SIZE. Will update the code later.

To balance the workload of all workers, we schedule the computation tasks in the following way:
At iteration i (starting from 0), each SP worker will compute the self-attention for its currently active shard (calling ragged attention kernel), and i number of cross-shard attention (calling paged attention kernel). SP workers have a perfectly balanced workload in each iteration, although the workload per iteration increases step by step. We will need to further investigate the performance impact of this design.

…ement a sequence parallel kernel. Verified with 2 sp workers

…nd_forward())

…different orders of kernel calls

…balance

TODO: turn communication into async fashion and overlap it with computation

… kv cache management before testing because we haven't implemented kv cache management for seq parallel yet

…data

…P are enabled

… 2. flashinfer intialization args

…ernel when sp_size > 1

…alues

…ODO: fix the bug that causes communication hang.

…ted it

…t_metadata

… attn

python/sglang/bench_latency.py

ZYHowell · 2024-07-28T18:39:19Z

python/sglang/srt/managers/controller/infer_batch.py

+            seq_lens_cpu[i] = seq_len // model_runner.sp_size + (
+                seq_len % model_runner.sp_size > model_runner.sp_rank
+            )
+            prefix_lens_cpu[i] = prefix_len // model_runner.sp_size + (
+                prefix_len % model_runner.sp_size > model_runner.sp_rank
+            )


#3 is merged to main, let's resolve this one

ZYHowell · 2024-07-28T18:44:54Z

python/sglang/srt/layers/parallel_utils/parallel_state.py

+    return prev_rank
+
+
+def get_actual_tensor_model_parallel_world_size():


I'd recommend actual -> kv but it's just a naming issue

Make sense. I will fix this.

ZYHowell · 2024-07-28T19:07:27Z

python/sglang/srt/layers/radix_attention.py

+                    (existing_sid, sid) if existing_sid > sid else (sid, existing_sid)
+                )
+                q_data = qs[i]
+                kv_data = torch.stack(owned_shards[j], dim=1)


instead of a stack, can we do: 1. allocate kv_data together; 2. construct a view for k_data and v_data? (k_data = kv_data[:, :k_head, :], v_data = kv_data[:, k_head:, :]) This reduces a potentially large memcpy

I should have used the ragged attention kernel here which avoids creating this redundant kv_data. Will push a fix to this soon.

ZYHowell · 2024-07-28T21:12:12Z

python/sglang/srt/managers/controller/infer_batch.py

+        kv_indices = torch.arange(
+            0, torch.sum(seq_lens), dtype=torch.int32, device="cuda"
+        )


this seems pretty hacky and I don't understand why arange is correct...

We will get rid of this magic after switching to use ragged attention for non-causal attention parts.

python/sglang/srt/layers/linear.py

…e_loc to fix positional encoding and KV cache store

… now attn works for unevenly-distributed sequenses too

… tradeoffs

…args

ZYHowell

lgtm shall we merge this and do decoding in the next pr?

ivanium · 2024-07-31T02:56:10Z

Sounds good

* test: test cases of combining multiple attention kernel calls to implement a sequence parallel kernel. Verified with 2 sp workers * fix: simplify flashinfer kernel initialization (begin_forward() and end_forward()) * test: add logic for sp worker 1 which is basically the same but with different orders of kernel calls * chore: format tweak * feat: a general seq parallel attention kernel that achieves workload balance * fix: minor tweak loop iteration within ring attention * feat [radix_attention]: seq_parallel kernel with sync communication. TODO: turn communication into async fashion and overlap it with computation * test: update test cases for seq parallel attn kernel. Need to disable kv cache management before testing because we haven't implemented kv cache management for seq parallel yet * chore [radix_attention]: format tweak * feat: async communication within ring attention * fix [parallel_utils]: add missed files * fix [infer_batch]: set default values for newly added sp-related metadata * fix [bench_latency]: minor fixes to input args * feat [parallel_utils]: get actual tp rank and size when both TP and SP are enabled * feat [linear]: add QKVParallelLinear * feat [llama2]: update llama model to use our QKVParallelLinear * feat [model_runner]: initialize model parallel with sequence parallel * fix [infer_batch]: 1. a minor issue when calling get_prefill_indices; 2. flashinfer intialization args * fix [bench_latency]: load model with sp_rank * feat [radix_attention]: automatically dispatch to seq-parallel attn kernel when sp_size > 1 * debug: stash current debug changes * fix [radix_attention]: reshape q tensor before running the kernel * bug fix for sp layout types * fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded values * fix [wip]: disable p2p communication within ring attention for now. TODO: fix the bug that causes communication hang. * chore [bench_latency]: disable decode for now since we haven't supported it * upstream with correct prefill sp layout * fix early exit on decode SP * chore: tweak format * update layout * bug fix * fix [linear, radix_attention]: fix q head indexes per SP worker to align with GQA setting. * fix [infer_batch]: set up flashinfer kernels for the batch size > 1 case * chore: tweak format * fix [radix_attention]: revert commented-out kv cache store operations in normal attention * fix: adjust k, v tensor shape to align with both TP and SP setting * chore [llama2]: minor adjustment * fix: update bench_latency to evenly distribute each sequence across all SP workers to avoid the layout issue * test: update test cases to align with current kernel in args * fix [model_runner]: initialize TokenToKVPool with correct num_heads and enable KV cache store in SP attention * chore [radix_attention]: clean up comments * fix [model_runner]: correct num_heads in memory profiling as well to avoid OOM * fix [infer_batch]: adopt SP KV cache allocation * feat [linear]: correctly partition q proj along the num_heads dimension with GQA * chore [llama2]: clean up stable variables * feat [infer_batch]: adjust positions to SP layout when preparing input_metadata * feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard attn * feat [parallel_state]: creat sequence parallel comm groups * test [sp_comm_group]: simple test case with sp_size = 2 * doc [parallel_state]: doc string for our SP group organization * fix [infer_batch]: add padding zeros to positions tensor and out_cache_loc to fix positional encoding and KV cache store * feat [radix_attn, infer_batch]: create masks for padded sequences and now attn works for unevenly-distributed sequenses too * chore [bench_latency]: revert original prompts * fix [parallel_state]: rename "actual" to "kv" * refactor [radix_attention]: unified two cases with differnt comm-comp tradeoffs * chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]" * fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_args * fix [infer_batch]: only pad positions and out_cache_loc for prefill * chore [linear]: clean up and revise comments * chore [parallel_state]: revise comments * chore [linear]: revise comments and class names * chore [radix_attention]: add defensive checks --------- Co-authored-by: ZYHowell <yhzhuang@cmu.edu>

ivanium and others added 30 commits July 19, 2024 13:33

test: test cases of combining multiple attention kernel calls to impl…

f183a4e

…ement a sequence parallel kernel. Verified with 2 sp workers

fix: simplify flashinfer kernel initialization (begin_forward() and e…

5c32e65

…nd_forward())

test: add logic for sp worker 1 which is basically the same but with …

523fd34

…different orders of kernel calls

chore: format tweak

7dc2b6b

feat: a general seq parallel attention kernel that achieves workload …

9d4989a

…balance

fix: minor tweak loop iteration within ring attention

9b08f4b

feat [radix_attention]: seq_parallel kernel with sync communication.

8d4fe0f

TODO: turn communication into async fashion and overlap it with computation

test: update test cases for seq parallel attn kernel. Need to disable…

bf90bd9

… kv cache management before testing because we haven't implemented kv cache management for seq parallel yet

chore [radix_attention]: format tweak

6d0d872

feat: async communication within ring attention

3d0d2b4

fix [parallel_utils]: add missed files

e43d992

Merge branch 'pr-sp-rope' into merge-pr1

d95db82

fix [infer_batch]: set default values for newly added sp-related meta…

c0c9980

…data

fix [bench_latency]: minor fixes to input args

2033e2d

feat [parallel_utils]: get actual tp rank and size when both TP and S…

acfa3dd

…P are enabled

feat [linear]: add QKVParallelLinear

0dc3330

feat [llama2]: update llama model to use our QKVParallelLinear

3368b7d

feat [model_runner]: initialize model parallel with sequence parallel

aec5879

fix [infer_batch]: 1. a minor issue when calling get_prefill_indices;…

c74cd15

… 2. flashinfer intialization args

fix [bench_latency]: load model with sp_rank

ff13134

feat [radix_attention]: automatically dispatch to seq-parallel attn k…

41cab0a

…ernel when sp_size > 1

debug: stash current debug changes

f90958e

fix [radix_attention]: reshape q tensor before running the kernel

62ec03b

bug fix for sp layout types

310c9f0

fix: adjust tensor layout. TODO: fix many dirty hacks and hardcoded v…

0fd900f

…alues

fix [wip]: disable p2p communication within ring attention for now. T…

cc92c77

…ODO: fix the bug that causes communication hang.

chore [bench_latency]: disable decode for now since we haven't suppor…

a0d49dc

…ted it

upstream with correct prefill sp layout

f251271

fix early exit on decode SP

277fe5b

Merge branch 'main' into merge-pr1

4f6c036

ivanium added 6 commits July 27, 2024 17:33

Merge branch 'pr-new-sp-layout' into merge-pr3

26e9375

feat [infer_batch]: adjust positions to SP layout when preparing inpu…

790e6e1

…t_metadata

feat [infer_batch]: use dedicate paged attn kernel for cross-SP-shard…

db08f61

… attn

feat [parallel_state]: creat sequence parallel comm groups

c3747d3

test [sp_comm_group]: simple test case with sp_size = 2

5a819e5

doc [parallel_state]: doc string for our SP group organization

1eede7f

ivanium mentioned this pull request Jul 28, 2024

Update sp layout #3

Merged

ivanium changed the title ~~Sequence parallel attention kernel~~ [WIP] Sequence parallel attention kernel Jul 28, 2024

ZYHowell reviewed Jul 28, 2024

View reviewed changes

ivanium added 14 commits July 29, 2024 11:46

fix [infer_batch]: add padding zeros to positions tensor and out_cach…

2ee759f

…e_loc to fix positional encoding and KV cache store

feat [radix_attn, infer_batch]: create masks for padded sequences and…

393d1c1

… now attn works for unevenly-distributed sequenses too

chore [bench_latency]: revert original prompts

d7ccf4b

fix [parallel_state]: rename "actual" to "kv"

1113df2

refactor [radix_attention]: unified two cases with differnt comm-comp…

ae42c6e

… tradeoffs

chore: rename "actual_tp_[size|rank]" to "kv_tp_[size|rank]"

c04b57f

fix [infer_batch]: ensure prefix_lens is not None in init_flashinfer_…

34f2328

…args

fix [infer_batch]: only pad positions and out_cache_loc for prefill

a4310c1

Merge branch 'pr-sp-comm-group' into merge-pr3

658aa95

Merge branch 'main' into merge-pr3

6372d70

chore [linear]: clean up and revise comments

118819b

chore [parallel_state]: revise comments

7f65031

chore [linear]: revise comments and class names

4e8d4df

chore [radix_attention]: add defensive checks

5da3951

ZYHowell approved these changes Jul 31, 2024

View reviewed changes

ivanium changed the title ~~[WIP] Sequence parallel attention kernel~~ Sequence parallel prefill attention kernel Jul 31, 2024

ivanium merged commit 98c1154 into main Jul 31, 2024

ivanium mentioned this pull request Jul 31, 2024

Create SP communication groups #4

Closed

ivanium deleted the seq-parallel-kernel branch September 6, 2024 23:54

		return prev_rank


		def get_actual_tensor_model_parallel_world_size():

Conversation

ivanium commented Jul 21, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZYHowell left a comment

Choose a reason for hiding this comment

Uh oh!

ivanium commented Jul 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants