Does the input sharding match exact optimization of long sequence? #3

guanzhchen · 2024-04-06T08:50:29Z

Thanks for your exciting work!

I found the extract_local function seems to split the input sequence length L into L/world_size. Are parameters optimized (backward) for each chunk rather than the whole long sequence? So have you tried if there are any approximation errors or the optimization is length-agnostic?

jzhang38 · 2024-04-06T09:28:39Z

Are parameters optimized (backward) for each chunk rather than the whole long sequence?

The whole sequence.

have you tried if there are any approximation errors or the optimization is length-agnostic?

https://github.com/zhuzilin/ring-flash-attention/blob/55ff66fd35f329dfcc24ce7a448bfdd532865966/test/test_zigzag_ring_flash_attn_func.py#L121

guanzhchen · 2024-04-06T09:31:42Z

That makes sense! Thank you!

guanzhchen closed this as completed Apr 6, 2024

puppet101 mentioned this issue Apr 10, 2024

error when finetuning yi-34b #13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the input sharding match exact optimization of long sequence? #3

Does the input sharding match exact optimization of long sequence? #3

guanzhchen commented Apr 6, 2024

jzhang38 commented Apr 6, 2024

guanzhchen commented Apr 6, 2024

Does the input sharding match exact optimization of long sequence? #3

Does the input sharding match exact optimization of long sequence? #3

Comments

guanzhchen commented Apr 6, 2024

jzhang38 commented Apr 6, 2024

guanzhchen commented Apr 6, 2024