[AutoTP] Make AutoTP work when num_heads not divisible by number of workers #4011

delock · 2023-07-21T05:31:05Z

Currently AutoTP will assert when num_heads are not divisible by number of workers. However in some situation this might be what user intend to do. i.e. having three compute device for model with 16 heads and want to put all three compute device to work. In this situation, each worker will process 5 or 6 heads, which is still better than run the workload on two compute device and each worker process 8 heads and leave the third device idle.

This PR distribute attention heads to each worker as even as possible, and shard hidden_size according to this distribution, this allows AutoTP run OOB even when number heads are not divisible by number of devices installed on the system.

… model_config

mrwyattii

Can we add a unit test to verify for an odd number of devices? Perhaps extend this test class:

DeepSpeed/tests/unit/inference/test_inference.py

Line 489 in 777ae39

class TestAutoTensorParallelism(DistributedTest):

to add something like:

@pytest.mark.world_size(3)
def test_odd_world_size(
        self,
        model_w_task,
        query,
        inf_kwargs,
        assert_fn,
        dtype,
    ):
        invalid_test_msg = validate_test(model_w_task, dtype, enable_cuda_graph=False, enable_triton=False)
        if invalid_test_msg:
            pytest.skip(invalid_test_msg)

        model, task = model_w_task
        local_rank = int(os.getenv("LOCAL_RANK", "0"))
        world_size = int(os.getenv("WORLD_SIZE", "2"))

        pipe = pipeline(task, model=model, device=torch.device("cpu"), framework="pt")
        bs_output = pipe(query, **inf_kwargs)

        pipe.model = deepspeed.init_inference(pipe.model, mp_size=world_size, dtype=dtype)
        # Switch device to GPU so that input tensors are not on CPU
        pipe.device = torch.device(get_accelerator().device_name(local_rank))
        ds_output = pipe(query, **inf_kwargs)

        print(local_rank, "baseline", bs_output)
        print(local_rank, "deepspeed", ds_output)
        assert assert_fn(bs_output, ds_output)

…ama2

delock · 2023-07-25T15:48:51Z

@mrwyattii Test added. There is a result mismatch assertion in the test and I can also reproduce this assertion with CPU+BF16. Will need sometime to debug this issue.

Can we add a unit test to verify for an odd number of devices? Perhaps extend this test class:

DeepSpeed/tests/unit/inference/test_inference.py

Line 489 in 777ae39

class TestAutoTensorParallelism(DistributedTest):

to add something like:

@pytest.mark.world_size(3)
def test_odd_world_size(
        self,
        model_w_task,
        query,
        inf_kwargs,
        assert_fn,
        dtype,
    ):
        invalid_test_msg = validate_test(model_w_task, dtype, enable_cuda_graph=False, enable_triton=False)
        if invalid_test_msg:
            pytest.skip(invalid_test_msg)

        model, task = model_w_task
        local_rank = int(os.getenv("LOCAL_RANK", "0"))
        world_size = int(os.getenv("WORLD_SIZE", "2"))

        pipe = pipeline(task, model=model, device=torch.device("cpu"), framework="pt")
        bs_output = pipe(query, **inf_kwargs)

        pipe.model = deepspeed.init_inference(pipe.model, mp_size=world_size, dtype=dtype)
        # Switch device to GPU so that input tensors are not on CPU
        pipe.device = torch.device(get_accelerator().device_name(local_rank))
        ds_output = pipe(query, **inf_kwargs)

        print(local_rank, "baseline", bs_output)
        print(local_rank, "deepspeed", ds_output)
        assert assert_fn(bs_output, ds_output)

delock · 2023-08-17T10:17:03Z

@mrwyattii @molly-smith I have identified the issue for result mismatch and fixed. Can you help restart workflow? Thanks!

@mrwyattii Test added. There is a result mismatch assertion in the test and I can also reproduce this assertion with CPU+BF16. Will need sometime to debug this issue.

Can we add a unit test to verify for an odd number of devices? Perhaps extend this test class:

DeepSpeed/tests/unit/inference/test_inference.py

Line 489 in 777ae39

class TestAutoTensorParallelism(DistributedTest):

to add something like:

@pytest.mark.world_size(3)
def test_odd_world_size(
        self,
        model_w_task,
        query,
        inf_kwargs,
        assert_fn,
        dtype,
    ):
        invalid_test_msg = validate_test(model_w_task, dtype, enable_cuda_graph=False, enable_triton=False)
        if invalid_test_msg:
            pytest.skip(invalid_test_msg)

        model, task = model_w_task
        local_rank = int(os.getenv("LOCAL_RANK", "0"))
        world_size = int(os.getenv("WORLD_SIZE", "2"))

        pipe = pipeline(task, model=model, device=torch.device("cpu"), framework="pt")
        bs_output = pipe(query, **inf_kwargs)

        pipe.model = deepspeed.init_inference(pipe.model, mp_size=world_size, dtype=dtype)
        # Switch device to GPU so that input tensors are not on CPU
        pipe.device = torch.device(get_accelerator().device_name(local_rank))
        ds_output = pipe(query, **inf_kwargs)

        print(local_rank, "baseline", bs_output)
        print(local_rank, "deepspeed", ds_output)
        assert assert_fn(bs_output, ds_output)

delock · 2023-09-20T10:17:07Z

Hi @mrwyattii @molly-smith The test failure is fixed. Can you help restart CI workflow? Thanks!

mrwyattii · 2023-10-02T16:50:22Z

@delock approved the PR, but there is a merge conflict. Can you resolve that? The PR will auto-merge after!

tjruwase · 2023-10-02T16:52:15Z

@delock, can you please help with the merge conflict?

delock · 2023-10-07T08:58:14Z

@mrwyattii @tjruwase the conflict is resolved, thanks!

delock · 2023-10-10T07:26:24Z

Conflict with lm_head parallelism resolved, and add uneven sharding support for lm_head parallel.

delock · 2023-10-12T01:20:07Z

Hi @mrwyattii @tjruwase , the recent merge conflict had been resolved, and we also support uneven sharding of lm_head parallel. Can you take a quick look whether it can be put into merge queue? Thanks!

…orkers (microsoft#4011) * allow number of heads not divisible by number of ranks * get num_heads from model config, more robust * simplify logic where num_head itself is sharded * name tweaks * make code more robust where num_attention_heads may not be defined in model_config * support num_key_value_heads < num_attention_heads which is used by llama2 * add test for 5 ranks * change odd rank # to 3 to avoid test skip * add get_shard_size function * modify sharding mechanism according to latest auto TP * fix accuracy issue * fix format * skip tests with fusedqkv * remove skip of fusedqkv tests * skip test fusedqkv with odd number of ranks * support model with n_heads in model_config * fix TestInjectionPolicy::test[fp32-t5] * fix uneven_heads on some fusedqkv types (microsoft#12) * odd support fusedqkv * fix format and clear text * better fix when activation size cannot be divided by number of heads * move tp_shard.py under module_inject * Add get_num_kv_heads in tp_shard.py * Refine according to comments * remove old comment * fix bug in getting num_kv_heads * support uneven sharding of lm_head tensor parallel --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: mzl <mingzhi.liu@intel.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

delock added 3 commits July 20, 2023 05:41

allow number of heads not divisible by number of ranks

0706acd

get num_heads from model config, more robust

0bf785f

simplify logic where num_head itself is sharded

72b9e1a

delock requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners July 21, 2023 05:31

delock and others added 3 commits July 21, 2023 01:32

name tweaks

5ed9a56

make code more robust where num_attention_heads may not be defined in…

73f499d

… model_config

Merge branch 'master' into gma/uneven_heads

48322c7

delock mentioned this pull request Jul 24, 2023

add lm_head and embed_out tensor parallel #3962

Merged

Merge branch 'master' into gma/uneven_heads

f14e290

mrwyattii suggested changes Jul 24, 2023

View reviewed changes

loadams and others added 3 commits July 24, 2023 12:43

Merge branch 'master' into gma/uneven_heads

b62317c

support num_key_value_heads < num_attention_heads which is used by ll…

12c0628

…ama2

add test for 5 ranks

8f23d9b

delock requested a review from tjruwase as a code owner July 25, 2023 03:43

delock and others added 2 commits July 25, 2023 00:54

change odd rank # to 3 to avoid test skip

9c53bd7

Merge branch 'master' into gma/uneven_heads

413224b

molly-smith self-requested a review August 2, 2023 18:04

delock and others added 6 commits August 9, 2023 02:22

Merge branch 'master' into gma/uneven_heads

78d6667

add get_shard_size function

27fde30

modify sharding mechanism according to latest auto TP

8e1fd27

Merge branch 'master' into gma/uneven_heads

9a6bc12

fix accuracy issue

2dac94f

Merge branch 'master' into gma/uneven_heads

885f6a3

Merge branch 'master' into gma/uneven_heads

369eb3e

delock mentioned this pull request Sep 20, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Closed

25 tasks

fix bug in getting num_kv_heads

567fb9a

Merge branch 'master' into gma/uneven_heads

47c83ca

molly-smith assigned mrwyattii Sep 20, 2023

Merge branch 'master' into gma/uneven_heads

d194ab0

mrwyattii approved these changes Oct 2, 2023

View reviewed changes

mrwyattii enabled auto-merge October 2, 2023 16:50

Merge branch 'master' into gma/uneven_heads

6db5ddd

auto-merge was automatically disabled October 7, 2023 08:57
Head branch was pushed to by a user without write access

delock added 2 commits October 10, 2023 15:16

Merge branch 'up-master' into gma/uneven_heads

698b62a

support uneven sharding of lm_head tensor parallel

d75149f

delock added 2 commits October 11, 2023 08:48

Merge branch 'master' into gma/uneven_heads

248532d

Merge branch 'master' into gma/uneven_heads

a9056fd

delock added 4 commits October 12, 2023 09:20

Merge branch 'master' into gma/uneven_heads

81bd29f

Merge branch 'master' into gma/uneven_heads

693a9fe

Merge branch 'master' into gma/uneven_heads

4c45a5b

Merge branch 'master' into gma/uneven_heads

a7513e1

tjruwase enabled auto-merge October 25, 2023 15:46

tjruwase added this pull request to the merge queue Oct 25, 2023

Merged via the queue into microsoft:master with commit f15cccf Oct 25, 2023
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoTP] Make AutoTP work when num_heads not divisible by number of workers #4011

[AutoTP] Make AutoTP work when num_heads not divisible by number of workers #4011

delock commented Jul 21, 2023

mrwyattii left a comment

delock commented Jul 25, 2023

delock commented Aug 17, 2023

delock commented Sep 20, 2023 •

edited

Loading

mrwyattii commented Oct 2, 2023

tjruwase commented Oct 2, 2023

delock commented Oct 7, 2023

delock commented Oct 10, 2023

delock commented Oct 12, 2023

[AutoTP] Make AutoTP work when num_heads not divisible by number of workers #4011

[AutoTP] Make AutoTP work when num_heads not divisible by number of workers #4011

Conversation

delock commented Jul 21, 2023

mrwyattii left a comment

Choose a reason for hiding this comment

delock commented Jul 25, 2023

delock commented Aug 17, 2023

delock commented Sep 20, 2023 • edited Loading

mrwyattii commented Oct 2, 2023

tjruwase commented Oct 2, 2023

delock commented Oct 7, 2023

delock commented Oct 10, 2023

delock commented Oct 12, 2023

delock commented Sep 20, 2023 •

edited

Loading