[BUG] partition_balanced return wrong result. #4312

zjjMaiMai · 2023-09-12T18:01:33Z

Background

In pipeline parallelism, deepspeed uses ds_utils.partition_balanced to balance the partitioning of the model according to the number of parameters or class names.

DeepSpeed/deepspeed/runtime/pipe/module.py

Lines 380 to 395 in 581e44d

    
           if method == 'uniform': 
        
               num_layers = len(self._layer_specs) 
        
               self.parts = ds_utils.partition_uniform(num_items=num_layers, num_parts=num_stages) 
        
           elif method == 'parameters': 
        
               param_counts = self._count_layer_params() 
        
               self.parts = ds_utils.partition_balanced(weights=param_counts, num_parts=num_stages) 
        
           elif method.startswith('type:'): 
        
               layertype = method.split(':')[1] 
        
               binary_weights = [0] * len(self._layer_specs) 
        
               for idx in self._find_layer_type(layertype): 
        
                   binary_weights[idx] = 1 
        
               self.parts = ds_utils.partition_balanced(weights=binary_weights, num_parts=num_stages) 
        
           elif method == 'profile': 
        
               raise NotImplementedError(f'Partitioning method {method} not implemented.') 
        
           else: 
        
               raise NotImplementedError(f'Partitioning method {method} not implemented.')

What wrong?

>>> import deepspeed
>>> deepspeed.__version__
'0.10.3+542dc0d5'
>>> from deepspeed.runtime import utils as ds_utils
>>> ds_utils.partition_balanced([1, 1, 1, 1, 1], 4)
[0, 2, 4, 5, 5]
>>>

the result [0, 2, 4, 5, 5] means [2, 2, 1, 0] layers for each part, which is not balanced at all. the last part will throw an exception because there are no parameters to training.

i add some unit test for this function, and i will fix it later if anyone need it.

zjjMaiMai · 2023-09-12T18:05:45Z

@microsoft-github-policy-service agree

zjjMaiMai · 2023-09-14T02:36:09Z

already fixed! cc @tjruwase

ShadenSmith · 2023-10-06T16:22:32Z

Thanks for this PR, @zjjMaiMai!

A note on balance: the objective function in the original code minimizes the maximum load per partition. The maximum load on a pipeline stage determines the pipeline throughput, and so the original result is also balanced. Example paper: Fast Optimal Load Balancing Algorithms for 1D Partitioning

# Background In pipeline parallelism, deepspeed uses `ds_utils.partition_balanced` to balance the partitioning of the model according to the number of parameters or class names. https://github.com/microsoft/DeepSpeed/blob/581e44dd1ab3c409a5905335867c761d5cb4db5b/deepspeed/runtime/pipe/module.py#L380-L395 # What wrong? ``` >>> import deepspeed >>> deepspeed.__version__ '0.10.3+542dc0d5' >>> from deepspeed.runtime import utils as ds_utils >>> ds_utils.partition_balanced([1, 1, 1, 1, 1], 4) [0, 2, 4, 5, 5] >>> ``` the result [0, 2, 4, 5, 5] means [2, 2, 1, 0] layers for each part, which is not balanced at all. the last part will throw an exception because there are no parameters to training. i add some unit test for this function, and i will fix it later if anyone need it. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

add test for partition_balanced

07252ae

zjjMaiMai requested review from jeffra, mrwyattii and tjruwase as code owners September 12, 2023 18:01

zjjMaiMai and others added 4 commits September 13, 2023 12:18

update unit test

eaf159e

fix partition_balanced use DP solving The Linear Partition Problem

6891159

Merge branch 'master' into fix_pp

051ff99

remove unuse bisect.bisect_left

397348a

zjjMaiMai and others added 4 commits September 14, 2023 10:50

update partition_balanced func and test case

4ffc9ec

Merge branch 'master' into fix_pp

7eb0ca8

Merge branch 'master' into fix_pp

6e72d89

Merge branch 'master' into fix_pp

1eeba2d

tjruwase requested a review from ShadenSmith October 6, 2023 10:40

ShadenSmith approved these changes Oct 6, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Oct 6, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 6, 2023

Merge branch 'master' into fix_pp

f118e73

tjruwase added this pull request to the merge queue Dec 4, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 4, 2023

tjruwase added this pull request to the merge queue Dec 8, 2023

Merged via the queue into microsoft:master with commit 2bdf061 Dec 8, 2023
15 checks passed

tjruwase mentioned this pull request Feb 5, 2024

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. #5078

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] partition_balanced return wrong result. #4312

[BUG] partition_balanced return wrong result. #4312

zjjMaiMai commented Sep 12, 2023

zjjMaiMai commented Sep 12, 2023

zjjMaiMai commented Sep 14, 2023

ShadenSmith commented Oct 6, 2023

	if method == 'uniform':
	num_layers = len(self._layer_specs)
	self.parts = ds_utils.partition_uniform(num_items=num_layers, num_parts=num_stages)
	elif method == 'parameters':
	param_counts = self._count_layer_params()
	self.parts = ds_utils.partition_balanced(weights=param_counts, num_parts=num_stages)
	elif method.startswith('type:'):
	layertype = method.split(':')[1]
	binary_weights = [0] * len(self._layer_specs)
	for idx in self._find_layer_type(layertype):
	binary_weights[idx] = 1
	self.parts = ds_utils.partition_balanced(weights=binary_weights, num_parts=num_stages)
	elif method == 'profile':
	raise NotImplementedError(f'Partitioning method {method} not implemented.')
	else:
	raise NotImplementedError(f'Partitioning method {method} not implemented.')

[BUG] partition_balanced return wrong result. #4312

[BUG] partition_balanced return wrong result. #4312

Conversation

zjjMaiMai commented Sep 12, 2023

Background

What wrong?

zjjMaiMai commented Sep 12, 2023

zjjMaiMai commented Sep 14, 2023

ShadenSmith commented Oct 6, 2023