remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation #5112

bm-synth · 2024-02-10T15:22:13Z

When performing the map operation required for the curriculum learning, the output of metric_function requires an index field:

    def update_metric_results(self, data, metric_types, metric_dtypes, metric_functions, metric_results):
        for m_idx in range(len(metric_types)):
            [...]
            if metric_type == 'single_value_per_sample':
                for row in range(metric_values.size()[0]):
                    metric_result["sample_to_metric_builder"].add_item(metric_values[row].reshape(-1))
                    metric_result["metric_to_sample_dict"][metric_values[row].item()].append(
                        data['index'][row][0].item()). ##<------- data['index']??

There is no mention to this in the documentation, where it specifies that the output of metric_function should be a dict/DataFrame (?) with an index key/column. To makes things worse, on top of that, there is no way for an user to be able to specify a proper index value for each sample, because the distribution of samples across workers/threads is not know, as it's done inside DataAnalysis:

    def run_map_helper(self, thread_id):
        start_idx, end_idx = self.thread_splits[thread_id][0], \
            self.thread_splits[thread_id][1]
        logger.info(f"worker {self.worker_id} thread {thread_id}: start working " \
            f"on data subset {start_idx} to {end_idx}")
        thread_dataset = Subset(self.dataset, list(range(start_idx, end_idx)))
        sampler = BatchSampler(SequentialSampler(thread_dataset), batch_size=self.batch_size, drop_last=False)

Since by design you picked a SequentialSampler, then you know beforehand the global id of each each sample of each batch of each thread of each worker by looking at

self.worker_splits, self.thread_splits = split_dataset(self.dataset, self.num_workers, self.worker_id,
                                                               self.num_threads)
start_idx, end_idx = thread_splits[t_idx_reduce][0], thread_splits[t_idx_reduce][1]

and you can populate that index value correctly, instead of asking the user to provide it.

This PR removes the need for 'index' key in data and uses instead the batch, thread, and worker ids to compute the global index of each sample.

bm-synth · 2024-02-10T21:37:59Z

@microsoft-github-policy-service agree

conglongli · 2024-02-15T12:18:54Z

@bm-synth Could you resolve the conflicts? Thanks.

bm-synth · 2024-02-15T13:20:06Z

@bm-synth Could you resolve the conflicts? Thanks.

@conglongli done

conglongli · 2024-02-15T13:38:59Z

@bm-synth Could you resolve the conflicts? Thanks.

@conglongli done

@bm-synth Thanks you. On the other hand, after reading this PR's details, I'm concerning that your PR might not be able to replace the index key. The index key is because that user's dataset may have shuffling feature, so we have to ask user to always provide an index to indicate "inside the data, what is the exact index of this sample". Otherwise we could make a wrong connection between the data sample and the curriculum difficulty value. This PR basically assumes that the data is always in-order, which might not be always the case. You can refer to how I do the data analysis at here https://github.com/microsoft/Megatron-DeepSpeed/blob/6d4c535eeae782daa22583fd8abac7cec3bb60f2/examples_deepspeed/data_efficiency/gpt/ds_analyze_gpt_data_map.sh#L66 where I have to add a "--return-data-index" flag to return the actual index.

conglongli · 2024-02-15T13:48:00Z

@bm-synth To further clarify: in Megatron-DeepSpeed and Megatron-LM, the dataset is shuffled even before reaching sampler https://github.com/microsoft/Megatron-DeepSpeed/blob/6d4c535eeae782daa22583fd8abac7cec3bb60f2/megatron/data/gpt_dataset.py#L597. This is why even if we use a SequentialSampler for data analysis, the data could still be shuffled. Thus an index key provided by user is needed.

conglongli · 2024-02-15T14:11:17Z

@bm-synth After some more thinking, I think there is still values in your approach and it should work in many cases. So my proposal is that: we still keep the index key and use it when user provides it. Otherwise we use your approach.

bm-synth · 2024-02-15T14:30:16Z

@bm-synth To further clarify: in Megatron-DeepSpeed and Megatron-LM, the dataset is shuffled even before reaching sampler https://github.com/microsoft/Megatron-DeepSpeed/blob/6d4c535eeae782daa22583fd8abac7cec3bb60f2/megatron/data/gpt_dataset.py#L597. This is why even if we use a SequentialSampler for data analysis, the data could still be shuffled. Thus an index key provided by user is needed.

@conglongli ok i saw it in the megatron source code:

        if args.return_data_index:
            sample_dict.update({'index': np.array([orig_idx], dtype=np.int64)})

where orig_idx is defined as:

class GPTDataset(torch.utils.data.Dataset):

    def __getitem__(self, idx):
        args = get_args()
        orig_idx = idx

so orig_idx seems to be the index pre-shuffling, ie before being shuffled by the dataloader, ie in a Dataset.
And that makes sense, for the map reduce we're just mapping and outputting the difficulties for the samples, the order is irrelevant.

This is a quick code I just wrote to test my theory (try with shuffle=False and shuffle=True):

import torch
class Dataset(torch.utils.data.Dataset):
    def __init__(self):
        self.values = list(range(11))

    def __len__(self):
        return 11

    def __getitem__(self, idx):
        return idx, self.values[idx]

s=Dataset()
loader= torch.utils.data.DataLoader(s,
                                     batch_size=1, shuffle=True,
                                     num_workers=0)

for i in iter(loader):
    print(i)

with shuffle=True:

[tensor([3]), tensor([3])]
[tensor([1]), tensor([1])]
[tensor([0]), tensor([0])]
[tensor([5]), tensor([5])]
[tensor([6]), tensor([6])]
[tensor([10]), tensor([10])]
[tensor([7]), tensor([7])]
[tensor([9]), tensor([9])]
[tensor([8]), tensor([8])]
[tensor([2]), tensor([2])]
[tensor([4]), tensor([4])]

and with shuffle=False:

[tensor([0]), tensor([0])]
[tensor([1]), tensor([1])]
[tensor([2]), tensor([2])]
[tensor([3]), tensor([3])]
[tensor([4]), tensor([4])]
[tensor([5]), tensor([5])]
[tensor([6]), tensor([6])]
[tensor([7]), tensor([7])]
[tensor([8]), tensor([8])]
[tensor([9]), tensor([9])]
[tensor([10]), tensor([10])]

and this shows that the idx in __getitem__ in the Dataset class is the global idx, pre-shuffling.

Also thinking about it, your DataAnalyzer only takes a dataset, and then you must pass it to deepspeed.initialize() that will return a DataLoader. Usually shuffling is specified in the user dataloader (DataLoader(dataset, sampler=None,...)). So this should still work. However

in Megatron-DeepSpeed and Megatron-LM, the dataset is shuffled even before reaching sampler

I looked at that particular code and I believe that it changes the problem, this is a "non-standard" shuffling procedure done outside DataLoader, and adding the index parameter only suits your Megatron case. (Im not 100% sure of this tbh, is it?)

So i added a new commit where i support:

the megatron use case: if index field exists in the dataset items;
any other user-defined ordering, if sample_indices is provided when constructing DataAnalyzer;
as default behaviour, the indices given by the original order;

Added missing `ininstance` check in [#5112. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

…aAnalysis` map operation (microsoft#5112) When performing the map operation required for the curriculum learning, the output of `metric_function` requires an `index` field: ``` def update_metric_results(self, data, metric_types, metric_dtypes, metric_functions, metric_results): for m_idx in range(len(metric_types)): [...] if metric_type == 'single_value_per_sample': for row in range(metric_values.size()[0]): metric_result["sample_to_metric_builder"].add_item(metric_values[row].reshape(-1)) metric_result["metric_to_sample_dict"][metric_values[row].item()].append( data['index'][row][0].item()). ##<------- data['index']?? ``` There is no mention to this in the documentation, where it specifies that the output of `metric_function` should be a dict/DataFrame (?) with an `index` key/column. To makes things worse, on top of that, there is no way for an user to be able to specify a proper `index` value for each sample, because the distribution of samples across workers/threads is not know, as it's done inside `DataAnalysis`: ``` def run_map_helper(self, thread_id): start_idx, end_idx = self.thread_splits[thread_id][0], \ self.thread_splits[thread_id][1] logger.info(f"worker {self.worker_id} thread {thread_id}: start working " \ f"on data subset {start_idx} to {end_idx}") thread_dataset = Subset(self.dataset, list(range(start_idx, end_idx))) sampler = BatchSampler(SequentialSampler(thread_dataset), batch_size=self.batch_size, drop_last=False) ``` Since by design you picked a `SequentialSampler`, then you know beforehand the global id of each each sample of each batch of each thread of each worker by looking at ``` self.worker_splits, self.thread_splits = split_dataset(self.dataset, self.num_workers, self.worker_id, self.num_threads) start_idx, end_idx = thread_splits[t_idx_reduce][0], thread_splits[t_idx_reduce][1] ``` and you can populate that index value correctly, instead of asking the user to provide it. This PR removes the need for `'index'` key in `data` and uses instead the batch, thread, and worker ids to compute the global index of each sample.

Added missing `ininstance` check in [microsoft#5112. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

…aAnalysis` map operation (microsoft#5112) When performing the map operation required for the curriculum learning, the output of `metric_function` requires an `index` field: ``` def update_metric_results(self, data, metric_types, metric_dtypes, metric_functions, metric_results): for m_idx in range(len(metric_types)): [...] if metric_type == 'single_value_per_sample': for row in range(metric_values.size()[0]): metric_result["sample_to_metric_builder"].add_item(metric_values[row].reshape(-1)) metric_result["metric_to_sample_dict"][metric_values[row].item()].append( data['index'][row][0].item()). ##<------- data['index']?? ``` There is no mention to this in the documentation, where it specifies that the output of `metric_function` should be a dict/DataFrame (?) with an `index` key/column. To makes things worse, on top of that, there is no way for an user to be able to specify a proper `index` value for each sample, because the distribution of samples across workers/threads is not know, as it's done inside `DataAnalysis`: ``` def run_map_helper(self, thread_id): start_idx, end_idx = self.thread_splits[thread_id][0], \ self.thread_splits[thread_id][1] logger.info(f"worker {self.worker_id} thread {thread_id}: start working " \ f"on data subset {start_idx} to {end_idx}") thread_dataset = Subset(self.dataset, list(range(start_idx, end_idx))) sampler = BatchSampler(SequentialSampler(thread_dataset), batch_size=self.batch_size, drop_last=False) ``` Since by design you picked a `SequentialSampler`, then you know beforehand the global id of each each sample of each batch of each thread of each worker by looking at ``` self.worker_splits, self.thread_splits = split_dataset(self.dataset, self.num_workers, self.worker_id, self.num_threads) start_idx, end_idx = thread_splits[t_idx_reduce][0], thread_splits[t_idx_reduce][1] ``` and you can populate that index value correctly, instead of asking the user to provide it. This PR removes the need for `'index'` key in `data` and uses instead the batch, thread, and worker ids to compute the global index of each sample.

Added missing `ininstance` check in [microsoft#5112. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

bm-synth added 5 commits February 9, 2024 15:10

added assert of torch vs numpy types

14f2bbe

removed 'data' key in favour or a global index

84d4398

removed self.start_idx and self.end_idx

9a4cf51

metric_function as it was

7951289

removed assert

8e6e556

bm-synth changed the title ~~remove 'index' key from output of metric_function in DataAnalysis~~ remove mandatory index key from output of metric_function in DataAnalysis map operation Feb 10, 2024

bm-synth changed the title ~~remove mandatory index key from output of metric_function in DataAnalysis map operation~~ remove mandatory index key from output of metric_function in DataAnalysis map operation Feb 10, 2024

removed dtypes

00a9fed

bm-synth changed the title ~~remove mandatory index key from output of metric_function in DataAnalysis map operation~~ remove mandatory index key from output of metric_function in DataAnalysis map operation [ONGOING TESTING] Feb 10, 2024

Merge branch 'master' into data_analysis_remove_key_index

9bad499

bm-synth marked this pull request as ready for review February 10, 2024 21:35

bm-synth requested a review from conglongli as a code owner February 10, 2024 21:35

bm-synth changed the title ~~remove mandatory index key from output of metric_function in DataAnalysis map operation [ONGOING TESTING]~~ remove mandatory index key from output of metric_function in DataAnalysis map operation Feb 10, 2024

bm-synth and others added 3 commits February 12, 2024 12:32

Merge branch 'master' into data_analysis_remove_key_index

496b147

format

847e462

Merge branch 'master' into data_analysis_remove_key_index

7f4a2b5

Merge branch 'master' into data_analysis_remove_key_index

15864bc

formatting

200056b

support megatron and user-defined use case

ea1d23d

Merge branch 'master' into data_analysis_remove_key_index

bab5e41

conglongli approved these changes Feb 15, 2024

View reviewed changes

conglongli enabled auto-merge February 15, 2024 14:40

conglongli added this pull request to the merge queue Feb 15, 2024

Merged via the queue into microsoft:master with commit 2d0a6bc Feb 15, 2024
12 checks passed

bm-synth deleted the data_analysis_remove_key_index branch February 15, 2024 16:32

bm-synth mentioned this pull request Feb 16, 2024

Added missing isinstance check in PR 5112 #5142

Merged

github-merge-queue bot pushed a commit that referenced this pull request Feb 16, 2024

Added missing isinstance check in PR 5112 (#5142)

9bd62e0

Added missing `ininstance` check in [#5112. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024

Added missing isinstance check in PR 5112 (microsoft#5142)

31ed8c6

Added missing `ininstance` check in [microsoft#5112. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024

Added missing isinstance check in PR 5112 (microsoft#5142)

d9b9d04

Added missing `ininstance` check in [microsoft#5112. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation #5112

remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation #5112

bm-synth commented Feb 10, 2024 •

edited

bm-synth commented Feb 10, 2024

conglongli commented Feb 15, 2024

bm-synth commented Feb 15, 2024

conglongli commented Feb 15, 2024

conglongli commented Feb 15, 2024

conglongli commented Feb 15, 2024

bm-synth commented Feb 15, 2024 •

edited

remove mandatory index key from output of metric_function in DataAnalysis map operation #5112

remove mandatory index key from output of metric_function in DataAnalysis map operation #5112

Conversation

bm-synth commented Feb 10, 2024 • edited

bm-synth commented Feb 10, 2024

conglongli commented Feb 15, 2024

bm-synth commented Feb 15, 2024

conglongli commented Feb 15, 2024

conglongli commented Feb 15, 2024

conglongli commented Feb 15, 2024

bm-synth commented Feb 15, 2024 • edited

remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation #5112

remove mandatory `index` key from output of `metric_function` in `DataAnalysis` map operation #5112

bm-synth commented Feb 10, 2024 •

edited

bm-synth commented Feb 15, 2024 •

edited