Distributed in-memory map-reduce for data analyzer #5129

bm-synth · 2024-02-14T09:32:06Z

Adds class DistributedDataAnlyzer that implements a map-reduce on distributed memory.

instead of writing hundred/thousands of temp files as intermediate storage in the map/reduce as in DataAnalyzer, each node holds disjoints ordered subsets of pairs of (metric, sample id) as a distributed tensor.
Also removes the need to specify metric_dtypes as it's automatically inferred by the return of metric_function(data).
Removes the need to have a distributed file system where all nodes need to be able to write to: here only rank 0 does the writing.
Much faster than the original map-reduce based on writing and loading several temp files into the disk, requires less memory, no temp files, and is simpler.

How does it work

for each metric, the only results storage is metric_result as a list of (sample_id, metric_value) tuples.
- metric_result is converted to a 2D tensor when whole dataset has been iterated.
sample_idx_dtype and metric_value_dtype are collected by an all_reduce(op=MIN) and MAX across the metric_result of all nodes.
each node holds a metric_result tensor of N samples as N x (metric, sample), sorted by metric, with different metric values across nodes. E.g.:
- node 1 holds [ [1,20], [1, 30], [2,10], [2, 30]], node 2 holds [ [3,10], [3, 20], [3,15], [3, 40]], and node 3 holds [ [4,20], [4, 30], [5,25], [5, 50]].
- to convert the list [(metric,sample)] to a dict{ metric = [samples]} each node iterates only its own dataset, as dictionary keys do not overlap across nodes. In this case, node 1 builds { 1: [20, 30], 2: [10,30]}, node 2 builds { 3: [10, 20, 15, 40] }, and node 3 holds { 4: [20, 30], 5: [25, 50]}.
To write the merged files: (1) rank 0 opens the file, (2) iteratively receives buffers of values, dict keys and dict values from other ranks and writes them, and (3) closes the file.

Future work

Ideally, one could take this and do the curriculum setup on-the-fly when calling deepspeed initialize, i.e. without writing/loading map-reduce files and without forcing the user to call .map() and .reduce() beforehand. It takes less than 10 seconds so it's totally feasible.

References

file_write_ordered() implements a sequential shared write similar to MPI_File_write_ordered. It is however adapted to communicate and write a list of tensors, instead of a single tensor. And it is also adapted to have only rank 0 writing to the file, instead of using a shared pointer.
dist_sample_sort() implements a distributed sample sort, as detailed here and illustrated below. The ranges in step 3 guarantee disjoint subsets of keys (metric values) across nodes.

…peed into distributed_data_analyzer

conglongli

@bm-synth Currently we don't have enough bandwidth to do a full review of this PR. But given that this is a standalone new feature, I'm approving it for now. There is one CI failed which I'm not sure why.

bm-synth · 2024-02-19T10:17:56Z

@conglongli The tests show that all .bin files match, and only {metric_name}_index_to_sample_percentile_merged.idx is identical, ie. {metric}_index_to_metric.idx, {metric}_index_to_sample.idx and mod/{metric}_sample_to_metric.idx differ. I believe it's because they're written in different order and then merged, because the execution using these files is identical to the baseline.

@bm-synth Currently we don't have enough bandwidth to do a full review of this PR.

cc @mrwyattii : ideally the user should only need to edit the config file to enable curriculum: this whole map+reduce should happen behind the scenes when you call deepspeed.initialize. The current map-reduce is infeasible for large datasets: takes too long to run, outputs thousands of tiny files, requires shared storage. It's also infeasible to run several experiments with different curriculum settings (too slow, too many folders/files). Let me know if you need more documentation or a 1-to-1 explanation about this PR, so that we can decide on how to improve this.

bm-synth · 2024-02-19T11:27:19Z

There is one CI failed which I'm not sure why.

@conglongli this is also happening in other PRs, i think it's an issue on your CI.

conglongli · 2024-02-19T11:30:35Z

There is one CI failed which I'm not sure why.

@conglongli this is also happening in other PRs, i think it's an issue on your CI.

Yeah we will investigate

loadams · 2024-02-20T17:42:54Z

There is one CI failed which I'm not sure why.

@conglongli this is also happening in other PRs, i think it's an issue on your CI.

Yeah we will investigate

This should be fixed now.

…yzer. (#5169) Minor improvements of [#5129. - Writes all buffers at once to the output file, instead of iteratively (`indexed_dataset.py`, method `add_items()`). - Fixes the wrong initialisation of `num_workers` and `worker_id` that were being ignored when they were provided by the user. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

Adds class `DistributedDataAnlyzer` that implements a map-reduce on distributed memory. - instead of writing hundred/thousands of temp files as intermediate storage in the map/reduce as in `DataAnalyzer`, each node holds disjoints ordered subsets of pairs of `(metric, sample id)` as a distributed tensor. - Also removes the need to specify `metric_dtypes` as it's automatically inferred by the return of `metric_function(data)`. - Removes the need to have a distributed file system where all nodes need to be able to write to: here only rank 0 does the writing. - Much faster than the original map-reduce based on writing and loading several temp files into the disk, requires less memory, no temp files, and is simpler. ## How does it work - for each metric, the only results storage is `metric_result` as a list of `(sample_id, metric_value)` tuples. - `metric_result` is converted to a 2D tensor when whole dataset has been iterated. - `sample_idx_dtype` and `metric_value_dtype` are collected by an `all_reduce(op=MIN)` and `MAX` across the `metric_result` of all nodes. - each node holds a `metric_result` tensor of `N` samples as `N x (metric, sample)`, sorted by metric, with different `metric` values across nodes. E.g.: - node 1 holds `[ [1,20], [1, 30], [2,10], [2, 30]]`, node 2 holds `[ [3,10], [3, 20], [3,15], [3, 40]]`, and node 3 holds `[ [4,20], [4, 30], [5,25], [5, 50]]`. - to convert the list `[(metric,sample)]` to a `dict{ metric = [samples]}` each node iterates only its own dataset, as dictionary keys do not overlap across nodes. In this case, node 1 builds `{ 1: [20, 30], 2: [10,30]}`, node 2 builds `{ 3: [10, 20, 15, 40] }`, and node 3 holds `{ 4: [20, 30], 5: [25, 50]}`. - To write the merged files: (1) rank 0 opens the file, (2) iteratively receives buffers of values, dict keys and dict values from other ranks and writes them, and (3) closes the file. ## Future work Ideally, one could take this and do the curriculum setup on-the-fly when calling deepspeed `initialize`, i.e. without writing/loading map-reduce files and without forcing the user to call `.map()` and `.reduce()` beforehand. It takes less than 10 seconds so it's totally feasible. ## References - `file_write_ordered()` implements a sequential shared write similar to [`MPI_File_write_ordered`](https://www.open-mpi.org/doc/v3.0/man3/MPI_File_write_ordered.3.php). It is however adapted to communicate and write a list of tensors, instead of a single tensor. And it is also adapted to have only rank 0 writing to the file, instead of using a shared pointer. - `dist_sample_sort()` implements a distributed sample sort, as detailed [here](https://brunomaga.github.io/Distributed-Sort) and illustrated below. The ranges in step 3 guarantee disjoint subsets of keys (metric values) across nodes. ![sample_sort](https://github.com/microsoft/DeepSpeed/assets/150697676/53828103-370f-4f3b-9074-3e3bb8603000) --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…yzer. (microsoft#5169) Minor improvements of [microsoft#5129. - Writes all buffers at once to the output file, instead of iteratively (`indexed_dataset.py`, method `add_items()`). - Fixes the wrong initialisation of `num_workers` and `worker_id` that were being ignored when they were provided by the user. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

Adds class `DistributedDataAnlyzer` that implements a map-reduce on distributed memory. - instead of writing hundred/thousands of temp files as intermediate storage in the map/reduce as in `DataAnalyzer`, each node holds disjoints ordered subsets of pairs of `(metric, sample id)` as a distributed tensor. - Also removes the need to specify `metric_dtypes` as it's automatically inferred by the return of `metric_function(data)`. - Removes the need to have a distributed file system where all nodes need to be able to write to: here only rank 0 does the writing. - Much faster than the original map-reduce based on writing and loading several temp files into the disk, requires less memory, no temp files, and is simpler. ## How does it work - for each metric, the only results storage is `metric_result` as a list of `(sample_id, metric_value)` tuples. - `metric_result` is converted to a 2D tensor when whole dataset has been iterated. - `sample_idx_dtype` and `metric_value_dtype` are collected by an `all_reduce(op=MIN)` and `MAX` across the `metric_result` of all nodes. - each node holds a `metric_result` tensor of `N` samples as `N x (metric, sample)`, sorted by metric, with different `metric` values across nodes. E.g.: - node 1 holds `[ [1,20], [1, 30], [2,10], [2, 30]]`, node 2 holds `[ [3,10], [3, 20], [3,15], [3, 40]]`, and node 3 holds `[ [4,20], [4, 30], [5,25], [5, 50]]`. - to convert the list `[(metric,sample)]` to a `dict{ metric = [samples]}` each node iterates only its own dataset, as dictionary keys do not overlap across nodes. In this case, node 1 builds `{ 1: [20, 30], 2: [10,30]}`, node 2 builds `{ 3: [10, 20, 15, 40] }`, and node 3 holds `{ 4: [20, 30], 5: [25, 50]}`. - To write the merged files: (1) rank 0 opens the file, (2) iteratively receives buffers of values, dict keys and dict values from other ranks and writes them, and (3) closes the file. ## Future work Ideally, one could take this and do the curriculum setup on-the-fly when calling deepspeed `initialize`, i.e. without writing/loading map-reduce files and without forcing the user to call `.map()` and `.reduce()` beforehand. It takes less than 10 seconds so it's totally feasible. ## References - `file_write_ordered()` implements a sequential shared write similar to [`MPI_File_write_ordered`](https://www.open-mpi.org/doc/v3.0/man3/MPI_File_write_ordered.3.php). It is however adapted to communicate and write a list of tensors, instead of a single tensor. And it is also adapted to have only rank 0 writing to the file, instead of using a shared pointer. - `dist_sample_sort()` implements a distributed sample sort, as detailed [here](https://brunomaga.github.io/Distributed-Sort) and illustrated below. The ranges in step 3 guarantee disjoint subsets of keys (metric values) across nodes. ![sample_sort](https://github.com/microsoft/DeepSpeed/assets/150697676/53828103-370f-4f3b-9074-3e3bb8603000) --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

…yzer. (microsoft#5169) Minor improvements of [microsoft#5129. - Writes all buffers at once to the output file, instead of iteratively (`indexed_dataset.py`, method `add_items()`). - Fixes the wrong initialisation of `num_workers` and `worker_id` that were being ignored when they were provided by the user. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>

bm-synth added 23 commits February 9, 2024 15:10

added assert of torch vs numpy types

14f2bbe

first draft

796341d

reverted to original master

07aa4b4

added metric type accumulate_value_over_samples

815a789

pre-commit

28a72e7

Merge branch 'master' into distributed_data_analyzer

e8dbf0b

Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…

ec3479f

…peed into distributed_data_analyzer

Update data_analyzer.py

38d7ce6

added check for single node reduce. added barriers

295fba6

more bug fixes

4144e42

new iteration, many bug fixes

a1e121c

bug fixes

e045753

Merge branch 'master' into distributed_data_analyzer

3a89116

fixing previous commit

cdc838c

Merge branch 'master' into distributed_data_analyzer

ba34a55

pre-commit

5c07710

Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…

87d7686

…peed into distributed_data_analyzer

write sequentially to file

a634787

Merge branch 'master' into distributed_data_analyzer

848ffd5

fixes in sequential write

ec59f08

Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…

832874c

…peed into distributed_data_analyzer

pre-commit hooks

ea0d65f

Merge branch 'master' into distributed_data_analyzer

c6c9bc5

bm-synth changed the title ~~[DRAFT] Distributed data analyzer~~ [DRAFT] Distributed map-reduce in data analyzer Feb 16, 2024

bm-synth added 2 commits February 18, 2024 08:39

added main as example

56a9533

Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…

b4d8654

…peed into distributed_data_analyzer

bm-synth changed the title ~~[DRAFT] Distributed map-reduce in data analyzer~~ [DRAFT] Distributed map-reduce for data analyzer Feb 18, 2024

bm-synth changed the title ~~[DRAFT] Distributed map-reduce for data analyzer~~ Distributed map-reduce for data analyzer Feb 18, 2024

Merge branch 'master' into distributed_data_analyzer

676dc1a

bm-synth changed the title ~~Distributed map-reduce for data analyzer~~ Distributed in-memory map-reduce for data analyzer Feb 18, 2024

Merge branch 'distributed_data_analyzer' of github.com:bm-synth/DeepS…

7ac5e45

…peed into distributed_data_analyzer

bm-synth marked this pull request as ready for review February 19, 2024 00:51

bm-synth requested a review from conglongli as a code owner February 19, 2024 00:51

conglongli self-assigned this Feb 19, 2024

conglongli approved these changes Feb 19, 2024

View reviewed changes

bm-synth added 4 commits February 19, 2024 09:49

added missing static function

8bf0e63

removed/added breaklines to match base code

e5a7eb0

corrected comment

3b8014f

imports

5a42687

bm-synth added 2 commits February 19, 2024 10:50

removed main

cdaad36

reverted main

b3d4062

bm-synth added 3 commits February 19, 2024 13:59

bug fix in sample calculation

7cabfa2

added worker_an and num_worker to kwargs

62f68dd

removed dist.initialize ()from DataAnalyzer.run_map_reduce

6d35e45

Merge branch 'master' into distributed_data_analyzer

c6292f2

conglongli added this pull request to the merge queue Feb 20, 2024

Merged via the queue into microsoft:master with commit e977c7d Feb 20, 2024
12 checks passed

This was referenced Feb 21, 2024

Serial data analyzer #5168

Closed

Write multiple items to output file at once, in distributed data analyzer. #5169

Merged

bm-synth deleted the distributed_data_analyzer branch February 21, 2024 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed in-memory map-reduce for data analyzer #5129

Distributed in-memory map-reduce for data analyzer #5129

bm-synth commented Feb 14, 2024 •

edited

conglongli left a comment

bm-synth commented Feb 19, 2024 •

edited

bm-synth commented Feb 19, 2024

conglongli commented Feb 19, 2024

loadams commented Feb 20, 2024

Distributed in-memory map-reduce for data analyzer #5129

Distributed in-memory map-reduce for data analyzer #5129

Conversation

bm-synth commented Feb 14, 2024 • edited

How does it work

Future work

References

conglongli left a comment

Choose a reason for hiding this comment

bm-synth commented Feb 19, 2024 • edited

bm-synth commented Feb 19, 2024

conglongli commented Feb 19, 2024

loadams commented Feb 20, 2024

bm-synth commented Feb 14, 2024 •

edited

bm-synth commented Feb 19, 2024 •

edited