Column-wise sharder #2

YazhiGao · 2021-09-15T21:29:07Z

Summary:

add shards idx, ranks, size in related config for metadata passing
add cw sharding for per-rank table allocation.
many design decision are captured in https://fb.quip.com/byvkAZGpK1o0

Reviewed By: dstaay-fb

Differential Revision: D30437562

facebook-github-bot · 2021-09-15T21:29:28Z

This pull request was exported from Phabricator. Differential Revision: D30437562

Summary: Pull Request resolved: meta-pytorch/torchrec#2 * add shards idx, ranks, size in related config for metadata passing * add cw sharding for per-rank table allocation. * many design decision are captured in https://fb.quip.com/byvkAZGpK1o0 Reviewed By: dstaay-fb Differential Revision: D30437562 fbshipit-source-id: e0b0f3647a34435408bcc8b8cb9f8d4433cfe616

facebook-github-bot · 2021-09-16T00:33:37Z

This pull request was exported from Phabricator. Differential Revision: D30437562

Summary: Pull Request resolved: #2 Add Criteo TorchArrow Dataset into `datasets/experimental` This is a very first step: 1. Data set is loaded from TSV(!) instead of Parquet 2. Only load the dataset, no transformation yet. 3. No missing value in dense/categorical feature yet. Will add TorchArrow transformation in following diffs. Reviewed By: colin2328 Differential Revision: D33595316 fbshipit-source-id: 6905b3485df984f14a2019eb360b1986465f31bf

Summary: Pull Request resolved: pytorch#141 Quite convenient in the Criteo-DLRM preproc workload (ongoing work in meta-pytorch/torchrec#2), e.g. we can do ``` df["dense_features"] = (dense_features["dense_features"] + 3).log() ``` where "dense_features" is a nested struct with 13 columns. API rational: Pandas supports limited number of numeric functions. But for numeric ops Pandas supported (e.g. `abs`, `add`), they are both applicable for Series and DataFrame: ``` >>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]} >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]}) >>> df["a"].abs() 0 1 1 2 2 3 Name: a, dtype: int64 >>> df.abs() a b 0 1 4 1 2 5 2 3 6 >>> df["a"].add(1) 0 2 1 3 2 -2 Name: a, dtype: int64 >>> df.add(1) a b 0 2 -3 1 3 -4 2 -2 7 ``` Reviewed By: vancexu Differential Revision: D33616632 fbshipit-source-id: e4a08e66ef787b002a418d0c443f1274cbc70569

Summary: X-link: #141 Quite convenient in the Criteo-DLRM preproc workload (ongoing work in meta-pytorch/torchrec#2), e.g. we can do ``` df["dense_features"] = (dense_features["dense_features"] + 3).log() ``` where "dense_features" is a nested struct with 13 columns. API rational: Pandas supports limited number of numeric functions. But for numeric ops Pandas supported (e.g. `abs`, `add`), they are both applicable for Series and DataFrame: ``` >>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]} >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]}) >>> df["a"].abs() 0 1 1 2 2 3 Name: a, dtype: int64 >>> df.abs() a b 0 1 4 1 2 5 2 3 6 >>> df["a"].add(1) 0 2 1 3 2 -2 Name: a, dtype: int64 >>> df.add(1) a b 0 2 -3 1 3 -4 2 -2 7 ``` Reviewed By: vancexu Differential Revision: D33616632 fbshipit-source-id: e4a08e66ef787b002a418d0c443f1274cbc70569

Summary: Some bug fixes during the integration test in PyPER O3: ### fix meta-pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix meta-pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332

…#1391) Summary: Some bug fixes during the integration test in PyPER O3: ### fix meta-pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix meta-pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332

Summary: Some bug fixes during the integration test in PyPER O3: fix meta-pytorch#1 _embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic. fix meta-pytorch#3 Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D51601041

Summary: Pull Request resolved: #1541 Some bug fixes during the integration test in PyPER O3: # fix #1 _embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. # fix #2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic. # fix #3 Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Reviewed By: dstaay-fb Differential Revision: D51601041 fbshipit-source-id: 95cf346b5247f1d5afb6643ecfd7dca4b3c4d575

Summary: Add ability to run prefetch as a stage in `StagedTrainPipeline` Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows): ``` sdd = SparseDataDistUtil( model=self._model, data_dist_stream=torch.torch.cuda.Stream(), prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream ) pipeline = [ PipelineStage( name="data_copy", runnable=lambda batch, context: batch.to( self._device, non_blocking=True ), stream=torch.cuda.Stream(), ), PipelineStage( name="start_sparse_data_dist", runnable=sdd.start_sparse_data_dist, stream=sdd.data_dist_stream, fill_callback=sdd.wait_sparse_data_dist, ), PipelineStage( name="prefetch", runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch stream=sdd.prefetch_stream, fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch ), ] return StagedTrainPipeline(pipeline_stages=pipeline) ``` Order of execution for above pipeline: Iteration meta-pytorch#1: _fill_pipeline(): batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback) batch 1: memcpy, start_sdd, wait_sdd (callback) batch 2: memcpy progress(): batch 3: memcpy batch 2: start_sdd batch 1: prefetch after pipeline progress(): model(batch 0) load_prefetch (prepares for model fwd on batch 1) wait_sdd (prepares for batch 2 prefetch) Iteration meta-pytorch#2: progress(): batch 4: memcpy batch 3: start_sdd batch 2: prefetch after pipeline progress(): model(batch 1) load_prefetch (prepares for model fwd on batch 2) wait_sdd (prepares for batch 3 prefetch) Reviewed By: zzzwen Differential Revision: D59786807

Summary: Pull Request resolved: #2239 Add ability to run prefetch as a stage in `StagedTrainPipeline` Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows): ``` sdd = SparseDataDistUtil( model=self._model, data_dist_stream=torch.torch.cuda.Stream(), prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream ) pipeline = [ PipelineStage( name="data_copy", runnable=lambda batch, context: batch.to( self._device, non_blocking=True ), stream=torch.cuda.Stream(), ), PipelineStage( name="start_sparse_data_dist", runnable=sdd.start_sparse_data_dist, stream=sdd.data_dist_stream, fill_callback=sdd.wait_sparse_data_dist, ), PipelineStage( name="prefetch", runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch stream=sdd.prefetch_stream, fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch ), ] return StagedTrainPipeline(pipeline_stages=pipeline) ``` Order of execution for above pipeline: Iteration #1: _fill_pipeline(): batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback) batch 1: memcpy, start_sdd, wait_sdd (callback) batch 2: memcpy progress(): batch 3: memcpy batch 2: start_sdd batch 1: prefetch after pipeline progress(): model(batch 0) load_prefetch (prepares for model fwd on batch 1) wait_sdd (prepares for batch 2 prefetch) Iteration #2: progress(): batch 4: memcpy batch 3: start_sdd batch 2: prefetch after pipeline progress(): model(batch 1) load_prefetch (prepares for model fwd on batch 2) wait_sdd (prepares for batch 3 prefetch) Reviewed By: zzzwen, joshuadeng Differential Revision: D59786807 fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea

Summary: # context * dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2` * however, in reality, the actual KJT._values could be empty * this issue was discribed in D57998381 > run the local-default every a few time one could get an error for some of the dynamic_shape being zero This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero, and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.** * error log: P1462233278 ``` [rank0]: - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0). [rank0]: Suggested fixes: [rank0]: vlen5 = 0 ``` # method * padding the kjt._values with the minimum required size `(2, )` * in the case of empty values, kjt._lengths and kjt._offsets should all be zeros * it doesn't affect the true logic/mathematic values of the kjt # issues 1. exported_program.module can't take in empty-value input. 2. deserialized unflattened model can't take in empty-value input, which could happen in real data. 3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve meta-pytorch#2. NOTE: Please check the in-line comments in the test file for details # Other Concerns 1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm. 2. Differential Revision: D45410437

…eta-pytorch#2250) Summary: Pull Request resolved: meta-pytorch#2250 # context * dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2` * however, in reality, the actual KJT._values could be empty * this issue was discribed in D57998381 > run the local-default every a few time one could get an error for some of the dynamic_shape being zero This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero, and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.** * error log: P1462233278 ``` [rank0]: - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0). [rank0]: Suggested fixes: [rank0]: vlen5 = 0 ``` # method * padding the kjt._values with the minimum required size `(2, )` * in the case of empty values, kjt._lengths and kjt._offsets should all be zeros * it doesn't affect the true logic/mathematic values of the kjt # issues 1. exported_program.module can't take in empty-value input. 2. deserialized unflattened model can't take in empty-value input, which could happen in real data. 3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve meta-pytorch#2. NOTE: Please check the in-line comments in the test file for details # Other Concerns 1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm. 2. Differential Revision: D45410437

…2250) Summary: Pull Request resolved: #2250 # context * dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2` * however, in reality, the actual KJT._values could be empty * this issue was discribed in D57998381 > run the local-default every a few time one could get an error for some of the dynamic_shape being zero This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero, and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.** * error log: P1462233278 ``` [rank0]: - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0). [rank0]: Suggested fixes: [rank0]: vlen5 = 0 ``` # method * padding the kjt._values with the minimum required size `(2, )` * in the case of empty values, kjt._lengths and kjt._offsets should all be zeros * it doesn't affect the true logic/mathematic values of the kjt # issues 1. exported_program.module can't take in empty-value input. 2. deserialized unflattened model can't take in empty-value input, which could happen in real data. 3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve #2. NOTE: Please check the in-line comments in the test file for details # Other Concerns 1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm. 2. it looks like the empty-value input has non-deterministic returns. could be due to `torch.empty`. this is not related to this diff. Reviewed By: PaulZhang12 Differential Revision: D45410437 fbshipit-source-id: f336059e4e68ce59d1a97c73b7221a5a23f6bf06

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Sep 15, 2021

YazhiGao force-pushed the export-D30437562 branch from 9206478 to a0fca30 Compare September 16, 2021 00:33

facebook-github-bot closed this in a26cfe7 Sep 16, 2021

levythu mentioned this pull request Sep 13, 2023

Multitple fixes to MC modules to facilitate integration #1391

Closed

duduyi2013 mentioned this pull request Nov 27, 2023

MCM Fix for ig integration #1541

Closed

sarckk mentioned this pull request Jul 22, 2024

Enable prefetch stage for StagedTrainPipeline #2239

Closed

TroyGarden mentioned this pull request Jul 27, 2024

modify the zero-size values in KJT for dynamic shape compatibility #2250

Closed

xiexbing mentioned this pull request Oct 22, 2024

CUDA kernel error when using VBE #2502

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Column-wise sharder #2

Column-wise sharder #2

Uh oh!

YazhiGao commented Sep 15, 2021

Uh oh!

facebook-github-bot commented Sep 15, 2021

Uh oh!

facebook-github-bot commented Sep 16, 2021

Uh oh!

Uh oh!

Column-wise sharder #2

Column-wise sharder #2

Uh oh!

Conversation

YazhiGao commented Sep 15, 2021

Uh oh!

facebook-github-bot commented Sep 15, 2021

Uh oh!

facebook-github-bot commented Sep 16, 2021

Uh oh!

Uh oh!