-
Notifications
You must be signed in to change notification settings - Fork 563
Column-wise sharder #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pull request was exported from Phabricator. Differential Revision: D30437562 |
Summary: Pull Request resolved: meta-pytorch/torchrec#2 * add shards idx, ranks, size in related config for metadata passing * add cw sharding for per-rank table allocation. * many design decision are captured in https://fb.quip.com/byvkAZGpK1o0 Reviewed By: dstaay-fb Differential Revision: D30437562 fbshipit-source-id: e0b0f3647a34435408bcc8b8cb9f8d4433cfe616
9206478
to
a0fca30
Compare
This pull request was exported from Phabricator. Differential Revision: D30437562 |
facebook-github-bot
pushed a commit
that referenced
this pull request
Jan 20, 2022
Summary: Pull Request resolved: #2 Add Criteo TorchArrow Dataset into `datasets/experimental` This is a very first step: 1. Data set is loaded from TSV(!) instead of Parquet 2. Only load the dataset, no transformation yet. 3. No missing value in dense/categorical feature yet. Will add TorchArrow transformation in following diffs. Reviewed By: colin2328 Differential Revision: D33595316 fbshipit-source-id: 6905b3485df984f14a2019eb360b1986465f31bf
OswinC
pushed a commit
to OswinC/torcharrow
that referenced
this pull request
Mar 11, 2022
Summary: Pull Request resolved: pytorch#141 Quite convenient in the Criteo-DLRM preproc workload (ongoing work in meta-pytorch/torchrec#2), e.g. we can do ``` df["dense_features"] = (dense_features["dense_features"] + 3).log() ``` where "dense_features" is a nested struct with 13 columns. API rational: Pandas supports limited number of numeric functions. But for numeric ops Pandas supported (e.g. `abs`, `add`), they are both applicable for Series and DataFrame: ``` >>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]} >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]}) >>> df["a"].abs() 0 1 1 2 2 3 Name: a, dtype: int64 >>> df.abs() a b 0 1 4 1 2 5 2 3 6 >>> df["a"].add(1) 0 2 1 3 2 -2 Name: a, dtype: int64 >>> df.add(1) a b 0 2 -3 1 3 -4 2 -2 7 ``` Reviewed By: vancexu Differential Revision: D33616632 fbshipit-source-id: e4a08e66ef787b002a418d0c443f1274cbc70569
facebook-github-bot
pushed a commit
to pytorch/torcharrow
that referenced
this pull request
Jun 29, 2022
Summary: X-link: #141 Quite convenient in the Criteo-DLRM preproc workload (ongoing work in meta-pytorch/torchrec#2), e.g. we can do ``` df["dense_features"] = (dense_features["dense_features"] + 3).log() ``` where "dense_features" is a nested struct with 13 columns. API rational: Pandas supports limited number of numeric functions. But for numeric ops Pandas supported (e.g. `abs`, `add`), they are both applicable for Series and DataFrame: ``` >>> import pandas as pd >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]} >>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]}) >>> df["a"].abs() 0 1 1 2 2 3 Name: a, dtype: int64 >>> df.abs() a b 0 1 4 1 2 5 2 3 6 >>> df["a"].add(1) 0 2 1 3 2 -2 Name: a, dtype: int64 >>> df.add(1) a b 0 2 -3 1 3 -4 2 -2 7 ``` Reviewed By: vancexu Differential Revision: D33616632 fbshipit-source-id: e4a08e66ef787b002a418d0c443f1274cbc70569
levythu
pushed a commit
to levythu/torchrec
that referenced
this pull request
Sep 13, 2023
Summary: Some bug fixes during the integration test in PyPER O3: ### fix meta-pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix meta-pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332
levythu
pushed a commit
to levythu/torchrec
that referenced
this pull request
Sep 18, 2023
…#1391) Summary: Some bug fixes during the integration test in PyPER O3: ### fix meta-pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix meta-pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332
levythu
pushed a commit
to levythu/torchrec
that referenced
this pull request
Sep 19, 2023
…#1391) Summary: Some bug fixes during the integration test in PyPER O3: ### fix meta-pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix meta-pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332
levythu
pushed a commit
to levythu/torchrec
that referenced
this pull request
Sep 21, 2023
…#1391) Summary: Some bug fixes during the integration test in PyPER O3: ### fix meta-pytorch#1 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. ### fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic. ### fix meta-pytorch#3 Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D48804332
duduyi2013
pushed a commit
to duduyi2013/torchrec
that referenced
this pull request
Nov 27, 2023
Summary: Some bug fixes during the integration test in PyPER O3: fix meta-pytorch#1 _embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. fix meta-pytorch#2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic. fix meta-pytorch#3 Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Differential Revision: D51601041
facebook-github-bot
pushed a commit
that referenced
this pull request
Dec 1, 2023
Summary: Pull Request resolved: #1541 Some bug fixes during the integration test in PyPER O3: # fix #1 _embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution. This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a. # fix #2 ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic. # fix #3 Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary Reviewed By: dstaay-fb Differential Revision: D51601041 fbshipit-source-id: 95cf346b5247f1d5afb6643ecfd7dca4b3c4d575
sarckk
added a commit
to sarckk/torchrec
that referenced
this pull request
Jul 22, 2024
Summary: Add ability to run prefetch as a stage in `StagedTrainPipeline` Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows): ``` sdd = SparseDataDistUtil( model=self._model, data_dist_stream=torch.torch.cuda.Stream(), prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream ) pipeline = [ PipelineStage( name="data_copy", runnable=lambda batch, context: batch.to( self._device, non_blocking=True ), stream=torch.cuda.Stream(), ), PipelineStage( name="start_sparse_data_dist", runnable=sdd.start_sparse_data_dist, stream=sdd.data_dist_stream, fill_callback=sdd.wait_sparse_data_dist, ), PipelineStage( name="prefetch", runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch stream=sdd.prefetch_stream, fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch ), ] return StagedTrainPipeline(pipeline_stages=pipeline) ``` Order of execution for above pipeline: Iteration meta-pytorch#1: _fill_pipeline(): batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback) batch 1: memcpy, start_sdd, wait_sdd (callback) batch 2: memcpy progress(): batch 3: memcpy batch 2: start_sdd batch 1: prefetch after pipeline progress(): model(batch 0) load_prefetch (prepares for model fwd on batch 1) wait_sdd (prepares for batch 2 prefetch) Iteration meta-pytorch#2: progress(): batch 4: memcpy batch 3: start_sdd batch 2: prefetch after pipeline progress(): model(batch 1) load_prefetch (prepares for model fwd on batch 2) wait_sdd (prepares for batch 3 prefetch) Reviewed By: zzzwen Differential Revision: D59786807
facebook-github-bot
pushed a commit
that referenced
this pull request
Jul 23, 2024
Summary: Pull Request resolved: #2239 Add ability to run prefetch as a stage in `StagedTrainPipeline` Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows): ``` sdd = SparseDataDistUtil( model=self._model, data_dist_stream=torch.torch.cuda.Stream(), prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream ) pipeline = [ PipelineStage( name="data_copy", runnable=lambda batch, context: batch.to( self._device, non_blocking=True ), stream=torch.cuda.Stream(), ), PipelineStage( name="start_sparse_data_dist", runnable=sdd.start_sparse_data_dist, stream=sdd.data_dist_stream, fill_callback=sdd.wait_sparse_data_dist, ), PipelineStage( name="prefetch", runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch stream=sdd.prefetch_stream, fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch ), ] return StagedTrainPipeline(pipeline_stages=pipeline) ``` Order of execution for above pipeline: Iteration #1: _fill_pipeline(): batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback) batch 1: memcpy, start_sdd, wait_sdd (callback) batch 2: memcpy progress(): batch 3: memcpy batch 2: start_sdd batch 1: prefetch after pipeline progress(): model(batch 0) load_prefetch (prepares for model fwd on batch 1) wait_sdd (prepares for batch 2 prefetch) Iteration #2: progress(): batch 4: memcpy batch 3: start_sdd batch 2: prefetch after pipeline progress(): model(batch 1) load_prefetch (prepares for model fwd on batch 2) wait_sdd (prepares for batch 3 prefetch) Reviewed By: zzzwen, joshuadeng Differential Revision: D59786807 fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea
TroyGarden
added a commit
to TroyGarden/torchrec
that referenced
this pull request
Jul 27, 2024
Summary: # context * dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2` * however, in reality, the actual KJT._values could be empty * this issue was discribed in D57998381 > run the local-default every a few time one could get an error for some of the dynamic_shape being zero This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero, and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.** * error log: P1462233278 ``` [rank0]: - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0). [rank0]: Suggested fixes: [rank0]: vlen5 = 0 ``` # method * padding the kjt._values with the minimum required size `(2, )` * in the case of empty values, kjt._lengths and kjt._offsets should all be zeros * it doesn't affect the true logic/mathematic values of the kjt # issues 1. exported_program.module can't take in empty-value input. 2. deserialized unflattened model can't take in empty-value input, which could happen in real data. 3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve meta-pytorch#2. NOTE: Please check the in-line comments in the test file for details # Other Concerns 1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm. 2. Differential Revision: D45410437
TroyGarden
added a commit
to TroyGarden/torchrec
that referenced
this pull request
Jul 27, 2024
…eta-pytorch#2250) Summary: Pull Request resolved: meta-pytorch#2250 # context * dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2` * however, in reality, the actual KJT._values could be empty * this issue was discribed in D57998381 > run the local-default every a few time one could get an error for some of the dynamic_shape being zero This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero, and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.** * error log: P1462233278 ``` [rank0]: - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0). [rank0]: Suggested fixes: [rank0]: vlen5 = 0 ``` # method * padding the kjt._values with the minimum required size `(2, )` * in the case of empty values, kjt._lengths and kjt._offsets should all be zeros * it doesn't affect the true logic/mathematic values of the kjt # issues 1. exported_program.module can't take in empty-value input. 2. deserialized unflattened model can't take in empty-value input, which could happen in real data. 3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve meta-pytorch#2. NOTE: Please check the in-line comments in the test file for details # Other Concerns 1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm. 2. Differential Revision: D45410437
facebook-github-bot
pushed a commit
that referenced
this pull request
Jul 29, 2024
…2250) Summary: Pull Request resolved: #2250 # context * dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2` * however, in reality, the actual KJT._values could be empty * this issue was discribed in D57998381 > run the local-default every a few time one could get an error for some of the dynamic_shape being zero This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero, and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.** * error log: P1462233278 ``` [rank0]: - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0). [rank0]: Suggested fixes: [rank0]: vlen5 = 0 ``` # method * padding the kjt._values with the minimum required size `(2, )` * in the case of empty values, kjt._lengths and kjt._offsets should all be zeros * it doesn't affect the true logic/mathematic values of the kjt # issues 1. exported_program.module can't take in empty-value input. 2. deserialized unflattened model can't take in empty-value input, which could happen in real data. 3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve #2. NOTE: Please check the in-line comments in the test file for details # Other Concerns 1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm. 2. it looks like the empty-value input has non-deterministic returns. could be due to `torch.empty`. this is not related to this diff. Reviewed By: PaulZhang12 Differential Revision: D45410437 fbshipit-source-id: f336059e4e68ce59d1a97c73b7221a5a23f6bf06
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Reviewed By: dstaay-fb
Differential Revision: D30437562