Skip to content

Conversation

YazhiGao
Copy link

Summary:

  • add shards idx, ranks, size in related config for metadata passing
  • add cw sharding for per-rank table allocation.
  • many design decision are captured in https://fb.quip.com/byvkAZGpK1o0

Reviewed By: dstaay-fb

Differential Revision: D30437562

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Sep 15, 2021
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30437562

Summary:
Pull Request resolved: meta-pytorch/torchrec#2

* add shards idx, ranks, size in related config for metadata passing
* add cw sharding for per-rank table allocation.
* many design decision are captured in https://fb.quip.com/byvkAZGpK1o0

Reviewed By: dstaay-fb

Differential Revision: D30437562

fbshipit-source-id: e0b0f3647a34435408bcc8b8cb9f8d4433cfe616
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D30437562

facebook-github-bot pushed a commit that referenced this pull request Jan 20, 2022
Summary:
Pull Request resolved: #2

Add Criteo TorchArrow Dataset into `datasets/experimental`

This is a very first step:
1. Data set is loaded from TSV(!) instead of Parquet
2. Only load the dataset, no transformation yet.
3. No missing value in dense/categorical feature yet.

Will add TorchArrow transformation in following diffs.

Reviewed By: colin2328

Differential Revision: D33595316

fbshipit-source-id: 6905b3485df984f14a2019eb360b1986465f31bf
OswinC pushed a commit to OswinC/torcharrow that referenced this pull request Mar 11, 2022
Summary:
Pull Request resolved: pytorch#141

Quite convenient in the Criteo-DLRM preproc workload (ongoing work in meta-pytorch/torchrec#2), e.g. we can do
```
df["dense_features"] = (dense_features["dense_features"] + 3).log()
```
where "dense_features" is a nested struct with 13 columns.

API rational: Pandas supports limited number of numeric functions. But for numeric ops Pandas supported (e.g. `abs`, `add`), they are both applicable for Series and DataFrame:
```
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]}

>>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]})
>>> df["a"].abs()
0    1
1    2
2    3
Name: a, dtype: int64
>>> df.abs()
   a  b
0  1  4
1  2  5
2  3  6
>>> df["a"].add(1)
0    2
1    3
2   -2
Name: a, dtype: int64
>>> df.add(1)
   a  b
0  2 -3
1  3 -4
2 -2  7
```

Reviewed By: vancexu

Differential Revision: D33616632

fbshipit-source-id: e4a08e66ef787b002a418d0c443f1274cbc70569
facebook-github-bot pushed a commit to pytorch/torcharrow that referenced this pull request Jun 29, 2022
Summary:
X-link: #141

Quite convenient in the Criteo-DLRM preproc workload (ongoing work in meta-pytorch/torchrec#2), e.g. we can do
```
df["dense_features"] = (dense_features["dense_features"] + 3).log()
```
where "dense_features" is a nested struct with 13 columns.

API rational: Pandas supports limited number of numeric functions. But for numeric ops Pandas supported (e.g. `abs`, `add`), they are both applicable for Series and DataFrame:
```
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]}

>>> df = pd.DataFrame({"a": [1, 2, -3], "b": [-4, -5, 6]})
>>> df["a"].abs()
0    1
1    2
2    3
Name: a, dtype: int64
>>> df.abs()
   a  b
0  1  4
1  2  5
2  3  6
>>> df["a"].add(1)
0    2
1    3
2   -2
Name: a, dtype: int64
>>> df.add(1)
   a  b
0  2 -3
1  3 -4
2 -2  7
```

Reviewed By: vancexu

Differential Revision: D33616632

fbshipit-source-id: e4a08e66ef787b002a418d0c443f1274cbc70569
levythu pushed a commit to levythu/torchrec that referenced this pull request Sep 13, 2023
Summary:
Some bug fixes during the integration test in PyPER O3:

### fix meta-pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix meta-pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix meta-pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
levythu pushed a commit to levythu/torchrec that referenced this pull request Sep 18, 2023
…#1391)

Summary:

Some bug fixes during the integration test in PyPER O3:

### fix meta-pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix meta-pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix meta-pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
levythu pushed a commit to levythu/torchrec that referenced this pull request Sep 19, 2023
…#1391)

Summary:

Some bug fixes during the integration test in PyPER O3:

### fix meta-pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix meta-pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix meta-pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
levythu pushed a commit to levythu/torchrec that referenced this pull request Sep 21, 2023
…#1391)

Summary:

Some bug fixes during the integration test in PyPER O3:

### fix meta-pytorch#1

 `_embedding_bag_collection` (`ShardedEmbeddingBagCollection`) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate `_input_dist`. As a result, TREC pipelining thinks it's not ready for input distribution.

This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.

### fix meta-pytorch#2

ManagedCollisionCollection.forward is not traceable because it uses unwarpped `KeyedJaggedTensor.from_jt_dict`. We don't care about its internal detail so just keep it atomic.

### fix meta-pytorch#3

Due to how remap table is set, `MCHManagedCollisionModule` doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D48804332
duduyi2013 pushed a commit to duduyi2013/torchrec that referenced this pull request Nov 27, 2023
Summary:
Some bug fixes during the integration test in PyPER O3:
fix meta-pytorch#1
_embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution.
This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.
fix meta-pytorch#2
ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic.
fix meta-pytorch#3
Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Differential Revision: D51601041
facebook-github-bot pushed a commit that referenced this pull request Dec 1, 2023
Summary:
Pull Request resolved: #1541

Some bug fixes during the integration test in PyPER O3:
# fix #1
_embedding_bag_collection (ShardedEmbeddingBagCollection) is not really called by input_dist (because the same thing is already distributed by ShardedManagedCollisionCollection) . So it never get a chance to initiate _input_dist. As a result, TREC pipelining thinks it's not ready for input distribution.
This is not expected, since the module is not used in the stage anyway, nor should it be put in fused a2a communication. With this change, https://fburl.com/code/ud8lnixv it'll satisfy the assertion, meanwhile doesn't carry _input_dists so won't be put into fused a2a.
# fix #2
ManagedCollisionCollection.forward is not traceable because it uses unwarpped KeyedJaggedTensor.from_jt_dict. We don't care about its internal detail so just keep it atomic.
# fix #3
Due to how remap table is set, MCHManagedCollisionModule doesn't support i32 id list for now. An easy fix is to convert to i64 regardless. A more memory efficient fix is probably change the remapper to i32 if necessary

Reviewed By: dstaay-fb

Differential Revision: D51601041

fbshipit-source-id: 95cf346b5247f1d5afb6643ecfd7dca4b3c4d575
sarckk added a commit to sarckk/torchrec that referenced this pull request Jul 22, 2024
Summary:
Add ability to run prefetch as a stage in `StagedTrainPipeline`

Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows):
```
sdd = SparseDataDistUtil(
    model=self._model,
    data_dist_stream=torch.torch.cuda.Stream(),
    prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream
)

pipeline = [
    PipelineStage(
        name="data_copy",
        runnable=lambda batch, context: batch.to(
            self._device, non_blocking=True
        ),
        stream=torch.cuda.Stream(),
    ),
    PipelineStage(
        name="start_sparse_data_dist",
        runnable=sdd.start_sparse_data_dist,
        stream=sdd.data_dist_stream,
        fill_callback=sdd.wait_sparse_data_dist,
    ),
    PipelineStage(
        name="prefetch",
        runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch
        stream=sdd.prefetch_stream,
        fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch
    ),
]

return StagedTrainPipeline(pipeline_stages=pipeline)
```

Order of execution for above pipeline:

Iteration meta-pytorch#1:

_fill_pipeline():
batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback)
batch 1: memcpy, start_sdd, wait_sdd (callback)
batch 2: memcpy

progress():
batch 3: memcpy
batch 2: start_sdd
batch 1: prefetch

after pipeline progress():
model(batch 0)
load_prefetch (prepares for model fwd on batch 1)
wait_sdd (prepares for batch 2 prefetch)

Iteration meta-pytorch#2:
progress():
batch 4: memcpy
batch 3: start_sdd
batch 2: prefetch

after pipeline progress():
model(batch 1)
load_prefetch (prepares for model fwd on batch 2)
wait_sdd (prepares for batch 3 prefetch)

Reviewed By: zzzwen

Differential Revision: D59786807
facebook-github-bot pushed a commit that referenced this pull request Jul 23, 2024
Summary:
Pull Request resolved: #2239

Add ability to run prefetch as a stage in `StagedTrainPipeline`

Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows):
```
sdd = SparseDataDistUtil(
    model=self._model,
    data_dist_stream=torch.torch.cuda.Stream(),
    prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream
)

pipeline = [
    PipelineStage(
        name="data_copy",
        runnable=lambda batch, context: batch.to(
            self._device, non_blocking=True
        ),
        stream=torch.cuda.Stream(),
    ),
    PipelineStage(
        name="start_sparse_data_dist",
        runnable=sdd.start_sparse_data_dist,
        stream=sdd.data_dist_stream,
        fill_callback=sdd.wait_sparse_data_dist,
    ),
    PipelineStage(
        name="prefetch",
        runnable=sdd.prefetch, <--- add stage with runnable=sdd.prefetch
        stream=sdd.prefetch_stream,
        fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch
    ),
]

return StagedTrainPipeline(pipeline_stages=pipeline)
```

Order of execution for above pipeline:

Iteration #1:

_fill_pipeline():
batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback)
batch 1: memcpy, start_sdd, wait_sdd (callback)
batch 2: memcpy

progress():
batch 3: memcpy
batch 2: start_sdd
batch 1: prefetch

after pipeline progress():
model(batch 0)
load_prefetch (prepares for model fwd on batch 1)
wait_sdd (prepares for batch 2 prefetch)

Iteration #2:
progress():
batch 4: memcpy
batch 3: start_sdd
batch 2: prefetch

after pipeline progress():
model(batch 1)
load_prefetch (prepares for model fwd on batch 2)
wait_sdd (prepares for batch 3 prefetch)

Reviewed By: zzzwen, joshuadeng

Differential Revision: D59786807

fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 27, 2024
Summary:
# context
* dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2`
* however, in reality, the actual KJT._values could be empty
* this issue was discribed in D57998381

> run the local-default every a few time one could get an error for some of the dynamic_shape being zero
This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero,
and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.**
* error log: P1462233278
```
[rank0]:   - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0).
[rank0]: Suggested fixes:
[rank0]:   vlen5 = 0
```

# method
* padding the kjt._values with the minimum required size `(2, )`
* in the case of empty values, kjt._lengths and kjt._offsets should all be zeros
* it doesn't affect the true logic/mathematic values of the kjt

# issues
1. exported_program.module can't take in empty-value input.
2. deserialized unflattened model can't take in empty-value input, which could happen in real data.
3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve meta-pytorch#2.

NOTE: Please check the in-line comments in the test file for details

# Other Concerns
1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm.
2.

Differential Revision: D45410437
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 27, 2024
…eta-pytorch#2250)

Summary:
Pull Request resolved: meta-pytorch#2250

# context
* dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2`
* however, in reality, the actual KJT._values could be empty
* this issue was discribed in D57998381

> run the local-default every a few time one could get an error for some of the dynamic_shape being zero
This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero,
and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.**
* error log: P1462233278
```
[rank0]:   - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0).
[rank0]: Suggested fixes:
[rank0]:   vlen5 = 0
```

# method
* padding the kjt._values with the minimum required size `(2, )`
* in the case of empty values, kjt._lengths and kjt._offsets should all be zeros
* it doesn't affect the true logic/mathematic values of the kjt

# issues
1. exported_program.module can't take in empty-value input.
2. deserialized unflattened model can't take in empty-value input, which could happen in real data.
3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve meta-pytorch#2.

NOTE: Please check the in-line comments in the test file for details

# Other Concerns
1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm.
2.

Differential Revision: D45410437
facebook-github-bot pushed a commit that referenced this pull request Jul 29, 2024
…2250)

Summary:
Pull Request resolved: #2250

# context
* dynamic shape usually has a minimum value requirement: `dynamic_shape >= 2`
* however, in reality, the actual KJT._values could be empty
* this issue was discribed in D57998381

> run the local-default every a few time one could get an error for some of the dynamic_shape being zero
This is because in some corner case (not very rare though), the some dynamic_shape dim of the `sample_input` could be zero,
and 0-size dynamic shape is handled differently during torch.export. **Bascially it will assume this dynamic shape should always be zero.**
* error log: P1462233278
```
[rank0]:   - Not all values of vlen5 = L['args'][0][0].event_id_list_features_seqs['marketplace']._values.size()[0] in the specified range are valid because vlen5 was inferred to be a constant (0).
[rank0]: Suggested fixes:
[rank0]:   vlen5 = 0
```

# method
* padding the kjt._values with the minimum required size `(2, )`
* in the case of empty values, kjt._lengths and kjt._offsets should all be zeros
* it doesn't affect the true logic/mathematic values of the kjt

# issues
1. exported_program.module can't take in empty-value input.
2. deserialized unflattened model can't take in empty-value input, which could happen in real data.
3. deserialized unflattened model can't take in altered input, which could be a potential workaround if can't resolve #2.

NOTE: Please check the in-line comments in the test file for details

# Other Concerns
1. the inconsistency in the KJT (lengths are zeros, but values is non empty) might be incompatible with some downstream functions/operators, will need more tests to confirm.
2. it looks like the empty-value input has non-deterministic returns. could be due to `torch.empty`.  this is not related to this diff.

Reviewed By: PaulZhang12

Differential Revision: D45410437

fbshipit-source-id: f336059e4e68ce59d1a97c73b7221a5a23f6bf06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants