Skip to content

Conversation

xunnanxu
Copy link
Contributor

Summary:

  • add dmp_collection_sync_interval_batches as a config value to SDD pipeline (and semi sync)
  • default to 1 (every batch).
  • disable using None
  • disabled if DMPC not used

Differential Revision: D70988786

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request Apr 30, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request Apr 30, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request Apr 30, 2025
…ytorch#2929)

Summary:
Pull Request resolved: meta-pytorch#2929

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from 4133ec5 to 763ac47 Compare May 1, 2025 04:21
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

@xunnanxu xunnanxu requested a review from Copilot May 1, 2025 04:22
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates a basic configuration to support DMP collection syncing via a new configuration value in the training pipelines. Key changes include adding helper functions (get_class_name and assert_instance), updating the sparse train pipeline to optionally sync DMPs every N batches, and adding tests to validate the new sync functionality.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
torchrec/distributed/utils.py Added helper functions get_class_name and assert_instance for improved type error messages.
torchrec/distributed/types.py Introduced a stub for prefetch to support future extensions.
torchrec/distributed/train_pipeline/utils.py Updated type checks and stream handling using assert_instance.
torchrec/distributed/train_pipeline/train_pipelines.py Integrated a new sync_embeddings function that optionally syncs DMP collection based on a configurable interval.
torchrec/distributed/train_pipeline/tests/test_train_pipelines.py Added tests to verify proper sync behavior and disable sync when DMP is not used.
torchrec/distributed/embedding_types.py Updated the prefetch method’s parameter type to accept Multistreamable instances.



def assert_instance(obj: object, t: Type[_T]) -> _T:
assert isinstance(obj, t), f"Got {get_class_name(obj)}"
Copy link
Preview

Copilot AI May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider including the expected type in the assertion error message (e.g., 'Expected {t}, got {get_class_name(obj)}') to improve debuggability.

Suggested change
assert isinstance(obj, t), f"Got {get_class_name(obj)}"
assert isinstance(obj, t), f"Expected {t.__name__}, got {get_class_name(obj)}"

Copilot uses AI. Check for mistakes.

Comment on lines +119 to +120
f"{self.__class__.__name__} does not support context (not expected). "
"Embedding weight sync is disabled."
Copy link
Preview

Copilot AI May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider clarifying or documenting in the warning message when the context is absent so that it’s clear under what conditions embedding sync is disabled.

Suggested change
f"{self.__class__.__name__} does not support context (not expected). "
"Embedding weight sync is disabled."
f"{self.__class__.__name__}: Embedding weight synchronization requires a valid "
"TrainPipelineContext. No context was provided, so embedding sync is disabled. "
"Ensure that a TrainPipelineContext is passed to enable this feature."

Copilot uses AI. Check for mistakes.

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from 763ac47 to 48a5ebb Compare May 1, 2025 04:47
xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from 48a5ebb to b345e32 Compare May 1, 2025 04:47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:
Pull Request resolved: meta-pytorch#2929

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from b345e32 to f8b62cb Compare May 1, 2025 04:51
xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from f8b62cb to 27c855f Compare May 1, 2025 05:18
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from 27c855f to c980bb5 Compare May 1, 2025 06:12
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from c980bb5 to d03ba2f Compare May 1, 2025 08:14
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:
Pull Request resolved: meta-pytorch#2929

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from d03ba2f to 2d785a2 Compare May 1, 2025 08:17
xunnanxu added a commit to xunnanxu/torchrec that referenced this pull request May 1, 2025
…ytorch#2929)

Summary:

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@xunnanxu xunnanxu force-pushed the export-D70988786 branch from 2d785a2 to 073ef16 Compare May 1, 2025 08:17
…ytorch#2929)

Summary:
Pull Request resolved: meta-pytorch#2929

* add `dmp_collection_sync_interval_batches` as a config value to SDD pipeline (and semi sync)
* default to 1 (every batch).
* disable using `None`
* disabled if DMPC not used

Differential Revision: D70988786
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70988786

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants