Skip to content

Conversation

@joshuadeng
Copy link
Contributor

Summary: tsia

Differential Revision: D31941381

facebook-github-bot and others added 30 commits September 2, 2021 11:36
fbshipit-source-id: 18eb1fdcd8d0571b03cc80ffe3ac866486871ba3
Summary:
The current torchrec lint has issues. The issue is exposed by the recent diff: D30597438.

For example: the current BaseEmbedding in fbcode/torchrec/distributed/embedding_lookup.py has no __init__ function, which will cause a keyError Exception. This failure won't be catched by the recent fix in D30597438. Thus, making the lint output failed as:

   Error  (TORCHRECDOCSTRING) lint-command-failure
    Command `python3 fbcode/torchrec/linter/module_linter.py @{{PATHSFILE}}`
    failed.

    Run `arc lint --trace --take TORCHRECDOCSTRING` for the full command
    output.

    Oncall: torchrec

Sandcastle test also shows the following error as:

{F659615618}

We submit these two changes:
1. to resolve the function_name "__init__" or "forward", if they are not in function list
2. to catch the remaining exception, except the syntaxerror.

Reviewed By: zertosh

Differential Revision: D30722426

fbshipit-source-id: 5f11110af039f2fa7bc3f63902739f0e6ea5e287
Summary:
* added ads 2021h1 model
* random data launcher unit test locally
* refactor torchrec ddp init for module with no params with gradients need(we have a pure buffer-based module for ads as calibration)
* source model: https://www.internalfb.com/code/fbsource/[9f3e1042dd2d]/fbcode/hpc/models/ads/ads_1x_2021h1.py

Reviewed By: xing-liu

Differential Revision: D30076019

fbshipit-source-id: 568205e0c4fa6e60eaf4c9e94946acad5d8578e5
Summary: The current design expose invoke_on_rank_and_broadcast_result(...) call to users which is not very user friendly

Reviewed By: divchenko

Differential Revision: D30733086

fbshipit-source-id: 6824d2ecfb9fc149c3cb7fc095d7f9ac96ba4ed1
Summary:
WarmupOptimizer needs to update underlying optimizer states; bug was introduced in refactor of CombinedOptimizer
D30176405

Reviewed By: divchenko

Differential Revision: D30755047

fbshipit-source-id: 122038e6a4c7bc73cc859ed8cffa68e2b9841a63
Summary:
Generalize regroup method introduced in D30044807

Will switch out usage in ig_clips_tab_pt.py in a followup diff

Reviewed By: divchenko

Differential Revision: D30375713

fbshipit-source-id: 6eb37c4f547db04d0048134187bb7aa0657bb9cf
Summary: Use torch SiLU instead

Reviewed By: colin2328

Differential Revision: D30700094

fbshipit-source-id: b4a92e971769b9f7be739264869cee176f55f5e9
Summary: This modules promote non-functional style of modeling.

Reviewed By: wx1988

Differential Revision: D30701381

fbshipit-source-id: fedc510366e5a10e87b6ab71ac12204c5b91b45d
Summary:
Pull Request resolved: #1

* move it to ml_foundation folder before further testing on the performance
* make it non lazy
* add numerical testing

Reviewed By: divchenko

Differential Revision: D30756661

fbshipit-source-id: e2c50848bec12943951476d23991a6f586916487
Summary: Prior to this diff, we used a fixed param key "embedding_bags" in embedding_lookup.py, This diff moves the code to embedding module sharders so we can use different keys for different embedding modules.

Reviewed By: divchenko

Differential Revision: D30801536

fbshipit-source-id: ef04bd0b727139829bc6879555dfe819422b3884
Summary:
This diff integrates with ShardedTensor in PyTorch distributed according to the plan/discussions in https://fb.quip.com/fwucARGO5SeO

We are doing the first part of integration, which we replaces the ShardedTensor/Metadata defintions in torchrec and using the ones defined in Pytorch distributed.

A second part integration might be more involved that we need to accomodate fbgemm kernels to take sharded tensor and do the computation, then switching to a mode that a ShardedModule contains a sharded weight/tensor directly, instead of multiple small nn.EmbeddingBags.

Reviewed By: YazhiGao

Differential Revision: D29403713

fbshipit-source-id: 279643bd01261ae564238b9dea9d2af5597342c2
Summary: Support Copies of Data

Reviewed By: YazhiGao

Differential Revision: D30262094

fbshipit-source-id: 33a32245afbc419436c1902ba32020ebb4c133e7
… on AWS cluster.

Summary:
# Context
* Inside fbcode, we don't need to worry much about how to use torchrec. It's as simple as running `import torchrec` and letting autodeps figure out how to add the relevant buck target.
* In OSS, where there is no buck, we need to somehow be able to run `import torchrec`. And we want to do this in a way that is independent of where we call our python script `python3 ~/example_folder/example_folder/.../my_torchrec_script.py`. i.e. We don't want to have to keep `my_torchrec_script.py` at the same level as the torchrec repo just so we can call `import torchrec` (as this will not work when my_torchrec_script.py cannot be easily co-located with the torchrec repo. e.g. torchrec STL app scripts):

```
random_folder
|_______________repos/
                      |_______torchrec/
                      |_______my_torchrec_script.py
```

# This Diff

The way to allow us to run `import torchrec` anywhere is to make a `setup.py` for torchrec which allows us to install torchrec with `python setup.py install`. This diff adds a **minimum viable version** of the setup.py that is **just good enough to unblock TorchRec external scale validation work on AWS clusters**. If you look at the setup.py for other domain libraries, they are way more complicated (e.g. [torchvision setup.py](https://fburl.com/zqef7peu)) and we will eventually upgrade this setup.py so it is more sophisticated for the official OSS release.

Reviewed By: colin2328

Differential Revision: D30839689

fbshipit-source-id: 9ac7722eaf8685e5d7a6b7f422ae3c91991d49c6
Summary:
Assert integer types for JT & KJT lengths and offsets;

checked tensor data type in JT class. KJT class was already covered.

Reviewed By: dstaay-fb

Differential Revision: D30842080

fbshipit-source-id: cf78edfffabb30f664951bfe35cf7b665df18e7c
…ollection, and GroupedPooledEmbeddingsLookup

Summary:
all nn.Modules should be able to self.load_state_dict(self.state_dict()). Current EmbeddingBag modules cannot, and DMP itself cannot.

This diff reflects state_dict() customization to undo it in load_state_dict to maintain. It adds a test in DMP to test for this

Reviewed By: divchenko, rkindi

Differential Revision: D30820466

fbshipit-source-id: 181ee3484aac6c348b6bb15dc59494c188b2e89c
Summary:
To add layer norm's non-lazy version.

Keep current usage of Lazy version LayerNorm in Video and IG as is.

* Add non lazy version of LayerNorm
* Rename the TorchRec version LayerNorm as MCLayerNorm and LazyMCLayerNorm
* Move MCLayerName and LazyMCLayerNorm into torchrec/fb/module folder
* Add numerical unit test
* Add lazy vs nonlazy numerical unit test
* Fix the adoption.

Reviewed By: divchenko

Differential Revision: D30828204

fbshipit-source-id: db722abef965622829489c60a7e5866178343814
Summary: update KJTA2A docstring, provide _recat example

Reviewed By: colin2328

Differential Revision: D30877670

fbshipit-source-id: 50eca883d0c49df0738837d682c7179332c88627
…le workers.

Summary:
# Context
DataLoader can be used with multiple workers/processes to increase throughput. Map-style datasets (due to having a length property and keyed samples) automatically ensure that samples from the dataset are not duplicated across the multiple workers. However, for IterDataPipes (stream-style datasets), we must manually support coordinate the workers so they don't duplicate samples ([see relevant PyTorch docs here](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)).

Criteo is a torchrec IterDataPipe that does not currently have logic to prevent duplicate samples.

# This Diff
* Adds support for Criteo to handle multiple workers without duplicating samples across workers. Followed the PyTorch [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)' suggestion on how to do this.
* Adds some unit tests wrapping the Criteo dataset in DataLoader showing that multiple workers now works without duplicating data.

# Implementation Details
*How do we split up the input criteo TSV files across the different workers?*

There are a few options I considered. **tldr** Option 1 used in this diff is simple and performant. If you want to squeeze additional utilization of the workers, you can subdivide the TSVs into smaller ones. Option 2 is too wasteful. Option 3 is too complicated and is not as performant as option 1.

* Option 1 (what this diff does): Each TSV file is assigned to one worker.
   * Pros:
      * Straightforward implementation. Works best when number TSV files evenly divides num_workers.
      * All data is read only once.
   * Cons:
      * During validation, if you have just 1 tsv file, only one worker gets to process that file while all other workers are idle.
* Option 2: Every tsv file is read across all the workers and we drop rows on each worker to prevent duplication.
   * Pros:
      * All workers are being utilized even for a single TSV.
    * Cons:
      * Terribly wasteful: each worker reads all of the rows and drops (num_workers - 1) / (num_workers) portion of the rows. Each worker essentially reads in all the data.
* Option 3: Every tsv file is sharded across all the workers. Instead of naively reading all the data like in Option 2, we somehow use IOBase `seek` to chunk the tsv up and assign the chunks to different workers.
   * Pros:
      * All data is only read once. (in theory, see cons below)
      * All workers are being utilized even for a single TSV.
   * Cons:
      * **Very complicated.** Because each row of the TSV does not use the same number of bytes, when you seek in a TSV file, you might end up somewhere in the middle of one of the rows. You might need to drop that row, or do an additional seek to jump back to collect the rest of the row. You may get a performance hit due to the seeking.
      * You can achieve the same effect with better performance (due to the lack of seeks) by subdividing the TSV files into smaller files and using Option 1.

Reviewed By: colin2328

Differential Revision: D30872755

fbshipit-source-id: 85396e8db28f79ed83d62f70fcf991cfd6108216
Summary:
The diff is to refactor mlp related modules:
* make the perceptron, mlp, mcmlp and mcperceptron non-lazy
* make the mlp as a apex.mlp wrapper if it is available
* move the mlp (calling perceptron) to torchrec/fb/modules
* move the mc version to torchrec/fb/ml_foundation/modules
* update unit test
* update the related calling

Reviewed By: wx1988

Differential Revision: D30874769

fbshipit-source-id: 59b0d4d0fcd456ce528de141d1074374f2bde4fd
Summary:
For a somewhat common data transform use case, where we need to convert from unpacked format back to the packed format.

(In particular, this is a dependency for cross batch sampling)

Reviewed By: divchenko

Differential Revision: D30890351

fbshipit-source-id: b387f9f67b58c7e7b021fc6fc67bcc9f9be432de
Summary:
1. lengths in KJTAll2all can be int64
2. Use the external all_to_all_single(...) API instead of alltoall_base

Reviewed By: colin2328, jiaqizhai

Differential Revision: D30925298

fbshipit-source-id: f835454f6dbaec60c8a0bbeceaba2efe25e8ab18
Summary:
Pull Request resolved: #2

* add shards idx, ranks, size in related config for metadata passing
* add cw sharding for per-rank table allocation.
* many design decision are captured in https://fb.quip.com/byvkAZGpK1o0

Reviewed By: dstaay-fb

Differential Revision: D30437562

fbshipit-source-id: 0570e431d1ebb128d3d0871681093f95fe56d5f8
Summary: Added unit tests for GradientClippingOptimizer

Reviewed By: dstaay-fb

Differential Revision: D30876265

fbshipit-source-id: 762567572b712bd9dd40820f07ec21843fe014df
…ules

Summary:
1. override named_parameters(). Optimizer will use named_parameters() instead.
2. simplify state_dict()

Differential Revision: D30944159

fbshipit-source-id: 7240f5e6188a3ee014f025ec4947032043bb086b
Summary: Ensure Rank/Device match in ShardMetaData (can not assume device is same as current device planner is run on - ie. before change was leading to rank:1/cuda:0)

Reviewed By: YazhiGao

Differential Revision: D31030367

fbshipit-source-id: 54f9de2611170d1a529afe74a4452388b057f818
Differential Revision: D31042728

fbshipit-source-id: 14799576da39297674ad302ca3fb035c436d82cc
Summary:
The diff contains the following items:
* DCN refactor to be non-lazy version
* move DCN to torchrec/modules
* add unit test to have numerical testing

The reason to not keep lazy version is that:
- it is a minor change with in_features, such that the lazy module won't save much from complexity.
- torchrec/modules is a non-lazy environment.

Reviewed By: yiq-liu

Differential Revision: D31028571

fbshipit-source-id: dececb85889471aad642404d83a5b6faec32d975
Summary:
Pull Request resolved: #3

fix tensor placement where the remote device should receive {rank, local_rank}

Reviewed By: dstaay-fb

Differential Revision: D31072120

fbshipit-source-id: b884afce691cac48a74524ca69e55c90e1308b39
Summary: as title - twrw doesn't really makes sense for gloo/cpu

Reviewed By: rkindi

Differential Revision: D31092150

fbshipit-source-id: 0d43c0f68ea049d085c105375c61995285a58f35
Summary: Implement DMP.named_buffers()

Differential Revision: D31104124

fbshipit-source-id: 984baf747c3c89b1d0f5ccf4da5d45b57bdf4754
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

Wenchen Wang and others added 5 commits November 4, 2021 14:24
Summary: As titled

Reviewed By: JadeNie

Differential Revision: D31568073

fbshipit-source-id: f31788013c20deae54dab7e7582c39d95ec0ae5c
Summary:
- Calculator: max device cost
- Placer: support CW, swap backtrack algo around
- Enumerator: switch to using name vs fqn (which is more common on hpc/torchrec configs)

Reviewed By: YLGH

Differential Revision: D32165540

fbshipit-source-id: ad143c176676ddc5145a90698d13689d7b7ba803
Summary:
For some reason in D32055570 (70d9684) the contents of the symlinked files did not show up on github - https://github.com/fairinternal/torchrec/blob/main/third_party/fbgemm_gpu

rkindi is this the right way to symlink a file from internal repo? Does this need more config changes?

EDIT: Looks like symlinking doesn't work well with github exports. Going to delete the symlink and do this via git submodules

Reviewed By: colin2328

Differential Revision: D32073541

fbshipit-source-id: ef0a4d36cca7259a84279e54884c16b1e2674426
Summary:
Currently verifying if installation is successful involves installing stl tasks and lightning and is fairly complicated. Apart from the complexity we shouldn't need a second library to verify if TorchRec is installed.
I see a lot of value in having a simpler script to verify if installation is successful

Reviewed By: rkindi

Differential Revision: D32122481

fbshipit-source-id: b4431433be5a799840d6cf6d4c1a5570f8c15afd
Summary:
Adding:
- ddr memory in gb
- sorting sharding types in shards column
- adding unused sharding types per rank for readability across all ranks
- legend to explain
- all logging on >100 lines

Differential Revision: D32215263

fbshipit-source-id: 61b54afa6fef96a3ba4935532c529dc2acc44e9c
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

Summary: Fix warmup for fused optimizers

Reviewed By: divchenko, jiaqizhai, zhangruiskyline

Differential Revision: D32229152

fbshipit-source-id: 10c924e5b58a6b29cbfde260e7e8677cfe4e3e90
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

7 similar comments
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

Jiaqi Zhai and others added 2 commits November 7, 2021 12:38
Differential Revision: D32228418

fbshipit-source-id: 3d995813dc93bc05c1277824cdc8a0e750e2cee9
Summary: kjt.bucketize is only called in row-wise sharding before sparse feature all2all. Move it to embedding_sharding.py so we don't need trim_keys(...) for KJT.

Reviewed By: divchenko

Differential Revision: D32234942

fbshipit-source-id: 6e921c3aa4676573bfdedbaf3449c8d75f2bb1f8
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

Summary:
tsia

Pull Request resolved: #10

Differential Revision: D31941381

Pulled By: joshuadeng

fbshipit-source-id: 7a073a67dfbbb1050bbae43e702fa2df42bfdffb
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D31941381

YazhiGao pushed a commit to YazhiGao/torchrec-2 that referenced this pull request Feb 24, 2022
Summary:
Pull Request resolved: meta-pytorch/torchrec#54

Pull Request resolved: meta-pytorch/torchrec#10

The original fb dlrm implementation shuffled batches to get their final results.

Reviewed By: colin2328

Differential Revision: D34008000

fbshipit-source-id: 811405e02bf36436afcb0fab4873d0947396be6b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants