-
Notifications
You must be signed in to change notification settings - Fork 580
Add pyre support to torchrec repo #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fbshipit-source-id: 18eb1fdcd8d0571b03cc80ffe3ac866486871ba3
Summary:
The current torchrec lint has issues. The issue is exposed by the recent diff: D30597438.
For example: the current BaseEmbedding in fbcode/torchrec/distributed/embedding_lookup.py has no __init__ function, which will cause a keyError Exception. This failure won't be catched by the recent fix in D30597438. Thus, making the lint output failed as:
Error (TORCHRECDOCSTRING) lint-command-failure
Command `python3 fbcode/torchrec/linter/module_linter.py @{{PATHSFILE}}`
failed.
Run `arc lint --trace --take TORCHRECDOCSTRING` for the full command
output.
Oncall: torchrec
Sandcastle test also shows the following error as:
{F659615618}
We submit these two changes:
1. to resolve the function_name "__init__" or "forward", if they are not in function list
2. to catch the remaining exception, except the syntaxerror.
Reviewed By: zertosh
Differential Revision: D30722426
fbshipit-source-id: 5f11110af039f2fa7bc3f63902739f0e6ea5e287
Summary: * added ads 2021h1 model * random data launcher unit test locally * refactor torchrec ddp init for module with no params with gradients need(we have a pure buffer-based module for ads as calibration) * source model: https://www.internalfb.com/code/fbsource/[9f3e1042dd2d]/fbcode/hpc/models/ads/ads_1x_2021h1.py Reviewed By: xing-liu Differential Revision: D30076019 fbshipit-source-id: 568205e0c4fa6e60eaf4c9e94946acad5d8578e5
Summary: The current design expose invoke_on_rank_and_broadcast_result(...) call to users which is not very user friendly Reviewed By: divchenko Differential Revision: D30733086 fbshipit-source-id: 6824d2ecfb9fc149c3cb7fc095d7f9ac96ba4ed1
Summary: WarmupOptimizer needs to update underlying optimizer states; bug was introduced in refactor of CombinedOptimizer D30176405 Reviewed By: divchenko Differential Revision: D30755047 fbshipit-source-id: 122038e6a4c7bc73cc859ed8cffa68e2b9841a63
Summary: Generalize regroup method introduced in D30044807 Will switch out usage in ig_clips_tab_pt.py in a followup diff Reviewed By: divchenko Differential Revision: D30375713 fbshipit-source-id: 6eb37c4f547db04d0048134187bb7aa0657bb9cf
Summary: Use torch SiLU instead Reviewed By: colin2328 Differential Revision: D30700094 fbshipit-source-id: b4a92e971769b9f7be739264869cee176f55f5e9
Summary: This modules promote non-functional style of modeling. Reviewed By: wx1988 Differential Revision: D30701381 fbshipit-source-id: fedc510366e5a10e87b6ab71ac12204c5b91b45d
Summary: Pull Request resolved: #1 * move it to ml_foundation folder before further testing on the performance * make it non lazy * add numerical testing Reviewed By: divchenko Differential Revision: D30756661 fbshipit-source-id: e2c50848bec12943951476d23991a6f586916487
Summary: Prior to this diff, we used a fixed param key "embedding_bags" in embedding_lookup.py, This diff moves the code to embedding module sharders so we can use different keys for different embedding modules. Reviewed By: divchenko Differential Revision: D30801536 fbshipit-source-id: ef04bd0b727139829bc6879555dfe819422b3884
Summary: This diff integrates with ShardedTensor in PyTorch distributed according to the plan/discussions in https://fb.quip.com/fwucARGO5SeO We are doing the first part of integration, which we replaces the ShardedTensor/Metadata defintions in torchrec and using the ones defined in Pytorch distributed. A second part integration might be more involved that we need to accomodate fbgemm kernels to take sharded tensor and do the computation, then switching to a mode that a ShardedModule contains a sharded weight/tensor directly, instead of multiple small nn.EmbeddingBags. Reviewed By: YazhiGao Differential Revision: D29403713 fbshipit-source-id: 279643bd01261ae564238b9dea9d2af5597342c2
Summary: Support Copies of Data Reviewed By: YazhiGao Differential Revision: D30262094 fbshipit-source-id: 33a32245afbc419436c1902ba32020ebb4c133e7
… on AWS cluster.
Summary:
# Context
* Inside fbcode, we don't need to worry much about how to use torchrec. It's as simple as running `import torchrec` and letting autodeps figure out how to add the relevant buck target.
* In OSS, where there is no buck, we need to somehow be able to run `import torchrec`. And we want to do this in a way that is independent of where we call our python script `python3 ~/example_folder/example_folder/.../my_torchrec_script.py`. i.e. We don't want to have to keep `my_torchrec_script.py` at the same level as the torchrec repo just so we can call `import torchrec` (as this will not work when my_torchrec_script.py cannot be easily co-located with the torchrec repo. e.g. torchrec STL app scripts):
```
random_folder
|_______________repos/
|_______torchrec/
|_______my_torchrec_script.py
```
# This Diff
The way to allow us to run `import torchrec` anywhere is to make a `setup.py` for torchrec which allows us to install torchrec with `python setup.py install`. This diff adds a **minimum viable version** of the setup.py that is **just good enough to unblock TorchRec external scale validation work on AWS clusters**. If you look at the setup.py for other domain libraries, they are way more complicated (e.g. [torchvision setup.py](https://fburl.com/zqef7peu)) and we will eventually upgrade this setup.py so it is more sophisticated for the official OSS release.
Reviewed By: colin2328
Differential Revision: D30839689
fbshipit-source-id: 9ac7722eaf8685e5d7a6b7f422ae3c91991d49c6
Summary: Assert integer types for JT & KJT lengths and offsets; checked tensor data type in JT class. KJT class was already covered. Reviewed By: dstaay-fb Differential Revision: D30842080 fbshipit-source-id: cf78edfffabb30f664951bfe35cf7b665df18e7c
…ollection, and GroupedPooledEmbeddingsLookup Summary: all nn.Modules should be able to self.load_state_dict(self.state_dict()). Current EmbeddingBag modules cannot, and DMP itself cannot. This diff reflects state_dict() customization to undo it in load_state_dict to maintain. It adds a test in DMP to test for this Reviewed By: divchenko, rkindi Differential Revision: D30820466 fbshipit-source-id: 181ee3484aac6c348b6bb15dc59494c188b2e89c
Summary: To add layer norm's non-lazy version. Keep current usage of Lazy version LayerNorm in Video and IG as is. * Add non lazy version of LayerNorm * Rename the TorchRec version LayerNorm as MCLayerNorm and LazyMCLayerNorm * Move MCLayerName and LazyMCLayerNorm into torchrec/fb/module folder * Add numerical unit test * Add lazy vs nonlazy numerical unit test * Fix the adoption. Reviewed By: divchenko Differential Revision: D30828204 fbshipit-source-id: db722abef965622829489c60a7e5866178343814
Summary: update KJTA2A docstring, provide _recat example Reviewed By: colin2328 Differential Revision: D30877670 fbshipit-source-id: 50eca883d0c49df0738837d682c7179332c88627
…le workers. Summary: # Context DataLoader can be used with multiple workers/processes to increase throughput. Map-style datasets (due to having a length property and keyed samples) automatically ensure that samples from the dataset are not duplicated across the multiple workers. However, for IterDataPipes (stream-style datasets), we must manually support coordinate the workers so they don't duplicate samples ([see relevant PyTorch docs here](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)). Criteo is a torchrec IterDataPipe that does not currently have logic to prevent duplicate samples. # This Diff * Adds support for Criteo to handle multiple workers without duplicating samples across workers. Followed the PyTorch [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)' suggestion on how to do this. * Adds some unit tests wrapping the Criteo dataset in DataLoader showing that multiple workers now works without duplicating data. # Implementation Details *How do we split up the input criteo TSV files across the different workers?* There are a few options I considered. **tldr** Option 1 used in this diff is simple and performant. If you want to squeeze additional utilization of the workers, you can subdivide the TSVs into smaller ones. Option 2 is too wasteful. Option 3 is too complicated and is not as performant as option 1. * Option 1 (what this diff does): Each TSV file is assigned to one worker. * Pros: * Straightforward implementation. Works best when number TSV files evenly divides num_workers. * All data is read only once. * Cons: * During validation, if you have just 1 tsv file, only one worker gets to process that file while all other workers are idle. * Option 2: Every tsv file is read across all the workers and we drop rows on each worker to prevent duplication. * Pros: * All workers are being utilized even for a single TSV. * Cons: * Terribly wasteful: each worker reads all of the rows and drops (num_workers - 1) / (num_workers) portion of the rows. Each worker essentially reads in all the data. * Option 3: Every tsv file is sharded across all the workers. Instead of naively reading all the data like in Option 2, we somehow use IOBase `seek` to chunk the tsv up and assign the chunks to different workers. * Pros: * All data is only read once. (in theory, see cons below) * All workers are being utilized even for a single TSV. * Cons: * **Very complicated.** Because each row of the TSV does not use the same number of bytes, when you seek in a TSV file, you might end up somewhere in the middle of one of the rows. You might need to drop that row, or do an additional seek to jump back to collect the rest of the row. You may get a performance hit due to the seeking. * You can achieve the same effect with better performance (due to the lack of seeks) by subdividing the TSV files into smaller files and using Option 1. Reviewed By: colin2328 Differential Revision: D30872755 fbshipit-source-id: 85396e8db28f79ed83d62f70fcf991cfd6108216
Summary: The diff is to refactor mlp related modules: * make the perceptron, mlp, mcmlp and mcperceptron non-lazy * make the mlp as a apex.mlp wrapper if it is available * move the mlp (calling perceptron) to torchrec/fb/modules * move the mc version to torchrec/fb/ml_foundation/modules * update unit test * update the related calling Reviewed By: wx1988 Differential Revision: D30874769 fbshipit-source-id: 59b0d4d0fcd456ce528de141d1074374f2bde4fd
Summary: For a somewhat common data transform use case, where we need to convert from unpacked format back to the packed format. (In particular, this is a dependency for cross batch sampling) Reviewed By: divchenko Differential Revision: D30890351 fbshipit-source-id: b387f9f67b58c7e7b021fc6fc67bcc9f9be432de
Summary: 1. lengths in KJTAll2all can be int64 2. Use the external all_to_all_single(...) API instead of alltoall_base Reviewed By: colin2328, jiaqizhai Differential Revision: D30925298 fbshipit-source-id: f835454f6dbaec60c8a0bbeceaba2efe25e8ab18
Summary: Pull Request resolved: #2 * add shards idx, ranks, size in related config for metadata passing * add cw sharding for per-rank table allocation. * many design decision are captured in https://fb.quip.com/byvkAZGpK1o0 Reviewed By: dstaay-fb Differential Revision: D30437562 fbshipit-source-id: 0570e431d1ebb128d3d0871681093f95fe56d5f8
Summary: Added unit tests for GradientClippingOptimizer Reviewed By: dstaay-fb Differential Revision: D30876265 fbshipit-source-id: 762567572b712bd9dd40820f07ec21843fe014df
…ules Summary: 1. override named_parameters(). Optimizer will use named_parameters() instead. 2. simplify state_dict() Differential Revision: D30944159 fbshipit-source-id: 7240f5e6188a3ee014f025ec4947032043bb086b
Summary: Ensure Rank/Device match in ShardMetaData (can not assume device is same as current device planner is run on - ie. before change was leading to rank:1/cuda:0) Reviewed By: YazhiGao Differential Revision: D31030367 fbshipit-source-id: 54f9de2611170d1a529afe74a4452388b057f818
Differential Revision: D31042728 fbshipit-source-id: 14799576da39297674ad302ca3fb035c436d82cc
Summary: The diff contains the following items: * DCN refactor to be non-lazy version * move DCN to torchrec/modules * add unit test to have numerical testing The reason to not keep lazy version is that: - it is a minor change with in_features, such that the lazy module won't save much from complexity. - torchrec/modules is a non-lazy environment. Reviewed By: yiq-liu Differential Revision: D31028571 fbshipit-source-id: dececb85889471aad642404d83a5b6faec32d975
Summary: Pull Request resolved: #3 fix tensor placement where the remote device should receive {rank, local_rank} Reviewed By: dstaay-fb Differential Revision: D31072120 fbshipit-source-id: b884afce691cac48a74524ca69e55c90e1308b39
Summary: as title - twrw doesn't really makes sense for gloo/cpu Reviewed By: rkindi Differential Revision: D31092150 fbshipit-source-id: 0d43c0f68ea049d085c105375c61995285a58f35
Summary: Implement DMP.named_buffers() Differential Revision: D31104124 fbshipit-source-id: 984baf747c3c89b1d0f5ccf4da5d45b57bdf4754
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
Summary: As titled Reviewed By: JadeNie Differential Revision: D31568073 fbshipit-source-id: f31788013c20deae54dab7e7582c39d95ec0ae5c
Summary: - Calculator: max device cost - Placer: support CW, swap backtrack algo around - Enumerator: switch to using name vs fqn (which is more common on hpc/torchrec configs) Reviewed By: YLGH Differential Revision: D32165540 fbshipit-source-id: ad143c176676ddc5145a90698d13689d7b7ba803
Summary: For some reason in D32055570 (70d9684) the contents of the symlinked files did not show up on github - https://github.com/fairinternal/torchrec/blob/main/third_party/fbgemm_gpu rkindi is this the right way to symlink a file from internal repo? Does this need more config changes? EDIT: Looks like symlinking doesn't work well with github exports. Going to delete the symlink and do this via git submodules Reviewed By: colin2328 Differential Revision: D32073541 fbshipit-source-id: ef0a4d36cca7259a84279e54884c16b1e2674426
Summary: Currently verifying if installation is successful involves installing stl tasks and lightning and is fairly complicated. Apart from the complexity we shouldn't need a second library to verify if TorchRec is installed. I see a lot of value in having a simpler script to verify if installation is successful Reviewed By: rkindi Differential Revision: D32122481 fbshipit-source-id: b4431433be5a799840d6cf6d4c1a5570f8c15afd
Summary: Adding: - ddr memory in gb - sorting sharding types in shards column - adding unused sharding types per rank for readability across all ranks - legend to explain - all logging on >100 lines Differential Revision: D32215263 fbshipit-source-id: 61b54afa6fef96a3ba4935532c529dc2acc44e9c
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
Summary: Fix warmup for fused optimizers Reviewed By: divchenko, jiaqizhai, zhangruiskyline Differential Revision: D32229152 fbshipit-source-id: 10c924e5b58a6b29cbfde260e7e8677cfe4e3e90
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
7 similar comments
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
Differential Revision: D32228418 fbshipit-source-id: 3d995813dc93bc05c1277824cdc8a0e750e2cee9
Summary: kjt.bucketize is only called in row-wise sharding before sparse feature all2all. Move it to embedding_sharding.py so we don't need trim_keys(...) for KJT. Reviewed By: divchenko Differential Revision: D32234942 fbshipit-source-id: 6e921c3aa4676573bfdedbaf3449c8d75f2bb1f8
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
Summary: tsia Pull Request resolved: #10 Differential Revision: D31941381 Pulled By: joshuadeng fbshipit-source-id: 7a073a67dfbbb1050bbae43e702fa2df42bfdffb
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D31941381 |
Summary: Pull Request resolved: meta-pytorch/torchrec#54 Pull Request resolved: meta-pytorch/torchrec#10 The original fb dlrm implementation shuffled batches to get their final results. Reviewed By: colin2328 Differential Revision: D34008000 fbshipit-source-id: 811405e02bf36436afcb0fab4873d0947396be6b
Summary: tsia
Differential Revision: D31941381