-
Notifications
You must be signed in to change notification settings - Fork 581
Add a symlink from fbgemm_gpu into TorchRec #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fbshipit-source-id: 18eb1fdcd8d0571b03cc80ffe3ac866486871ba3
Summary:
The current torchrec lint has issues. The issue is exposed by the recent diff: D30597438.
For example: the current BaseEmbedding in fbcode/torchrec/distributed/embedding_lookup.py has no __init__ function, which will cause a keyError Exception. This failure won't be catched by the recent fix in D30597438. Thus, making the lint output failed as:
Error (TORCHRECDOCSTRING) lint-command-failure
Command `python3 fbcode/torchrec/linter/module_linter.py @{{PATHSFILE}}`
failed.
Run `arc lint --trace --take TORCHRECDOCSTRING` for the full command
output.
Oncall: torchrec
Sandcastle test also shows the following error as:
{F659615618}
We submit these two changes:
1. to resolve the function_name "__init__" or "forward", if they are not in function list
2. to catch the remaining exception, except the syntaxerror.
Reviewed By: zertosh
Differential Revision: D30722426
fbshipit-source-id: 5f11110af039f2fa7bc3f63902739f0e6ea5e287
Summary: * added ads 2021h1 model * random data launcher unit test locally * refactor torchrec ddp init for module with no params with gradients need(we have a pure buffer-based module for ads as calibration) * source model: https://www.internalfb.com/code/fbsource/[9f3e1042dd2d]/fbcode/hpc/models/ads/ads_1x_2021h1.py Reviewed By: xing-liu Differential Revision: D30076019 fbshipit-source-id: 568205e0c4fa6e60eaf4c9e94946acad5d8578e5
Summary: The current design expose invoke_on_rank_and_broadcast_result(...) call to users which is not very user friendly Reviewed By: divchenko Differential Revision: D30733086 fbshipit-source-id: 6824d2ecfb9fc149c3cb7fc095d7f9ac96ba4ed1
Summary: WarmupOptimizer needs to update underlying optimizer states; bug was introduced in refactor of CombinedOptimizer D30176405 Reviewed By: divchenko Differential Revision: D30755047 fbshipit-source-id: 122038e6a4c7bc73cc859ed8cffa68e2b9841a63
Summary: Generalize regroup method introduced in D30044807 Will switch out usage in ig_clips_tab_pt.py in a followup diff Reviewed By: divchenko Differential Revision: D30375713 fbshipit-source-id: 6eb37c4f547db04d0048134187bb7aa0657bb9cf
Summary: Use torch SiLU instead Reviewed By: colin2328 Differential Revision: D30700094 fbshipit-source-id: b4a92e971769b9f7be739264869cee176f55f5e9
Summary: This modules promote non-functional style of modeling. Reviewed By: wx1988 Differential Revision: D30701381 fbshipit-source-id: fedc510366e5a10e87b6ab71ac12204c5b91b45d
Summary: Pull Request resolved: #1 * move it to ml_foundation folder before further testing on the performance * make it non lazy * add numerical testing Reviewed By: divchenko Differential Revision: D30756661 fbshipit-source-id: e2c50848bec12943951476d23991a6f586916487
Summary: Prior to this diff, we used a fixed param key "embedding_bags" in embedding_lookup.py, This diff moves the code to embedding module sharders so we can use different keys for different embedding modules. Reviewed By: divchenko Differential Revision: D30801536 fbshipit-source-id: ef04bd0b727139829bc6879555dfe819422b3884
Summary: This diff integrates with ShardedTensor in PyTorch distributed according to the plan/discussions in https://fb.quip.com/fwucARGO5SeO We are doing the first part of integration, which we replaces the ShardedTensor/Metadata defintions in torchrec and using the ones defined in Pytorch distributed. A second part integration might be more involved that we need to accomodate fbgemm kernels to take sharded tensor and do the computation, then switching to a mode that a ShardedModule contains a sharded weight/tensor directly, instead of multiple small nn.EmbeddingBags. Reviewed By: YazhiGao Differential Revision: D29403713 fbshipit-source-id: 279643bd01261ae564238b9dea9d2af5597342c2
Summary: Support Copies of Data Reviewed By: YazhiGao Differential Revision: D30262094 fbshipit-source-id: 33a32245afbc419436c1902ba32020ebb4c133e7
… on AWS cluster.
Summary:
# Context
* Inside fbcode, we don't need to worry much about how to use torchrec. It's as simple as running `import torchrec` and letting autodeps figure out how to add the relevant buck target.
* In OSS, where there is no buck, we need to somehow be able to run `import torchrec`. And we want to do this in a way that is independent of where we call our python script `python3 ~/example_folder/example_folder/.../my_torchrec_script.py`. i.e. We don't want to have to keep `my_torchrec_script.py` at the same level as the torchrec repo just so we can call `import torchrec` (as this will not work when my_torchrec_script.py cannot be easily co-located with the torchrec repo. e.g. torchrec STL app scripts):
```
random_folder
|_______________repos/
|_______torchrec/
|_______my_torchrec_script.py
```
# This Diff
The way to allow us to run `import torchrec` anywhere is to make a `setup.py` for torchrec which allows us to install torchrec with `python setup.py install`. This diff adds a **minimum viable version** of the setup.py that is **just good enough to unblock TorchRec external scale validation work on AWS clusters**. If you look at the setup.py for other domain libraries, they are way more complicated (e.g. [torchvision setup.py](https://fburl.com/zqef7peu)) and we will eventually upgrade this setup.py so it is more sophisticated for the official OSS release.
Reviewed By: colin2328
Differential Revision: D30839689
fbshipit-source-id: 9ac7722eaf8685e5d7a6b7f422ae3c91991d49c6
Summary: Assert integer types for JT & KJT lengths and offsets; checked tensor data type in JT class. KJT class was already covered. Reviewed By: dstaay-fb Differential Revision: D30842080 fbshipit-source-id: cf78edfffabb30f664951bfe35cf7b665df18e7c
…ollection, and GroupedPooledEmbeddingsLookup Summary: all nn.Modules should be able to self.load_state_dict(self.state_dict()). Current EmbeddingBag modules cannot, and DMP itself cannot. This diff reflects state_dict() customization to undo it in load_state_dict to maintain. It adds a test in DMP to test for this Reviewed By: divchenko, rkindi Differential Revision: D30820466 fbshipit-source-id: 181ee3484aac6c348b6bb15dc59494c188b2e89c
Summary: To add layer norm's non-lazy version. Keep current usage of Lazy version LayerNorm in Video and IG as is. * Add non lazy version of LayerNorm * Rename the TorchRec version LayerNorm as MCLayerNorm and LazyMCLayerNorm * Move MCLayerName and LazyMCLayerNorm into torchrec/fb/module folder * Add numerical unit test * Add lazy vs nonlazy numerical unit test * Fix the adoption. Reviewed By: divchenko Differential Revision: D30828204 fbshipit-source-id: db722abef965622829489c60a7e5866178343814
Summary: update KJTA2A docstring, provide _recat example Reviewed By: colin2328 Differential Revision: D30877670 fbshipit-source-id: 50eca883d0c49df0738837d682c7179332c88627
…le workers. Summary: # Context DataLoader can be used with multiple workers/processes to increase throughput. Map-style datasets (due to having a length property and keyed samples) automatically ensure that samples from the dataset are not duplicated across the multiple workers. However, for IterDataPipes (stream-style datasets), we must manually support coordinate the workers so they don't duplicate samples ([see relevant PyTorch docs here](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)). Criteo is a torchrec IterDataPipe that does not currently have logic to prevent duplicate samples. # This Diff * Adds support for Criteo to handle multiple workers without duplicating samples across workers. Followed the PyTorch [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)' suggestion on how to do this. * Adds some unit tests wrapping the Criteo dataset in DataLoader showing that multiple workers now works without duplicating data. # Implementation Details *How do we split up the input criteo TSV files across the different workers?* There are a few options I considered. **tldr** Option 1 used in this diff is simple and performant. If you want to squeeze additional utilization of the workers, you can subdivide the TSVs into smaller ones. Option 2 is too wasteful. Option 3 is too complicated and is not as performant as option 1. * Option 1 (what this diff does): Each TSV file is assigned to one worker. * Pros: * Straightforward implementation. Works best when number TSV files evenly divides num_workers. * All data is read only once. * Cons: * During validation, if you have just 1 tsv file, only one worker gets to process that file while all other workers are idle. * Option 2: Every tsv file is read across all the workers and we drop rows on each worker to prevent duplication. * Pros: * All workers are being utilized even for a single TSV. * Cons: * Terribly wasteful: each worker reads all of the rows and drops (num_workers - 1) / (num_workers) portion of the rows. Each worker essentially reads in all the data. * Option 3: Every tsv file is sharded across all the workers. Instead of naively reading all the data like in Option 2, we somehow use IOBase `seek` to chunk the tsv up and assign the chunks to different workers. * Pros: * All data is only read once. (in theory, see cons below) * All workers are being utilized even for a single TSV. * Cons: * **Very complicated.** Because each row of the TSV does not use the same number of bytes, when you seek in a TSV file, you might end up somewhere in the middle of one of the rows. You might need to drop that row, or do an additional seek to jump back to collect the rest of the row. You may get a performance hit due to the seeking. * You can achieve the same effect with better performance (due to the lack of seeks) by subdividing the TSV files into smaller files and using Option 1. Reviewed By: colin2328 Differential Revision: D30872755 fbshipit-source-id: 85396e8db28f79ed83d62f70fcf991cfd6108216
Summary: The diff is to refactor mlp related modules: * make the perceptron, mlp, mcmlp and mcperceptron non-lazy * make the mlp as a apex.mlp wrapper if it is available * move the mlp (calling perceptron) to torchrec/fb/modules * move the mc version to torchrec/fb/ml_foundation/modules * update unit test * update the related calling Reviewed By: wx1988 Differential Revision: D30874769 fbshipit-source-id: 59b0d4d0fcd456ce528de141d1074374f2bde4fd
Summary: For a somewhat common data transform use case, where we need to convert from unpacked format back to the packed format. (In particular, this is a dependency for cross batch sampling) Reviewed By: divchenko Differential Revision: D30890351 fbshipit-source-id: b387f9f67b58c7e7b021fc6fc67bcc9f9be432de
Summary: 1. lengths in KJTAll2all can be int64 2. Use the external all_to_all_single(...) API instead of alltoall_base Reviewed By: colin2328, jiaqizhai Differential Revision: D30925298 fbshipit-source-id: f835454f6dbaec60c8a0bbeceaba2efe25e8ab18
Summary: Pull Request resolved: #2 * add shards idx, ranks, size in related config for metadata passing * add cw sharding for per-rank table allocation. * many design decision are captured in https://fb.quip.com/byvkAZGpK1o0 Reviewed By: dstaay-fb Differential Revision: D30437562 fbshipit-source-id: 0570e431d1ebb128d3d0871681093f95fe56d5f8
Summary: Added unit tests for GradientClippingOptimizer Reviewed By: dstaay-fb Differential Revision: D30876265 fbshipit-source-id: 762567572b712bd9dd40820f07ec21843fe014df
…ules Summary: 1. override named_parameters(). Optimizer will use named_parameters() instead. 2. simplify state_dict() Differential Revision: D30944159 fbshipit-source-id: 7240f5e6188a3ee014f025ec4947032043bb086b
Summary: Ensure Rank/Device match in ShardMetaData (can not assume device is same as current device planner is run on - ie. before change was leading to rank:1/cuda:0) Reviewed By: YazhiGao Differential Revision: D31030367 fbshipit-source-id: 54f9de2611170d1a529afe74a4452388b057f818
Differential Revision: D31042728 fbshipit-source-id: 14799576da39297674ad302ca3fb035c436d82cc
Summary: The diff contains the following items: * DCN refactor to be non-lazy version * move DCN to torchrec/modules * add unit test to have numerical testing The reason to not keep lazy version is that: - it is a minor change with in_features, such that the lazy module won't save much from complexity. - torchrec/modules is a non-lazy environment. Reviewed By: yiq-liu Differential Revision: D31028571 fbshipit-source-id: dececb85889471aad642404d83a5b6faec32d975
Summary: Pull Request resolved: #3 fix tensor placement where the remote device should receive {rank, local_rank} Reviewed By: dstaay-fb Differential Revision: D31072120 fbshipit-source-id: b884afce691cac48a74524ca69e55c90e1308b39
Summary: as title - twrw doesn't really makes sense for gloo/cpu Reviewed By: rkindi Differential Revision: D31092150 fbshipit-source-id: 0d43c0f68ea049d085c105375c61995285a58f35
Summary: Implement DMP.named_buffers() Differential Revision: D31104124 fbshipit-source-id: 984baf747c3c89b1d0f5ccf4da5d45b57bdf4754
Summary: Call sync() in data stream for single GPU runs Reviewed By: divchenko Differential Revision: D31770560 fbshipit-source-id: 87deb84a1b5992d157ef9cc0e5139a4ca4eb4fb6
Summary: Only values need to be split for GroupEmbedding. __getitem__(...) on kjt will split values, weights, lengths, and offsets. Reviewed By: divchenko Differential Revision: D31770644 fbshipit-source-id: 37c53d5ac7f3d808097fc92471697448eed71090
Summary: Since input dist wait might require H2D sync eg KJT.sync(...), we wait on data stream to avoid blocking the default stream. Reviewed By: dstaay-fb Differential Revision: D31773789 fbshipit-source-id: fbe5ce4ccc835bad5dc8091b71ddc8673d9fb6ef
Summary: We don't need to do expensive torch slicing when segment is equal to feature count. Reviewed By: dstaay-fb Differential Revision: D31774275 fbshipit-source-id: 79596d14cf8997fde38620741dc21ddcd55247a4
Summary: In sequence embedding sharding, we might need to replicate sparse features and keep the original keys to construct SequenceEmbedding. Reviewed By: dstaay-fb Differential Revision: D31776401 fbshipit-source-id: 3f5e9a6818ea933389b44964473cd43535d1e733
Summary: as per title Reviewed By: lurunming Differential Revision: D31787520 fbshipit-source-id: 236b3e68ff092fc0e939d7b94f7014dd1b6e8f9b
Summary: Remove device dependency to get compute kernel/storage usage Differential Revision: D31673806 fbshipit-source-id: a84060e95cf68e298ad8f6d516ebc70afaf98753
Summary: Reworking for TREC planner internal components for better scalability. Attempts to support a broad use set of existing and new use cases https://fb.quip.com/V4htAeexikoR Differential Revision: D31496825 fbshipit-source-id: 1b74ffc2da19fe332e313bf5eb95a5a56fb7c121
Summary: 1. Instead of PipelinedInput create Multistreamable and Pipelineable (the latter one is public API-facing) interfaces. 1. Make explicit checks for Multistreamable/Pipelineable impls for both input, input_dist results and context. This avoids silent failures. 1. Create SequenceArchContext to be used instead of default EmptyContext . This forces record_stream implementation to be provides and avoids silent failures. 1. Make KJT, JT, KT implement Pipelineable interface. 1. Actual fix: make sure to call record_stream() on all tensors in context. Reviewed By: xing-liu, jiaqizhai Differential Revision: D31865112 fbshipit-source-id: 3d6545ce2d3d6080d7fb9a69480b83a8bcbb169d
Summary: The old setup.py was needed because the top level folder for the repo contained folders like /distributed, etc. Now, we have the top level folder have a single torchrec folder, so the setup.py needs to be changed to reflect this. Reviewed By: wx1988 Differential Revision: D31886770 fbshipit-source-id: 57072bbd84465167129b1d6c4c5f274afcb4b805
Summary: For sharder redesign, implement SMCTopology off of Topology Reviewed By: dstaay-fb Differential Revision: D31585087 fbshipit-source-id: d5b7a0806c39aeb85c32f84259986444f0209c52
Summary: **Summary**: This commit solves the first part of pytorch/pytorch#52306, which disallows type annotations on instance attributes inside any method other than the constructor. Pull Request resolved: pytorch/pytorch#67051 Test Plan: Added test to test_types.py. **Reviewers**: Zhengxu Chen **Subscribers**: Zhengxu Chen, Yanan Cao, Peng Wu, Yining Lu **Tasks**: T103941984 **Tags**: pytorch **Fixes** pytorch/pytorch#52306 Reviewed By: zhxchen17 Differential Revision: D31843527 Pulled By: andrewor14 fbshipit-source-id: 624879ae801621e367c59228be8b0581ecd30ef4
Summary: Part of the EmbeddingShardingPlanner refactor. Reviewed By: dstaay-fb Differential Revision: D31553701 fbshipit-source-id: ced039aadc3609c7af52b6d1faf7222b70597401
Summary: Wall Time cost calculator. General thought: memory BW dominated equations Reviewed By: dstaay-fb Differential Revision: D31706355 fbshipit-source-id: bff482645b8431c77824cbec8e6c3c1020349359
Summary:
Similar to _input_dists. Defer the initialization so that we can create ShardedEmbeddingBagCollection with less dependencies.
This diff fix below errors in dry-sharding:
```
File "<torch_package_1>.hpc/torchrec/sparsenn_provider.py", line 580, in shard_model
File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 109, in __init__
File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 145, in _init_dmp
File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 175, in _shard_modules_impl
File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 164, in _shard_modules_impl
File "<torch_package_1>.torchrec/distributed/embedding.py", line 497, in shard
File "<torch_package_1>.torchrec/distributed/embedding.py", line 262, in __init__
File "<torch_package_1>.torchrec/distributed/embedding.py", line 330, in _create_output_dist
File "<torch_package_1>.torchrec/distributed/twrw_sharding.py", line 354, in create_pooled_output_dist
File "<torch_package_1>.torchrec/distributed/twrw_sharding.py", line 259, in cross_pg
File "<torch_package_1>.torchrec/distributed/comm.py", line 122, in intra_and_cross_node_pg
File "/data/users/runming/fbsource/fbcode/buck-out/dev/gen/scripts/runming/transfer_learning/debug_dry_sharding#link-tree/caffe2/torch/fb/lwt/torch_distributed.py", line 168, in new_group
raise NotImplementedError(
```
Reviewed By: dstaay-fb
Differential Revision: D31866535
fbshipit-source-id: 80cccdfb6355281fed46daff6633db34f8758b01
Summary: In some scenarios, we want to create TwRwEmbeddingSharding and executing _shard() without intra & cross. Defer intra & cross pg initialization to achieve this. Reviewed By: dstaay-fb Differential Revision: D31846780 fbshipit-source-id: 007c9a1f5f4cf4bbc90198d830fd2e6d1f811d17
Summary: Stream() should be called only for cuda device. Reviewed By: dstaay-fb Differential Revision: D31838338 fbshipit-source-id: f022e29dcd90837c086cc675553a470164cfdddf
Summary: Planer use a real device type to generate sharding plan and thus shard metadata placement device is the real device. If we use meta device to contruct the model, the placement device could be conflicted with the tensor device. So we need to hack the placement device to meta device in order to pass the unmatched device verification. Reviewed By: dstaay-fb Differential Revision: D31836919 fbshipit-source-id: 68c10fe0f5a75b45fea7107e90c05ef5bc58b6cf
Summary: We need to pass processor group in All2All_Seq_Req_Wait otherwise the backward all2all will use a new NCCL stream. Reviewed By: divchenko Differential Revision: D31945454 fbshipit-source-id: 8a19a840c3cbb68471f746a0b7603293f1747c45
Summary: Pull Request resolved: pytorch/pytorch#64481 This simplifies `init_from_local_shards` API in sharded tensor, to only require user pass in a list of `Shard` and `overall_size`, instead of ShardedTensorMetadata. We will do the all_gather inside to form a valid ShardedTensorMetadata instead. TODO: add more test cases to improve coverage. ghstack-source-id: 141742350 Reviewed By: pritamdamania87 Differential Revision: D30748504 fbshipit-source-id: 6e97d95ffafde6b5f3970e2c2ba33b76cabd8d8a
Summary: We want to consolidate SMCTopology with the concrete base class Reviewed By: dstaay-fb Differential Revision: D32027047 fbshipit-source-id: 7c895d19826025bf157c5ebbf2832edf95665a1f
Summary: Original commit changeset: 6e97d95ffafd Reviewed By: wanchaol Differential Revision: D32023341 fbshipit-source-id: 2a9f7b637c0ff18700bcc3e44466fffcff861698
Summary: TorchRec OSS installation currently requires 3 steps: 1. fbgemm_gpu installation 2. Symlink fbgemm_gpu_py.so to the TorchRec directory 3. Run TorchRec's installation We can simplify this to a single step by bringing in fbgemm_gpu into TorchRec directory Reviewed By: rkindi Differential Revision: D32055570 fbshipit-source-id: b3d2c1234469898a1cfe5c2e3cdb67e3c289d9db
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D32055570 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
TorchRec OSS installation currently requires 3 steps:
We can simplify this to a single step by bringing in fbgemm_gpu into TorchRec directory
Reviewed By: rkindi
Differential Revision: D32055570