Add a symlink from fbgemm_gpu into TorchRec #14

animan42 · 2021-11-01T17:04:21Z

Summary:
TorchRec OSS installation currently requires 3 steps:

fbgemm_gpu installation
Symlink fbgemm_gpu_py.so to the TorchRec directory
Run TorchRec's installation

We can simplify this to a single step by bringing in fbgemm_gpu into TorchRec directory

Reviewed By: rkindi

Differential Revision: D32055570

fbshipit-source-id: 18eb1fdcd8d0571b03cc80ffe3ac866486871ba3

Summary: The current torchrec lint has issues. The issue is exposed by the recent diff: D30597438. For example: the current BaseEmbedding in fbcode/torchrec/distributed/embedding_lookup.py has no __init__ function, which will cause a keyError Exception. This failure won't be catched by the recent fix in D30597438. Thus, making the lint output failed as: Error (TORCHRECDOCSTRING) lint-command-failure Command `python3 fbcode/torchrec/linter/module_linter.py @{{PATHSFILE}}` failed. Run `arc lint --trace --take TORCHRECDOCSTRING` for the full command output. Oncall: torchrec Sandcastle test also shows the following error as: {F659615618} We submit these two changes: 1. to resolve the function_name "__init__" or "forward", if they are not in function list 2. to catch the remaining exception, except the syntaxerror. Reviewed By: zertosh Differential Revision: D30722426 fbshipit-source-id: 5f11110af039f2fa7bc3f63902739f0e6ea5e287

Summary: * added ads 2021h1 model * random data launcher unit test locally * refactor torchrec ddp init for module with no params with gradients need(we have a pure buffer-based module for ads as calibration) * source model: https://www.internalfb.com/code/fbsource/[9f3e1042dd2d]/fbcode/hpc/models/ads/ads_1x_2021h1.py Reviewed By: xing-liu Differential Revision: D30076019 fbshipit-source-id: 568205e0c4fa6e60eaf4c9e94946acad5d8578e5

Summary: The current design expose invoke_on_rank_and_broadcast_result(...) call to users which is not very user friendly Reviewed By: divchenko Differential Revision: D30733086 fbshipit-source-id: 6824d2ecfb9fc149c3cb7fc095d7f9ac96ba4ed1

Summary: WarmupOptimizer needs to update underlying optimizer states; bug was introduced in refactor of CombinedOptimizer D30176405 Reviewed By: divchenko Differential Revision: D30755047 fbshipit-source-id: 122038e6a4c7bc73cc859ed8cffa68e2b9841a63

Summary: Generalize regroup method introduced in D30044807 Will switch out usage in ig_clips_tab_pt.py in a followup diff Reviewed By: divchenko Differential Revision: D30375713 fbshipit-source-id: 6eb37c4f547db04d0048134187bb7aa0657bb9cf

Summary: Use torch SiLU instead Reviewed By: colin2328 Differential Revision: D30700094 fbshipit-source-id: b4a92e971769b9f7be739264869cee176f55f5e9

Summary: This modules promote non-functional style of modeling. Reviewed By: wx1988 Differential Revision: D30701381 fbshipit-source-id: fedc510366e5a10e87b6ab71ac12204c5b91b45d

Summary: Pull Request resolved: #1 * move it to ml_foundation folder before further testing on the performance * make it non lazy * add numerical testing Reviewed By: divchenko Differential Revision: D30756661 fbshipit-source-id: e2c50848bec12943951476d23991a6f586916487

Summary: Prior to this diff, we used a fixed param key "embedding_bags" in embedding_lookup.py, This diff moves the code to embedding module sharders so we can use different keys for different embedding modules. Reviewed By: divchenko Differential Revision: D30801536 fbshipit-source-id: ef04bd0b727139829bc6879555dfe819422b3884

Summary: This diff integrates with ShardedTensor in PyTorch distributed according to the plan/discussions in https://fb.quip.com/fwucARGO5SeO We are doing the first part of integration, which we replaces the ShardedTensor/Metadata defintions in torchrec and using the ones defined in Pytorch distributed. A second part integration might be more involved that we need to accomodate fbgemm kernels to take sharded tensor and do the computation, then switching to a mode that a ShardedModule contains a sharded weight/tensor directly, instead of multiple small nn.EmbeddingBags. Reviewed By: YazhiGao Differential Revision: D29403713 fbshipit-source-id: 279643bd01261ae564238b9dea9d2af5597342c2

Summary: Support Copies of Data Reviewed By: YazhiGao Differential Revision: D30262094 fbshipit-source-id: 33a32245afbc419436c1902ba32020ebb4c133e7

… on AWS cluster. Summary: # Context * Inside fbcode, we don't need to worry much about how to use torchrec. It's as simple as running `import torchrec` and letting autodeps figure out how to add the relevant buck target. * In OSS, where there is no buck, we need to somehow be able to run `import torchrec`. And we want to do this in a way that is independent of where we call our python script `python3 ~/example_folder/example_folder/.../my_torchrec_script.py`. i.e. We don't want to have to keep `my_torchrec_script.py` at the same level as the torchrec repo just so we can call `import torchrec` (as this will not work when my_torchrec_script.py cannot be easily co-located with the torchrec repo. e.g. torchrec STL app scripts): ``` random_folder |_______________repos/ |_______torchrec/ |_______my_torchrec_script.py ``` # This Diff The way to allow us to run `import torchrec` anywhere is to make a `setup.py` for torchrec which allows us to install torchrec with `python setup.py install`. This diff adds a **minimum viable version** of the setup.py that is **just good enough to unblock TorchRec external scale validation work on AWS clusters**. If you look at the setup.py for other domain libraries, they are way more complicated (e.g. [torchvision setup.py](https://fburl.com/zqef7peu)) and we will eventually upgrade this setup.py so it is more sophisticated for the official OSS release. Reviewed By: colin2328 Differential Revision: D30839689 fbshipit-source-id: 9ac7722eaf8685e5d7a6b7f422ae3c91991d49c6

Summary: Assert integer types for JT & KJT lengths and offsets; checked tensor data type in JT class. KJT class was already covered. Reviewed By: dstaay-fb Differential Revision: D30842080 fbshipit-source-id: cf78edfffabb30f664951bfe35cf7b665df18e7c

…ollection, and GroupedPooledEmbeddingsLookup Summary: all nn.Modules should be able to self.load_state_dict(self.state_dict()). Current EmbeddingBag modules cannot, and DMP itself cannot. This diff reflects state_dict() customization to undo it in load_state_dict to maintain. It adds a test in DMP to test for this Reviewed By: divchenko, rkindi Differential Revision: D30820466 fbshipit-source-id: 181ee3484aac6c348b6bb15dc59494c188b2e89c

Summary: To add layer norm's non-lazy version. Keep current usage of Lazy version LayerNorm in Video and IG as is. * Add non lazy version of LayerNorm * Rename the TorchRec version LayerNorm as MCLayerNorm and LazyMCLayerNorm * Move MCLayerName and LazyMCLayerNorm into torchrec/fb/module folder * Add numerical unit test * Add lazy vs nonlazy numerical unit test * Fix the adoption. Reviewed By: divchenko Differential Revision: D30828204 fbshipit-source-id: db722abef965622829489c60a7e5866178343814

Summary: update KJTA2A docstring, provide _recat example Reviewed By: colin2328 Differential Revision: D30877670 fbshipit-source-id: 50eca883d0c49df0738837d682c7179332c88627

…le workers. Summary: # Context DataLoader can be used with multiple workers/processes to increase throughput. Map-style datasets (due to having a length property and keyed samples) automatically ensure that samples from the dataset are not duplicated across the multiple workers. However, for IterDataPipes (stream-style datasets), we must manually support coordinate the workers so they don't duplicate samples ([see relevant PyTorch docs here](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)). Criteo is a torchrec IterDataPipe that does not currently have logic to prevent duplicate samples. # This Diff * Adds support for Criteo to handle multiple workers without duplicating samples across workers. Followed the PyTorch [docs](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset)' suggestion on how to do this. * Adds some unit tests wrapping the Criteo dataset in DataLoader showing that multiple workers now works without duplicating data. # Implementation Details *How do we split up the input criteo TSV files across the different workers?* There are a few options I considered. **tldr** Option 1 used in this diff is simple and performant. If you want to squeeze additional utilization of the workers, you can subdivide the TSVs into smaller ones. Option 2 is too wasteful. Option 3 is too complicated and is not as performant as option 1. * Option 1 (what this diff does): Each TSV file is assigned to one worker. * Pros: * Straightforward implementation. Works best when number TSV files evenly divides num_workers. * All data is read only once. * Cons: * During validation, if you have just 1 tsv file, only one worker gets to process that file while all other workers are idle. * Option 2: Every tsv file is read across all the workers and we drop rows on each worker to prevent duplication. * Pros: * All workers are being utilized even for a single TSV. * Cons: * Terribly wasteful: each worker reads all of the rows and drops (num_workers - 1) / (num_workers) portion of the rows. Each worker essentially reads in all the data. * Option 3: Every tsv file is sharded across all the workers. Instead of naively reading all the data like in Option 2, we somehow use IOBase `seek` to chunk the tsv up and assign the chunks to different workers. * Pros: * All data is only read once. (in theory, see cons below) * All workers are being utilized even for a single TSV. * Cons: * **Very complicated.** Because each row of the TSV does not use the same number of bytes, when you seek in a TSV file, you might end up somewhere in the middle of one of the rows. You might need to drop that row, or do an additional seek to jump back to collect the rest of the row. You may get a performance hit due to the seeking. * You can achieve the same effect with better performance (due to the lack of seeks) by subdividing the TSV files into smaller files and using Option 1. Reviewed By: colin2328 Differential Revision: D30872755 fbshipit-source-id: 85396e8db28f79ed83d62f70fcf991cfd6108216

Summary: The diff is to refactor mlp related modules: * make the perceptron, mlp, mcmlp and mcperceptron non-lazy * make the mlp as a apex.mlp wrapper if it is available * move the mlp (calling perceptron) to torchrec/fb/modules * move the mc version to torchrec/fb/ml_foundation/modules * update unit test * update the related calling Reviewed By: wx1988 Differential Revision: D30874769 fbshipit-source-id: 59b0d4d0fcd456ce528de141d1074374f2bde4fd

Summary: For a somewhat common data transform use case, where we need to convert from unpacked format back to the packed format. (In particular, this is a dependency for cross batch sampling) Reviewed By: divchenko Differential Revision: D30890351 fbshipit-source-id: b387f9f67b58c7e7b021fc6fc67bcc9f9be432de

Summary: 1. lengths in KJTAll2all can be int64 2. Use the external all_to_all_single(...) API instead of alltoall_base Reviewed By: colin2328, jiaqizhai Differential Revision: D30925298 fbshipit-source-id: f835454f6dbaec60c8a0bbeceaba2efe25e8ab18

Summary: Pull Request resolved: #2 * add shards idx, ranks, size in related config for metadata passing * add cw sharding for per-rank table allocation. * many design decision are captured in https://fb.quip.com/byvkAZGpK1o0 Reviewed By: dstaay-fb Differential Revision: D30437562 fbshipit-source-id: 0570e431d1ebb128d3d0871681093f95fe56d5f8

Summary: Added unit tests for GradientClippingOptimizer Reviewed By: dstaay-fb Differential Revision: D30876265 fbshipit-source-id: 762567572b712bd9dd40820f07ec21843fe014df

…ules Summary: 1. override named_parameters(). Optimizer will use named_parameters() instead. 2. simplify state_dict() Differential Revision: D30944159 fbshipit-source-id: 7240f5e6188a3ee014f025ec4947032043bb086b

Summary: Ensure Rank/Device match in ShardMetaData (can not assume device is same as current device planner is run on - ie. before change was leading to rank:1/cuda:0) Reviewed By: YazhiGao Differential Revision: D31030367 fbshipit-source-id: 54f9de2611170d1a529afe74a4452388b057f818

Differential Revision: D31042728 fbshipit-source-id: 14799576da39297674ad302ca3fb035c436d82cc

Summary: The diff contains the following items: * DCN refactor to be non-lazy version * move DCN to torchrec/modules * add unit test to have numerical testing The reason to not keep lazy version is that: - it is a minor change with in_features, such that the lazy module won't save much from complexity. - torchrec/modules is a non-lazy environment. Reviewed By: yiq-liu Differential Revision: D31028571 fbshipit-source-id: dececb85889471aad642404d83a5b6faec32d975

Summary: Pull Request resolved: #3 fix tensor placement where the remote device should receive {rank, local_rank} Reviewed By: dstaay-fb Differential Revision: D31072120 fbshipit-source-id: b884afce691cac48a74524ca69e55c90e1308b39

Summary: as title - twrw doesn't really makes sense for gloo/cpu Reviewed By: rkindi Differential Revision: D31092150 fbshipit-source-id: 0d43c0f68ea049d085c105375c61995285a58f35

Summary: Implement DMP.named_buffers() Differential Revision: D31104124 fbshipit-source-id: 984baf747c3c89b1d0f5ccf4da5d45b57bdf4754

Summary: Call sync() in data stream for single GPU runs Reviewed By: divchenko Differential Revision: D31770560 fbshipit-source-id: 87deb84a1b5992d157ef9cc0e5139a4ca4eb4fb6

Summary: Only values need to be split for GroupEmbedding. __getitem__(...) on kjt will split values, weights, lengths, and offsets. Reviewed By: divchenko Differential Revision: D31770644 fbshipit-source-id: 37c53d5ac7f3d808097fc92471697448eed71090

Summary: Since input dist wait might require H2D sync eg KJT.sync(...), we wait on data stream to avoid blocking the default stream. Reviewed By: dstaay-fb Differential Revision: D31773789 fbshipit-source-id: fbe5ce4ccc835bad5dc8091b71ddc8673d9fb6ef

Summary: We don't need to do expensive torch slicing when segment is equal to feature count. Reviewed By: dstaay-fb Differential Revision: D31774275 fbshipit-source-id: 79596d14cf8997fde38620741dc21ddcd55247a4

Summary: In sequence embedding sharding, we might need to replicate sparse features and keep the original keys to construct SequenceEmbedding. Reviewed By: dstaay-fb Differential Revision: D31776401 fbshipit-source-id: 3f5e9a6818ea933389b44964473cd43535d1e733

Summary: as per title Reviewed By: lurunming Differential Revision: D31787520 fbshipit-source-id: 236b3e68ff092fc0e939d7b94f7014dd1b6e8f9b

Summary: Remove device dependency to get compute kernel/storage usage Differential Revision: D31673806 fbshipit-source-id: a84060e95cf68e298ad8f6d516ebc70afaf98753

Summary: Reworking for TREC planner internal components for better scalability. Attempts to support a broad use set of existing and new use cases https://fb.quip.com/V4htAeexikoR Differential Revision: D31496825 fbshipit-source-id: 1b74ffc2da19fe332e313bf5eb95a5a56fb7c121

Summary: 1. Instead of PipelinedInput create Multistreamable and Pipelineable (the latter one is public API-facing) interfaces. 1. Make explicit checks for Multistreamable/Pipelineable impls for both input, input_dist results and context. This avoids silent failures. 1. Create SequenceArchContext to be used instead of default EmptyContext . This forces record_stream implementation to be provides and avoids silent failures. 1. Make KJT, JT, KT implement Pipelineable interface. 1. Actual fix: make sure to call record_stream() on all tensors in context. Reviewed By: xing-liu, jiaqizhai Differential Revision: D31865112 fbshipit-source-id: 3d6545ce2d3d6080d7fb9a69480b83a8bcbb169d

Summary: The old setup.py was needed because the top level folder for the repo contained folders like /distributed, etc. Now, we have the top level folder have a single torchrec folder, so the setup.py needs to be changed to reflect this. Reviewed By: wx1988 Differential Revision: D31886770 fbshipit-source-id: 57072bbd84465167129b1d6c4c5f274afcb4b805

Summary: For sharder redesign, implement SMCTopology off of Topology Reviewed By: dstaay-fb Differential Revision: D31585087 fbshipit-source-id: d5b7a0806c39aeb85c32f84259986444f0209c52

Summary: **Summary**: This commit solves the first part of pytorch/pytorch#52306, which disallows type annotations on instance attributes inside any method other than the constructor. Pull Request resolved: pytorch/pytorch#67051 Test Plan: Added test to test_types.py. **Reviewers**: Zhengxu Chen **Subscribers**: Zhengxu Chen, Yanan Cao, Peng Wu, Yining Lu **Tasks**: T103941984 **Tags**: pytorch **Fixes** pytorch/pytorch#52306 Reviewed By: zhxchen17 Differential Revision: D31843527 Pulled By: andrewor14 fbshipit-source-id: 624879ae801621e367c59228be8b0581ecd30ef4

Summary: Part of the EmbeddingShardingPlanner refactor. Reviewed By: dstaay-fb Differential Revision: D31553701 fbshipit-source-id: ced039aadc3609c7af52b6d1faf7222b70597401

Summary: Wall Time cost calculator. General thought: memory BW dominated equations Reviewed By: dstaay-fb Differential Revision: D31706355 fbshipit-source-id: bff482645b8431c77824cbec8e6c3c1020349359

Summary: Similar to _input_dists. Defer the initialization so that we can create ShardedEmbeddingBagCollection with less dependencies. This diff fix below errors in dry-sharding: ``` File "<torch_package_1>.hpc/torchrec/sparsenn_provider.py", line 580, in shard_model File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 109, in __init__ File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 145, in _init_dmp File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 175, in _shard_modules_impl File "<torch_package_1>.torchrec/distributed/model_parallel.py", line 164, in _shard_modules_impl File "<torch_package_1>.torchrec/distributed/embedding.py", line 497, in shard File "<torch_package_1>.torchrec/distributed/embedding.py", line 262, in __init__ File "<torch_package_1>.torchrec/distributed/embedding.py", line 330, in _create_output_dist File "<torch_package_1>.torchrec/distributed/twrw_sharding.py", line 354, in create_pooled_output_dist File "<torch_package_1>.torchrec/distributed/twrw_sharding.py", line 259, in cross_pg File "<torch_package_1>.torchrec/distributed/comm.py", line 122, in intra_and_cross_node_pg File "/data/users/runming/fbsource/fbcode/buck-out/dev/gen/scripts/runming/transfer_learning/debug_dry_sharding#link-tree/caffe2/torch/fb/lwt/torch_distributed.py", line 168, in new_group raise NotImplementedError( ``` Reviewed By: dstaay-fb Differential Revision: D31866535 fbshipit-source-id: 80cccdfb6355281fed46daff6633db34f8758b01

Summary: In some scenarios, we want to create TwRwEmbeddingSharding and executing _shard() without intra & cross. Defer intra & cross pg initialization to achieve this. Reviewed By: dstaay-fb Differential Revision: D31846780 fbshipit-source-id: 007c9a1f5f4cf4bbc90198d830fd2e6d1f811d17

Summary: Stream() should be called only for cuda device. Reviewed By: dstaay-fb Differential Revision: D31838338 fbshipit-source-id: f022e29dcd90837c086cc675553a470164cfdddf

Summary: Planer use a real device type to generate sharding plan and thus shard metadata placement device is the real device. If we use meta device to contruct the model, the placement device could be conflicted with the tensor device. So we need to hack the placement device to meta device in order to pass the unmatched device verification. Reviewed By: dstaay-fb Differential Revision: D31836919 fbshipit-source-id: 68c10fe0f5a75b45fea7107e90c05ef5bc58b6cf

Summary: We need to pass processor group in All2All_Seq_Req_Wait otherwise the backward all2all will use a new NCCL stream. Reviewed By: divchenko Differential Revision: D31945454 fbshipit-source-id: 8a19a840c3cbb68471f746a0b7603293f1747c45

Summary: Pull Request resolved: pytorch/pytorch#64481 This simplifies `init_from_local_shards` API in sharded tensor, to only require user pass in a list of `Shard` and `overall_size`, instead of ShardedTensorMetadata. We will do the all_gather inside to form a valid ShardedTensorMetadata instead. TODO: add more test cases to improve coverage. ghstack-source-id: 141742350 Reviewed By: pritamdamania87 Differential Revision: D30748504 fbshipit-source-id: 6e97d95ffafde6b5f3970e2c2ba33b76cabd8d8a

Summary: We want to consolidate SMCTopology with the concrete base class Reviewed By: dstaay-fb Differential Revision: D32027047 fbshipit-source-id: 7c895d19826025bf157c5ebbf2832edf95665a1f

Summary: Original commit changeset: 6e97d95ffafd Reviewed By: wanchaol Differential Revision: D32023341 fbshipit-source-id: 2a9f7b637c0ff18700bcc3e44466fffcff861698

Summary: TorchRec OSS installation currently requires 3 steps: 1. fbgemm_gpu installation 2. Symlink fbgemm_gpu_py.so to the TorchRec directory 3. Run TorchRec's installation We can simplify this to a single step by bringing in fbgemm_gpu into TorchRec directory Reviewed By: rkindi Differential Revision: D32055570 fbshipit-source-id: b3d2c1234469898a1cfe5c2e3cdb67e3c289d9db

facebook-github-bot · 2021-11-01T17:04:37Z

This pull request was exported from Phabricator. Differential Revision: D32055570

facebook-github-bot and others added 30 commits September 2, 2021 11:36

Initial commit

fa3adb7

fbshipit-source-id: 18eb1fdcd8d0571b03cc80ffe3ac866486871ba3

Add collective_plan(...) to EmbeddingPlanner

ff1ee34

Summary: The current design expose invoke_on_rank_and_broadcast_result(...) call to users which is not very user friendly Reviewed By: divchenko Differential Revision: D30733086 fbshipit-source-id: 6824d2ecfb9fc149c3cb7fc095d7f9ac96ba4ed1

Introduce KeyedTensor.regroup

baf5733

Summary: Generalize regroup method introduced in D30044807 Will switch out usage in ig_clips_tab_pt.py in a followup diff Reviewed By: divchenko Differential Revision: D30375713 fbshipit-source-id: 6eb37c4f547db04d0048134187bb7aa0657bb9cf

Remove Swish

63eeb82

Summary: Use torch SiLU instead Reviewed By: colin2328 Differential Revision: D30700094 fbshipit-source-id: b4a92e971769b9f7be739264869cee176f55f5e9

Remove concat

d42ee21

Summary: This modules promote non-functional style of modeling. Reviewed By: wx1988 Differential Revision: D30701381 fbshipit-source-id: fedc510366e5a10e87b6ab71ac12204c5b91b45d

Permute supports copies

0e618cb

Summary: Support Copies of Data Reviewed By: YazhiGao Differential Revision: D30262094 fbshipit-source-id: 33a32245afbc419436c1902ba32020ebb4c133e7

_recat docstring

15111c3

Summary: update KJTA2A docstring, provide _recat example Reviewed By: colin2328 Differential Revision: D30877670 fbshipit-source-id: 50eca883d0c49df0738837d682c7179332c88627

Fix dist_data.py

78ebecd

Summary: 1. lengths in KJTAll2all can be int64 2. Use the external all_to_all_single(...) API instead of alltoall_base Reviewed By: colin2328, jiaqizhai Differential Revision: D30925298 fbshipit-source-id: f835454f6dbaec60c8a0bbeceaba2efe25e8ab18

Add unit test for GradientClippingOptimizer

a0892ce

Summary: Added unit tests for GradientClippingOptimizer Reviewed By: dstaay-fb Differential Revision: D30876265 fbshipit-source-id: 762567572b712bd9dd40820f07ec21843fe014df

suppress errors in torchrec

8ee9f26

Differential Revision: D31042728 fbshipit-source-id: 14799576da39297674ad302ca3fb035c436d82cc

fix parameter placement (#3)

3015bca

Summary: Pull Request resolved: #3 fix tensor placement where the remote device should receive {rank, local_rank} Reviewed By: dstaay-fb Differential Revision: D31072120 fbshipit-source-id: b884afce691cac48a74524ca69e55c90e1308b39

disable twrw sharding for non cuda/nccl

a4bc5c2

Summary: as title - twrw doesn't really makes sense for gloo/cpu Reviewed By: rkindi Differential Revision: D31092150 fbshipit-source-id: 0d43c0f68ea049d085c105375c61995285a58f35

Implement named_buffers

5df6f63

Summary: Implement DMP.named_buffers() Differential Revision: D31104124 fbshipit-source-id: 984baf747c3c89b1d0f5ccf4da5d45b57bdf4754

xing-liu and others added 23 commits October 19, 2021 15:25

Optimize H2D sync for single GPU runs

e96caef

Summary: Call sync() in data stream for single GPU runs Reviewed By: divchenko Differential Revision: D31770560 fbshipit-source-id: 87deb84a1b5992d157ef9cc0e5139a4ca4eb4fb6

GroupedEmbedding

4837d3d

Summary: Only values need to be split for GroupEmbedding. __getitem__(...) on kjt will split values, weights, lengths, and offsets. Reviewed By: divchenko Differential Revision: D31770644 fbshipit-source-id: 37c53d5ac7f3d808097fc92471697448eed71090

Wait input dist on data stream

352b4ca

Summary: Since input dist wait might require H2D sync eg KJT.sync(...), we wait on data stream to avoid blocking the default stream. Reviewed By: dstaay-fb Differential Revision: D31773789 fbshipit-source-id: fbe5ce4ccc835bad5dc8091b71ddc8673d9fb6ef

Optimize kjt.split(...) when segment is equal to feature count

6f3d9d4

Summary: We don't need to do expensive torch slicing when segment is equal to feature count. Reviewed By: dstaay-fb Differential Revision: D31774275 fbshipit-source-id: 79596d14cf8997fde38620741dc21ddcd55247a4

Remove implicit dependencies on cuda in planner

884a9a0

Summary: as per title Reviewed By: lurunming Differential Revision: D31787520 fbshipit-source-id: 236b3e68ff092fc0e939d7b94f7014dd1b6e8f9b

Swap out device.torch w/ compute_device str

3a2fdaa

Summary: Remove device dependency to get compute kernel/storage usage Differential Revision: D31673806 fbshipit-source-id: a84060e95cf68e298ad8f6d516ebc70afaf98753

Implement SMCTopology

32c388d

Summary: For sharder redesign, implement SMCTopology off of Topology Reviewed By: dstaay-fb Differential Revision: D31585087 fbshipit-source-id: d5b7a0806c39aeb85c32f84259986444f0209c52

Implement ShardingEnumerator

3625f81

Summary: Part of the EmbeddingShardingPlanner refactor. Reviewed By: dstaay-fb Differential Revision: D31553701 fbshipit-source-id: ced039aadc3609c7af52b6d1faf7222b70597401

CostCalculator

03c57d7

Summary: Wall Time cost calculator. General thought: memory BW dominated equations Reviewed By: dstaay-fb Differential Revision: D31706355 fbshipit-source-id: bff482645b8431c77824cbec8e6c3c1020349359

Don't touch torch.cuda.Stream() for meta device

0a59cc2

Summary: Stream() should be called only for cuda device. Reviewed By: dstaay-fb Differential Revision: D31838338 fbshipit-source-id: f022e29dcd90837c086cc675553a470164cfdddf

Fix All2All_Seq_Req_Wait

01f9583

Summary: We need to pass processor group in All2All_Seq_Req_Wait otherwise the backward all2all will use a new NCCL stream. Reviewed By: divchenko Differential Revision: D31945454 fbshipit-source-id: 8a19a840c3cbb68471f746a0b7603293f1747c45

Refactor SMCTopology into Topology

72f06d5

Summary: We want to consolidate SMCTopology with the concrete base class Reviewed By: dstaay-fb Differential Revision: D32027047 fbshipit-source-id: 7c895d19826025bf157c5ebbf2832edf95665a1f

Back out "simplify init_from_local_shards API"

443016a

Summary: Original commit changeset: 6e97d95ffafd Reviewed By: wanchaol Differential Revision: D32023341 fbshipit-source-id: 2a9f7b637c0ff18700bcc3e44466fffcff861698

facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Nov 1, 2021

RenfeiChen-FB closed this Nov 11, 2021

RenfeiChen-FB force-pushed the main branch from 69326f8 to 69e007d Compare November 11, 2021 00:50

jianyuh mentioned this pull request Feb 10, 2022

Remove padding for CPU TBE op to reduce the memory waste (#14) #61

Closed

hetian127 mentioned this pull request Nov 12, 2025

test_installation.py does not work #3539

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a symlink from fbgemm_gpu into TorchRec #14

Add a symlink from fbgemm_gpu into TorchRec #14

Uh oh!

animan42 commented Nov 1, 2021

Uh oh!

facebook-github-bot commented Nov 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Add a symlink from fbgemm_gpu into TorchRec #14

Add a symlink from fbgemm_gpu into TorchRec #14

Uh oh!

Conversation

animan42 commented Nov 1, 2021

Uh oh!

facebook-github-bot commented Nov 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants