torchrec support on kvzch emb lookup module #2922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

duduyi2013 wants to merge 1 commit into meta-pytorch:main from duduyi2013:export-D73567631

Contributor

duduyi2013 commented Apr 28, 2025

Summary:

Change logs

add ZeroCollisionKeyValueEmbedding emb lookup
address existing unit test missing for ssd offloading
add new ut for kv zch embedding module
add a temp hack solution for calculate bucket metadata
embedding updates, details illustrated below

#######################################################################
########################### embedding.py updates ##########################
#######################################################################

keep the original idea to init shardedTensor during training init
for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done on the nn.module side, which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Differential Revision: D73567631

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Apr 28, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented May 6, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

duduyi2013 added a commit to duduyi2013/torchrec that referenced this pull request


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

c636311

Summary:
X-link: pytorch/FBGEMM#4035

Pull Request resolved: meta-pytorch#2922

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
###########################  embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv, emlin

Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from b4503c1 to c636311 Compare

May 6, 2025 20:49

Contributor

facebook-github-bot commented May 6, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

1 similar comment

Contributor

facebook-github-bot commented May 6, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

duduyi2013 added a commit to duduyi2013/torchrec that referenced this pull request


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

dada6e3

Summary:
X-link: pytorch/FBGEMM#4035

Pull Request resolved: meta-pytorch#2922

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
###########################  embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv, emlin

Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from c636311 to dada6e3 Compare

May 6, 2025 22:33

Contributor

facebook-github-bot commented May 6, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

1 similar comment

Contributor

facebook-github-bot commented May 7, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from dada6e3 to e2369a0 Compare

May 7, 2025 00:02

duduyi2013 added a commit to duduyi2013/torchrec that referenced this pull request


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

e2369a0

Summary:
X-link: pytorch/FBGEMM#4035

Pull Request resolved: meta-pytorch#2922

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
###########################  embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv, emlin

Differential Revision: D73567631

Contributor

facebook-github-bot commented May 7, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from e2369a0 to 59d230a Compare

May 7, 2025 04:16

duduyi2013 added a commit to duduyi2013/torchrec that referenced this pull request


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

59d230a

Summary:
X-link: pytorch/FBGEMM#4035

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
########################### embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv, emlin

Differential Revision: D73567631

Contributor

facebook-github-bot commented May 7, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from 59d230a to 723672c Compare

May 7, 2025 04:22

duduyi2013 added a commit to duduyi2013/torchrec that referenced this pull request


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

723672c

Summary:
X-link: pytorch/FBGEMM#4035

Pull Request resolved: meta-pytorch#2922

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
###########################  embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv, emlin

Differential Revision: D73567631

Contributor

facebook-github-bot commented May 7, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from 723672c to 4418509 Compare

May 7, 2025 18:18

duduyi2013 added a commit to duduyi2013/torchrec that referenced this pull request


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

Summary:
X-link: pytorch/FBGEMM#4035

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
########################### embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv

Differential Revision: D73567631

Contributor

facebook-github-bot commented May 7, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631


          torchrec support on kvzch emb lookup module (meta-pytorch#2922)

892c20f

Summary:
X-link: pytorch/FBGEMM#4035

Pull Request resolved: meta-pytorch#2922

X-link: facebookresearch/FBGEMM#1120

# Change logs
1. add ZeroCollisionKeyValueEmbedding emb lookup
2. address existing unit test missing for ssd offloading
3. add new ut for kv zch embedding module
4. add a temp hack solution for calculate bucket metadata
5. embedding updates, details illustrated below

#######################################################################
###########################  embedding.py updates ##########################
#######################################################################

1. keep the original idea to init shardedTensor during training init
2. for kv zch table, the shardeTensor will be init using virtual size for metadata calculation, and skip actual tensor size check for ST init, this is needed as during training init, the table has 0 rows
3. the new tensor, weight_id will not be registered in the EC becuase its shape is changing in realtime, the weight_id tensor will be generated in post_state_dict hooks
4. the new tensor, bucket could be registered and preserved, but in this diff we keep it the same way as weight_id
5. in post state dict hook, we call get_named_split_embedding_weights_snapshot to get Tuple[table_name, weight(ST), weight_id(ST), bucket(ST)], all 3 tensors are return in the format of ST, and we will update destination with the returned ST directly
6. in pre_load_state_dict_hook, which is called upon load_state_dict(), we will skip all 3 tensors update, because the tensor assignment is done [on the nn.module side](https://fburl.com/code/it5nior8), which doesn't support updating KVT through PMT. This is fine for now because, checkpoint loading will be done outside of the load_state_dict call, but we need future plans to make it work cohesively with other type of tensors

Reviewed By: kausv

Differential Revision: D73567631

duduyi2013 force-pushed the export-D73567631 branch from 4418509 to 892c20f Compare

May 7, 2025 18:22

Contributor

facebook-github-bot commented May 7, 2025

This pull request was exported from Phabricator. Differential Revision: D73567631

facebook-github-bot closed this in

388c3cd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported