Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Conversation

@bfineran
Copy link
Contributor

overwritten methods are not accounted for in the replica phase of torch DataParallel's forward pass, this causes an issue in sparseml's QAT of embeddings as embedding QAT buffers may not be on the correct device for an input in DP mode.

This PR introduces a temporary fix to send QAT values to the correct device for the forward pass. Future work should find a way to reconcile overwriting methods with torch DP/DDP as this is used/planned elsewhere in sparseml as well.

test_plan:
Tested failing command on multi-gpu environment against fix:

sparseml.transformers.question_answering  --output_dir qa-models/sparse_quantized  --model_name_or_path zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni  --recipe zoo:nlp/masked_language_modeling/bert-base/pytorch/huggingface/wikipedia_bookcorpus/12layer_pruned80_quant-none-vnni?recipe_type=transfer-question_answering  --distill_teacher zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/base-none  --dataset_name squad --per_device_train_batch_size 12 --per_device_eval_batch_size 24 --preprocessing_num_workers 6  --do_train --do_eval --evaluation_strategy epoch --fp16 --seed 21636  --per_device_train_batch_size 16 --per_device_eval_batch_size 24 --preprocessing_num_workers 6  --save_strategy no --save_total_limit 1 

@bfineran bfineran requested review from a team, corey-nm and rahul-tuli August 22, 2022 21:06
@bfineran bfineran self-assigned this Aug 22, 2022
@bfineran bfineran requested review from eldarkurtic and removed request for a team August 22, 2022 21:06
@github-actions
Copy link

@rahul-tuli @corey-nm @eldarkurtic assigned for review

@bfineran bfineran merged commit 2c03904 into main Aug 22, 2022
@bfineran bfineran deleted the embeddings-qat-data_parallel-patch branch August 22, 2022 21:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants