New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for distilling zero-shot classifier to more efficient student #10244
Conversation
e4cdc94
to
b8cf6de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic that you're using the Trainer
for that. Pinging Sylvain for review.
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great new example! I'm not sure it does support distributed training after reading everything so the PR should either be amended to support it or clearly indicate in the README it does not support it (same for TPUs).
examples/research_projects/zero-shot-distillation/distill_classifier.py
Outdated
Show resolved
Hide resolved
examples/research_projects/zero-shot-distillation/distill_classifier.py
Outdated
Show resolved
Hide resolved
@LysandreJik cool thanks for the feedback. @sgugger Thanks, I added |
Yes I meant distributed multi-GPU. I did see it will use all GPUs available on the machine however :-) |
if training_args.local_rank != -1: | ||
raise ValueError("Distributed training is not currently supported.") | ||
if training_args.tpu_num_cores is not None: | ||
raise ValueError("TPU acceleration is not currently supported.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
This PR introduces a script that provides a way to improve the speed and memory performance of a zero-shot classifier by training a more efficient student model from the zero-shot teacher's predictions over an unlabeled dataset.
For a given sequence, the zero-shot classification pipeline requires each possible label to be fed through the large NLI model separately. This requirement slows results considerably, particularly for tasks with a large number of classes
K
.Given (1) an unlabeled corpus and (2) a set of candidate class names, this script allows a user to train a standard classification head with
K
output dimensions. The script generates a softmax distribution for the provided data & class names, and a student classifier is then fine-tuned on these proxy labels. The resulting student model can be used for classifying novel text instances over theseK
classes with an order-of-magnitude boost in inference speed in addition to decreased memory usage.A teacher NLI model can be distilled to a student model by running
distill_classifier.py
like so:A number of other args are provided as well, such as
--teacher_name_or_path
and--student_name_or_path
for specifying the pre-trained student & teacher models to be used (by defaultroberta-large-mnli
anddistillbert-base-uncased
) and--hypothesis_template
for customizing the hypothesis template used by the teacher zero-shot model. The training is implemented viaTrainer
, so anyTrainingArguments
can be specified as well.The resulting model can then be used trivially in a text classification pipeline or in any other way:
See the included README.md for more details and examples.
Soon I'll introduce a similar script for self-training an NLI model, boosting the model's performance after training on only unlabeled data, which model can then be subsequently distilled with this script like any NLI model.
Update: I also just added a link to a working colab notebook demo.