Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task-agnostic or task-specific distillation used for CPU inference results? #10

Closed
lewtun opened this issue Dec 12, 2020 · 2 comments
Closed

Comments

@lewtun
Copy link

lewtun commented Dec 12, 2020

❓ Questions & Help

Details

Hello,

First of all, thank you very much for open-sourcing this research - I expect it will have a large impact on helping bring Transformers to production!

I have a question about the results in Table 3 of your paper.

Screen Shot 2020-12-12 at 11 03 24 pm

Is the distilled model with (4L, 312) obtained from task-agnostic or task-specific distillation? In Section 2 you state

Since we are experimenting with various NLU tasks, the capacity of the optimal student model that preserves accuracy may vary with varying level of task’s difficulty. Therefore, we experiment with distilling various sized student models; then, we pick the smaller model among the distilled models that can offer higher accuracy than the original BERT model for each task.

and I could not tell from the codebase which approach you used to generate the numbers on Table 3.

Thank you!

@ykim362
Copy link
Member

ykim362 commented Dec 13, 2020

Thanks for your interest, @lewtun !
For this one, we used task-specific distillation.

@lewtun
Copy link
Author

lewtun commented Dec 15, 2020

Thank you for the fast reply @ykim362 ! Looking forward to seeing these features in HF Transformers :)

@lewtun lewtun closed this as completed Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants