Task-agnostic or task-specific distillation used for CPU inference results? #10

lewtun · 2020-12-12T22:06:10Z

❓ Questions & Help

Details

Hello,

First of all, thank you very much for open-sourcing this research - I expect it will have a large impact on helping bring Transformers to production!

I have a question about the results in Table 3 of your paper.

Is the distilled model with (4L, 312) obtained from task-agnostic or task-specific distillation? In Section 2 you state

Since we are experimenting with various NLU tasks, the capacity of the optimal student model that preserves accuracy may vary with varying level of task’s difficulty. Therefore, we experiment with distilling various sized student models; then, we pick the smaller model among the distilled models that can offer higher accuracy than the original BERT model for each task.

and I could not tell from the codebase which approach you used to generate the numbers on Table 3.

Thank you!

ykim362 · 2020-12-13T05:48:36Z

Thanks for your interest, @lewtun !
For this one, we used task-specific distillation.

lewtun · 2020-12-15T09:57:24Z

Thank you for the fast reply @ykim362 ! Looking forward to seeing these features in HF Transformers :)

lewtun closed this as completed Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task-agnostic or task-specific distillation used for CPU inference results? #10

Task-agnostic or task-specific distillation used for CPU inference results? #10

lewtun commented Dec 12, 2020

ykim362 commented Dec 13, 2020

lewtun commented Dec 15, 2020

Task-agnostic or task-specific distillation used for CPU inference results? #10

Task-agnostic or task-specific distillation used for CPU inference results? #10

Comments

lewtun commented Dec 12, 2020

❓ Questions & Help

Details

ykim362 commented Dec 13, 2020

lewtun commented Dec 15, 2020