You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you very much for open-sourcing this research - I expect it will have a large impact on helping bring Transformers to production!
I have a question about the results in Table 3 of your paper.
Is the distilled model with (4L, 312) obtained from task-agnostic or task-specific distillation? In Section 2 you state
Since we are experimenting with various NLU tasks, the capacity of the optimal student model that preserves accuracy may vary with varying level of task’s difficulty. Therefore, we experiment with distilling various sized student models; then, we pick the smaller model among the distilled models that can offer higher accuracy than the original BERT model for each task.
and I could not tell from the codebase which approach you used to generate the numbers on Table 3.
Thank you!
The text was updated successfully, but these errors were encountered:
❓ Questions & Help
Details
Hello,
First of all, thank you very much for open-sourcing this research - I expect it will have a large impact on helping bring Transformers to production!
I have a question about the results in Table 3 of your paper.
Is the distilled model with (4L, 312) obtained from task-agnostic or task-specific distillation? In Section 2 you state
and I could not tell from the codebase which approach you used to generate the numbers on Table 3.
Thank you!
The text was updated successfully, but these errors were encountered: