Move tokenization logic to central JiantModelTransformers method #1290

jeswan · 2021-03-11T01:23:46Z

This PR refactors normalize_tokenizations() by moving the functions to JiantTransformerModels. Although the location of this logic is not ideal, it was the easiest way to have a central location for editing new models/tokenizers. This PR also adds resolve_model_arch_tokenizer(tokenizer) to map the tokenizers to model_archs.

codecov · 2021-03-11T15:09:30Z

Codecov Report

Merging #1290 (ad49c64) into js/feature/easy_add_model (f027a59) will increase coverage by 0.06%.
The diff coverage is 68.57%.

@@                      Coverage Diff                      @@
##           js/feature/easy_add_model    #1290      +/-   ##
=============================================================
+ Coverage                      49.75%   49.82%   +0.06%     
=============================================================
  Files                            162      163       +1     
  Lines                          11112    11166      +54     
=============================================================
+ Hits                            5529     5563      +34     
- Misses                          5583     5603      +20

Impacted Files	Coverage Δ
jiant/utils/python/datastructures.py	`72.41% <50.00%> (-2.79%)`	⬇️
jiant/proj/main/modeling/primary.py	`53.43% <52.00%> (-1.64%)`	⬇️
jiant/utils/tokenization_utils.py	`95.83% <95.83%> (ø)`
jiant/shared/model_resolution.py	`77.77% <100.00%> (+2.77%)`	⬆️
jiant/utils/tokenization_normalization.py	`94.73% <100.00%> (+20.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f027a59...ad49c64. Read the comment docs.

jiant/shared/model_resolution.py

zphang

I noticed that ELECTRA doesn't have a normalize tokenizations implementation. Can we add an abstractmethod to JiantTransformersModel and a raise NotImplemented for ELECTRA? (It should be possible to add the implementation for ELECTRA, but I'm okay with leaving it as a future to-do)
I was wondering if it might be a better idea to pull the normalize tokenization logic out into individual functions (in some cases, similar models have the same logic), but I can also see how that might be excessive delegation. I don't feel strongly either way, what do you think?

jiant/utils/tokenization_utils.py

…tion normalization

jeswan · 2021-03-19T01:38:44Z

I noticed that ELECTRA doesn't have a normalize tokenizations implementation. Can we add an abstractmethod to JiantTransformersModel and a raise NotImplemented for ELECTRA? (It should be possible to add the implementation for ELECTRA, but I'm okay with leaving it as a future to-do)

I was wondering if it might be a better idea to pull the normalize tokenization logic out into individual functions (in some cases, similar models have the same logic), but I can also see how that might be excessive delegation. I don't feel strongly either way, what do you think?

Comments implemented! For point 2 above, discussed offline to refactor if there is additional duplicate code in the future.

* Update to Transformers v4.3.3 (#1266) * use default return_dict in taskmodels and remove hidden state context manager in models. * return hidden states in output of model wrapper * Switch to task model/head factories instead of embedded if-else statements (#1268) * Use jiant transformers model wrapper instead of if-else. Use taskmodel and head factory instead of if-else. * switch to ModelArchitectures enum instead of strings * Refactor get_output_from_encoder() to be member of JiantTaskModel (#1283) * refactor getting output from encoder to be member function of jiant model * switch to explicit encode() in jiant transformers model * fix simple runscript test * update to tokenizer 0.10.1 * Add tests for flat_strip() (#1289) * add flat_strip test * add list to test cases flat_strip * mlm_weights(), feat_spec(), flat_strip() if-else refactors (#1288) * moves remaining if-else statments to jiant model or replaces with model agnostic method * switch from jiant_transformers_model to encoder * fix bug in flat_strip() * Move tokenization logic to central JiantModelTransformers method (#1290) * move model specific tokenization logic to JiantTransformerModels * implement abstract methods for JiantTransformerModels * fix tasks circular import (#1296) * Add DeBERTa (#1295) * Add DeBERTa with sanity test * fix tasks circular import * [WIP] add deberta tests * Revert "fix tasks circular import" This reverts commit f924640. * deberta tests passing with transformers 6472d8 * switch to deberta-v2 * fix get_mlm_weights_dict() for deberta-v2 * update to transformers 4.5.0 * mark deberta test_export as slow * Update test_tokenization_normalization.py * add guide to add a model * fix test_expor_model tests * minor pytest fixes (add num_labels for rte, overnight flag fix) * bugfix for simple api notebook * bugfix for #1310 * bugfix for #1306: simple api notebook path name * squad running * 2nd bugfix for #1310: not all tasks have num_labels property * simple api notebook back to roberta-base * run test matrix for more steps to compare to master * save last/best model test fix Co-authored-by: Jesse Swanson <js11133Wnyu.edu>

jeswan changed the base branch from master to js/feature/easy_add_model March 11, 2021 01:24

jeswan changed the title ~~Js/feature/move tokenization 2~~ Move tokenization logic to central JiantModelTransformers method Mar 11, 2021

jeswan force-pushed the js/feature/move_tokenization_2 branch 2 times, most recently from de4c5e9 to 291166d Compare March 11, 2021 15:00

move model specific tokenization logic to JiantTransformerModels

21e7736

jeswan force-pushed the js/feature/move_tokenization_2 branch from 291166d to 21e7736 Compare March 11, 2021 15:07

jeswan commented Mar 11, 2021

View reviewed changes

jiant/shared/model_resolution.py Outdated Show resolved Hide resolved

Update jiant/shared/model_resolution.py

391bd59

jeswan marked this pull request as ready for review March 11, 2021 15:11

jeswan requested review from HaokunLiu and zphang as code owners March 11, 2021 15:11

zphang reviewed Mar 18, 2021

View reviewed changes

jiant/utils/tokenization_utils.py Outdated Show resolved Hide resolved

jeswan force-pushed the js/feature/move_tokenization_2 branch 2 times, most recently from 68cc8b5 to fc64dcd Compare March 19, 2021 01:31

code review feedback: tokenization utils public and abstract tokeniza…

33faa25

…tion normalization

jeswan force-pushed the js/feature/move_tokenization_2 branch from fc64dcd to 33faa25 Compare March 19, 2021 01:33

implement abstract methods for JiantTransformerModels

ad49c64

jeswan force-pushed the js/feature/move_tokenization_2 branch from d1b6367 to ad49c64 Compare March 19, 2021 14:38

jeswan merged commit 9a7aa78 into js/feature/easy_add_model Mar 19, 2021

jeswan deleted the js/feature/move_tokenization_2 branch March 19, 2021 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move tokenization logic to central JiantModelTransformers method #1290

Move tokenization logic to central JiantModelTransformers method #1290

jeswan commented Mar 11, 2021 •

edited

codecov bot commented Mar 11, 2021 •

edited

zphang left a comment

jeswan commented Mar 19, 2021

Move tokenization logic to central JiantModelTransformers method #1290

Move tokenization logic to central JiantModelTransformers method #1290

Conversation

jeswan commented Mar 11, 2021 • edited

codecov bot commented Mar 11, 2021 • edited

Codecov Report

zphang left a comment

Choose a reason for hiding this comment

jeswan commented Mar 19, 2021

jeswan commented Mar 11, 2021 •

edited

codecov bot commented Mar 11, 2021 •

edited