Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tokenization logic to central JiantModelTransformers method #1290

Merged
merged 4 commits into from Mar 19, 2021

Conversation

jeswan
Copy link
Collaborator

@jeswan jeswan commented Mar 11, 2021

This PR refactors normalize_tokenizations() by moving the functions to JiantTransformerModels. Although the location of this logic is not ideal, it was the easiest way to have a central location for editing new models/tokenizers. This PR also adds resolve_model_arch_tokenizer(tokenizer) to map the tokenizers to model_archs.

@jeswan jeswan changed the base branch from master to js/feature/easy_add_model March 11, 2021 01:24
@jeswan jeswan changed the title Js/feature/move tokenization 2 Move tokenization logic to central JiantModelTransformers method Mar 11, 2021
@jeswan jeswan force-pushed the js/feature/move_tokenization_2 branch 2 times, most recently from de4c5e9 to 291166d Compare March 11, 2021 15:00
@jeswan jeswan force-pushed the js/feature/move_tokenization_2 branch from 291166d to 21e7736 Compare March 11, 2021 15:07
@codecov
Copy link

codecov bot commented Mar 11, 2021

Codecov Report

Merging #1290 (ad49c64) into js/feature/easy_add_model (f027a59) will increase coverage by 0.06%.
The diff coverage is 68.57%.

Impacted file tree graph

@@                      Coverage Diff                      @@
##           js/feature/easy_add_model    #1290      +/-   ##
=============================================================
+ Coverage                      49.75%   49.82%   +0.06%     
=============================================================
  Files                            162      163       +1     
  Lines                          11112    11166      +54     
=============================================================
+ Hits                            5529     5563      +34     
- Misses                          5583     5603      +20     
Impacted Files Coverage Δ
jiant/utils/python/datastructures.py 72.41% <50.00%> (-2.79%) ⬇️
jiant/proj/main/modeling/primary.py 53.43% <52.00%> (-1.64%) ⬇️
jiant/utils/tokenization_utils.py 95.83% <95.83%> (ø)
jiant/shared/model_resolution.py 77.77% <100.00%> (+2.77%) ⬆️
jiant/utils/tokenization_normalization.py 94.73% <100.00%> (+20.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f027a59...ad49c64. Read the comment docs.

@jeswan jeswan marked this pull request as ready for review March 11, 2021 15:11
Copy link
Collaborator

@zphang zphang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I noticed that ELECTRA doesn't have a normalize tokenizations implementation. Can we add an abstractmethod to JiantTransformersModel and a raise NotImplemented for ELECTRA? (It should be possible to add the implementation for ELECTRA, but I'm okay with leaving it as a future to-do)
  2. I was wondering if it might be a better idea to pull the normalize tokenization logic out into individual functions (in some cases, similar models have the same logic), but I can also see how that might be excessive delegation. I don't feel strongly either way, what do you think?

jiant/utils/tokenization_utils.py Outdated Show resolved Hide resolved
@jeswan jeswan force-pushed the js/feature/move_tokenization_2 branch 2 times, most recently from 68cc8b5 to fc64dcd Compare March 19, 2021 01:31
@jeswan jeswan force-pushed the js/feature/move_tokenization_2 branch from fc64dcd to 33faa25 Compare March 19, 2021 01:33
@jeswan
Copy link
Collaborator Author

jeswan commented Mar 19, 2021

  • I noticed that ELECTRA doesn't have a normalize tokenizations implementation. Can we add an abstractmethod to JiantTransformersModel and a raise NotImplemented for ELECTRA? (It should be possible to add the implementation for ELECTRA, but I'm okay with leaving it as a future to-do)
  • I was wondering if it might be a better idea to pull the normalize tokenization logic out into individual functions (in some cases, similar models have the same logic), but I can also see how that might be excessive delegation. I don't feel strongly either way, what do you think?

Comments implemented! For point 2 above, discussed offline to refactor if there is additional duplicate code in the future.

@jeswan jeswan force-pushed the js/feature/move_tokenization_2 branch from d1b6367 to ad49c64 Compare March 19, 2021 14:38
@jeswan jeswan merged commit 9a7aa78 into js/feature/easy_add_model Mar 19, 2021
@jeswan jeswan deleted the js/feature/move_tokenization_2 branch March 19, 2021 14:43
jeswan added a commit that referenced this pull request May 4, 2021
* Update to Transformers v4.3.3 (#1266)

* use default return_dict in taskmodels and remove hidden state context manager in models.

* return hidden states in output of model wrapper

* Switch to task model/head factories instead of embedded if-else statements (#1268)

* Use jiant transformers model wrapper instead of if-else. Use taskmodel and head factory instead of if-else.

* switch to ModelArchitectures enum instead of strings

* Refactor get_output_from_encoder() to be member of JiantTaskModel (#1283)

* refactor getting output from encoder to be member function of jiant model

* switch to explicit encode() in jiant transformers model

* fix simple runscript test

* update to tokenizer 0.10.1

* Add tests for flat_strip() (#1289)

* add flat_strip test

* add list to test cases flat_strip

* mlm_weights(), feat_spec(), flat_strip() if-else refactors (#1288)

* moves remaining if-else statments to jiant model or replaces with model agnostic method

* switch from jiant_transformers_model to encoder

* fix bug in flat_strip()

* Move tokenization logic to central JiantModelTransformers method (#1290)

* move model specific tokenization logic to JiantTransformerModels

* implement abstract methods for JiantTransformerModels

* fix tasks circular import (#1296)

* Add DeBERTa (#1295)

* Add DeBERTa with sanity test

* fix tasks circular import

* [WIP] add deberta tests

* Revert "fix tasks circular import"

This reverts commit f924640.

* deberta tests passing with transformers 6472d8

* switch to deberta-v2

* fix get_mlm_weights_dict() for deberta-v2

* update to transformers 4.5.0

* mark deberta test_export as slow

* Update test_tokenization_normalization.py

* add guide to add a model

* fix test_expor_model tests

* minor pytest fixes (add num_labels for rte, overnight flag fix)

* bugfix for simple api notebook

* bugfix for #1310

* bugfix for #1306: simple api notebook path name

* squad running

* 2nd bugfix for #1310: not all tasks have num_labels property

* simple api notebook back to roberta-base

* run test matrix for more steps to compare to master

* save last/best model test fix

Co-authored-by: Jesse Swanson <js11133Wnyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants