Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLNet support and overhaul/cleanup of BERT support #845

Merged
merged 64 commits into from Aug 7, 2019

Conversation

sleepinyourhat
Copy link
Contributor

@sleepinyourhat sleepinyourhat commented Jul 17, 2019

There's a lot going on here, and I'm still debugging. Suggestions for tests to add are very welcome!

I'm adding a few semi-related changes that are meant to help with clarity/maintainability:

  • 'auto' is now the default value for args.tokenizer, and should behave correctly for all standard models.
  • pair_task is now a property of Task objects. [update: pair_task is gone.]
  • The addition of start/end/sep/cls tokens now happens slightly later in preprocessing, and it's up to each task object to request it. This allows tasks to more easily decide how they want to use [SEP] tokens. This is likely to introduce some subtle bugs, but it should also fix some subtle bugs, and it's basically necessary. XLNet places [CLS] at the end, after the final [SEP], so we'd have to rewrite all that code anyhow.

I caught a bug along the way:

  • Some tasks, including CoPA, MultiRC, and ReCoRD didn't have pair_task set properly, so we weren't using BERT segment embeddings (i.e., the tokens before and after [SEP] were marked as segment A).

Note to self:

  • Update the site-side documentation when done.

@pep8speaks
Copy link

pep8speaks commented Jul 17, 2019

Hello @sleepinyourhat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 201:101: E501 line too long (108 > 100 characters)

Line 76:101: E501 line too long (104 > 100 characters)
Line 79:101: E501 line too long (114 > 100 characters)
Line 81:101: E501 line too long (124 > 100 characters)
Line 82:101: E501 line too long (114 > 100 characters)
Line 119:101: E501 line too long (110 > 100 characters)
Line 253:101: E501 line too long (112 > 100 characters)
Line 254:101: E501 line too long (201 > 100 characters)

Line 115:101: E501 line too long (152 > 100 characters)

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2019-08-07 21:30:58 UTC

@sleepinyourhat sleepinyourhat changed the title XLNet support and overhaul/cleanup of BERT support [WIP] XLNet support and overhaul/cleanup of BERT support Jul 17, 2019
config/defaults.conf Outdated Show resolved Hide resolved
jiant/modules/sentence_encoder.py Show resolved Hide resolved
@@ -106,7 +106,7 @@ def del_field_tokens(instance):
del field.tokens


def _index_split(task, split, indexers, vocab, record_file):
def _index_split(task, split, indexers, vocab, record_file, boundary_token_fn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method now needs to be passed around all over the place, but it helps simplify some really hairy task-specific logic surrounding the placement of [cls] and [sep].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not urgent. Yet I'm kind of worried about this. After all, how to deal with boundary, and how to put multiple input sentences together are model-specific. Right now, implementing these at data processing sphere does not make much difference. But when people come up with all sorts of weird things to do, processing split will become more bloated, making it harder to add new models or tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit messy. I think we could do better, but it's not obvious how—there are lots of tasks for which it's not precisely clear how to set up BERT. This PR isn't changing the messy-ish setup, it's just making it more visible.

Mind making an issue?

jiant/tasks/qa.py Show resolved Hide resolved
jiant/utils/data_loaders.py Show resolved Hide resolved
@@ -2123,7 +2143,7 @@ def __init__(self, path, max_seq_len, name="ccg", **kw):
self.val_data_text = None
self.test_data_text = None

def process_split(self, split, indexers) -> Iterable[Type[Instance]]:
def process_split(self, split, indexers, boundary_token_fn) -> Iterable[Type[Instance]]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CCG and other tagging tasks are likely to be especially tricky, because of the need to align the tags to the tokens.

  • There's subtle logic here and in the retokenization code and in this script that this may have broken. @pruksmhc @iftenney, would either of you be willing to take a pass, and maybe even push an update to this branch? 😬
  • I'm not sure I see what the differences are between wordpiece and sentencepiece tokenization. @iftenney - Do you know sentencepiece? Is there anything in that difference that we should worry about? The main thing that I can see is that spaces are explicitly included in the tokens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll take a pass early tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the CCG code at all.

For edge probing retokenize code, the only tokenization-specific thing is some postprocessing to encourage closer alignment - I'll have to check what sentencepiece looks like there. (Alternatively, I think sentencepiece might give you byte offsets so I could just use those directly.)

This shouldn't block anything; there's already an exception that will be thrown if the tokenizer isn't whitelisted:

raise ValueError(f"Unsupported tokenizer '{tokenizer_name}'")

@sleepinyourhat sleepinyourhat added 0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. bug Something isn't working cleanup This should be fairly easy labels Jul 17, 2019
jiant/models.py Outdated Show resolved Hide resolved
jiant/preprocess.py Outdated Show resolved Hide resolved
Copy link
Contributor

@pruksmhc pruksmhc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the major problems/bugs were resolved, and the changes seem sane. Of course, test and make sure no major regressions in any tasks remain before merging.

@sleepinyourhat
Copy link
Contributor Author

@pruksmhc Thanks! I'm testing everything I can, but do make sure to have a close look at CCG. I'm not fully set up to test that.

@pruksmhc
Copy link
Contributor

CCG has tokenization as a off-line preprocessing step (which should honestly be changed). I would say just put in the documentation that CCG is not yet set up for XLNet, and leave CCG for another PR.

@pruksmhc
Copy link
Contributor

The off-line preprocessing step should honestly be made as part of load_data too.

@sleepinyourhat
Copy link
Contributor Author

@pruksmhc - Mind making an issue?

@sleepinyourhat
Copy link
Contributor Author

This should be ready to go after one last test. @iftenney Any last comments before I merge?

gcp/kubernetes/templates/run_batch.jsonnet Show resolved Hide resolved
jiant/models.py Outdated Show resolved Hide resolved
jiant/pytorch_transformers_interface/__init__.py Outdated Show resolved Hide resolved
jiant/pytorch_transformers_interface/modules.py Outdated Show resolved Hide resolved
APIs from pytorch_transfromers.
"""

def __init__(self, args):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this only seems to depend on args.exp_dir and args.pytorch_transformers_embedding_mode - for cleaner abstractions, can we make this explicit?

This should work:

def __init__(self, exp_dir, pytorch_transformers_embedding_mode, **unused_kw):
  # do stuff

and call as module(**args.to_dict())

(similarly with parameter_setup() below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inheriting classes add a few more args, so this would either expand the code a fair bit or else add an inconsistency. Weak/lazy preference to leave as is for now.

jiant/utils/tokenizers.py Show resolved Hide resolved
scripts/edgeprobing/exp_fns.sh Outdated Show resolved Hide resolved
@sleepinyourhat
Copy link
Contributor Author

  • use_pytorch_transformers: Cleaned up—pushing shortly.
  • pytorch_pretrained_bert_cache: If this is only a preference, I'll leave it. Given the choice between following a simple but misleading/wrong naming convention and using a more complicated but more accurate naming convention, I tend toward the latter.
    Replying to replyable comments above...

jiant/models.py Outdated Show resolved Hide resolved
scripts/edgeprobing/exp_fns.sh Outdated Show resolved Hide resolved
@sleepinyourhat sleepinyourhat merged commit a1e9abf into master Aug 7, 2019
@sleepinyourhat sleepinyourhat deleted the pytorch-transformers branch August 7, 2019 21:39
phu-pmh pushed a commit that referenced this pull request Apr 17, 2020
* Rename namespaces to suppress warnings.

* Revert "Rename namespaces to suppress warnings."

This reverts commit 0cf7b23.

* Initial working-ish attempt.

* Intermediate check-in...

* More partial progress.

* Another pass...

* Fix sep/cls handling, cleanup.

* Further cleanup.

* Keyword name fix.

* Another flag fix.

* Pull debug print.

* Line length cleanup.

* WiC fix.

* Two task setup bugs.

* BoolQ typo

* Improved segment handling.

* Delete unused is_pair_task, other cleanup/fixes.

* Fix deleted path from merge.

* Fix cache path.

* Address (spurious?) tokenization warning.

* Select pool_type automatically to match model.

h/t Haokun Liu

* Config updates.

* Path fix

* Fix XLNet UNK handling.

* Internal temporary MNLI alternate.

* Revert "Internal temporary MNLI alternate."

This reverts commit 455792a.

* Add helper fn tests

* Finish merge

* Remove unused argument.

* Possible ReCoRD bug fix

* Cleanup

* Fix merge issues.

* Revert "Remove unused argument."

This reverts commit 96a7c37.

* Assorted responses to Alex's commenst.

* Further ReCoRD fix.

* @iftenney's comments.

* Fix/simplify segment logic.

* @W4ngatang's comments

* Cleanup.

* Cleanup

* Fix issues with alternative embeddings_mode settings, max_layer.

* More mix cleanup.

* Masking fix.

* Address (most of) @iftenney's comments

* Tidying.

* Misc cleanup.

* Comment.
@jeswan jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. bug Something isn't working cleanup This should be fairly easy jiant-v1-legacy Relevant to versions <= v1.3.2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants