XLNet support and overhaul/cleanup of BERT support #845

sleepinyourhat · 2019-07-17T19:47:33Z

There's a lot going on here, and I'm still debugging. Suggestions for tests to add are very welcome!

I'm adding a few semi-related changes that are meant to help with clarity/maintainability:

'auto' is now the default value for args.tokenizer, and should behave correctly for all standard models.
pair_task is now a property of Task objects. [update: pair_task is gone.]
The addition of start/end/sep/cls tokens now happens slightly later in preprocessing, and it's up to each task object to request it. This allows tasks to more easily decide how they want to use [SEP] tokens. This is likely to introduce some subtle bugs, but it should also fix some subtle bugs, and it's basically necessary. XLNet places [CLS] at the end, after the final [SEP], so we'd have to rewrite all that code anyhow.

I caught a bug along the way:

Some tasks, including CoPA, MultiRC, and ReCoRD didn't have pair_task set properly, so we weren't using BERT segment embeddings (i.e., the tokens before and after [SEP] were marked as segment A).

Note to self:

Update the site-side documentation when done.

This reverts commit 0cf7b23.

…ster

pep8speaks · 2019-07-17T19:47:41Z

Hello @sleepinyourhat! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file jiant/models.py:

Line 201:101: E501 line too long (108 > 100 characters)

In the file jiant/pytorch_transformers_interface/modules.py:

Line 76:101: E501 line too long (104 > 100 characters)
Line 79:101: E501 line too long (114 > 100 characters)
Line 81:101: E501 line too long (124 > 100 characters)
Line 82:101: E501 line too long (114 > 100 characters)
Line 119:101: E501 line too long (110 > 100 characters)
Line 253:101: E501 line too long (112 > 100 characters)
Line 254:101: E501 line too long (201 > 100 characters)

In the file main.py:

Line 115:101: E501 line too long (152 > 100 characters)

You can repair most issues by installing black and running: black -l 100 ./*. If you contribute often, have a look at the 'Contributing' section of the README for instructions on doing this automatically.

Comment last updated at 2019-08-07 21:30:58 UTC

cola_inference.py

config/defaults.conf

jiant/models.py

config/defaults.conf

jiant/modules/sentence_encoder.py

sleepinyourhat · 2019-07-17T20:09:16Z

jiant/preprocess.py

@@ -106,7 +106,7 @@ def del_field_tokens(instance):
        del field.tokens


-def _index_split(task, split, indexers, vocab, record_file):
+def _index_split(task, split, indexers, vocab, record_file, boundary_token_fn):


This method now needs to be passed around all over the place, but it helps simplify some really hairy task-specific logic surrounding the placement of [cls] and [sep].

It's not urgent. Yet I'm kind of worried about this. After all, how to deal with boundary, and how to put multiple input sentences together are model-specific. Right now, implementing these at data processing sphere does not make much difference. But when people come up with all sorts of weird things to do, processing split will become more bloated, making it harder to add new models or tasks.

Yeah, it's a bit messy. I think we could do better, but it's not obvious how—there are lots of tasks for which it's not precisely clear how to set up BERT. This PR isn't changing the messy-ish setup, it's just making it more visible.

Mind making an issue?

jiant/pytorch_transformers_interface/modules.py

jiant/tasks/qa.py

jiant/utils/data_loaders.py

jiant/preprocess.py

…h-transformers

sleepinyourhat · 2019-07-17T21:48:40Z

jiant/tasks/tasks.py

@@ -2123,7 +2143,7 @@ def __init__(self, path, max_seq_len, name="ccg", **kw):
        self.val_data_text = None
        self.test_data_text = None

-    def process_split(self, split, indexers) -> Iterable[Type[Instance]]:
+    def process_split(self, split, indexers, boundary_token_fn) -> Iterable[Type[Instance]]:


CCG and other tagging tasks are likely to be especially tricky, because of the need to align the tags to the tokens.

There's subtle logic here and in the retokenization code and in this script that this may have broken. @pruksmhc @iftenney, would either of you be willing to take a pass, and maybe even push an update to this branch? 😬

I'm not sure I see what the differences are between wordpiece and sentencepiece tokenization. @iftenney - Do you know sentencepiece? Is there anything in that difference that we should worry about? The main thing that I can see is that spaces are explicitly included in the tokens.

Yep, I'll take a pass early tomorrow.

I don't know the CCG code at all.

For edge probing retokenize code, the only tokenization-specific thing is some postprocessing to encourage closer alignment - I'll have to check what sentencepiece looks like there. (Alternatively, I think sentencepiece might give you byte offsets so I could just use those directly.)

This shouldn't block anything; there's already an exception that will be thrown if the tokenizer isn't whitelisted:

jiant/jiant/utils/retokenize.py

Line 339 in 7ff148e

raise ValueError(f"Unsupported tokenizer '{tokenizer_name}'")

jiant/models.py

jiant/preprocess.py

pruksmhc

Looks like the major problems/bugs were resolved, and the changes seem sane. Of course, test and make sure no major regressions in any tasks remain before merging.

sleepinyourhat · 2019-07-27T19:05:27Z

@pruksmhc Thanks! I'm testing everything I can, but do make sure to have a close look at CCG. I'm not fully set up to test that.

pruksmhc · 2019-07-29T02:14:43Z

CCG has tokenization as a off-line preprocessing step (which should honestly be changed). I would say just put in the documentation that CCG is not yet set up for XLNet, and leave CCG for another PR.

pruksmhc · 2019-07-29T02:15:02Z

The off-line preprocessing step should honestly be made as part of load_data too.

jiant/modules/sentence_encoder.py

sleepinyourhat · 2019-07-30T07:32:44Z

@pruksmhc - Mind making an issue?

jiant/pytorch_transformers_interface/modules.py

jiant/modules/sentence_encoder.py

sleepinyourhat · 2019-08-04T09:45:22Z

This should be ready to go after one last test. @iftenney Any last comments before I merge?

gcp/kubernetes/templates/run_batch.jsonnet

jiant/models.py

jiant/pytorch_transformers_interface/__init__.py

jiant/pytorch_transformers_interface/modules.py

iftenney · 2019-08-05T21:11:11Z

jiant/pytorch_transformers_interface/modules.py

+    APIs from pytorch_transfromers.
+    """
+
+    def __init__(self, args):


nit: this only seems to depend on args.exp_dir and args.pytorch_transformers_embedding_mode - for cleaner abstractions, can we make this explicit?

This should work:

def __init__(self, exp_dir, pytorch_transformers_embedding_mode, **unused_kw): # do stuff

and call as module(**args.to_dict())

(similarly with parameter_setup() below)

The inheriting classes add a few more args, so this would either expand the code a fair bit or else add an inconsistency. Weak/lazy preference to leave as is for now.

jiant/pytorch_transformers_interface/modules.py

jiant/utils/tokenizers.py

scripts/edgeprobing/exp_fns.sh

sleepinyourhat · 2019-08-06T16:51:42Z

use_pytorch_transformers: Cleaned up—pushing shortly.
pytorch_pretrained_bert_cache: If this is only a preference, I'll leave it. Given the choice between following a simple but misleading/wrong naming convention and using a more complicated but more accurate naming convention, I tend toward the latter.
Replying to replyable comments above...

jiant/models.py

scripts/edgeprobing/exp_fns.sh

@iftenney

* Rename namespaces to suppress warnings. * Revert "Rename namespaces to suppress warnings." This reverts commit 0cf7b23. * Initial working-ish attempt. * Intermediate check-in... * More partial progress. * Another pass... * Fix sep/cls handling, cleanup. * Further cleanup. * Keyword name fix. * Another flag fix. * Pull debug print. * Line length cleanup. * WiC fix. * Two task setup bugs. * BoolQ typo * Improved segment handling. * Delete unused is_pair_task, other cleanup/fixes. * Fix deleted path from merge. * Fix cache path. * Address (spurious?) tokenization warning. * Select pool_type automatically to match model. h/t Haokun Liu * Config updates. * Path fix * Fix XLNet UNK handling. * Internal temporary MNLI alternate. * Revert "Internal temporary MNLI alternate." This reverts commit 455792a. * Add helper fn tests * Finish merge * Remove unused argument. * Possible ReCoRD bug fix * Cleanup * Fix merge issues. * Revert "Remove unused argument." This reverts commit 96a7c37. * Assorted responses to Alex's commenst. * Further ReCoRD fix. * @iftenney's comments. * Fix/simplify segment logic. * @W4ngatang's comments * Cleanup. * Cleanup * Fix issues with alternative embeddings_mode settings, max_layer. * More mix cleanup. * Masking fix. * Address (most of) @iftenney's comments * Tidying. * Misc cleanup. * Comment.

sleepinyourhat added 7 commits July 12, 2019 15:01

Rename namespaces to suppress warnings.

0cf7b23

Revert "Rename namespaces to suppress warnings."

38c5581

This reverts commit 0cf7b23.

Merge branch 'master' of https://github.com/nyu-mll/jiant into nyu-ma…

0c4546b

…ster

Initial working-ish attempt.

e881c19

Intermediate check-in...

6d4ff7f

More partial progress.

0d64ff2

Another pass...

4d8c125

sleepinyourhat requested review from iftenney, pruksmhc and W4ngatang as code owners July 17, 2019 19:47

sleepinyourhat changed the title ~~XLNet support and overhaul/cleanup of BERT support~~ [WIP] XLNet support and overhaul/cleanup of BERT support Jul 17, 2019

sleepinyourhat mentioned this pull request Jul 17, 2019

Rewrite GPT support to match BERT/XLNet as closely as possible. #846

Closed

sleepinyourhat commented Jul 17, 2019

View reviewed changes

cola_inference.py Show resolved Hide resolved

sleepinyourhat commented Jul 17, 2019

View reviewed changes

config/defaults.conf Show resolved Hide resolved

sleepinyourhat commented Jul 17, 2019

View reviewed changes

jiant/models.py Show resolved Hide resolved

HaokunLiu reviewed Jul 17, 2019

View reviewed changes

config/defaults.conf Outdated Show resolved Hide resolved

Fix sep/cls handling, cleanup.

8f98adf

sleepinyourhat commented Jul 17, 2019

View reviewed changes

jiant/preprocess.py Show resolved Hide resolved

sleepinyourhat added 6 commits July 17, 2019 17:22

Further cleanup.

3f4c434

Merge branch 'master' of https://github.com/nyu-mll/jiant into pytorc…

fec6d36

…h-transformers

Keyword name fix.

283d23a

Another flag fix.

1563ef0

Pull debug print.

d98915b

Line length cleanup.

7b04a03

sleepinyourhat commented Jul 17, 2019

View reviewed changes

sleepinyourhat added 0.x.0 release on fix Put out a new 0.x.0 release when this is fixed. bug Something isn't working cleanup This should be fairly easy labels Jul 17, 2019

pruksmhc reviewed Jul 27, 2019

View reviewed changes

jiant/models.py Outdated Show resolved Hide resolved

pruksmhc reviewed Jul 27, 2019

View reviewed changes

jiant/preprocess.py Outdated Show resolved Hide resolved

pruksmhc approved these changes Jul 27, 2019

View reviewed changes

Cleanup.

657b2c9

HaokunLiu reviewed Jul 29, 2019

View reviewed changes

jiant/modules/sentence_encoder.py Outdated Show resolved Hide resolved

HaokunLiu reviewed Aug 1, 2019

View reviewed changes

jiant/pytorch_transformers_interface/modules.py Outdated Show resolved Hide resolved

HaokunLiu reviewed Aug 2, 2019

View reviewed changes

jiant/modules/sentence_encoder.py Show resolved Hide resolved

sleepinyourhat added 3 commits August 4, 2019 09:32

Cleanup

237214d

Fix issues with alternative embeddings_mode settings, max_layer.

a9ee48e

More mix cleanup.

89e426c

Masking fix.

1da2753

iftenney requested changes Aug 5, 2019

View reviewed changes

sleepinyourhat added 2 commits August 6, 2019 13:09

Address (most of) @iftenney's comments

b616bbd

Tidying.

281c45c

iftenney requested changes Aug 7, 2019

View reviewed changes

jiant/models.py Outdated Show resolved Hide resolved

scripts/edgeprobing/exp_fns.sh Outdated Show resolved Hide resolved

sleepinyourhat added 3 commits August 7, 2019 17:09

Misc cleanup.

12021ea

Comment.

65a9963

Merge branch 'master' into pytorch-transformers

7db9704

sleepinyourhat merged commit a1e9abf into master Aug 7, 2019

sleepinyourhat deleted the pytorch-transformers branch August 7, 2019 21:39

HaokunLiu mentioned this pull request Aug 13, 2019

Complete pytorch transformers interface, deprecate old GPT implement #881

Merged

jeswan mentioned this pull request Sep 17, 2020

[CLOSED] XLNet support and overhaul/cleanup of BERT support nyu-mll/jiant-v1-legacy#845

Closed

jeswan added the jiant-v1-legacy Relevant to versions <= v1.3.2 label Sep 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLNet support and overhaul/cleanup of BERT support #845

XLNet support and overhaul/cleanup of BERT support #845

sleepinyourhat commented Jul 17, 2019 •

edited

pep8speaks commented Jul 17, 2019 •

edited

sleepinyourhat Jul 17, 2019

HaokunLiu Jul 30, 2019

sleepinyourhat Jul 30, 2019

sleepinyourhat Jul 17, 2019

pruksmhc Jul 18, 2019

iftenney Jul 26, 2019

pruksmhc left a comment

sleepinyourhat commented Jul 27, 2019

pruksmhc commented Jul 29, 2019

pruksmhc commented Jul 29, 2019

sleepinyourhat commented Jul 30, 2019

sleepinyourhat commented Aug 4, 2019

iftenney Aug 5, 2019

sleepinyourhat Aug 6, 2019

sleepinyourhat commented Aug 6, 2019

XLNet support and overhaul/cleanup of BERT support #845

XLNet support and overhaul/cleanup of BERT support #845

Conversation

sleepinyourhat commented Jul 17, 2019 • edited

pep8speaks commented Jul 17, 2019 • edited

Comment last updated at 2019-08-07 21:30:58 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pruksmhc left a comment

Choose a reason for hiding this comment

sleepinyourhat commented Jul 27, 2019

pruksmhc commented Jul 29, 2019

pruksmhc commented Jul 29, 2019

sleepinyourhat commented Jul 30, 2019

sleepinyourhat commented Aug 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sleepinyourhat commented Aug 6, 2019

sleepinyourhat commented Jul 17, 2019 •

edited

pep8speaks commented Jul 17, 2019 •

edited