BERT tokens #494

JMMackenzie · 2022-11-21T05:51:48Z

Describe the solution you'd like
Currently, PISA does not readily support BERT wordpiece tokens exam ##pl because of the ## being eaten by the tokenizer.

We should have support for a command line flag like --pretokenized (similar to Anserini) to tell the tokenizer to simply consume whitespaces and do no more.

Checklist

Implement whitespace tokenizer (Whitespace tokenizer #496)
Allow for choosing tokenizer at query time (Selecting tokenizer at query time #499)
Allow for choosing tokenizer at indexing time

The text was updated successfully, but these errors were encountered:

elshize · 2022-11-25T00:54:32Z

@JMMackenzie Do you by any chance have some Anserini docs on how this is implemented? I'm not that familiar with bert, I'd love to understand it a bit more.

JMMackenzie · 2022-11-25T06:19:51Z

If you check this commit, you will see the basically just instantiate a "whitespace analyzer" which does what it says on the tin: castorini/anserini@14b315d

This boils down to something like this: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

A tokenizer that divides text at whitespace characters as defined by [Character.isWhitespace(int)](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int-). Note: That definition explicitly excludes the non-breaking space. Adjacent sequences of non-Whitespace characters form tokens.

I think for our intents/purposes, we can just tokenize directly on spaces. I think the only problem may be whether storing special characters will be handled correctly by the lexicon tooling, but I don't see why it wouldn't work. Any thoughts?

JMMackenzie · 2022-11-25T06:23:53Z

Basically this enhancement is for cases where we are ingesting a learned sparse index from either jsonl or another IR toolkit like Anserini/Terrier (perhaps via CIFF) which has a vocabulary which looks like:

#ing
...
fish
...

And then at query time we might see 101: fish ##ing locations or something like that. This example is just made up but should explain what we need.

I think currently PISA would turn that query into fish ing locations and then maybe match ing with the wrong token or just not find it.

elshize · 2022-11-25T14:27:20Z

Ah, ok, so this would be an alternative parsing, correct? When --pretokenized is passed, we break on spaces, otherwise, business as usual?

As for the lexicon, I don't see why it wouldn't work either. There's really nothing special about "special" characters like #. It's all just bytes.

If you have access to, or can get your hands on, a CIFF built this way (preferably not to large), it would be good to have it to do some sanity checks beyond any unit/integration tests we may write for that.

JMMackenzie · 2022-11-28T01:21:58Z

Sure, I can generate a CIFF file if that would help!

JMMackenzie added the enhancement New feature or request label Nov 21, 2022

JMMackenzie self-assigned this Nov 24, 2022

elshize mentioned this issue Nov 26, 2022

Whitespace tokenizer #496

Merged

elshize mentioned this issue Dec 24, 2022

Implement TextAnalyzer #503

Merged

3 tasks

elshize closed this as completed in #503 Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT tokens #494

BERT tokens #494

JMMackenzie commented Nov 21, 2022 •

edited by elshize

Loading

elshize commented Nov 25, 2022

JMMackenzie commented Nov 25, 2022

JMMackenzie commented Nov 25, 2022

elshize commented Nov 25, 2022

JMMackenzie commented Nov 28, 2022

BERT tokens #494

BERT tokens #494

Comments

JMMackenzie commented Nov 21, 2022 • edited by elshize Loading

Checklist

elshize commented Nov 25, 2022

JMMackenzie commented Nov 25, 2022

JMMackenzie commented Nov 25, 2022

elshize commented Nov 25, 2022

JMMackenzie commented Nov 28, 2022

JMMackenzie commented Nov 21, 2022 •

edited by elshize

Loading