Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERT tokens #494

Closed
2 of 3 tasks
JMMackenzie opened this issue Nov 21, 2022 · 5 comments · Fixed by #503
Closed
2 of 3 tasks

BERT tokens #494

JMMackenzie opened this issue Nov 21, 2022 · 5 comments · Fixed by #503
Assignees
Labels
enhancement New feature or request

Comments

@JMMackenzie
Copy link
Member

JMMackenzie commented Nov 21, 2022

Describe the solution you'd like
Currently, PISA does not readily support BERT wordpiece tokens exam ##pl because of the ## being eaten by the tokenizer.

We should have support for a command line flag like --pretokenized (similar to Anserini) to tell the tokenizer to simply consume whitespaces and do no more.

Checklist

@JMMackenzie JMMackenzie added the enhancement New feature or request label Nov 21, 2022
@JMMackenzie JMMackenzie self-assigned this Nov 24, 2022
@elshize
Copy link
Member

elshize commented Nov 25, 2022

@JMMackenzie Do you by any chance have some Anserini docs on how this is implemented? I'm not that familiar with bert, I'd love to understand it a bit more.

@JMMackenzie
Copy link
Member Author

If you check this commit, you will see the basically just instantiate a "whitespace analyzer" which does what it says on the tin: castorini/anserini@14b315d

This boils down to something like this: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html

A tokenizer that divides text at whitespace characters as defined by [Character.isWhitespace(int)](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int-). Note: That definition explicitly excludes the non-breaking space. Adjacent sequences of non-Whitespace characters form tokens.

I think for our intents/purposes, we can just tokenize directly on spaces. I think the only problem may be whether storing special characters will be handled correctly by the lexicon tooling, but I don't see why it wouldn't work. Any thoughts?

@JMMackenzie
Copy link
Member Author

Basically this enhancement is for cases where we are ingesting a learned sparse index from either jsonl or another IR toolkit like Anserini/Terrier (perhaps via CIFF) which has a vocabulary which looks like:

#ing
...
fish
...

And then at query time we might see 101: fish ##ing locations or something like that. This example is just made up but should explain what we need.

I think currently PISA would turn that query into fish ing locations and then maybe match ing with the wrong token or just not find it.

@elshize
Copy link
Member

elshize commented Nov 25, 2022

Ah, ok, so this would be an alternative parsing, correct? When --pretokenized is passed, we break on spaces, otherwise, business as usual?

As for the lexicon, I don't see why it wouldn't work either. There's really nothing special about "special" characters like #. It's all just bytes.

If you have access to, or can get your hands on, a CIFF built this way (preferably not to large), it would be good to have it to do some sanity checks beyond any unit/integration tests we may write for that.

@JMMackenzie
Copy link
Member Author

Sure, I can generate a CIFF file if that would help!

@elshize elshize mentioned this issue Dec 24, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants