-
-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT tokens #494
Comments
@JMMackenzie Do you by any chance have some Anserini docs on how this is implemented? I'm not that familiar with bert, I'd love to understand it a bit more. |
If you check this commit, you will see the basically just instantiate a "whitespace analyzer" which does what it says on the tin: castorini/anserini@14b315d This boils down to something like this: https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html
I think for our intents/purposes, we can just tokenize directly on spaces. I think the only problem may be whether storing special characters will be handled correctly by the lexicon tooling, but I don't see why it wouldn't work. Any thoughts? |
Basically this enhancement is for cases where we are ingesting a learned sparse index from either jsonl or another IR toolkit like Anserini/Terrier (perhaps via CIFF) which has a vocabulary which looks like:
And then at query time we might see I think currently PISA would turn that query into |
Ah, ok, so this would be an alternative parsing, correct? When As for the lexicon, I don't see why it wouldn't work either. There's really nothing special about "special" characters like #. It's all just bytes. If you have access to, or can get your hands on, a CIFF built this way (preferably not to large), it would be good to have it to do some sanity checks beyond any unit/integration tests we may write for that. |
Sure, I can generate a CIFF file if that would help! |
Describe the solution you'd like
Currently, PISA does not readily support BERT wordpiece tokens
exam ##pl
because of the##
being eaten by the tokenizer.We should have support for a command line flag like
--pretokenized
(similar to Anserini) to tell the tokenizer to simply consume whitespaces and do no more.Checklist
The text was updated successfully, but these errors were encountered: