New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character offset information #14
Comments
I started to work on this today, making some progress already... |
Hello, First of all - another thank you for getting involved in the project. Your input and ideas are very valuable. I do apologize for the somehow delayed feedback as I am working full-time and this is a personal project. My preference would be to have offsets correspond to the position of Rust I have had a look at your initial work on the offsets and I like it a lot. I have a few starting comments on what I have seen so far:
Now the hard part remains the calculation and propagation of offsets through multiple steps. I started to look into it last night, at the How would you like to work on this? This is a high amount of effort required - but I believe you have more capacity currently in executing quickly on it. If you can think of a way I could contribute or support, please let me know. Thanks again for this. |
No problem! I fully understand. I'm counting this as work and have some time to spare on it, so I don't mind taking this on and am already well underway (I haven't pushed the latest work yet cause I'm still in the middle of a big overhaul). I'm currently working on all the the low-level function in tokenization_utils.
I also considered that. I agree it would may look nicer but the only reason I opted for length was that I could then get away with storing less (a u8 would do as long as max_word_len never exceeds u8) and having some memory benefits therefore. Maybe it's premature optimisation, but I imagined a user loading and tokenizing a whole corpus at once. What do you think?
Right, I'm currently using bytes but I agree that that's perhaps to low level and unicode points may be better. |
I agree with the memory argument. How about reducing the size of the start_offset from
I think unicode points are just so much more visual when tokenizing a text and double-checking the offsets. If it's not too much work it'd be great if you could switch :) |
Yeah, no problem, I'll switch it to (u32,u32) and encode endpoint instead of length. |
…fset information (to be continued...) guillaume-be#14
… it used to (offsets are to uncleaned original) guillaume-be#14
…ing to simplify things and have elegant tokenization chains guillaume-be#14
…+ turning output_offsets into a Vec<Option<Offset>> instead of Vec<Offset>, because for some special tokens like BOS/EOS markers offsets make little sense guillaume-be#14
…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29
@guillaume-be Here's a status update: I'm a good way through the process and I've done several refactoring rounds. Basically I did a heart surgery on the tokenizer, and it's now powered internally by two (very similar) methods: pub fn split_on_char<'a, F>(token: TokenRef<'a>, test_character: F, add_separators: bool, set_mask: Mask) -> Vec<TokenRef<'a>>
where F: Fn(&char) -> bool;
pub fn split_on_substr<'a, F>(token: TokenRef<'a>, test_substr: F, add_separators: bool, set_mask: Mask) -> Vec<TokenRef<'a>>
where F: Fn(&'a str) -> (usize,usize); //returns length of match in bytes and chars All higher-level functions (split_on_punct, tokenize_whitespace, tokenize_cjk_chars, split_on_special_tokens) simply call one of these two main functions and pass a simple test function) The pub enum Mask {
None,
Whitespace,
Punctuation,
CJK,
Special,
Continuation,
Unknown
} One of the most important changes it that the tokenizer doesn't modify the source text at all (which would ruin the offsets), it eventually only modifies some tokens (lowercasing, stripping accents, etc), but this is done as late as possible so everything in earlier stages makes no string copies. Building a tokenizer can now be done quite nicely by just chaining stuff as seen for the base tokenizer and for bert:
I still have to work on all the other tokenizers so that will probably take a bit, I'm sure there will be some challenges there too. |
…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29
…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29
This looks great. I really like the implementation and the fact that data is cloned only at a later stage. I have updated the CI to run a validation between the reference Transformers tokenizers and the Rust-based transformers which may be useful if you'd like to validate the output remains unchanged. These integration tests also contain various benchmarks comparing the Python to Rust implementation. One thing I would like to check (but can drive) is to ensure that the performance benefits of the Rust implementation over Python remains in the same order of magnitude. If you are interested these are rather easy to set-up once a version of the library compiles. |
…the other tokenizers guillaume-be#14
That's a very good idea yes! I dare not really say yet whether things will be faster or slower. Of course adding the offsets everywhere adds quite some complexity which will come at a cost, but I'm also optimizing some stuff. Then again, especially in commit I just did, I think there are probably parts that can be optimized further (too much copying going on still for my taste). I'm adapting the tests to include the offsets as I go and trying to make it very explicit if I changed an outcome (such as https://github.com/proycon/rust-tokenizers/blob/offsets/main/src/preprocessing/tokenizer/tokenization_utils.rs#L1575 )
Right, we may need a better test for that too in 'test_group_common_pairs' and 'test_bpe' as right now it's only ascii-compatible and we'll need to ensure it holds up for multibyte characters too. |
By the way, you're also aware of https://github.com/huggingface/tokenizers I assume? It's also implemented in rust even. I haven't looked at their code yet though. It won't hurt to have two implementations anyway :) |
Yes - I am aware of the tokenizers package from Huggingface. The present crate is focused on inference (does not include the training of tokenizers) and follows a structure that is closer to that of the Python baseline. I tried to clearly separate the The present crate also offers (as far as I know) identical tokenization to the Python baselines. There are a few cases where the The deep learning crate currently has a clear focus on inference, and I believe this implementation of tokenizers is a good match so I believe both have their purpose for now. For the proposed refactoring maybe it is best not to get too deep into the other code base, I'd like to keep the two implementation fairly different :) |
…l show a char vs byte problem to solve)
…ctContinuation, subtokens carrying this mask have their offset refer to the entire token they are part of, rather than to a specific part of the token. guillaume-be#14
…rough the TokenizedInput structure, this work constitutes major refactoring of the tokenisation logic and introduces new shared functions that are all offset (and mask) aware. guillaume-be#14 initial setup of structures to handle offsets guillaume-be#14 A major batch of (untested!) work on the tokenisers to incorporate offset information (to be continued...) guillaume-be#14 made the truncate* functions also work without offsets if needed some error fixing some initial test compilation fixes another round of fixes fixed test compilation refactored tokenize_with_offsets to be nicer don't lowercase special tokens various test fixes + base tokenizer cleans text at a later stage than it used to (offsets are to uncleaned original) guillaume-be#14 major refactoring, introducing the Token and TokenRef structs and trying to simplify things and have elegant tokenization chains guillaume-be#14 removed a reference fix and optimisation fixed test_truncate_single_sentence fixed test_encode_single_sentence adding some copied documentation from the original huggingface transformers guillaume-be#14 adapting build_input_with_special_tokens to properly support offsets + turning output_offsets into a Vec<Option<Offset>> instead of Vec<Offset>, because for some special tokens like BOS/EOS markers offsets make little sense guillaume-be#14 cleanup fixed some tests (added missing reference output for offsets) last fixes for BERT tokenizer (tests are now green) guillaume-be#14 fix Implementing offset support in auxiliary functions for needed by all the other tokenizers guillaume-be#14 fixed split_on_regex implementing new GPT2 tokeniser (not done yet, some test failures) guillaume-be#14 Adapting roberta tokenizer to use offsets guillaume-be#14 (tests still show a char vs byte problem to solve) Added offset tests to test_base_tokenizer guillaume-be#14 fixed some tests but a new byte/char problem surfaced in the wordpiece tokenizer guillaume-be#14 made tokenize_wordpiece properly distinguish bytes and chars guillaume-be#14 Fixed Roberta tokenizer, introduced new masks: InexactBegin and InexactContinuation, subtokens carrying this mask have their offset refer to the entire token they are part of, rather than to a specific part of the token. guillaume-be#14 further implemented masks and propagate them to TokenizedInput guillaume-be#14 minor cleanup and logic fix
It would help if the tokenisers could maintain and output character offset information, with respect to the original input. This would help greatly to align information back together at the end of an NLP pipeline.
Possible points for discussion:
(this continues part of the discussion from guillaume-be/rust-bert#29)
The text was updated successfully, but these errors were encountered: