Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character offset information #14

Closed
proycon opened this issue May 4, 2020 · 10 comments
Closed

Character offset information #14

proycon opened this issue May 4, 2020 · 10 comments

Comments

@proycon
Copy link
Contributor

proycon commented May 4, 2020

It would help if the tokenisers could maintain and output character offset information, with respect to the original input. This would help greatly to align information back together at the end of an NLP pipeline.

Possible points for discussion:

(this continues part of the discussion from guillaume-be/rust-bert#29)

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 5, 2020
@proycon
Copy link
Contributor Author

proycon commented May 5, 2020

I started to work on this today, making some progress already...

@guillaume-be
Copy link
Owner

Hello,

First of all - another thank you for getting involved in the project. Your input and ideas are very valuable. I do apologize for the somehow delayed feedback as I am working full-time and this is a personal project.

My preference would be to have offsets correspond to the position of Rust char (unicode point). For example the following sentence: `"Hello, 🙂 world!" would be tokenized as:
[(Hello, (0, 5)), (,, (5, 6)), (🙂, (7, 8)), (world, (10, 15)), (!, (15, 16))]
I think it is the most expected behaviour and works well with the tokenization algorithms that work mostly on a character level (at least for wordpiece).
The special tokens added during encoding ([CLS] or [SEP]) should have offsets of (0,0) as they are not in the original sentence anyway.

I have had a look at your initial work on the offsets and I like it a lot. I have a few starting comments on what I have seen so far:

  • I think an Offset (start_pos (inclusive), end_pos (exclusive)) would be nicer to work with instead of start and length
  • I love how you kept the API teh same with tokenize and encode still working as expected.
  • The decode to vec is a great idea!

Now the hard part remains the calculation and propagation of offsets through multiple steps. I started to look into it last night, at the split_on_special_tokens level. It is not finalized yet, but feel free to have a look: https://github.com/guillaume-be/rust-tokenizers/tree/tokenization_offsets.

How would you like to work on this? This is a high amount of effort required - but I believe you have more capacity currently in executing quickly on it. If you can think of a way I could contribute or support, please let me know.

Thanks again for this.

@proycon
Copy link
Contributor Author

proycon commented May 5, 2020

First of all - another thank you for getting involved in the project. Your input and ideas are very valuable. I do apologize for the somehow delayed feedback as I am working full-time and this is a personal project.

No problem! I fully understand. I'm counting this as work and have some time to spare on it, so I don't mind taking this on and am already well underway (I haven't pushed the latest work yet cause I'm still in the middle of a big overhaul). I'm currently working on all the the low-level function in tokenization_utils.

I think an Offset (start_pos (inclusive), end_pos (exclusive)) would be nicer to work with instead of start and length

I also considered that. I agree it would may look nicer but the only reason I opted for length was that I could then get away with storing less (a u8 would do as long as max_word_len never exceeds u8) and having some memory benefits therefore. Maybe it's premature optimisation, but I imagined a user loading and tokenizing a whole corpus at once. What do you think?

My preference would be to have offsets correspond to the position of Rust char (unicode point).

Right, I'm currently using bytes but I agree that that's perhaps to low level and unicode points may be better.

@guillaume-be
Copy link
Owner

I also considered that. I agree it would may look nicer but the only reason I opted for length was that I could then get away with storing less (a u8 would do as long as max_word_len never exceeds u8) and having some memory benefits therefore. Maybe it's premature optimisation, but I imagined a user loading and tokenizing a whole corpus at once. What do you think?

I agree with the memory argument. How about reducing the size of the start_offset from u64 to u32 (we should be safe until we get a user encoding a sentence 18,446,744,073,709,551,615 characters long) and have both a start and end offsets as u32. It somehow feels more natural as a way to report offsets albeit, I agree, less efficient.

Right, I'm currently using bytes but I agree that that's perhaps to low level and unicode points may be better.

I think unicode points are just so much more visual when tokenizing a text and double-checking the offsets. If it's not too much work it'd be great if you could switch :)

@proycon
Copy link
Contributor Author

proycon commented May 5, 2020

Yeah, no problem, I'll switch it to (u32,u32) and encode endpoint instead of length.

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 5, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020
…ing to simplify things and have elegant tokenization chains guillaume-be#14
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020
…+ turning output_offsets into a Vec<Option<Offset>> instead of Vec<Offset>, because for some special tokens like BOS/EOS markers offsets make little sense guillaume-be#14
proycon added a commit to proycon/rust-bert that referenced this issue May 6, 2020
…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020
@proycon
Copy link
Contributor Author

proycon commented May 6, 2020

@guillaume-be Here's a status update: I'm a good way through the process and I've done several refactoring rounds. Basically I did a heart surgery on the tokenizer, and it's now powered internally by two (very similar) methods:

pub fn split_on_char<'a, F>(token: TokenRef<'a>, test_character: F, add_separators: bool, set_mask: Mask) -> Vec<TokenRef<'a>> 
  where F: Fn(&char) -> bool;

pub fn split_on_substr<'a, F>(token: TokenRef<'a>, test_substr: F, add_separators: bool, set_mask: Mask) -> Vec<TokenRef<'a>>
  where F: Fn(&'a str) -> (usize,usize); //returns length of match in bytes and chars

All higher-level functions (split_on_punct, tokenize_whitespace, tokenize_cjk_chars, split_on_special_tokens) simply call one of these two main functions and pass a simple test function)

The TokenRef and Token class encapsulate a token's text (&str vs String), it's offset, and a special mask to keep track of whether the token is unprocessed or if not, what kind of token it is:

pub enum Mask {
       None,
       Whitespace,
       Punctuation,
       CJK,
       Special,
       Continuation, 
       Unknown
  }

One of the most important changes it that the tokenizer doesn't modify the source text at all (which would ruin the offsets), it eventually only modifies some tokens (lowercasing, stripping accents, etc), but this is done as late as possible so everything in earlier stages makes no string copies.

Building a tokenizer can now be done quite nicely by just chaining stuff as seen for the base tokenizer and for bert:

I still have to work on all the other tokenizers so that will probably take a bit, I'm sure there will be some challenges there too.

proycon added a commit to proycon/rust-bert that referenced this issue May 7, 2020
…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29
proycon added a commit to proycon/rust-bert that referenced this issue May 7, 2020
…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29
@guillaume-be
Copy link
Owner

This looks great. I really like the implementation and the fact that data is cloned only at a later stage.
The main difference with the other tokenizers is the use of BPE instead of wordpiece, which may cause some additional trouble with the offsets as these methods are byte-based.

I have updated the CI to run a validation between the reference Transformers tokenizers and the Rust-based transformers which may be useful if you'd like to validate the output remains unchanged. These integration tests also contain various benchmarks comparing the Python to Rust implementation. One thing I would like to check (but can drive) is to ensure that the performance benefits of the Rust implementation over Python remains in the same order of magnitude. If you are interested these are rather easy to set-up once a version of the library compiles.

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 7, 2020
@proycon
Copy link
Contributor Author

proycon commented May 7, 2020

I have updated the CI to run a validation between the reference Transformers tokenizers and the Rust-based transformers which may be useful if you'd like to validate the output remains unchanged. These integration tests also contain various benchmarks comparing the Python to Rust implementation. One thing I would like to check (but can drive) is to ensure that the performance benefits of the Rust implementation over Python remains in the same order of magnitude. If you are interested these are rather easy to set-up once a version of the library compiles.

That's a very good idea yes! I dare not really say yet whether things will be faster or slower. Of course adding the offsets everywhere adds quite some complexity which will come at a cost, but I'm also optimizing some stuff. Then again, especially in commit I just did, I think there are probably parts that can be optimized further (too much copying going on still for my taste).

I'm adapting the tests to include the offsets as I go and trying to make it very explicit if I changed an outcome (such as https://github.com/proycon/rust-tokenizers/blob/offsets/main/src/preprocessing/tokenizer/tokenization_utils.rs#L1575 )

The main difference with the other tokenizers is the use of BPE instead of wordpiece, which may cause some additional trouble with the offsets as these methods are byte-base

Right, we may need a better test for that too in 'test_group_common_pairs' and 'test_bpe' as right now it's only ascii-compatible and we'll need to ensure it holds up for multibyte characters too.

@proycon
Copy link
Contributor Author

proycon commented May 7, 2020

By the way, you're also aware of https://github.com/huggingface/tokenizers I assume? It's also implemented in rust even. I haven't looked at their code yet though. It won't hurt to have two implementations anyway :)

@guillaume-be
Copy link
Owner

guillaume-be commented May 7, 2020

Yes - I am aware of the tokenizers package from Huggingface. The present crate is focused on inference (does not include the training of tokenizers) and follows a structure that is closer to that of the Python baseline. I tried to clearly separate the vocabulary and definition of special tokens from the tokenization routines which I believe add value to manipulate the later as separate objects with a different behaviour in the deep learning crate.

The present crate also offers (as far as I know) identical tokenization to the Python baselines. There are a few cases where the tokenizer crate deviates from the "classical" implementation of tokenizers (by design. See for example huggingface/tokenizers#240).

The deep learning crate currently has a clear focus on inference, and I believe this implementation of tokenizers is a good match so I believe both have their purpose for now. For the proposed refactoring maybe it is best not to get too deep into the other code base, I'd like to keep the two implementation fairly different :)

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 7, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
…ctContinuation, subtokens carrying this mask have their offset refer to the entire token they are part of, rather than to a specific part of the token. guillaume-be#14
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
…rough the TokenizedInput structure, this work constitutes major refactoring of the tokenisation logic and introduces new shared functions that are all offset (and mask) aware. guillaume-be#14

initial setup of structures to handle offsets guillaume-be#14

A major batch of (untested!) work on the tokenisers to incorporate offset information (to be continued...) guillaume-be#14

made the truncate* functions also work without offsets if needed

some error fixing

some initial test compilation fixes

another round of fixes

fixed test compilation

refactored tokenize_with_offsets to be nicer

don't lowercase special tokens

various test fixes + base tokenizer cleans text at a later stage than it used to (offsets are to uncleaned original) guillaume-be#14

major refactoring, introducing the Token and TokenRef structs and trying to simplify things and have elegant tokenization chains guillaume-be#14

removed a reference

fix and optimisation

fixed test_truncate_single_sentence

fixed test_encode_single_sentence

adding some copied documentation from the original huggingface transformers guillaume-be#14

adapting build_input_with_special_tokens to properly support offsets + turning output_offsets into a Vec<Option<Offset>> instead of Vec<Offset>, because for some special tokens like BOS/EOS markers offsets make little sense guillaume-be#14

cleanup

fixed some tests (added missing reference output for offsets)

last fixes for BERT tokenizer (tests are now green) guillaume-be#14

fix

Implementing offset support in auxiliary functions for needed by all the other tokenizers guillaume-be#14

fixed split_on_regex

implementing new GPT2 tokeniser (not done yet, some test failures) guillaume-be#14

Adapting roberta tokenizer to use offsets guillaume-be#14  (tests still show a char vs byte problem to solve)

Added offset tests to test_base_tokenizer guillaume-be#14

fixed some tests but a new byte/char problem surfaced in the wordpiece tokenizer guillaume-be#14

made tokenize_wordpiece properly distinguish bytes and chars guillaume-be#14

Fixed Roberta tokenizer, introduced new masks: InexactBegin and InexactContinuation, subtokens carrying this mask have their offset refer to the entire token they are part of, rather than to a specific part of the token. guillaume-be#14

further implemented masks and propagate them to TokenizedInput guillaume-be#14

minor cleanup and logic fix
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants