Character offset information #14

proycon · 2020-05-04T20:40:18Z

It would help if the tokenisers could maintain and output character offset information, with respect to the original input. This would help greatly to align information back together at the end of an NLP pipeline.

Possible points for discussion:

Are utf-8 byte offsets or do we also want unicode graphemes? (with https://github.com/unicode-rs/unicode-segmentation for instance).

(this continues part of the discussion from guillaume-be/rust-bert#29)

proycon · 2020-05-05T13:58:47Z

I started to work on this today, making some progress already...

guillaume-be · 2020-05-05T15:18:05Z

Hello,

First of all - another thank you for getting involved in the project. Your input and ideas are very valuable. I do apologize for the somehow delayed feedback as I am working full-time and this is a personal project.

My preference would be to have offsets correspond to the position of Rust char (unicode point). For example the following sentence: `"Hello, 🙂 world!" would be tokenized as:
[(Hello, (0, 5)), (,, (5, 6)), (🙂, (7, 8)), (world, (10, 15)), (!, (15, 16))]
I think it is the most expected behaviour and works well with the tokenization algorithms that work mostly on a character level (at least for wordpiece).
The special tokens added during encoding ([CLS] or [SEP]) should have offsets of (0,0) as they are not in the original sentence anyway.

I have had a look at your initial work on the offsets and I like it a lot. I have a few starting comments on what I have seen so far:

I think an Offset (start_pos (inclusive), end_pos (exclusive)) would be nicer to work with instead of start and length
I love how you kept the API teh same with tokenize and encode still working as expected.
The decode to vec is a great idea!

Now the hard part remains the calculation and propagation of offsets through multiple steps. I started to look into it last night, at the split_on_special_tokens level. It is not finalized yet, but feel free to have a look: https://github.com/guillaume-be/rust-tokenizers/tree/tokenization_offsets.

How would you like to work on this? This is a high amount of effort required - but I believe you have more capacity currently in executing quickly on it. If you can think of a way I could contribute or support, please let me know.

Thanks again for this.

proycon · 2020-05-05T15:44:33Z

First of all - another thank you for getting involved in the project. Your input and ideas are very valuable. I do apologize for the somehow delayed feedback as I am working full-time and this is a personal project.

No problem! I fully understand. I'm counting this as work and have some time to spare on it, so I don't mind taking this on and am already well underway (I haven't pushed the latest work yet cause I'm still in the middle of a big overhaul). I'm currently working on all the the low-level function in tokenization_utils.

I think an Offset (start_pos (inclusive), end_pos (exclusive)) would be nicer to work with instead of start and length

I also considered that. I agree it would may look nicer but the only reason I opted for length was that I could then get away with storing less (a u8 would do as long as max_word_len never exceeds u8) and having some memory benefits therefore. Maybe it's premature optimisation, but I imagined a user loading and tokenizing a whole corpus at once. What do you think?

My preference would be to have offsets correspond to the position of Rust char (unicode point).

Right, I'm currently using bytes but I agree that that's perhaps to low level and unicode points may be better.

guillaume-be · 2020-05-05T16:03:46Z

I also considered that. I agree it would may look nicer but the only reason I opted for length was that I could then get away with storing less (a u8 would do as long as max_word_len never exceeds u8) and having some memory benefits therefore. Maybe it's premature optimisation, but I imagined a user loading and tokenizing a whole corpus at once. What do you think?

I agree with the memory argument. How about reducing the size of the start_offset from u64 to u32 (we should be safe until we get a user encoding a sentence 18,446,744,073,709,551,615 characters long) and have both a start and end offsets as u32. It somehow feels more natural as a way to report offsets albeit, I agree, less efficient.

Right, I'm currently using bytes but I agree that that's perhaps to low level and unicode points may be better.

I think unicode points are just so much more visual when tokenizing a text and double-checking the offsets. If it's not too much work it'd be great if you could switch :)

proycon · 2020-05-05T16:10:38Z

Yeah, no problem, I'll switch it to (u32,u32) and encode endpoint instead of length.

…fset information (to be continued...) guillaume-be#14

… it used to (offsets are to uncleaned original) guillaume-be#14

…ing to simplify things and have elegant tokenization chains guillaume-be#14

…ormers guillaume-be#14

…+ turning output_offsets into a Vec<Option<Offset>> instead of Vec<Offset>, because for some special tokens like BOS/EOS markers offsets make little sense guillaume-be#14

…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29

proycon · 2020-05-06T21:43:29Z

@guillaume-be Here's a status update: I'm a good way through the process and I've done several refactoring rounds. Basically I did a heart surgery on the tokenizer, and it's now powered internally by two (very similar) methods:

pub fn split_on_char<'a, F>(token: TokenRef<'a>, test_character: F, add_separators: bool, set_mask: Mask) -> Vec<TokenRef<'a>> 
  where F: Fn(&char) -> bool;

pub fn split_on_substr<'a, F>(token: TokenRef<'a>, test_substr: F, add_separators: bool, set_mask: Mask) -> Vec<TokenRef<'a>>
  where F: Fn(&'a str) -> (usize,usize); //returns length of match in bytes and chars

All higher-level functions (split_on_punct, tokenize_whitespace, tokenize_cjk_chars, split_on_special_tokens) simply call one of these two main functions and pass a simple test function)

The TokenRef and Token class encapsulate a token's text (&str vs String), it's offset, and a special mask to keep track of whether the token is unprocessed or if not, what kind of token it is:

pub enum Mask {
       None,
       Whitespace,
       Punctuation,
       CJK,
       Special,
       Continuation, 
       Unknown
  }

One of the most important changes it that the tokenizer doesn't modify the source text at all (which would ruin the offsets), it eventually only modifies some tokens (lowercasing, stripping accents, etc), but this is done as late as possible so everything in earlier stages makes no string copies.

Building a tokenizer can now be done quite nicely by just chaining stuff as seen for the base tokenizer and for bert:

I still have to work on all the other tokenizers so that will probably take a bit, I'm sure there will be some challenges there too.

…sponsible for it), will be reintroduced once the tokenizers provide the offset information (guillaume-be/rust-tokenizers#14), issue guillaume-be#29

guillaume-be · 2020-05-07T18:02:26Z

This looks great. I really like the implementation and the fact that data is cloned only at a later stage.
The main difference with the other tokenizers is the use of BPE instead of wordpiece, which may cause some additional trouble with the offsets as these methods are byte-based.

I have updated the CI to run a validation between the reference Transformers tokenizers and the Rust-based transformers which may be useful if you'd like to validate the output remains unchanged. These integration tests also contain various benchmarks comparing the Python to Rust implementation. One thing I would like to check (but can drive) is to ensure that the performance benefits of the Rust implementation over Python remains in the same order of magnitude. If you are interested these are rather easy to set-up once a version of the library compiles.

…the other tokenizers guillaume-be#14

proycon · 2020-05-07T18:53:43Z

I have updated the CI to run a validation between the reference Transformers tokenizers and the Rust-based transformers which may be useful if you'd like to validate the output remains unchanged. These integration tests also contain various benchmarks comparing the Python to Rust implementation. One thing I would like to check (but can drive) is to ensure that the performance benefits of the Rust implementation over Python remains in the same order of magnitude. If you are interested these are rather easy to set-up once a version of the library compiles.

That's a very good idea yes! I dare not really say yet whether things will be faster or slower. Of course adding the offsets everywhere adds quite some complexity which will come at a cost, but I'm also optimizing some stuff. Then again, especially in commit I just did, I think there are probably parts that can be optimized further (too much copying going on still for my taste).

I'm adapting the tests to include the offsets as I go and trying to make it very explicit if I changed an outcome (such as https://github.com/proycon/rust-tokenizers/blob/offsets/main/src/preprocessing/tokenizer/tokenization_utils.rs#L1575 )

The main difference with the other tokenizers is the use of BPE instead of wordpiece, which may cause some additional trouble with the offsets as these methods are byte-base

Right, we may need a better test for that too in 'test_group_common_pairs' and 'test_bpe' as right now it's only ascii-compatible and we'll need to ensure it holds up for multibyte characters too.

proycon · 2020-05-07T18:57:14Z

By the way, you're also aware of https://github.com/huggingface/tokenizers I assume? It's also implemented in rust even. I haven't looked at their code yet though. It won't hurt to have two implementations anyway :)

guillaume-be · 2020-05-07T19:41:08Z

Yes - I am aware of the tokenizers package from Huggingface. The present crate is focused on inference (does not include the training of tokenizers) and follows a structure that is closer to that of the Python baseline. I tried to clearly separate the vocabulary and definition of special tokens from the tokenization routines which I believe add value to manipulate the later as separate objects with a different behaviour in the deep learning crate.

The present crate also offers (as far as I know) identical tokenization to the Python baselines. There are a few cases where the tokenizer crate deviates from the "classical" implementation of tokenizers (by design. See for example huggingface/tokenizers#240).

The deep learning crate currently has a clear focus on inference, and I believe this implementation of tokenizers is a good match so I believe both have their purpose for now. For the proposed refactoring maybe it is best not to get too deep into the other code base, I'd like to keep the two implementation fairly different :)

…illaume-be#14

…l show a char vs byte problem to solve)

…e tokenizer guillaume-be#14

…e-be#14

…ctContinuation, subtokens carrying this mask have their offset refer to the entire token they are part of, rather than to a specific part of the token. guillaume-be#14

…ume-be#14

…me-be#14

… code guillaume-be#14

…rough the TokenizedInput structure, this work constitutes major refactoring of the tokenisation logic and introduces new shared functions that are all offset (and mask) aware. guillaume-be#14 initial setup of structures to handle offsets guillaume-be#14 A major batch of (untested!) work on the tokenisers to incorporate offset information (to be continued...) guillaume-be#14 made the truncate* functions also work without offsets if needed some error fixing some initial test compilation fixes another round of fixes fixed test compilation refactored tokenize_with_offsets to be nicer don't lowercase special tokens various test fixes + base tokenizer cleans text at a later stage than it used to (offsets are to uncleaned original) guillaume-be#14 major refactoring, introducing the Token and TokenRef structs and trying to simplify things and have elegant tokenization chains guillaume-be#14 removed a reference fix and optimisation fixed test_truncate_single_sentence fixed test_encode_single_sentence adding some copied documentation from the original huggingface transformers guillaume-be#14 adapting build_input_with_special_tokens to properly support offsets + turning output_offsets into a Vec<Option<Offset>> instead of Vec<Offset>, because for some special tokens like BOS/EOS markers offsets make little sense guillaume-be#14 cleanup fixed some tests (added missing reference output for offsets) last fixes for BERT tokenizer (tests are now green) guillaume-be#14 fix Implementing offset support in auxiliary functions for needed by all the other tokenizers guillaume-be#14 fixed split_on_regex implementing new GPT2 tokeniser (not done yet, some test failures) guillaume-be#14 Adapting roberta tokenizer to use offsets guillaume-be#14 (tests still show a char vs byte problem to solve) Added offset tests to test_base_tokenizer guillaume-be#14 fixed some tests but a new byte/char problem surfaced in the wordpiece tokenizer guillaume-be#14 made tokenize_wordpiece properly distinguish bytes and chars guillaume-be#14 Fixed Roberta tokenizer, introduced new masks: InexactBegin and InexactContinuation, subtokens carrying this mask have their offset refer to the entire token they are part of, rather than to a specific part of the token. guillaume-be#14 further implemented masks and propagate them to TokenizedInput guillaume-be#14 minor cleanup and logic fix

…me-be#14

… code guillaume-be#14

…e-be#14

…ust-tokenizers#14 , guillaume-be/rust-tokenizers#19)

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 5, 2020

initial setup of structures to handle offsets guillaume-be#14

e6f1304

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 5, 2020

A major batch of (untested!) work on the tokenisers to incorporate of…

25003c9

…fset information (to be continued...) guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020

various test fixes + base tokenizer cleans text at a later stage than…

abfd686

… it used to (offsets are to uncleaned original) guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020

major refactoring, introducing the Token and TokenRef structs and try…

f627899

…ing to simplify things and have elegant tokenization chains guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020

adding some copied documentation from the original huggingface transf…

75574cd

…ormers guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 6, 2020

last fixes for BERT tokenizer (tests are now green) guillaume-be#14

d0502bf

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 7, 2020

Implementing offset support in auxiliary functions for needed by all …

a1af11c

…the other tokenizers guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 7, 2020

implementing new GPT2 tokeniser (not done yet, some test failures) gu…

b521f27

…illaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

Adapting roberta tokenizer to use offsets guillaume-be#14 (tests stil…

4951c0f

…l show a char vs byte problem to solve)

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

Added offset tests to test_base_tokenizer guillaume-be#14

7329878

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

fixed some tests but a new byte/char problem surfaced in the wordpiec…

a49ae0e

…e tokenizer guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

made tokenize_wordpiece properly distinguish bytes and chars guillaum…

1e679f2

…e-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

further implemented masks and propagate them to TokenizedInput guilla…

241f1d2

…ume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

factored out some common GPT2-like code to split_on_bpe_pairs guillau…

72c55a6

…me-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

implementing offsets for openai_gpt_tokenizer + fixes guillaume-be#14

b863008

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 8, 2020

Working on offset support for ctrl tokeniser, refactoring some shared…

eab26ce

… code guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

refactored out fix_mask function guillaume-be#14

d79d137

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

implement exact/inexact modes for GPT-like tokenizers guillaume-be#14

1711972

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

fixes and tests for ctrl tokenizer guillaume-be#14

e060435

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

factored out some common GPT2-like code to split_on_bpe_pairs guillau…

7e8cc51

…me-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

Working on offset support for ctrl tokeniser, refactoring some shared…

5660995

… code guillaume-be#14

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

refactored out fix_mask function guillaume-be#14

aa0ac21

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

implement exact/inexact modes for GPT-like tokenizers guillaume-be#14

509a2b3

proycon added a commit to proycon/rust-tokenizers that referenced this issue May 9, 2020

fixes and tests for ctrl tokenizer + cleanup + header update guillaum…

8030f2c

…e-be#14

proycon mentioned this issue May 9, 2020

Implemented offset information and major refactoring of tokenisation functions. #19

Merged

proycon added a commit to proycon/rust-bert that referenced this issue May 9, 2020

Adapted to API changes in offset-aware rust-tokenizer (guillaume-be/r…

e9c55e2

…ust-tokenizers#14 , guillaume-be/rust-tokenizers#19)

proycon mentioned this issue May 9, 2020

Adapted to API changes in offset-aware rust-tokenizer guillaume-be/rust-bert#32

Merged

guillaume-be closed this as completed in 9cabad5 May 10, 2020

guillaume-be pushed a commit to guillaume-be/rust-bert that referenced this issue May 10, 2020

Adapted to API changes in offset-aware rust-tokenizer (guillaume-be/r…

251a015

…ust-tokenizers#14 , guillaume-be/rust-tokenizers#19)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character offset information #14

Character offset information #14

proycon commented May 4, 2020

proycon commented May 5, 2020

guillaume-be commented May 5, 2020

proycon commented May 5, 2020

guillaume-be commented May 5, 2020

proycon commented May 5, 2020

proycon commented May 6, 2020

guillaume-be commented May 7, 2020

proycon commented May 7, 2020

proycon commented May 7, 2020

guillaume-be commented May 7, 2020 •

edited

Character offset information #14

Character offset information #14

Comments

proycon commented May 4, 2020

proycon commented May 5, 2020

guillaume-be commented May 5, 2020

proycon commented May 5, 2020

guillaume-be commented May 5, 2020

proycon commented May 5, 2020

proycon commented May 6, 2020

guillaume-be commented May 7, 2020

proycon commented May 7, 2020

proycon commented May 7, 2020

guillaume-be commented May 7, 2020 • edited

guillaume-be commented May 7, 2020 •

edited