Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pre_tokenizers] Fix sentencepiece based Metaspace #1357

Merged
merged 49 commits into from
Nov 14, 2023

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Oct 7, 2023

Fix the issue with meta spaces that applies the pre_tokenization to all the sub strings.

use tokenizers::pre_tokenizers::metaspace::{Metaspace, PrependScheme};
use tokenizers::{PreTokenizedString, PreTokenizer};
use regex::Regex;
let pretok = Metaspace::new_with_prepend_scheme('▁', true, PrependScheme::Always)
let mut pretokenized = PreTokenizedString::from("Hey my friend <s>how▁are you");
let re_ref = Regex::new(r"(<s>)").unwrap();
pretokenized
  .split(|_, sequence| sequence.split(&re_ref, SplitDelimiterBehavior::Isolated))
  .expect("AddedVocabulary bad split");
pretok.pre_tokenize(&mut pretokenized).unwrap();

with legacy:

vec![("▁Hey", (0, 6)),("▁my", (6, 11)),("▁friend", (11, 20)),("▁", (20, 23)),("<s>", (23, 26)),("how", (26, 29)),("▁are", (29, 35)),("▁you", (35, 41))]);

without legacy:

 vec![("▁Hey", (0, 6)),("▁my", (6, 11)),("▁friend", (11, 20)),("▁", (20, 23)),("▁<s>", (23, 29)),("▁how", (29, 35)),("▁are", (35, 41)),("▁you", (41, 47))]);

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 7, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker changed the title Fix sentencepiece bases Metaspace Fix sentencepiece based Metaspace Oct 8, 2023
@ArthurZucker
Copy link
Collaborator Author

Serialization is not working properly yet! Will have to fix this

@ArthurZucker ArthurZucker marked this pull request as ready for review November 10, 2023 17:20
tokenizers/src/pre_tokenizers/metaspace.rs Outdated Show resolved Hide resolved
tokenizers/src/pre_tokenizers/metaspace.rs Outdated Show resolved Hide resolved
tokenizers/src/pre_tokenizers/metaspace.rs Show resolved Hide resolved
tokenizers/src/pre_tokenizers/mod.rs Outdated Show resolved Hide resolved
bindings/python/src/pre_tokenizers.rs Outdated Show resolved Hide resolved
bindings/python/src/pre_tokenizers.rs Outdated Show resolved Hide resolved
bindings/python/src/pre_tokenizers.rs Outdated Show resolved Hide resolved
bindings/python/src/pre_tokenizers.rs Outdated Show resolved Hide resolved
tokenizers/src/pre_tokenizers/metaspace.rs Show resolved Hide resolved
tokenizers/src/pre_tokenizers/metaspace.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last tiny one.

bindings/python/src/pre_tokenizers.rs Outdated Show resolved Hide resolved
@ArthurZucker ArthurZucker changed the title Fix sentencepiece based Metaspace [pre_tokenizers] Fix sentencepiece based Metaspace Nov 14, 2023
ArthurZucker and others added 2 commits November 14, 2023 17:52
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants