Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading SentencePieceVocab from text file #51

Open
MikaelCall opened this issue Apr 19, 2021 · 5 comments
Open

Reading SentencePieceVocab from text file #51

MikaelCall opened this issue Apr 19, 2021 · 5 comments

Comments

@MikaelCall
Copy link

I've created a SentencePiece model using Python which results in a .model and a .vocab file. It is not possible to create a SentencePieceVocab from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:

<unk>	0
<s>	0
</s>	0
▁	-2.29038
s	-3.10405
l	-3.41047

I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:

impl SentencePieceVocab {
    ...
    
    pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { 
        ... 
    }
}

in rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs

@guillaume-be
Copy link
Owner

Hello @MikaelCall ,

This is a good idea - it would be great if you could contribute your code back to the community!
I have a few questions on my side regarding this implementation:

  • The user typically wants to create a Tokenizer that creates internally a Vocab. I believe handling this different format would need to be adapted to the tokenizers as well. Usually the Tokenizer contains both a SentencePieceModel and a Vocab. These are for some tokenizers generated from the same file (for example XLNetTokenizer) - does this mean we need to add additional parsing capabilities for all of the Vocab of Sentencepiece-based tokenizers? (i.e., updating the ::from_file method of XLNetVocab)?
  • The loading of Sentencepiece files (from proto or text file) could be shared in 2 traits:
    • FromProto exposing methods generate_vocab_from_proto (returning a HashMap) and generate_vocab_with_scores_from_proto (returning a Trie)
    • FromText exposing methods generate_vocab_from_text and generate_vocab_with_scores_fromt_text

The Vocab structs (e.g. XLNetVocab, XLMRobertaVocab,...) and SentencePieceModel would implement these traits to load the files and create the intermediate HashMap required for their internal storage. Alternatively the traits could be arranged as TrieFromFile and VocabFromFile re-arranging the above-methods (Vocabs would implement VocabFromFile and SentencePieceModel would implement TrieFromFile)

  • For the additional file support, there may be 2 ways to implement it:
    • Separate public method from_vocab_txt_file as you suggested
    • Unique public method from_file that tries loading the file as a protobuf, and falls back to a text file loading as protobuf fails. The unique entry point would probably call 2 specialized functions to try loading the file, but would allow to keep the API unchanged. The advantage is that the rust-bert pipeline loading tokenizer file do not need to know which format the vocab is stored as. We may want to still expose from_vocab_txt_file publicly, but the support for both format in a single loading method offers more consistent API. What do you think?

Potential for a conversion utility

This is however possibly more complex than converting the file to a Protobuf. Maybe a Python-based library allowing conversion from text file to Proto (and the other way around) would be generally valuable. I believe the community may be interested in such a tool that could be more broadly applicable (e.g. to Python users). I haven't tried it, but maybe the following script gives an indication that something along the lines of:

import sentencepiece_model_pb2 as model
m = model.ModelProto()

tokens = load_tokens_and_score_from_text(filepath)

for token in tokens:
    new_token = model.ModelProto().SentencePiece()
    new_token.piece = token
    new_token.score = 0
    m.pieces.append(new_token)

with open('new.model', 'wb') as f:
    f.write(m.SerializeToString())

Please let me know what you think

@MikaelCall
Copy link
Author

I'm not familiar with the design and use cases so I can unfortunately not give any useful input on how to update traits.

Concerning the additional file support and the 2 ways to implement it, I think that your second suggestion would be very easy to implement and I'd be willing to submit a PR for that as long as you think it is a clean solution that doesn't interfere with your design.

I also agree with you that a conversion tool could be useful or at least being able to choose the output format via optional arguments in the CLI current tools.

@failable
Copy link

failable commented Mar 2, 2022

How to create a spiece.model using vocab.txt (example) now?

@tobygodwin
Copy link

Hi @MikaelCall

Is there any chance you could share your parser code? You mentioned you had written it already in your first post. Thanks!

impl SentencePieceVocab {
    ...
    
    pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { 
        ... 
    }
}

@MikaelCall
Copy link
Author

It's quite some time since I looked at it. This is what I was able to dig up.

    /// Read Vocab file for sentence piece tokenization
    fn read_vocab_file(path: &str) -> Result<SentencePieceVocab, TokenizeError> {
        let f = File::open(path).map_err(|e| {
            TokenizerError::FileNotFound(format!("{} vocabulary file not found :{}", path, e))
        })?;
        let br = BufReader::new(f);
        let mut values = HashMap::new();

        for (index, line) in br.lines().enumerate() {
            let line = match line {
                Ok(value) => value,
                Err(e) => {
                    return Err(TokenizerError::VocabularyParsingError(e.to_string()).into());
                }
            };

            let token = line
                .split_whitespace()
                .next()
                .ok_or_else(|| TokenizerError::VocabularyParsingError(line.clone()))?
                .trim();

            // println!("{} .. {} -> |{}|", index, line, token);

            if let Some(_previous_value) = values.insert(token.to_owned(), index as i64) {
                panic!("FIXME");
            }
        }

        let mut special_values = HashMap::new();
        let unknown_value = SentencePieceVocab::unknown_value();
        SentencePieceVocab::_register_as_special_value(
            unknown_value,
            &values,
            &mut special_values,
        )?;

        let indices = Self::swap_key_values(&values);
        let special_indices = Self::swap_key_values(&special_values);

        Ok(SentencePieceVocab {
            values,
            indices,
            unknown_value,
            special_values,
            special_indices,
        })
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants