New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading SentencePieceVocab from text file #51
Comments
Hello @MikaelCall , This is a good idea - it would be great if you could contribute your code back to the community!
The
Potential for a conversion utilityThis is however possibly more complex than converting the file to a Protobuf. Maybe a Python-based library allowing conversion from text file to Proto (and the other way around) would be generally valuable. I believe the community may be interested in such a tool that could be more broadly applicable (e.g. to Python users). I haven't tried it, but maybe the following script gives an indication that something along the lines of: import sentencepiece_model_pb2 as model
m = model.ModelProto()
tokens = load_tokens_and_score_from_text(filepath)
for token in tokens:
new_token = model.ModelProto().SentencePiece()
new_token.piece = token
new_token.score = 0
m.pieces.append(new_token)
with open('new.model', 'wb') as f:
f.write(m.SerializeToString()) Please let me know what you think |
I'm not familiar with the design and use cases so I can unfortunately not give any useful input on how to update traits. Concerning the additional file support and the 2 ways to implement it, I think that your second suggestion would be very easy to implement and I'd be willing to submit a PR for that as long as you think it is a clean solution that doesn't interfere with your design. I also agree with you that a conversion tool could be useful or at least being able to choose the output format via optional arguments in the CLI current tools. |
How to create a |
Hi @MikaelCall Is there any chance you could share your parser code? You mentioned you had written it already in your first post. Thanks!
|
It's quite some time since I looked at it. This is what I was able to dig up. /// Read Vocab file for sentence piece tokenization
fn read_vocab_file(path: &str) -> Result<SentencePieceVocab, TokenizeError> {
let f = File::open(path).map_err(|e| {
TokenizerError::FileNotFound(format!("{} vocabulary file not found :{}", path, e))
})?;
let br = BufReader::new(f);
let mut values = HashMap::new();
for (index, line) in br.lines().enumerate() {
let line = match line {
Ok(value) => value,
Err(e) => {
return Err(TokenizerError::VocabularyParsingError(e.to_string()).into());
}
};
let token = line
.split_whitespace()
.next()
.ok_or_else(|| TokenizerError::VocabularyParsingError(line.clone()))?
.trim();
// println!("{} .. {} -> |{}|", index, line, token);
if let Some(_previous_value) = values.insert(token.to_owned(), index as i64) {
panic!("FIXME");
}
}
let mut special_values = HashMap::new();
let unknown_value = SentencePieceVocab::unknown_value();
SentencePieceVocab::_register_as_special_value(
unknown_value,
&values,
&mut special_values,
)?;
let indices = Self::swap_key_values(&values);
let special_indices = Self::swap_key_values(&special_values);
Ok(SentencePieceVocab {
values,
indices,
unknown_value,
special_values,
special_indices,
})
} |
I've created a SentencePiece model using Python which results in a
.model
and a.vocab
file. It is not possible to create aSentencePieceVocab
from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:
in
rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs
The text was updated successfully, but these errors were encountered: