Converting into a onnx model #13

sehHeiden · 2023-09-04T18:06:55Z

Could I add some ONNX export version?

My current attempt is:

# Initialize the model
model = germansentiment.SentimentModel()

# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)

# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")

# Export the vocab
with open('vocab.json', 'w') as f:
    json.dump(model.tokenizer.vocab, f)

Than I used the model in Elixir:

{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")

{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)

# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])

{init_fn, predict_fn} = Axon.build(model)

predict_fn.(params, token_tensor)

But I still have some problems/questions:

is this correct
why do some keys in the vocab.json start with ##
why are keys are named ["unused{x}"]
why does the prediction not scale 0 to 1, but are signed floats
why do some strings not work in my version not work. The string "Ein scheiß Film" works on hugging face but not in the export.
Why are some keys in capital letters. While the text is always converted to lower?

about 4) I currently scale the prediction as follows:

prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1]))

about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.

I opened a question in the Elixir Forum about it here.

The text was updated successfully, but these errors were encountered:

oliverguhr · 2023-09-06T13:47:24Z

Hi @sehHeiden
this is an interesting question. The problem is the tokenization. The process is a bit more complex than splitting the words. Longer and compound words get split up into individual tokens, it works a bit like a simple compression algorithm. The huggingface team has a library for all the different tokenizer. To make it work, you would need to implement the BertTokenizer in Elixir or build a wrapper for the compiled Rust tokenizers from this lib.

Or you use a tool to run the original python code in Elixir, something like this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting into a onnx model #13

Converting into a onnx model #13

sehHeiden commented Sep 4, 2023

oliverguhr commented Sep 6, 2023

Converting into a onnx model #13

Converting into a onnx model #13

Comments

sehHeiden commented Sep 4, 2023

oliverguhr commented Sep 6, 2023