Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting into a onnx model #13

Open
sehHeiden opened this issue Sep 4, 2023 · 1 comment
Open

Converting into a onnx model #13

sehHeiden opened this issue Sep 4, 2023 · 1 comment

Comments

@sehHeiden
Copy link

Could I add some ONNX export version?

My current attempt is:

# Initialize the model
model = germansentiment.SentimentModel()

# Dummy input that matches the input dimensions of the model
dummy_input = torch.randint(0, 30_000, (1, 512), dtype=torch.long)

# Export to ONNX
torch.onnx.export(model.model, dummy_input, "german_sentiment_model.onnx")

# Export the vocab
with open('vocab.json', 'w') as f:
    json.dump(model.tokenizer.vocab, f)

Than I used the model in Elixir:

{model, params} = AxonOnnx.import("./models/models/german_sentiment_model.onnx")

{:ok, vocab_string} = File.read("./models/models/vocab.json")
{:ok, vocab_map} = Jason.decode(vocab_string)

# Tokenize
input_text = "Ein schlechter Film"
token_list = Enum.map(String.split(input_text, " "), fn x -> vocab_map[x] end)
token_tensor = Nx.tensor(List.duplicate(0, 512 - length(token_list)))
token_tensor = Nx.concatenate([Nx.tensor(token_list), token_tensor])

{init_fn, predict_fn} = Axon.build(model)

predict_fn.(params, token_tensor)

But I still have some problems/questions:

is this correct
why do some keys in the vocab.json start with ##
why are keys are named ["unused{x}"]
why does the prediction not scale 0 to 1, but are signed floats
why do some strings not work in my version not work. The string "Ein scheiß Film" works on hugging face but not in the export.
Why are some keys in capital letters. While the text is always converted to lower?

about 4) I currently scale the prediction as follows:

prediction = predict_fn.(params, token_tensor)
one_hot = Nx.divide(Nx.pow(2, prediction), Nx.sum(Nx.pow(2, prediction)))
poltical_score = 5*(Nx.to_number(one_hot[0][0]) - Nx.to_number(one_hot[0][1])) 

about 5) In my version above keys that not matched return nil. I changed that to 0. But that changes to meaning of the sentence.

I opened a question in the Elixir Forum about it here.

@oliverguhr
Copy link
Owner

Hi @sehHeiden
this is an interesting question. The problem is the tokenization. The process is a bit more complex than splitting the words. Longer and compound words get split up into individual tokens, it works a bit like a simple compression algorithm. The huggingface team has a library for all the different tokenizer. To make it work, you would need to implement the BertTokenizer in Elixir or build a wrapper for the compiled Rust tokenizers from this lib.

Or you use a tool to run the original python code in Elixir, something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants