Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nlp inference and on-device tokenization #17

Merged
merged 5 commits into from Jan 4, 2024
Merged

Support nlp inference and on-device tokenization #17

merged 5 commits into from Jan 4, 2024

Conversation

jchalatu
Copy link
Contributor

@jchalatu jchalatu commented Dec 8, 2023

This PR primarily adds support in the SavedModelRunner for text/string based inputs during NLP model inference. A new class was also added to device_modules named NetworkTokenizer, which allows users to make network calls to a device with the SentencePiece CLI installed.

Copy link

github-actions bot commented Dec 8, 2023

Coverage report

The coverage rate went from 100% to 100% ➡️

100% of new lines are covered.

Diff Coverage details (click to unfold)

nengo_edge/network_runner.py

100% of new lines are covered (100% of the complete file).

nengo_edge/version.py

100% of new lines are covered (100% of the complete file).

nengo_edge/saved_model_runner.py

100% of new lines are covered (100% of the complete file).

nengo_edge/device_modules/network_tokenizer.py

100% of new lines are covered (100% of the complete file).

Copy link
Member

@drasmuss drasmuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just minor comments.

# Process string inputs
assert self.model_params["type"] == "nlp"
assert self.tokenizer is not None
return self._run_model(self.tokenizer.tokenize(inputs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work if we did inputs = self.tokenizer.tokenizer(inputs).numpy()? It'd be nice to share the _run_model call if possible, just to have as much of the pipeline shared as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do inputs.lower() here too, since that's what we're currently assuming in the tokenizer training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it where the pipeline was shared at one point, but I'll have to test it out again because I'm pretty sure something wasn't working.

Also, I have adjusted our model/tokenizer training pipeline to more closely match how we are doing it elsewhere (in past projects). We can discuss that further, as I'm not sure which is more correct or accurate for the vocabulary sizes we are considering.

]

ragged_in0 = [inputs[0]] # type: ignore
ragged_in1 = [inputs[1]] # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was it complaining about with these type errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was because ragged_in0/1 is already defined in the asr block above, so it was a re-assignment problem if I recall.

Parameters
----------
input_text: str
Input string to be tokenized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detokenize takes a batch of tokens, but tokenize just takes a single string. We should probably make them both the same, for consistency. I think we could basically rename decode_ids to detokenize (supporting non-batched inputs), and then the batch loop could go in .run (same as tokenize). So in both cases tokenize/detokenize are for single inputs, and run handles the batching.

# Detokenize asr outputs using greedy decoding
# Detokenize asr outputs by applying greedy decoding, removing
# blank tokens and merging repeats before feeding values into the
# sentencepiece tokenizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why we didn't use keras.backend.ctc_decode here? Seems like we could here, and it'd take care of this for us.

Copy link
Contributor Author

@jchalatu jchalatu Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair, I just forgot about that function when I was making this fixup. I think we also initially forgot to include it in this step and our original test didn't include blanks/repeats (or at least it didn't impact the outcome of the test if that were to occur).

)
output = subprocess.run(
cmd.split(),
input=input_text,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do an input_text.lower() here too.

Copy link
Contributor Author

@jchalatu jchalatu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine not to include these changes if they don't seem that important. Looks good overall though.

outputs = np.array(_outputs)
# Apply CTC decoding
outputs = tf.keras.backend.ctc_decode(
tf.nn.softmax(ragged.to_masked(outputs)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we actually need the softmax here, despite the model outputting logits. The internal argmax during Keras' ctc_decode should work the same with or without it. It's probably fine to leave it though, not a big difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though that too, but for some reason it doesn't. If you just pass in the logits you get a very different and incorrect decoding (this is the bug that Lisa was running into a while back). I never really investigated why though, my guess is that somewhere in there they're assuming these sum to 1 or something like that.

@@ -73,8 +73,8 @@ def test_runner_ragged(
_, tokenizer_path = new_tokenizer(tmp_path)

if model_type == "asr":
assert isinstance(pipeline.post[0], TokenizerDesc)
pipeline.post[0].tokenizer_file = tokenizer_path
assert isinstance(pipeline.post[1], TokenizerDesc)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference these as -1 indexing, since we know the tokenizer will always be at the end of a pipeline? Just so we avoid having to open a PR to change the indexing if the other layers change.

@drasmuss drasmuss merged commit 7feec8b into main Jan 4, 2024
1 check failed
@drasmuss drasmuss deleted the nlp branch January 4, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants