-
Notifications
You must be signed in to change notification settings - Fork 47
non-HTML input is treated as HTML #51
Comments
@jelmervdl Maybe this is related to #35, where I'm seeing multiple exceptions in the engine but can't reproduce on different languages? |
@andrenatal it could be very well related. There are paragraphs with literal |
The extension will be called upon to translate https://github.com/browsermt/bergamot-translator/blob/c0f311a8c067372057a6f301c42b40bbe30a9c1a/src/translator/response_options.h#L22 |
@abhi-agg could you please take a look at this? |
@andrenatal I am on it. |
@kpu By |
The user could browse to a web page with a text/plain content type. The browser displays the text, which may include stray < and > that should not be interpreted as HTML. |
As in, a user with a German Firefox UI visits https://neural.mt/test.txt . The browser offers to translate it to German (or at least it should, but that's not the point here). The text should be sent to the engine with HTML off. |
Did some quick tests via wasm test page:
|
That looks consistent with expectations. In HTML mode the processing looks like:
|
Wasn't this fixed by: #53? Can we close it? |
There's a few things going on:
So I think this issue could be retitled as "text is being sent to HTML mode" |
What happens in the case of a legit instance of |
If you have HTML mode on and provide ill-formed HTML, it will throw an informative exception to the request and the engine will be fine (to take more requests). However, the HTML mode was originally scoped to be used with Firefox's innerHTML where it's not possible to pass erroneous HTML. We could of course improvise something to just skip the character or treat it as if it were |
Note that you can specify whether to use HTML mode every time you call Right now the extension does some batching, calling const responseOptions = {qualityScores: true, alignment: true, html: true};
let input = new this.WasmEngineModule.VectorString();
messages.forEach(message => {
input.push_back(message.sourceParagraph);
});
let result = this.translationService.translate(translationModel, input, responseOptions); We can alter the emscripten bindings of bergamot-translator a bit and accept a ResponseOptions object per input string, if that makes the extension's code easier. Something like: const responseOptions = {qualityScores: true, alignment: true};
let input = new this.WasmEngineModule.VectorSomethingSomething();
messages.forEach(message => {
input.push_back(message.sourceParagraph, {...responseOptions, html: message.isHTML});
});
let result = this.translationService.translate(translationModel, input);
Trying to parse HTML with a fallback sounds riksy. For example, if someone were to type |
Based on my conversation with Abhi, we'll need the InPageTranslation.js parser to always determine if the string being passed to the engine contains plain or html text and set the proper flag on ResponseOptions. Is that right @abhi-agg ? |
If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues |
All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on. The HTML feature is designed to operate on snippets of HTML. It does not make sense to expect automatic content identification from text vs snippets of HTML because that would introduce bugs in text translation where words are mysteriously not translated inside angle brackets. Does Firefox render text/plain content containing some tags as HTML? |
Submitted the PR that should close this issue.
This leaves ⬆️ up for discussion. |
Thanks @abhi-agg! |
It looks like all input is marked as HTML for the translator, even though nodes' text content is submitted. If a node contains something like
<p>Hello < world</p>
, it would submitHello < world
. Which when parsed as HTML is invalid input and would cause an exception.Even if HTML was being submitted, it would not be properly used (and cause an
abort()
) because the model doesn't produce alignment information. In the model configuration yaml, the linealignment: soft
is missing.The text was updated successfully, but these errors were encountered: