non-HTML input is treated as HTML #51

jelmervdl · 2022-01-26T13:09:26Z

It looks like all input is marked as HTML for the translator, even though nodes' text content is submitted. If a node contains something like <p>Hello < world</p>, it would submit Hello < world. Which when parsed as HTML is invalid input and would cause an exception.

Even if HTML was being submitted, it would not be properly used (and cause an abort()) because the model doesn't produce alignment information. In the model configuration yaml, the line alignment: soft is missing.

The text was updated successfully, but these errors were encountered:

andrenatal · 2022-01-26T19:18:06Z

@jelmervdl Maybe this is related to #35, where I'm seeing multiple exceptions in the engine but can't reproduce on different languages?

jelmervdl · 2022-01-26T20:56:07Z

@andrenatal it could be very well related. There are paragraphs with literal < and > in them. Those paragraphs would cause the html parsing bit in bergamot-translator to throw (but because of how emscripten is configured… abort()?) since it tries to parse that as an open tag and fail.

kpu · 2022-01-26T21:19:32Z

The extension will be called upon to translate text/plain and text/HTML documents, so the extension will need to toggle HTML on/off anyway, using this:

https://github.com/browsermt/bergamot-translator/blob/c0f311a8c067372057a6f301c42b40bbe30a9c1a/src/translator/response_options.h#L22
(The TODO comment is outdated and will be fixed.)

andrenatal · 2022-01-27T08:54:03Z

@abhi-agg could you please take a look at this?

abhi-agg · 2022-01-27T18:11:46Z

@andrenatal I am on it.

abhi-agg · 2022-01-27T18:18:59Z

The extension will be called upon to translate text/plain and text/HTML documents, so the extension will need to toggle HTML on/off anyway

@kpu By documents, do you literally mean the user will upload documents to translate or do you only mean web pages?

kpu · 2022-01-27T18:57:04Z

The user could browse to a web page with a text/plain content type. The browser displays the text, which may include stray < and > that should not be interpreted as HTML.

kpu · 2022-01-27T19:02:11Z

As in, a user with a German Firefox UI visits https://neural.mt/test.txt . The browser offers to translate it to German (or at least it should, but that's not the point here). The text should be sent to the engine with HTML off.

abhi-agg · 2022-01-27T19:59:20Z

Did some quick tests via wasm test page:

Assuming right model config was set and response options are set to do html translation
1. Hello < world. returned following for various models:
  1. De: Hallo Welt.
  2. Es: Hola, mundo.
  3. Cz: Ahoj, svět.
  4. Et: Tere < maailm.
    (Seems like < is gone in De, Es and Cz language)
2. Hello < world causes engine to assert (because given string is a bad-formed html)
Non-html translation for Hello < world. Hello < world. returned follwing:
1. De: Hello < Welt. Hallo Welt.
2. Es: Hola < mundo. Hola, mundo.
3. Cz: Hello < world. Ahoj, svět.
4. Et: Tere < maailm. Tere < maailm.

kpu · 2022-01-27T20:16:13Z

That looks consistent with expectations. In HTML mode the processing looks like:

Input Hello < world.
HTML maps to internal text Hello < world.
Internal Estonian target Tere < maailm. (Which I note is what happened in text mode.)
HTML escapes to Tere < maailm.

andrenatal · 2022-01-29T11:20:20Z

Wasn't this fixed by: #53?

Can we close it?

kpu · 2022-01-29T13:26:57Z

There's a few things going on:

alignment: soft was not passed at load time so HTML didn't know what to do with word alignments, resulting in poor HTML placement quality. This was mentioned as part of the initial issue and is now resolved as Add alignment: soft; to YAML #53.
The extension is currently extracting text and sending it to the engine with HTML on. This means a stray < in the HTML is converted by the extension to < which then confuses HTML mode in the engine. This was also mentioned as part of the issue and remains open to my knowledge. We believe this is the cause of the crash in Translation skips paragraphs #35.
The extension has no means to toggle the HTML flag. This will be required at some point since users may browse to text/plain and text/html pages. I have suggested doing this as part of the fix of point 2 above. My understanding is @abhi-agg has a draft version of a fix; he posted test output above.
The extension should use HTML mode instead of sending text, which is Use HTML translation feature including inline tag movement #52.

So I think this issue could be retitled as "text is being sent to HTML mode"

andrenatal · 2022-01-29T18:34:54Z

What happens in the case of a legit instance of < existing in the page's text being sent to the engine?

kpu · 2022-01-29T18:46:40Z

If you have HTML mode on and provide ill-formed HTML, it will throw an informative exception to the request and the engine will be fine (to take more requests). However, the -fno-exceptions limitation of WASM maps this exception to process death. In Thursday's plenary, I raised the possibility of an API change to allow returning an error code, as the current API does not have the means to express errors by anything other than exceptions. See browsermt/bergamot-translator#316 .

HTML mode was originally scoped to be used with Firefox's innerHTML where it's not possible to pass erroneous HTML.

We could of course improvise something to just skip the character or treat it as if it were < but this seemed like it would only cause trouble of being inconsistent with Firefox's interpretation of broken HTML.

jelmervdl · 2022-01-29T21:27:18Z

Note that you can specify whether to use HTML mode every time you call translate(). For translating innerHTML it should be on, but for translating textContent or the outbound form inputs it should not be on.

Right now the extension does some batching, calling translate() with a VectorString with multiple input strings, but only one ResponseOptions object which specifies whether they are all HTML or not. I.e.

const responseOptions = {qualityScores: true, alignment: true, html: true};
let input = new this.WasmEngineModule.VectorString();
messages.forEach(message => {
    input.push_back(message.sourceParagraph);
});
let result = this.translationService.translate(translationModel, input, responseOptions);

We can alter the emscripten bindings of bergamot-translator a bit and accept a ResponseOptions object per input string, if that makes the extension's code easier. Something like:

const responseOptions = {qualityScores: true, alignment: true};
let input = new this.WasmEngineModule.VectorSomethingSomething();
messages.forEach(message => {
    input.push_back(message.sourceParagraph, {...responseOptions, html: message.isHTML});
});
let result = this.translationService.translate(translationModel, input);

We could of course improvise something to just skip the character or treat it as if it were < but this seemed like it would only cause trouble of being inconsistent with Firefox's interpretation of broken HTML.

Trying to parse HTML with a fallback sounds riksy. For example, if someone were to type When output<input and checked>unchecked into their textarea, that will be interpreted as valid HTML (and input and checked would go untranslated)

andrenatal · 2022-01-31T22:20:29Z

Based on my conversation with Abhi, we'll need the InPageTranslation.js parser to always determine if the string being passed to the engine contains plain or html text and set the proper flag on ResponseOptions. Is that right @abhi-agg ?

andrenatal · 2022-02-01T16:07:53Z

If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues

kpu · 2022-02-01T16:42:46Z

All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on.

The HTML feature is designed to operate on snippets of HTML. It does not make sense to expect automatic content identification from text vs snippets of HTML because that would introduce bugs in text translation where words are mysteriously not translated inside angle brackets.

Does Firefox render text/plain content containing some tags as HTML?

abhi-agg · 2022-02-01T17:00:10Z

All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on.

Submitted the PR that should close this issue.

If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues

This leaves ⬆️ up for discussion.

kpu · 2022-02-01T17:49:10Z

Thanks @abhi-agg!

jelmervdl changed the title ~~non-HTML input is marked as HTML~~ non-HTML input is treated as HTML Jan 26, 2022

kpu mentioned this issue Jan 26, 2022

Use HTML translation feature including inline tag movement #52

Closed

andrenatal added the bug Something isn't working label Jan 26, 2022

kpu mentioned this issue Jan 26, 2022

Add alignment: soft; to YAML #53

Merged

abhi-agg mentioned this issue Feb 1, 2022

Fixed issue where non-html text was treated as html input for translation #66

Merged

abhi-agg self-assigned this Feb 1, 2022

andrenatal closed this as completed in #66 Feb 3, 2022

kpu mentioned this issue Feb 3, 2022

Translation skips paragraphs #35

Closed

abhi-agg mentioned this issue Feb 10, 2022

HTML/Non-HTML translation per item of Batch Translation API browsermt/bergamot-translator#345

Closed

abhi-agg mentioned this issue Feb 21, 2022

QE for HTML input doesn't work browsermt/bergamot-translator#355

Closed

jerinphilip mentioned this issue Feb 28, 2022

Test cases for HTML/Non-HTML translation per item of Batch Translation API browsermt/bergamot-translator#363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-HTML input is treated as HTML #51

non-HTML input is treated as HTML #51

jelmervdl commented Jan 26, 2022

andrenatal commented Jan 26, 2022

jelmervdl commented Jan 26, 2022

kpu commented Jan 26, 2022

andrenatal commented Jan 27, 2022

abhi-agg commented Jan 27, 2022

abhi-agg commented Jan 27, 2022

kpu commented Jan 27, 2022

kpu commented Jan 27, 2022 •

edited

abhi-agg commented Jan 27, 2022

kpu commented Jan 27, 2022 •

edited

andrenatal commented Jan 29, 2022

kpu commented Jan 29, 2022 •

edited

andrenatal commented Jan 29, 2022 •

edited

kpu commented Jan 29, 2022 •

edited

jelmervdl commented Jan 29, 2022 •

edited

andrenatal commented Jan 31, 2022

andrenatal commented Feb 1, 2022

kpu commented Feb 1, 2022

abhi-agg commented Feb 1, 2022

kpu commented Feb 1, 2022

non-HTML input is treated as HTML #51

non-HTML input is treated as HTML #51

Comments

jelmervdl commented Jan 26, 2022

andrenatal commented Jan 26, 2022

jelmervdl commented Jan 26, 2022

kpu commented Jan 26, 2022

andrenatal commented Jan 27, 2022

abhi-agg commented Jan 27, 2022

abhi-agg commented Jan 27, 2022

kpu commented Jan 27, 2022

kpu commented Jan 27, 2022 • edited

abhi-agg commented Jan 27, 2022

kpu commented Jan 27, 2022 • edited

andrenatal commented Jan 29, 2022

kpu commented Jan 29, 2022 • edited

andrenatal commented Jan 29, 2022 • edited

kpu commented Jan 29, 2022 • edited

jelmervdl commented Jan 29, 2022 • edited

andrenatal commented Jan 31, 2022

andrenatal commented Feb 1, 2022

kpu commented Feb 1, 2022

abhi-agg commented Feb 1, 2022

kpu commented Feb 1, 2022

kpu commented Jan 27, 2022 •

edited

kpu commented Jan 27, 2022 •

edited

kpu commented Jan 29, 2022 •

edited

andrenatal commented Jan 29, 2022 •

edited

kpu commented Jan 29, 2022 •

edited

jelmervdl commented Jan 29, 2022 •

edited