Skip to content
This repository has been archived by the owner on Sep 4, 2023. It is now read-only.

non-HTML input is treated as HTML #51

Closed
jelmervdl opened this issue Jan 26, 2022 · 20 comments · Fixed by #66
Closed

non-HTML input is treated as HTML #51

jelmervdl opened this issue Jan 26, 2022 · 20 comments · Fixed by #66
Assignees
Labels
bug Something isn't working

Comments

@jelmervdl
Copy link
Contributor

It looks like all input is marked as HTML for the translator, even though nodes' text content is submitted. If a node contains something like <p>Hello &lt; world</p>, it would submit Hello < world. Which when parsed as HTML is invalid input and would cause an exception.

Even if HTML was being submitted, it would not be properly used (and cause an abort()) because the model doesn't produce alignment information. In the model configuration yaml, the line alignment: soft is missing.

@jelmervdl jelmervdl changed the title non-HTML input is marked as HTML non-HTML input is treated as HTML Jan 26, 2022
@andrenatal
Copy link
Contributor

@jelmervdl Maybe this is related to #35, where I'm seeing multiple exceptions in the engine but can't reproduce on different languages?

@jelmervdl
Copy link
Contributor Author

@andrenatal it could be very well related. There are paragraphs with literal < and > in them. Those paragraphs would cause the html parsing bit in bergamot-translator to throw (but because of how emscripten is configured… abort()?) since it tries to parse that as an open tag and fail.

@andrenatal andrenatal added the bug Something isn't working label Jan 26, 2022
@kpu
Copy link
Contributor

kpu commented Jan 26, 2022

The extension will be called upon to translate text/plain and text/HTML documents, so the extension will need to toggle HTML on/off anyway, using this:

https://github.com/browsermt/bergamot-translator/blob/c0f311a8c067372057a6f301c42b40bbe30a9c1a/src/translator/response_options.h#L22
(The TODO comment is outdated and will be fixed.)

@andrenatal
Copy link
Contributor

@abhi-agg could you please take a look at this?

@abhi-agg
Copy link
Collaborator

@andrenatal I am on it.

@abhi-agg
Copy link
Collaborator

The extension will be called upon to translate text/plain and text/HTML documents, so the extension will need to toggle HTML on/off anyway

@kpu By documents, do you literally mean the user will upload documents to translate or do you only mean web pages?

@kpu
Copy link
Contributor

kpu commented Jan 27, 2022

The user could browse to a web page with a text/plain content type. The browser displays the text, which may include stray < and > that should not be interpreted as HTML.

@kpu
Copy link
Contributor

kpu commented Jan 27, 2022

As in, a user with a German Firefox UI visits https://neural.mt/test.txt . The browser offers to translate it to German (or at least it should, but that's not the point here). The text should be sent to the engine with HTML off.

@abhi-agg
Copy link
Collaborator

Did some quick tests via wasm test page:

  1. Assuming right model config was set and response options are set to do html translation
    1. Hello &lt; world. returned following for various models:
      1. De: Hallo Welt.
      2. Es: Hola, mundo.
      3. Cz: Ahoj, svět.
      4. Et: Tere &lt; maailm.
        (Seems like &lt; is gone in De, Es and Cz language)
    2. Hello < world causes engine to assert (because given string is a bad-formed html)
  2. Non-html translation for Hello &lt; world. Hello < world. returned follwing:
    1. De: Hello &lt; Welt. Hallo Welt.
    2. Es: Hola &lt; mundo. Hola, mundo.
    3. Cz: Hello &lt; world. Ahoj, svět.
    4. Et: Tere &lt; maailm. Tere < maailm.

@kpu
Copy link
Contributor

kpu commented Jan 27, 2022

That looks consistent with expectations. In HTML mode the processing looks like:

  1. Input Hello &lt; world.
  2. HTML maps to internal text Hello < world.
  3. Internal Estonian target Tere < maailm. (Which I note is what happened in text mode.)
  4. HTML escapes to Tere &lt; maailm.

@andrenatal
Copy link
Contributor

Wasn't this fixed by: #53?

Can we close it?

@kpu
Copy link
Contributor

kpu commented Jan 29, 2022

There's a few things going on:

  1. alignment: soft was not passed at load time so HTML didn't know what to do with word alignments, resulting in poor HTML placement quality. This was mentioned as part of the initial issue and is now resolved as Add alignment: soft; to YAML #53.
  2. The extension is currently extracting text and sending it to the engine with HTML on. This means a stray &lt; in the HTML is converted by the extension to < which then confuses HTML mode in the engine. This was also mentioned as part of the issue and remains open to my knowledge. We believe this is the cause of the crash in Translation skips paragraphs #35.
  3. The extension has no means to toggle the HTML flag. This will be required at some point since users may browse to text/plain and text/html pages. I have suggested doing this as part of the fix of point 2 above. My understanding is @abhi-agg has a draft version of a fix; he posted test output above.
  4. The extension should use HTML mode instead of sending text, which is Use HTML translation feature including inline tag movement #52.

So I think this issue could be retitled as "text is being sent to HTML mode"

@andrenatal
Copy link
Contributor

andrenatal commented Jan 29, 2022

What happens in the case of a legit instance of < existing in the page's text being sent to the engine?

@kpu
Copy link
Contributor

kpu commented Jan 29, 2022

If you have HTML mode on and provide ill-formed HTML, it will throw an informative exception to the request and the engine will be fine (to take more requests). However, the -fno-exceptions limitation of WASM maps this exception to process death. In Thursday's plenary, I raised the possibility of an API change to allow returning an error code, as the current API does not have the means to express errors by anything other than exceptions. See browsermt/bergamot-translator#316 .

HTML mode was originally scoped to be used with Firefox's innerHTML where it's not possible to pass erroneous HTML.

We could of course improvise something to just skip the character or treat it as if it were &lt; but this seemed like it would only cause trouble of being inconsistent with Firefox's interpretation of broken HTML.

@jelmervdl
Copy link
Contributor Author

jelmervdl commented Jan 29, 2022

Note that you can specify whether to use HTML mode every time you call translate(). For translating innerHTML it should be on, but for translating textContent or the outbound form inputs it should not be on.

Right now the extension does some batching, calling translate() with a VectorString with multiple input strings, but only one ResponseOptions object which specifies whether they are all HTML or not. I.e.

const responseOptions = {qualityScores: true, alignment: true, html: true};
let input = new this.WasmEngineModule.VectorString();
messages.forEach(message => {
    input.push_back(message.sourceParagraph);
});
let result = this.translationService.translate(translationModel, input, responseOptions);

We can alter the emscripten bindings of bergamot-translator a bit and accept a ResponseOptions object per input string, if that makes the extension's code easier. Something like:

const responseOptions = {qualityScores: true, alignment: true};
let input = new this.WasmEngineModule.VectorSomethingSomething();
messages.forEach(message => {
    input.push_back(message.sourceParagraph, {...responseOptions, html: message.isHTML});
});
let result = this.translationService.translate(translationModel, input);

We could of course improvise something to just skip the character or treat it as if it were < but this seemed like it would only cause trouble of being inconsistent with Firefox's interpretation of broken HTML.

Trying to parse HTML with a fallback sounds riksy. For example, if someone were to type When output<input and checked>unchecked into their textarea, that will be interpreted as valid HTML (and input and checked would go untranslated)

@andrenatal
Copy link
Contributor

Based on my conversation with Abhi, we'll need the InPageTranslation.js parser to always determine if the string being passed to the engine contains plain or html text and set the proper flag on ResponseOptions. Is that right @abhi-agg ?

@andrenatal
Copy link
Contributor

If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues

@kpu
Copy link
Contributor

kpu commented Feb 1, 2022

All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on.

The HTML feature is designed to operate on snippets of HTML. It does not make sense to expect automatic content identification from text vs snippets of HTML because that would introduce bugs in text translation where words are mysteriously not translated inside angle brackets.

Does Firefox render text/plain content containing some tags as HTML?

@abhi-agg
Copy link
Collaborator

abhi-agg commented Feb 1, 2022

All you need to do to fix this issue for the time being is turn the HTML flag to off. This is a one line change. When you have HTML translation again, turn it on.

Submitted the PR that should close this issue.

If the above is the case, it demonstrates a weakness of the engine itself: it shouldn't be the presentation layer responsibility to determine the type of the content being input, the user should just be able to input whatever they and and the engine itself infer that. There will be use cases where this won't be possible, like SOA applications for example. Unfortunately this is a bad design decision that can lead to even more issues

This leaves ⬆️ up for discussion.

@abhi-agg abhi-agg self-assigned this Feb 1, 2022
@kpu
Copy link
Contributor

kpu commented Feb 1, 2022

Thanks @abhi-agg!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
4 participants