Deduplicate separators when processing HTML #290

tushuhei · 2023-09-18T02:28:56Z

Demo
https://codepen.io/tushuhei/pen/VwqMywj

Setup

<p>xyz<wbr>abcabc</p>

const parser = new HTMLProcessingParser({
  UW4: {a: 1001}, // means "should separate right before 'a'".
});
const paragraph = document.querySelector('p');
parser.applyElement(paragraph);
console.log(paragraph.innerHTML);

Expected

<p>xyz<wbr>abc<wbr>abc</p>

Actual

<p>xyz<wbr><wbr>abc<wbr>abc</p>

We may want to remove duplicated separators in case we need to apply BudouX to the same element multiple times (e.g. Web Components that reuse their Light DOM).

@kojiishi Could you take a look what changes should be applied to html_processor.ts?

The text was updated successfully, but these errors were encountered:

tushuhei · 2023-10-09T23:02:13Z

After some investigation, I'm starting to feel like this fix is not trivial.
Separator HTML elements may come to the end of a paragraph node and the top of the following paragraph node, so we need a postprocess to look for any duplicated separators across the paragraph.nodes. If we can assume that the separator is always , it's easy to locale them with querySelectorAll('wbr') and the nextSibling property. However, the reality is that the separator can be an arbitrary node, thus the querySelectorAll(TAG_NAME) approach won't work. I'm feeling we could close this issue if there's no performant way to implement the fix as it should work even if the separator is duplicated. @kojiishi wdyt?

kojiishi · 2023-10-10T08:08:32Z

Avoiding redundant separaters is easy, but applying multiple times is a bit complicated, because once BudouX is applied, it's difficult to distinguish  inserted by BudouX from the one author originally inserted.

One way to do this is removing existing separaters before inserting new ones. This looks like a clean way, but this doesn't work well, becuase this option ignores author-inserted separaters.

The 2nd option. BudouX can check if there's an existing separater before inserting a new one. This can fix the redundant separators, but if the content changes and if we want to apply BudouX again, existing separaters for the previous content will remain. For example:

Original content abcabcabc
Applied once: abcabcabc
A c was removed: abcababc
Then applying again doesn't remove the  after ab. The result is different from when applying to abcababc.

The 3rd option is to mark separators. For example, xyzabcabc can become to xyzabcabc.

The 4th option is BudouX to insert different separators and apply the option 1. For example, if author is supposed to use , BudouX can insert &ZeroWidthSpace;. This is a lot simpler than the option 3, but requires authors to be aware of this.

If we want to fix redundant separators without worrying about applying multiple times for updated content, the option 2 looks reasonable and simple to me.

Thoughts?

tushuhei · 2023-10-10T15:46:23Z

Thanks for your thorough consideration. I think the 2nd option is the way to go. We want to respect word break opportunities inserted by the author initially, but we should not distinguish if they're inserted by the author or BudouX itself from the second run.

tushuhei assigned kojiishi Sep 18, 2023

tushuhei mentioned this issue Sep 18, 2023

[Web Components] Use Light DOM instead of Shadow DOM #291

Merged

kojiishi mentioned this issue Oct 25, 2023

Avoid inserting separators to where the source has one #342

Merged

kojiishi closed this as completed in #342 Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate separators when processing HTML #290

Deduplicate separators when processing HTML #290

tushuhei commented Sep 18, 2023

tushuhei commented Oct 9, 2023

kojiishi commented Oct 10, 2023 •

edited

tushuhei commented Oct 10, 2023

Deduplicate separators when processing HTML #290

Deduplicate separators when processing HTML #290

Comments

tushuhei commented Sep 18, 2023

tushuhei commented Oct 9, 2023

kojiishi commented Oct 10, 2023 • edited

tushuhei commented Oct 10, 2023

kojiishi commented Oct 10, 2023 •

edited