Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate separators when processing HTML #290

Closed
tushuhei opened this issue Sep 18, 2023 · 3 comments · Fixed by #342
Closed

Deduplicate separators when processing HTML #290

tushuhei opened this issue Sep 18, 2023 · 3 comments · Fixed by #342
Assignees

Comments

@tushuhei
Copy link
Member

Demo
https://codepen.io/tushuhei/pen/VwqMywj

Setup

<p>xyz<wbr>abcabc</p>
const parser = new HTMLProcessingParser({
  UW4: {a: 1001}, // means "should separate right before 'a'".
});
const paragraph = document.querySelector('p');
parser.applyElement(paragraph);
console.log(paragraph.innerHTML);

Expected

<p>xyz<wbr>abc<wbr>abc</p>

Actual

<p>xyz<wbr><wbr>abc<wbr>abc</p>

We may want to remove duplicated separators in case we need to apply BudouX to the same element multiple times (e.g. Web Components that reuse their Light DOM).

@kojiishi Could you take a look what changes should be applied to html_processor.ts?

@tushuhei
Copy link
Member Author

tushuhei commented Oct 9, 2023

After some investigation, I'm starting to feel like this fix is not trivial.
Separator HTML elements may come to the end of a paragraph node and the top of the following paragraph node, so we need a postprocess to look for any duplicated separators across the paragraph.nodes. If we can assume that the separator is always <wbr>, it's easy to locale them with querySelectorAll('wbr') and the nextSibling property. However, the reality is that the separator can be an arbitrary node, thus the querySelectorAll(TAG_NAME) approach won't work. I'm feeling we could close this issue if there's no performant way to implement the fix as it should work even if the separator is duplicated. @kojiishi wdyt?

@kojiishi
Copy link
Collaborator

kojiishi commented Oct 10, 2023

Avoiding redundant separaters is easy, but applying multiple times is a bit complicated, because once BudouX is applied, it's difficult to distinguish <wbr> inserted by BudouX from the one author originally inserted.

One way to do this is removing existing separaters before inserting new ones. This looks like a clean way, but this doesn't work well, becuase this option ignores author-inserted separaters.

The 2nd option. BudouX can check if there's an existing separater before inserting a new one. This can fix the redundant separators, but if the content changes and if we want to apply BudouX again, existing separaters for the previous content will remain. For example:

  1. Original content abcabcabc
  2. Applied once: abc<wbr>abc<wbr>abc
  3. A c was removed: abc<wbr>ab<wbr>abc
  4. Then applying again doesn't remove the <wbr> after ab. The result is different from when applying to abcababc.

The 3rd option is to mark separators. For example, xyz<wbr>abcabc can become to xyz<wbr budoux-origin=author>abc<wbr budoux-origin=auto>abc.

The 4th option is BudouX to insert different separators and apply the option 1. For example, if author is supposed to use <wbr>, BudouX can insert &ZeroWidthSpace;. This is a lot simpler than the option 3, but requires authors to be aware of this.

If we want to fix redundant separators without worrying about applying multiple times for updated content, the option 2 looks reasonable and simple to me.

Thoughts?

@tushuhei
Copy link
Member Author

Thanks for your thorough consideration. I think the 2nd option is the way to go. We want to respect word break opportunities inserted by the author initially, but we should not distinguish if they're inserted by the author or BudouX itself from the second run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants