Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unpaired close tags and self-closing tags #360

Merged
merged 3 commits into from
Nov 11, 2023

Conversation

kojiishi
Copy link
Collaborator

#251 assumed that all tags are closed properly.

This assumption doesn't stand for cases like:

  1. Self-closing tags such as <img> don't have corresponding close tags.
  2. Unpaired close tags are still valid HTML.

This patch supports these cases by assuming all open tags that doesn't nest correctly or that doesn't close are automatically closed.

This isn't the full HTML "adoption agency algorithm", but it should be good enough for the needs of BudouX.

Fixes #355

google#251 assumed that all tags are closed properly.

This assumption doesn't stand for cases like:
1. Self-closing tags such as `<img>` don't have corresponding close tags.
2. Unpaired close tags are still valid HTML.

This patch supports these cases by assuming all open tags that doesn't
nest correctly or that doesn't close are automatically closed.

This isn't the full HTML "adoption agency algorithm", but it should be
good enough for the needs of BudouX.

Fixes google#355
tushuhei
tushuhei previously approved these changes Nov 10, 2023
Copy link
Member

@tushuhei tushuhei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

resolver = html_processor.HTMLChunkResolver(['abxyabc', 'def'], '<wbr>')
resolver.feed(input)
self.assertEqual(resolver.output, expected,
'WBR tags should not be inserted if NOBR.')
Copy link
Member

@tushuhei tushuhei Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate this test message by mentioning the IMG tag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks, done.

@tushuhei
Copy link
Member

@kojiishi I left a small comment actually. PTAL.

Copy link
Member

@tushuhei tushuhei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit request about the test message

@kojiishi kojiishi merged commit 2457c51 into google:main Nov 11, 2023
19 checks passed
@kojiishi kojiishi deleted the unpaired branch November 11, 2023 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unopened HTML tag causes exception in budoux 0.6
2 participants