Tokenizer: use Python objects to represent tokens #521

jayaddison · 2020-12-30T17:42:46Z

This change refactors the tokenizer module to use Python object instances where previously plain dictionaries were used to hold token state.

This builds upon #519, #520 and attempts to resolve #24.

…me state

…object-tokens

jayaddison · 2020-12-30T17:45:21Z

NB: This isn't suitable for merge currently; it seems to introduce a noticeable performance penalty:

Before (2c19b98)

.........................................
html_parse_etree: Mean +- std dev: 202 ms +- 10 ms

After (8772408)

.........................................
html_parse_etree: Mean +- std dev: 226 ms +- 11 ms

gsnedders · 2021-01-04T16:37:44Z

This builds upon #519, #520 and attempts to resolve #24.

Leaving reviewing this till after those two, FYI.

jayaddison · 2022-12-24T01:03:21Z

Cleaning up some old / stale pull requests; please let me know if this changeset is considered worthwhile and I'll reopen if so.

jayaddison added 16 commits December 29, 2020 14:44

Consistency: consume a single character at a time during attribute na…

183d8a0

…me state

Refactor: pretranslate lowercase element and attribute names

2e86373

Restore self.currentToken safety check

8f96b17

Alternate approach: do not pretranslate temporary buffered data

a912842

Consistency: character consumption within double-escaped state

f9f370e

Refactor: use Python objects for tokens within tokenizer

bcee8bd

Introduce type hierarchy for tag-related tokens

67262f8

Simplify tag token construction

900bdaf

Refactor token attribution name/value accumulation

1f6cae9

Cleanup: remove leavingThisState / emitToken logic

695ac1c

Remove EmptyTag tokenizer token class

b1a444b

Refactor: pre-translate strings that are only used in lowercase context

bb7fabc

Cleanup: remove getattr anti-pattern

5f4ace9

Consistency: use camel-casing to correspond with existing codebase style

d744c86

Consistency: consume a single character at a time during attribute na…

1d62e69

…me state

Merge branch 'tokenizer/pretranslate-lowercase-names' into tokenizer/…

8772408

…object-tokens

Linting cleanup

192cce0

Clarify method name: clearAttribute -> flushAttribute

e76e0dd

gsnedders mentioned this pull request Jan 5, 2021

Compile html5lib with Cython #524

Draft

Merge branch 'master' into tokenizer/object-tokens

da37332

jayaddison closed this Dec 24, 2022

Provide feedback