Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile html5lib with Cython #524

Draft
wants to merge 22 commits into
base: master
Choose a base branch
from
Draft

Conversation

gsnedders
Copy link
Member

Use Cython to make the parser quicker; see #445. This builds on top of #272.

This is a long way from ready to land, but shows potential. We probably also want to split this up so many of the earlier changes land first if they make performance sense without Cython.

The change to attribute representation especially might be of interest to #521 (cc @jayaddison).

There's also some API changes towards the end of the branch which we may well want to delay landing even beyond the rest of the Cython stuff.

This added a fair bit of complexity, and notable made the Phase classes
dynamically generated.

However, by doing this, we no longer include "process the
token using the rules for" phases in the debug log.
This allows us to define the argument as an int in Cython
This is in preparation for Cython using C function pointers for _state
The current _ascii module is a placeholder, because I accidentally
deleted the original implementation of it (but I needed to rewrite it
to be even quicker anyway!)
This makes duplicate checking much quicker, and avoids the
conversion to a dict at the end
# then stop
if self.chunkOffset != self.chunkSize:
while True:
# this really should be a slice of self.chunk, but https://github.com/cython/cython/issues/3536
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with cython.boundscheck(False), cython.wraparound(False):
c = self.chunk[i]
if c > 0x7F and opposite or c <= 0x7F and (bitmap[index(c)] & bit(c)):
cyrv += self.chunk[self.chunkOffset:i]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref cython/cython#3887 (IIRC, this was a significant perf issue)

@jayaddison
Copy link
Contributor

Thanks @gsnedders - I see that the attribute names/values are no longer accessed via tuple/list indexes (a nice improvement), and that the element attributes are initialized as attribute maps (rather than being converted into them at the end of the process).

Those are similar to the approach taken in #521 as well. What's the general approach to review, quality/performance assurance and merge for each pull request? It looks tricky to review this changeset as one combined change (given the size of the diff) - could we create a list of the proposed changes (with dependencies / ordering between them) and queue them up for review?

@gsnedders
Copy link
Member Author

What's the general approach to review, quality/performance assurance and merge for each pull request? It looks tricky to review this changeset as one combined change (given the size of the diff) - could we create a list of the proposed changes (with dependencies / ordering between them) and queue them up for review?

Yeah, I was mostly just throwing this up as one PR so at least the branch exists somewhere that isn't my local disk!

Also not sure why it fails on 2.7 or 3.6 on CI; it doesn't fail locally!

@jgraham jgraham marked this pull request as draft March 10, 2023 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants