⚡ Optimize list membership check to set for HTML tags#216
Conversation
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
Reviewer's guide (collapsed on small PRs)Reviewer's GuideReplaces a per-tag list membership check with a constant set literal when determining DelimiterKind for certain HTML tags, improving membership-test performance in the chunker delimiter logic. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Since this membership set is constant, consider pulling it out into a named module-level constant (e.g.,
BLOCK_LEVEL_HTML_TAGS) so it is only created once and its intent is clearer at the call site.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Since this membership set is constant, consider pulling it out into a named module-level constant (e.g., `BLOCK_LEVEL_HTML_TAGS`) so it is only created once and its intent is clearer at the call site.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull request overview
Micro-optimization in the HTML delimiter patterns to use a set literal for membership checks when deciding block vs paragraph delimiter kinds.
Changes:
- Replace list literal with set literal in
tag in ...membership test for HTML block tag detection.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Optimizes DelimiterKind detection for specific HTML tags by switching from a list membership check to a set literal.
Changes:
- Replace
tag in [...]withtag in {...}for HTML “block” tag detection.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Optimizes HTML tag DelimiterKind selection in the chunker by switching from linear list membership checks to constant-time set-like membership.
Changes:
- Introduce an
HTML_BLOCK_TAGSconstant for block-level HTML tags. - Update
HTML_TAGS_PATTERNSgeneration to useHTML_BLOCK_TAGSfor membership checks.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| nestable=False, | ||
| ) | ||
|
|
||
| HTML_BLOCK_TAGS = frozenset({"html", "body", "main", "section", "article"}) |
💡 What: The optimization implemented
Replaced the
["html", "body", "main", "section", "article"]list with a set literal{"html", "body", "main", "section", "article"}for membership checking when determining the DelimiterKind for HTML tags.🎯 Why: The performance problem it solves$O(N)$ operations. In Python 3.12, membership checks against a constant set literal are optimized into $O(1)$ lookup time. Since this lookup is executed for every HTML tag in the list of tags, improving the lookup translates to an immediate performance improvement.
Checking if an item is in a list requires
frozensetat compile time, leading to constant📊 Measured Improvement:
Measured performance for 100,000 iterations:
[...]: ~0.63s(...): ~0.69s{...}: ~0.30sThe set lookup is approximately twice as fast as the list lookup.
PR created automatically by Jules for task 9663154682552657493 started by @bashandbone
Summary by Sourcery
Enhancements: