This repository was archived by the owner on Oct 16, 2025. It is now read-only.
Fix bug 1171957 - Rewrite scraper to use section importers#38
Merged
Conversation
Add a parser and visitor that work with HTML fragments.
KumaScript acts as a special HTML text. This is a skeleton, to be fleshed out as the section-specific parsers are added.
- Issue declarations moved from mdn/models.py to mdn/issues.py - Instances now have .issues property - Visitor._to_cls is now .process, collects issues after processing - Added Visitor.add_issue, .add_raw_issue to aid collecting issues - 'unknown_kumascript' issue is now auto-collected
Move from test_scrape.py to new base.py
In SpecName class and the SpecNameVisitor, lookup the referenced mdn_key and add an issue for unknown keys.
It makes sense for the visitors for SpecName, Spec2, and the specification description should be in the same module.
* Shared argument handing and validation, with generic 'kumascript_wrong_args' issue when the number of arguments are wrong. * Add at least one test for each KumaScript class * Where applicable, implement ks.to_html() * Replace issue 'spec2_arg_count' with 'kumascript_wrong_args'
Use new parsing method for specification description. Includes:
* Parsing HTML elements with no content, like <td></td>
* Add issue on using {{SpecName}} or {{Spec2}} in a description, but
still convert to HTML.
Drop content parsing spec tests, but retain those that exercise the visitors, to stay at 100% coverage.
Prepare for adding a data source to all elements: * Change basic args to 'raw' and 'start' with defaults * Initialize parent classes by keyword argument * Change KumaScript naming to use 'canonical_name' class variable, defaulting to the class name, rather than init parameter * Add default for KumaScript params 'args' * Change KumaScript tree to explicit Known/UnknownKumaScript * Update tests for new initialization
Same as mdn.scrape.PageVisitor.join_content, but without the StripNextSpace hack needed because '3D' needs to be parsed one way for compat features (text "3D") and another way for compat support (version "3" followed by inline text "D"). This is the annoyance that made this refactor seem like a good idea, so enjoy.
Centralize database access into a Data class, use in loading specifications.
compat_feature_grammar is used to parse <td> elements containing compatibility feature data, and CompatBaseVisitor is used to extract the data and add feature-specific issues. The feature ID and slug lookup is converted to a method on the Data class.
PageVisitor.visit_compat_row_cell is still used for support cells, so most is retained. Add the raw contents as cell['raw'], and re-parse when identified as a feature cell. Drop a coverage test and some unused scrape code.
Fix issues from reinstalling project on a new laptop: * Add ignores that were in my global .gitignore * Add link to install documentation on wiki * Fix reference to `make install-jslint`
Move mdn.scrape.is_fake_id to utils.is_new_id, in preparation to using in mdn.data and mdn.compatibility
Skeleton of compatability support cell subparser: - Custom grammar for components of support cells - Data lookup and normalization for versions and supports - Normalization of version numbers - Parsing of version-only cells
- Add {{CompatNightly}} KumaScript parser (no args version)
- Flesh out rest of support subparser, port over tests from test_scrape
Replace visitor.cell_to_support with reparsing and extracting versions and support with CompatSupportVisitor. Drop unused code and invalid tests.
Simplify page_grammar, since it doesn't have to parse support data, and drop related tests. Drop most support cell tests.
Small fixes with small impact: - Move support grammar outside of test setup - Alpha-sort imports - Move visitor.scope to class-level variable - s/attritbutes/attributes
Using the specification in visitor._attribute_validation_by_tag, drop unexpected attributes and raise issues on select unexpected or missing attributes. Use a strict whitelist for MDN content visitors.
The footnote subparser ends up using the same grammar as compat features. The visitor has slightly different behaviour than the monolithic scraper: - Footnotes split by <br> tags are handled correctly - <span> tags are not dropped - to be fixed.
Most parse errors appear to be incorrectly nested HTML or unknown HTML tags. These get reported at the top-level tag that contains the problematic element. Adjust the message for halt_import, and try to find the inner element that is causing the parse error.
These are the HTML elements used on the remaining MDN feature pages. There may be more on non-feature pages.
Before, rule text_item did not match { ... } because it was avoiding
matching KumaScript. Now it uses a negative lookahead to just match
single curly braces.
This only appears on: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select as: <option value="value2" selected>Value 2</option>
If the page appears to have content but none was extracted, then add an issue. Was on https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand before I fixed the page.
Before, code assumed there was always an active footnote ID when a <pre> section was encountered in the footnote area.
On https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent, a colspan of 6 is used when only 5 columns remain in table.
On https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode (and maybe others), the <div> containing the compatibility table is closed after the footnotes and an <h3>. This warns that something appears to be wrong.
Instead of blacklisting HTML elements that should not be allowed, default to no elements allowed and whitelist the ones that are allowed. This adds the tag_dropped issue for unexpected elements. Also, handle a nested <table> inside a compatibility table, by moving finalization to .to_feature_dict() and warning about <table>, <tr>, and embedded <td> elements.
Optionally pass a Data() object into scrape_page, and visually check that it gets passed into sub-visitors and extractors (it does). Remove some unneeded *args.
Change to top-down function order, rewrite and add comments, simplify some code.
Add expected scopes to KumaScript processing, and add issue 'unexpected kumascript' if the parsing scope is unexpected. Remove some issue types that are now too specific. Also, change calling signature of KumaScript._make_issue to take a **kwargs argument.
Contributor
Author
|
Rebased on top of PR #37, which fixes an issue that would have prevented practical testing. No more rebases expected. |
Contributor
Author
|
Here's the issue counts for the rewritten importer. "Old Count" is the current importer running on browsercompat.herokuapp.com, and "New Count" is the local import in my dev box. Highlights:
I'll start opening issues for the items that will require importer fixes, but won't start submitting code until this PR is merged.
|
Contributor
There was a problem hiding this comment.
change to throwing NotImplemented ?
Contributor
|
@jwhitlock and I did a code walk-through today. Aside from this one comment, I agreed with the code and reasoning behind it. r+ after the |
Use a more neutral name for base class or Visitor and Extractor
Ensure that classes derived from Extractor implement required methods
groovecoder
added a commit
that referenced
this pull request
Sep 2, 2015
Fix bug 1171957 - Rewrite scraper to use section importers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously, a single PEG grammar and data processor (
mdn/scrape.py) was used to extract data from an MDN page. This PR changes this significantly.It fixes these bugs:
The changes are done as a series of commits with passing tests, but it may be better to review it on a file-by-file basis:
mdn/visitor.py- Base classes setup the interface for converting a parsed grammar (Visitor) and processing a converted document (Extractor)mdn/data.py-Dataprovides database lookup and caching services to the parsers, with the (remote) possibility of using the API instead of direct database access through Django models.mdn/html.py- Provides a grammar for parsing well-formed HTML body sections, whichHTMLVisitorconverts into elements representing the HTML parts.mdn/kumascript.py- Augments the HTML grammar with KumaScript macros, whichKumaVisitorconverts into elements representing the macros, their data, and their API representations.mdn/specifications.py-SpecSectionExtractortakes a Specifications section that has been converted into elements, and finds the elements representing specifications data. It passes these toSpecNameVisitor,Spec2Visitor, andSpecDescVisitor, which re-parse and extract the data.mdn/compatibility.py-CompatSectionExtractortakes a Browser Compatibility section that has been converted into elements, and finds the elements representing compatibility data. These are re-parsed with further customized grammars, and passed toCompatFeatureVisitor,CompatSupportVisitor, andCompatFootnoteVisitorto extract the data.mdn/scrape.py- With most of the data extraction functionality moved to other files,scrape_pageis left to convert a raw MDN page into elements, usingPageExtractorto collect them into sections and pass the relevant ones toSpecSectionExtractorandCompatSectionExtractor.Other changes are included, such as moving issue definitions from
mdn/model.pytomdn/issues.py, and some documentation cleanup.