Skip to content
This repository was archived by the owner on Oct 16, 2025. It is now read-only.

Fix bug 1171957 - Rewrite scraper to use section importers#38

Merged
groovecoder merged 70 commits into
mdn:masterfrom
jwhitlock:1171957_refactor_scraper
Sep 2, 2015
Merged

Fix bug 1171957 - Rewrite scraper to use section importers#38
groovecoder merged 70 commits into
mdn:masterfrom
jwhitlock:1171957_refactor_scraper

Conversation

@jwhitlock
Copy link
Copy Markdown
Contributor

Previously, a single PEG grammar and data processor (mdn/scrape.py) was used to extract data from an MDN page. This PR changes this significantly.

It fixes these bugs:

  • bug 1171957 - Refactor scraper into sub-parsers
  • bug 1174808 - Text between "Browser compatiblilty" header and the compat table breaks import tool
  • bug 1175177 - Work correctly on pages with swapped compat and spec sections
  • bug 1183593 - Parse links in specification comments

The changes are done as a series of commits with passing tests, but it may be better to review it on a file-by-file basis:

  • mdn/visitor.py - Base classes setup the interface for converting a parsed grammar (Visitor) and processing a converted document (Extractor)
  • mdn/data.py - Data provides database lookup and caching services to the parsers, with the (remote) possibility of using the API instead of direct database access through Django models.
  • mdn/html.py - Provides a grammar for parsing well-formed HTML body sections, which HTMLVisitor converts into elements representing the HTML parts.
  • mdn/kumascript.py - Augments the HTML grammar with KumaScript macros, which KumaVisitor converts into elements representing the macros, their data, and their API representations.
  • mdn/specifications.py - SpecSectionExtractor takes a Specifications section that has been converted into elements, and finds the elements representing specifications data. It passes these to SpecNameVisitor, Spec2Visitor, and SpecDescVisitor, which re-parse and extract the data.
  • mdn/compatibility.py - CompatSectionExtractor takes a Browser Compatibility section that has been converted into elements, and finds the elements representing compatibility data. These are re-parsed with further customized grammars, and passed to CompatFeatureVisitor, CompatSupportVisitor, and CompatFootnoteVisitor to extract the data.
  • mdn/scrape.py - With most of the data extraction functionality moved to other files, scrape_page is left to convert a raw MDN page into elements, using PageExtractor to collect them into sections and pass the relevant ones to SpecSectionExtractor and CompatSectionExtractor.

Other changes are included, such as moving issue definitions from mdn/model.py to mdn/issues.py, and some documentation cleanup.

jwhitlock and others added 29 commits August 17, 2015 19:14
Add a parser and visitor that work with HTML fragments.
KumaScript acts as a special HTML text.  This is a skeleton, to be
fleshed out as the section-specific parsers are added.
- Issue declarations moved from mdn/models.py to mdn/issues.py
- Instances now have .issues property
- Visitor._to_cls is now .process, collects issues after processing
- Added Visitor.add_issue, .add_raw_issue to aid collecting issues
- 'unknown_kumascript' issue is now auto-collected
Move from test_scrape.py to new base.py
In SpecName class and the SpecNameVisitor, lookup the referenced mdn_key
and add an issue for unknown keys.
It makes sense for the visitors for SpecName, Spec2, and the
specification description should be in the same module.
* Shared argument handing and validation, with generic
  'kumascript_wrong_args' issue when the number of arguments are wrong.
* Add at least one test for each KumaScript class
* Where applicable, implement ks.to_html()
* Replace issue 'spec2_arg_count' with 'kumascript_wrong_args'
Use new parsing method for specification description.  Includes:
* Parsing HTML elements with no content, like <td></td>
* Add issue on using {{SpecName}} or {{Spec2}} in a description, but
  still convert to HTML.
Drop content parsing spec tests, but retain those that exercise the
visitors, to stay at 100% coverage.
Prepare for adding a data source to all elements:
* Change basic args to 'raw' and 'start' with defaults
* Initialize parent classes by keyword argument
* Change KumaScript naming to use 'canonical_name' class variable,
  defaulting to the class name, rather than init parameter
* Add default for KumaScript params 'args'
* Change KumaScript tree to explicit Known/UnknownKumaScript
* Update tests for new initialization
Same as mdn.scrape.PageVisitor.join_content, but without the
StripNextSpace hack needed because '3D' needs to be parsed one way for
compat features (text "3D") and another way for compat support
(version "3" followed by inline text "D").  This is the annoyance that
made this refactor seem like a good idea, so enjoy.
Centralize database access into a Data class, use in loading
specifications.
compat_feature_grammar is used to parse <td> elements containing
compatibility feature data, and CompatBaseVisitor is used to extract the
data and add feature-specific issues.  The feature ID and slug lookup is
converted to a method on the Data class.
PageVisitor.visit_compat_row_cell is still used for support cells, so
most is retained.  Add the raw contents as cell['raw'], and re-parse
when identified as a feature cell. Drop a coverage test and some unused
scrape code.
Fix issues from reinstalling project on a new laptop:

* Add ignores that were in my global .gitignore
* Add link to install documentation on wiki
* Fix reference to `make install-jslint`
Move mdn.scrape.is_fake_id to utils.is_new_id, in preparation to using
in mdn.data and mdn.compatibility
Skeleton of compatability support cell subparser:
- Custom grammar for components of support cells
- Data lookup and normalization for versions and supports
- Normalization of version numbers
- Parsing of version-only cells
- Add {{CompatNightly}} KumaScript parser (no args version)
- Flesh out rest of support subparser, port over tests from test_scrape
Replace visitor.cell_to_support with reparsing and extracting versions
and support with CompatSupportVisitor. Drop unused code and invalid
tests.
Simplify page_grammar, since it doesn't have to parse support data,
and drop related tests. Drop most support cell tests.
Small fixes with small impact:
- Move support grammar outside of test setup
- Alpha-sort imports
- Move visitor.scope to class-level variable
- s/attritbutes/attributes
Using the specification in visitor._attribute_validation_by_tag,
drop unexpected attributes and raise issues on select unexpected or
missing attributes.  Use a strict whitelist for MDN content visitors.
The footnote subparser ends up using the same grammar as compat
features.  The visitor has slightly different behaviour than the
monolithic scraper:
- Footnotes split by <br> tags are handled correctly
- <span> tags are not dropped - to be fixed.
Most parse errors appear to be incorrectly nested HTML or unknown HTML
tags. These get reported at the top-level tag that contains the
problematic element. Adjust the message for halt_import, and try to find
the inner element that is causing the parse error.
These are the HTML elements used on the remaining MDN feature pages.
There may be more on non-feature pages.
Before, rule text_item did not match { ... } because it was avoiding
matching KumaScript.  Now it uses a negative lookahead to just match
single curly braces.
This only appears on:
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select
as:
<option value="value2" selected>Value 2</option>
If the page appears to have content but none was extracted, then add an
issue. Was on
https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand
before I fixed the page.
Before, code assumed there was always an active footnote ID when a <pre>
section was encountered in the footnote area.
On
https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode
(and maybe others), the <div> containing the compatibility table is
closed after the footnotes and an <h3>.  This warns that something
appears to be wrong.
Instead of blacklisting HTML elements that should not be allowed,
default to no elements allowed and whitelist the ones that are allowed.
This adds the tag_dropped issue for unexpected elements.

Also, handle a nested <table> inside a compatibility table, by moving
finalization to .to_feature_dict() and warning about <table>, <tr>, and
embedded <td> elements.
Optionally pass a Data() object into scrape_page, and visually check
that it gets passed into sub-visitors and extractors (it does).  Remove
some unneeded *args.
Change to top-down function order, rewrite and add comments, simplify
some code.
Add expected scopes to KumaScript processing, and add issue 'unexpected
kumascript' if the parsing scope is unexpected.  Remove some issue types
that are now too specific.

Also, change calling signature of KumaScript._make_issue to take a
**kwargs argument.
@jwhitlock
Copy link
Copy Markdown
Contributor Author

Rebased on top of PR #37, which fixes an issue that would have prevented practical testing. No more rebases expected.

@jwhitlock
Copy link
Copy Markdown
Contributor Author

Here's the issue counts for the rewritten importer. "Old Count" is the current importer running on browsercompat.herokuapp.com, and "New Count" is the local import in my dev box. Highlights:

  • section_skipped, section_missed, and halt_import drop to 0. The rewritten importer can continue parsing, but generates issues such as tag_dropped (unexpected HTML elements), skipped_content (such as paragraphs before a compat table), no_data (usually, whole document is wrapped in a <div>) and skipped_h3 (<h3> after compat table) instead.
  • Because more content is parsed, there are a lot more unknown_kumascript, inline_text, unexpected_attribute errors
  • New unexpected_kumascript for macro use in unexpected sections

I'll start opening issues for the items that will require importer fixes, but won't start submitting code until this PR is merged.

Issue Slug Old Count New Count
Total 1977 5607
section_skipped 873 0
unknown_version 324 244
unknown_kumascript 246 1926
inline_text 190 771
halt_import 113 0
unexpected_attribute 72 229
footnote_no_id 32 122
tag_dropped 20 767
skipped_h3 19 98
footnote_missing 19 74
section_missed 10 0
footnote_multiple 10 25
failed_download 8 8
unknown_browser 7 24
exception 7 1
footnote_unused 6 65
span_dropped 5 0
compatgeckodesktop_unknown 5 7
spec_h2_name 2 8
spec_h2_id 2 9
spec2_converted 2 6
second_footnote 2 0
missing_attribute 2 21
specname_not_kumascript 1 9
skipped_content 0 478
feature_header 0 4
spec2_omitted 0 3
specname_omitted 0 1
specname_converted 0 1
footnote_feature 0 16
cell_out_of_bounds 0 1
unknown_spec 0 6
spec_mismatch 0 6
unexpected_kumascript 0 485
no_data 0 63
kumascript_wrong_args 0 129

Comment thread mdn/visitor.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to throwing NotImplemented ?

@groovecoder
Copy link
Copy Markdown
Contributor

@jwhitlock and I did a code walk-through today. Aside from this one comment, I agreed with the code and reasoning behind it. r+ after the NotImplemented fix.

Use a more neutral name for base class or Visitor and Extractor
Ensure that classes derived from Extractor implement required methods
groovecoder added a commit that referenced this pull request Sep 2, 2015
Fix bug 1171957 - Rewrite scraper to use section importers
@groovecoder groovecoder merged commit fdc12cb into mdn:master Sep 2, 2015
@jwhitlock jwhitlock deleted the 1171957_refactor_scraper branch September 3, 2015 16:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants