Fix bug 1171957 - Rewrite scraper to use section importers by jwhitlock · Pull Request #38 · mdn/browsercompat

jwhitlock · 2015-08-17T17:52:34Z

Previously, a single PEG grammar and data processor (mdn/scrape.py) was used to extract data from an MDN page. This PR changes this significantly.

It fixes these bugs:

bug 1171957 - Refactor scraper into sub-parsers
bug 1174808 - Text between "Browser compatiblilty" header and the compat table breaks import tool
bug 1175177 - Work correctly on pages with swapped compat and spec sections
bug 1183593 - Parse links in specification comments

The changes are done as a series of commits with passing tests, but it may be better to review it on a file-by-file basis:

mdn/visitor.py - Base classes setup the interface for converting a parsed grammar (Visitor) and processing a converted document (Extractor)
mdn/data.py - Data provides database lookup and caching services to the parsers, with the (remote) possibility of using the API instead of direct database access through Django models.
mdn/html.py - Provides a grammar for parsing well-formed HTML body sections, which HTMLVisitor converts into elements representing the HTML parts.
mdn/kumascript.py - Augments the HTML grammar with KumaScript macros, which KumaVisitor converts into elements representing the macros, their data, and their API representations.
mdn/specifications.py - SpecSectionExtractor takes a Specifications section that has been converted into elements, and finds the elements representing specifications data. It passes these to SpecNameVisitor, Spec2Visitor, and SpecDescVisitor, which re-parse and extract the data.
mdn/compatibility.py - CompatSectionExtractor takes a Browser Compatibility section that has been converted into elements, and finds the elements representing compatibility data. These are re-parsed with further customized grammars, and passed to CompatFeatureVisitor, CompatSupportVisitor, and CompatFootnoteVisitor to extract the data.
mdn/scrape.py - With most of the data extraction functionality moved to other files, scrape_page is left to convert a raw MDN page into elements, using PageExtractor to collect them into sections and pass the relevant ones to SpecSectionExtractor and CompatSectionExtractor.

Other changes are included, such as moving issue definitions from mdn/model.py to mdn/issues.py, and some documentation cleanup.

Add a parser and visitor that work with HTML fragments.

KumaScript acts as a special HTML text. This is a skeleton, to be fleshed out as the section-specific parsers are added.

- Issue declarations moved from mdn/models.py to mdn/issues.py - Instances now have .issues property - Visitor._to_cls is now .process, collects issues after processing - Added Visitor.add_issue, .add_raw_issue to aid collecting issues - 'unknown_kumascript' issue is now auto-collected

Move from test_scrape.py to new base.py

In SpecName class and the SpecNameVisitor, lookup the referenced mdn_key and add an issue for unknown keys.

It makes sense for the visitors for SpecName, Spec2, and the specification description should be in the same module.

* Shared argument handing and validation, with generic 'kumascript_wrong_args' issue when the number of arguments are wrong. * Add at least one test for each KumaScript class * Where applicable, implement ks.to_html() * Replace issue 'spec2_arg_count' with 'kumascript_wrong_args'

Use new parsing method for specification description. Includes: * Parsing HTML elements with no content, like <td></td> * Add issue on using {{SpecName}} or {{Spec2}} in a description, but still convert to HTML.

Drop content parsing spec tests, but retain those that exercise the visitors, to stay at 100% coverage.

Prepare for adding a data source to all elements: * Change basic args to 'raw' and 'start' with defaults * Initialize parent classes by keyword argument * Change KumaScript naming to use 'canonical_name' class variable, defaulting to the class name, rather than init parameter * Add default for KumaScript params 'args' * Change KumaScript tree to explicit Known/UnknownKumaScript * Update tests for new initialization

Same as mdn.scrape.PageVisitor.join_content, but without the StripNextSpace hack needed because '3D' needs to be parsed one way for compat features (text "3D") and another way for compat support (version "3" followed by inline text "D"). This is the annoyance that made this refactor seem like a good idea, so enjoy.

Centralize database access into a Data class, use in loading specifications.

compat_feature_grammar is used to parse <td> elements containing compatibility feature data, and CompatBaseVisitor is used to extract the data and add feature-specific issues. The feature ID and slug lookup is converted to a method on the Data class.

PageVisitor.visit_compat_row_cell is still used for support cells, so most is retained. Add the raw contents as cell['raw'], and re-parse when identified as a feature cell. Drop a coverage test and some unused scrape code.

Fix issues from reinstalling project on a new laptop: * Add ignores that were in my global .gitignore * Add link to install documentation on wiki * Fix reference to `make install-jslint`

Move mdn.scrape.is_fake_id to utils.is_new_id, in preparation to using in mdn.data and mdn.compatibility

Skeleton of compatability support cell subparser: - Custom grammar for components of support cells - Data lookup and normalization for versions and supports - Normalization of version numbers - Parsing of version-only cells

- Add {{CompatNightly}} KumaScript parser (no args version) - Flesh out rest of support subparser, port over tests from test_scrape

Replace visitor.cell_to_support with reparsing and extracting versions and support with CompatSupportVisitor. Drop unused code and invalid tests.

Simplify page_grammar, since it doesn't have to parse support data, and drop related tests. Drop most support cell tests.

Small fixes with small impact: - Move support grammar outside of test setup - Alpha-sort imports - Move visitor.scope to class-level variable - s/attritbutes/attributes

Using the specification in visitor._attribute_validation_by_tag, drop unexpected attributes and raise issues on select unexpected or missing attributes. Use a strict whitelist for MDN content visitors.

The footnote subparser ends up using the same grammar as compat features. The visitor has slightly different behaviour than the monolithic scraper: - Footnotes split by <br> tags are handled correctly - <span> tags are not dropped - to be fixed.

Most parse errors appear to be incorrectly nested HTML or unknown HTML tags. These get reported at the top-level tag that contains the problematic element. Adjust the message for halt_import, and try to find the inner element that is causing the parse error.

These are the HTML elements used on the remaining MDN feature pages. There may be more on non-feature pages.

Regex from http://stackoverflow.com/questions/249791/

Before, rule text_item did not match { ... } because it was avoiding matching KumaScript. Now it uses a negative lookahead to just match single curly braces.

This only appears on: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select as: <option value="value2" selected>Value 2</option>

If the page appears to have content but none was extracted, then add an issue. Was on https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand before I fixed the page.

Before, code assumed there was always an active footnote ID when a <pre> section was encountered in the footnote area.

On https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent, a colspan of 6 is used when only 5 columns remain in table.

On https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode (and maybe others), the <div> containing the compatibility table is closed after the footnotes and an <h3>. This warns that something appears to be wrong.

Instead of blacklisting HTML elements that should not be allowed, default to no elements allowed and whitelist the ones that are allowed. This adds the tag_dropped issue for unexpected elements. Also, handle a nested <table> inside a compatibility table, by moving finalization to .to_feature_dict() and warning about <table>, <tr>, and embedded <td> elements.

Optionally pass a Data() object into scrape_page, and visually check that it gets passed into sub-visitors and extractors (it does). Remove some unneeded *args.

Change to top-down function order, rewrite and add comments, simplify some code.

Add expected scopes to KumaScript processing, and add issue 'unexpected kumascript' if the parsing scope is unexpected. Remove some issue types that are now too specific. Also, change calling signature of KumaScript._make_issue to take a **kwargs argument.

jwhitlock · 2015-08-18T00:16:24Z

Rebased on top of PR #37, which fixes an issue that would have prevented practical testing. No more rebases expected.

jwhitlock · 2015-08-19T17:12:28Z

Here's the issue counts for the rewritten importer. "Old Count" is the current importer running on browsercompat.herokuapp.com, and "New Count" is the local import in my dev box. Highlights:

section_skipped, section_missed, and halt_import drop to 0. The rewritten importer can continue parsing, but generates issues such as tag_dropped (unexpected HTML elements), skipped_content (such as paragraphs before a compat table), no_data (usually, whole document is wrapped in a <div>) and skipped_h3 (<h3> after compat table) instead.
Because more content is parsed, there are a lot more unknown_kumascript, inline_text, unexpected_attribute errors
New unexpected_kumascript for macro use in unexpected sections

I'll start opening issues for the items that will require importer fixes, but won't start submitting code until this PR is merged.

Issue Slug	Old Count	New Count
Total	1977	5607
section_skipped	873	0
unknown_version	324	244
unknown_kumascript	246	1926
inline_text	190	771
halt_import	113	0
unexpected_attribute	72	229
footnote_no_id	32	122
tag_dropped	20	767
skipped_h3	19	98
footnote_missing	19	74
section_missed	10	0
footnote_multiple	10	25
failed_download	8	8
unknown_browser	7	24
exception	7	1
footnote_unused	6	65
span_dropped	5	0
compatgeckodesktop_unknown	5	7
spec_h2_name	2	8
spec_h2_id	2	9
spec2_converted	2	6
second_footnote	2	0
missing_attribute	2	21
specname_not_kumascript	1	9
skipped_content	0	478
feature_header	0	4
spec2_omitted	0	3
specname_omitted	0	1
specname_converted	0	1
footnote_feature	0	16
cell_out_of_bounds	0	1
unknown_spec	0	6
spec_mismatch	0	6
unexpected_kumascript	0	485
no_data	0	63
kumascript_wrong_args	0	129

groovecoder · 2015-09-02T18:28:01Z

change to throwing NotImplemented ?

groovecoder · 2015-09-02T20:24:28Z

@jwhitlock and I did a code walk-through today. Aside from this one comment, I agreed with the code and reasoning behind it. r+ after the NotImplemented fix.

Use a more neutral name for base class or Visitor and Extractor

Ensure that classes derived from Extractor implement required methods

Fix bug 1171957 - Rewrite scraper to use section importers

jwhitlock assigned jezdez Aug 17, 2015

jwhitlock and others added 29 commits August 17, 2015 19:14

bug 1171957 - Add HTML fragment parser

3c40d43

Add a parser and visitor that work with HTML fragments.

bug 1171957 - Add KumaScript parsing

7c4a278

KumaScript acts as a special HTML text. This is a skeleton, to be fleshed out as the section-specific parsers are added.

bug 1171957 - Make base TestCase for MDN tests

67a91fe

Move from test_scrape.py to new base.py

bug 1171957 - Add specification lookup to SpecName

d049e4e

In SpecName class and the SpecNameVisitor, lookup the referenced mdn_key and add an issue for unknown keys.

bug 1171957 - Use SpecNameVisitor in scrape

fb4fab4

bug 1171957 - Drop redundant tests

b3647d5

bug 1171957 - Renaming specification module

5e82b2b

It makes sense for the visitors for SpecName, Spec2, and the specification description should be in the same module.

bug 1171957 - Add Spec2 parsing

d6de559

bug 1171957 - Refactor specification description

d068f58

Use new parsing method for specification description. Includes: * Parsing HTML elements with no content, like <td></td> * Add issue on using {{SpecName}} or {{Spec2}} in a description, but still convert to HTML.

bug 1171957 - Drop redundant spec tests

343ce0a

Drop content parsing spec tests, but retain those that exercise the visitors, to stay at 100% coverage.

bug 1171957 - Add mdn.data.Data

f8d0608

Centralize database access into a Data class, use in loading specifications.

bug 1171957 - Move slugify to mdn.utils

9a5efb4

bug 1171957 - Use CompatFeatureVisitor in scrape

61a0940

PageVisitor.visit_compat_row_cell is still used for support cells, so most is retained. Add the raw contents as cell['raw'], and re-parse when identified as a feature cell. Drop a coverage test and some unused scrape code.

bug 1171957 - Remove redundant feature tests

c7c65d3

bug 1171957 - QA from reinstalling

be3cff3

Fix issues from reinstalling project on a new laptop: * Add ignores that were in my global .gitignore * Add link to install documentation on wiki * Fix reference to `make install-jslint`

bug 1171957 - Move feature dict to visitor

33f6d79

bug 1171957 - Move to mdn.utils.is_new_id

3f63b8e

Move mdn.scrape.is_fake_id to utils.is_new_id, in preparation to using in mdn.data and mdn.compatibility

bug 1171957 - Start on compat support subparser

b2f200f

Skeleton of compatability support cell subparser: - Custom grammar for components of support cells - Data lookup and normalization for versions and supports - Normalization of version numbers - Parsing of version-only cells

bug 1171957 - Complete compat support subparser

691f951

- Add {{CompatNightly}} KumaScript parser (no args version) - Flesh out rest of support subparser, port over tests from test_scrape

bug 1171957 - Use CompatSupportVisitor in scrape

de57a7f

Replace visitor.cell_to_support with reparsing and extracting versions and support with CompatSupportVisitor. Drop unused code and invalid tests.

bug 1171957 - Drop redundant code and tests

9807fcc

Simplify page_grammar, since it doesn't have to parse support data, and drop related tests. Drop most support cell tests.

bug 1171957 - Fixing small annoyances

7004326

Small fixes with small impact: - Move support grammar outside of test setup - Alpha-sort imports - Move visitor.scope to class-level variable - s/attritbutes/attributes

bug 1171957 - Add element attribute validation

0b19d2a

Using the specification in visitor._attribute_validation_by_tag, drop unexpected attributes and raise issues on select unexpected or missing attributes. Use a strict whitelist for MDN content visitors.

bug 1171957 - Add footnote subparser

9b58421

The footnote subparser ends up using the same grammar as compat features. The visitor has slightly different behaviour than the monolithic scraper: - Footnotes split by <br> tags are handled correctly - <span> tags are not dropped - to be fixed.

jwhitlock added 17 commits August 17, 2015 19:14

bug 1171957 - Warn on empty spec cells

659fba1

bug 1171957 - Add remaining HTML elements

960b104

These are the HTML elements used on the remaining MDN feature pages. There may be more on non-feature pages.

bug 1171957 - Handle escaped quotes

252dcd2

Regex from http://stackoverflow.com/questions/249791/

bug 1171957 - Handle { ... } text

39222e3

Before, rule text_item did not match { ... } because it was avoiding matching KumaScript. Now it uses a negative lookahead to just match single curly braces.

bug 1171957 - Handle boolean attributes

595bae3

This only appears on: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select as: <option value="value2" selected>Value 2</option>

bug 1171957 - Error if no content parsed

edeeada

If the page appears to have content but none was extracted, then add an issue. Was on https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand before I fixed the page.

bug 1171957 - Warn on <pre> element w/o footnote

3717acb

Before, code assumed there was always an active footnote ID when a <pre> section was encountered in the footnote area.

bug 1171957 - Warn if cell extends beyond table

26292d7

On https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent, a colspan of 6 is used when only 5 columns remain in table.

bug 1171957 - Warn if range excluded from footnote

779939b

On https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode (and maybe others), the <div> containing the compatibility table is closed after the footnotes and an <h3>. This warns that something appears to be wrong.

bug 1171957 - Cleanup visitor initialization

d56405e

Optionally pass a Data() object into scrape_page, and visually check that it gets passed into sub-visitors and extractors (it does). Remove some unneeded *args.

bug 1171957 - Code cleanup in mdn/scrape.py

c41e8f2

Change to top-down function order, rewrite and add comments, simplify some code.

bug 1171957 - Minor code cleanup

50231f3

bug 1171957 - Push section lookup into SpecName

5941e11

bug 1171957 - Generate HTML grammar and handlers

de7be5a

jwhitlock mentioned this pull request Aug 25, 2015

bug 957802 - set db logging to warn during tests mdn/kuma#3443

Merged

jwhitlock assigned groovecoder and unassigned jezdez Aug 28, 2015

groovecoder reviewed Sep 2, 2015
View reviewed changes

Comment thread mdn/visitor.py Outdated

Copy link
Copy Markdown

Contributor

groovecoder Sep 2, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to throwing NotImplemented ?

jwhitlock added 3 commits September 2, 2015 16:23

bug 1171957 - Remove unused issues

62f7a19

bug 1171957 - Rename BaseVisitor to Recorder

c68d6a3

Use a more neutral name for base class or Visitor and Extractor

fix bug 1171957 - Use NotImplementedError

9a8a193

Ensure that classes derived from Extractor implement required methods

groovecoder added a commit that referenced this pull request Sep 2, 2015

Merge pull request #38 from jwhitlock/1171957_refactor_scraper

fdc12cb

Fix bug 1171957 - Rewrite scraper to use section importers

groovecoder merged commit fdc12cb into mdn:master Sep 2, 2015

jwhitlock deleted the 1171957_refactor_scraper branch September 3, 2015 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug 1171957 - Rewrite scraper to use section importers#38

Fix bug 1171957 - Rewrite scraper to use section importers#38
groovecoder merged 70 commits into
mdn:masterfrom
jwhitlock:1171957_refactor_scraper

jwhitlock commented Aug 17, 2015

Uh oh!

jwhitlock commented Aug 18, 2015

Uh oh!

jwhitlock commented Aug 19, 2015

Uh oh!

groovecoder Sep 2, 2015

Uh oh!

groovecoder commented Sep 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jwhitlock commented Aug 17, 2015

Uh oh!

jwhitlock commented Aug 18, 2015

Uh oh!

jwhitlock commented Aug 19, 2015

Uh oh!

groovecoder Sep 2, 2015

Choose a reason for hiding this comment

Uh oh!

groovecoder commented Sep 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants