fix bug 1170199 - detect pages with no data #35

jwhitlock · 2015-06-16T01:51:16Z

This PR builds on previous PRs, and can be rebased

If a page doesn't contain a specification or browser compatibility header, or the CompatibilityTable KumaScript macro, then consider it "No Data" and don't parse. Speeds up a full re-parse by 30%, and reduces issues. The biggest drop is in doc_parse_error, from 405 to 17. section_skipped also dropped from 2200 to 1604. This is less expected, but is due to pages like Web/API/PowerManager/removeWakeLockListener, which has a section named "Specification" (not "Specifications"), no browser compatibility data, and no actual specification data to speak of.

Issue Slug	Old Count	New Count
Total	7091	6103
section_skipped	2200	1604
inline_text	1896	1896
unexpected_attribute	488	488
doc_parse_error	405	17
spec2_converted	283	283
unknown_kumascript	272	272
skipped_h3	262	262
unknown_version	221	221
specname_converted	193	193
halt_import	155	151
footnote_missing	106	106
specname_not_kumascript	98	98
missing_attribute	98	98
footnote_no_id	75	75
footnote_unused	59	59
unknown_browser	42	42
section_missed	40	40
spec_h2_id	38	38
footnote_multiple	38	38
spec_h2_name	32	32
tag_dropped	22	22
unknown_spec	21	21
span_dropped	12	12
second_footnote	12	12
spec_mismatch	10	10
compatgeckodesktop_unknown	10	10
footnote_feature	2	2
spec2_wrong_kumascript	1	1

Issues are collected in a new model, with: - A slug to identify similar issues, - The start and end position in the MDN page - A dictionary of context data The slugs are also used to lookup: - Issue severity: - Warning: probably does not impact importing - Error: data is present but not imported - Critical: unable to finish importing page - Short and long description templates, using the per-issue context data. - User-contributed content-fixing hints on: https://wiki.mozilla.org/MDN/Development/CompatibilityTables/Importer The parse output has been changed. meta.scrape.raw.issues now contains both the original issues (now Warnings) and errors (Error and Critical). The HTML description no longer appears in the meta section, but is generated at display time. Finally, the importer list shows a count of Warnings / Error / Critical instead of the binary has / doesn't have errors.

Consistantly use text_type, so that pages like https://developer.mozilla.org/en-US/docs/Web/CSS/font-variant will import correctly.

Add the "Commit" button when no errors or critical errors are found on the page.

Canonical features loaded from the database had the name {'zxx': 'name'}, exposing the internal implementation and breaking the display. Fixed so name is 'name' instead.

* Rename tools/gather_import_errors.py * Use new 'issues' terminology and endpoints * During import, count up to 100% rather than down to 0% * Write CSV with new data fields

It appears TravisCI is now using 'ascii' as their default system encoding, causing reads of UTF-8 files to fail. Use io.open with explicit encoding to fix.

Bump requirements as suggested by requires.io

When displaying the diff of a collection, adding the resource name to the output is redundant, since it appears in the JSON API output.

Order the locales in the translated string (en, then in alpha order), so that the diffs highlight the changes rather than random dict sorting.

It is likely that the subpath will be set but not the number.

Combine common tool code into tools/common.py, to reduce duplication and simplify command line parsing.

Collection.load_collection will copy the resources from another collection. The resources can then be modified, and a CollectionChangeset used to update the original Collection.

Previously, "false" was interpreted as the JSON value for False.

New tool for gathering MDN pages and creating or updating the branch features related to them. Includes converting to canonical names, better handling of long slugs, and adding URLs to pages w/o compatibility data.

When searching the importer by MDN URL, drop the querystring and fragment automatically, rather than returning an error.

Use readonly_fields to prevent loading the whole database in the admin.

Add a page status for "page imported w/o compat data"

In the scrape-constructed view_feature, use name="canonical", rather than name={"zxx": "canonical"}.

Some pages don't have the expected structure, such as <h2> headings, resulting in the doc rule not matching. Turn this into a doc_parse_error issue, instead of an Exception, for further processing. Example: https://developer.mozilla.org/en-US/docs/Navigation_timing

Add issues specname_blank_key, spec2_wrong_kumascript, and spec2_arg_count to replace assertions resulting in an exception issue. Also, refactor visitor.unknown_kumascript_issue into visitor.kumascript_issue, so it can be used in more KumaScript issue reporting.

- Issues name is a link in importer/issues - "Download MDN page" rather than "Download MDN pages"

https://developer.mozilla.org/en-US/docs/Web/CSS/@viewport/max-zoom has the text '"max-zoom" descriptor, which doesn't end in quotes.

https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_CSS_gradients has a Bengali (bn-BD) translation that broke processing, prompting several changes: - MDN paths increased to 1024 characters - Gather localized titles of MDN pages from metadata - Migrate featurepage.status and issue.slug choices from previous work - When scraping, add localized names of the page to the target feature if it is not set as a canonical name - When a task encounters an unexpected status, assert with the human-friendly name - Add STATUS_NO_DATA as an 'already fetched' state - Display IRI (with unicode) instead of URI (with percent-encoded unicode) on the sample feature page

When mdn.tasks.fetch_translation gets a non-200 response, report as a failed_download issue and continue. Previously, an exception was also raised, which halted tools/import_mdn.py.

gather_import_issues.py was counting end position, not issues.

When 'ECMAScript 1st/3rd Edition.' appears as the Specification name, convert to ES1/ES3. Other text is an issue, not a parse error.

Add sample specifications so that specrow tests aren't complicated with 'unknown_spec' issues.

Instead of the context being the whole first <td> element, just highlight the errored KumaScript.

Text (instead of {{Spec2(key)}}) in a specification status will result in an warning-level issue, instead of halting scraping. The status text is not parsed, but instead the SpecName key is used.

* django-allauth 0.19.1 -> 0.20.0 - Bump email field size * django-sortedm2m 0.9.5 -> 0.10.0 - Better customization hooks * flake8 2.4.0 -> 2.4.1 - pip blacklist * pylibmc 1.4.2 -> 1.4.3 - Threading fixes * static3 branch -> 0.6.1 - Explicit UTF-8 reads * virtualenv 12.1.1 -> 13.0.1 - pip, setuptools upgrades

Store issues in the FeaturePage JSON rather than load when deserializing the data. Speeds up page views, importing, and gathering import issues.

Switch the specification description (third column of the table) from a simple parser to a tokenizer pattern (like compatibility cells) so that inline KumaScript can be transformed into plain text.

Translate KumaScript found on MDN in specification descriptions to plain HTML. Warn on use of {{Spec2}}, which is probably a typo for {{SpecName}}.

Rearrange the parsing grammar so that generic elements such as , , and are no longer tied to cell parsing but can be reused in footnote parsing.

Tokenize the footnote section, in a similar way to the specification description and compatibility cells.

Change line numbers from 0-index to 1-index. If the issue context includes the last lines of the file (which may not be terminated with a line feed), include in context.

Don't add footnote_no_id issue for empty or whitespace-only footnote paragraphs. If a multi-line footnote includes an empty paragraph, drop it.

When a <pre> tag in the footnotes includes attributes, warn the user that they won't be in output. Previously, class="brush:css" was special cased, but trying to default to no attributes since we'll have to reject unexpected markup eventually.

Also adjust code block handling so that generics like _consume_attributes can be used to parse it.

Standardize parsing of common HTML elements with optional attributes

Rename kumascript_to_text to kumascript_to_html, to reflect that it outputs HTML instead of plain text.

Previously, unknown issue slugs would result in KeyError exceptions. Now, a generic message is printed, allowing for easier debugging and fixing.

Instead of tokenizing HTML content as a sequence of tags, parse into a tree of nested content. This allows more nuanced handling of HTML, such as removing tags ( and <a> in feature names, everywhere), and more detailed messages.

In the sample JS display and the browse app, handle features with canonical names, which are encoded as strings rather than objects.

The list of import issues can be filtered by clicking one of 11 pre-selected topic filters, or by searching for a partial MDN path, such as: http://localhost:8000/importer/?topic=Web/CSS/- Also switched to sorting by MDN slug, and other cleanup for the importer list view.

Add an endpoint for viewing all the pages affected by an issue. Reformat the issue summary page to link to the detail page.

If a page doesn't include strings identifying a Browser compatibility section, a Specifications section, or the CompatibilityTable KumaScript macro, then stop parsing. Detection is done with strings, not regex, so it should be fast but maybe with false positives.

jwhitlock · 2015-06-16T15:27:14Z

Wrong project

fix bug 1170196 - Filter issues by topic, other fixes

jwhitlock added 30 commits May 27, 2015 09:34

bug 1154349 - Moar speeling fixes

53485d0

bug 1154349 - Better non-ascii handling

d5d9a3a

Consistantly use text_type, so that pages like https://developer.mozilla.org/en-US/docs/Web/CSS/font-variant will import correctly.

bug 1154349 - Allow MDN import w/ warnings

eecf8d1

Add the "Commit" button when no errors or critical errors are found on the page.

bug 1154349 - Fix link to view_features endpoint

94ed9dc

bug 1154349 - Fix canonical feature scrape

112d3ab

Canonical features loaded from the database had the name {'zxx': 'name'}, exposing the internal implementation and breaking the display. Fixed so name is 'name' instead.

bug 1154349 - Update tools/gather_import_issues.py

ee95ac2

* Rename tools/gather_import_errors.py * Use new 'issues' terminology and endpoints * During import, count up to 100% rather than down to 0% * Write CSV with new data fields

bug 1154349 - Add issues summary page

c0b4dac

fix bug 1154349 - Intentional file reads

8208656

It appears TravisCI is now using 'ascii' as their default system encoding, causing reads of UTF-8 files to fail. Use io.open with explicit encoding to fix.

bug 1132658 - Bump requirements

b53bf97

Bump requirements as suggested by requires.io

bug 1132658 - Remove redundant resource name

2e529fb

When displaying the diff of a collection, adding the resource name to the output is redundant, since it appears in the JSON API output.

bug 1132658 - Improve diffs for translated strings

c30da33

Order the locales in the translated string (en, then in alpha order), so that the diffs highlight the changes rather than random dict sorting.

bug 1132658 - Data ID for section uses subpath

a0635f1

It is likely that the subpath will be set but not the number.

bug 1132658 - Refactor common tool code

0beadca

Combine common tool code into tools/common.py, to reduce duplication and simplify command line parsing.

bug 1132658 - Add Collection.load_collection

391fb81

Collection.load_collection will copy the resources from another collection. The resources can then be modified, and a CollectionChangeset used to update the original Collection.

bug 1132658 - Allow 'false' as a canonical name

517e3db

Previously, "false" was interpreted as the JSON value for False.

bug 1132658 - Add tools/mirror_mdn_features.py

19c478b

New tool for gathering MDN pages and creating or updating the branch features related to them. Includes converting to canonical names, better handling of long slugs, and adding URLs to pages w/o compatibility data.

bug 1132658 - Improve importer search by URL

200ca82

When searching the importer by MDN URL, drop the querystring and fragment automatically, rather than returning an error.

bug 1132658 - Improve admin for mdn app

3b06d68

Use readonly_fields to prevent loading the whole database in the admin.

bug 1132658 - Add "No Data" status for pages

0d996fa

Add a page status for "page imported w/o compat data"

bug 1132658 - Fix canonical feature names

e15e251

In the scrape-constructed view_feature, use name="canonical", rather than name={"zxx": "canonical"}.

bug 1132658 - Fix typos in issue templates

3000b7a

bug 1132658 - Download MDN pages w/o cache

c4a476f

bug 1132658 - Small importer UI fixes

276e8ad

- Issues name is a link in importer/issues - "Download MDN page" rather than "Download MDN pages"

bug 1132658 - Fix template for failed_download

9362902

bug 1132658 - Handle text with partial quotes

5984331

https://developer.mozilla.org/en-US/docs/Web/CSS/@viewport/max-zoom has the text '"max-zoom" descriptor, which doesn't end in quotes.

bug 1132658 - Handle redirect on $json

27674a0

jwhitlock added 24 commits May 27, 2015 09:34

fix bug 1132658 - Convert download failure to issue

d2e2a0a

When mdn.tasks.fetch_translation gets a non-200 response, report as a failed_download issue and continue. Previously, an exception was also raised, which halted tools/import_mdn.py.

bug 1154349 - Counts in gather_import_issues.py

98e6db1

gather_import_issues.py was counting end position, not issues.

bug 1132677 - Handle by-name ES1/ES3 specs

831bfe8

When 'ECMAScript 1st/3rd Edition.' appears as the Specification name, convert to ES1/ES3. Other text is an issue, not a parse error.

bug 1132677 - Remove issue from specrow tests

0ed56e7

Add sample specifications so that specrow tests aren't complicated with 'unknown_spec' issues.

bug 1132677 - Narrow context for SpecName issues

9562a2d

Instead of the context being the whole first <td> element, just highlight the errored KumaScript.

fix bug 1132677 - Handle text in spec status cell

9a5af35

Text (instead of {{Spec2(key)}}) in a specification status will result in an warning-level issue, instead of halting scraping. The status text is not parsed, but instead the SpecName key is used.

bug 1134373 - Refactor issue tracking

2a62cdb

Store issues in the FeaturePage JSON rather than load when deserializing the data. Speeds up page views, importing, and gathering import issues.

bug 1134373 - Switch spec description to tokenizer

ad52164

Switch the specification description (third column of the table) from a simple parser to a tokenizer pattern (like compatibility cells) so that inline KumaScript can be transformed into plain text.

fix bug 1134373 - Handle spec desc. KumaScript

031dd8d

Translate KumaScript found on MDN in specification descriptions to plain HTML. Warn on use of {{Spec2}}, which is probably a typo for {{SpecName}}.

bug 1139619 - Rearrange grammar to reuse elements

5cab7a0

Rearrange the parsing grammar so that generic elements such as , , and are no longer tied to cell parsing but can be reused in footnote parsing.

bug 1139619 - Convert footnote to tokenizer

3ce864c

Tokenize the footnote section, in a similar way to the specification description and compatibility cells.

bug 1139619 - Better issue context

1b6414f

Change line numbers from 0-index to 1-index. If the issue context includes the last lines of the file (which may not be terminated with a line feed), include in context.

bug 1139619 - Discard empty footnote paragraphs

70ea8fa

Don't add footnote_no_id issue for empty or whitespace-only footnote paragraphs. If a multi-line footnote includes an empty paragraph, drop it.

bug 1139619 - Warn on <pre> attributes

8cf371e

When a <pre> tag in the footnotes includes attributes, warn the user that they won't be in output. Previously, class="brush:css" was special cased, but trying to default to no attributes since we'll have to reject unexpected markup eventually.

bug 1139619 - Handle <code> blocks in footnote

c4c2edf

Also adjust code block handling so that generics like _consume_attributes can be used to parse it.

bug 1139619 - Refactor common HTML parsing

b006e85

Standardize parsing of common HTML elements with optional attributes

bug 1139619 - Rename kumascript_to_html

a0b6982

Rename kumascript_to_text to kumascript_to_html, to reflect that it outputs HTML instead of plain text.

bug 1139619 - Gracefully handle unknown issue slug

bf247e6

Previously, unknown issue slugs would result in KeyError exceptions. Now, a generic message is printed, allowing for easier debugging and fixing.

bug 1132269 - Handle canonical feature names

5d99cbb

In the sample JS display and the browse app, handle features with canonical names, which are encoded as strings rather than objects.

fix bug 1170196 - Add importer issue details

7d808b6

Add an endpoint for viewing all the pages affected by an issue. Reformat the issue summary page to link to the detail page.

jwhitlock closed this Jun 16, 2015

jwhitlock added a commit that referenced this pull request Aug 4, 2015

Merge pull request #35 from jwhitlock/1170196_filter_issues

63aaa2e

fix bug 1170196 - Filter issues by topic, other fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug 1170199 - detect pages with no data #35

fix bug 1170199 - detect pages with no data #35

jwhitlock commented Jun 16, 2015

jwhitlock commented Jun 16, 2015

fix bug 1170199 - detect pages with no data #35

fix bug 1170199 - detect pages with no data #35

Conversation

jwhitlock commented Jun 16, 2015

jwhitlock commented Jun 16, 2015