Skip to content
This repository has been archived by the owner on Nov 10, 2017. It is now read-only.

fix bug 1170199 - detect pages with no data #35

Closed
wants to merge 54 commits into from

Conversation

jwhitlock
Copy link
Contributor

This PR builds on previous PRs, and can be rebased

If a page doesn't contain a specification or browser compatibility header, or the CompatibilityTable KumaScript macro, then consider it "No Data" and don't parse. Speeds up a full re-parse by 30%, and reduces issues. The biggest drop is in doc_parse_error, from 405 to 17. section_skipped also dropped from 2200 to 1604. This is less expected, but is due to pages like Web/API/PowerManager/removeWakeLockListener, which has a section named "Specification" (not "Specifications"), no browser compatibility data, and no actual specification data to speak of.

Issue Slug Old Count New Count
Total 7091 6103
section_skipped 2200 1604
inline_text 1896 1896
unexpected_attribute 488 488
doc_parse_error 405 17
spec2_converted 283 283
unknown_kumascript 272 272
skipped_h3 262 262
unknown_version 221 221
specname_converted 193 193
halt_import 155 151
footnote_missing 106 106
specname_not_kumascript 98 98
missing_attribute 98 98
footnote_no_id 75 75
footnote_unused 59 59
unknown_browser 42 42
section_missed 40 40
spec_h2_id 38 38
footnote_multiple 38 38
spec_h2_name 32 32
tag_dropped 22 22
unknown_spec 21 21
span_dropped 12 12
second_footnote 12 12
spec_mismatch 10 10
compatgeckodesktop_unknown 10 10
footnote_feature 2 2
spec2_wrong_kumascript 1 1

Issues are collected in a new model, with:
- A slug to identify similar issues,
- The start and end position in the MDN page
- A dictionary of context data

The slugs are also used to lookup:
- Issue severity:
  - Warning: probably does not impact importing
  - Error: data is present but not imported
  - Critical: unable to finish importing page
- Short and long description templates, using the per-issue context
  data.
- User-contributed content-fixing hints on:

https://wiki.mozilla.org/MDN/Development/CompatibilityTables/Importer

The parse output has been changed. meta.scrape.raw.issues now contains
both the original issues (now Warnings) and errors (Error and Critical).
The HTML description no longer appears in the meta section, but is
generated at display time.

Finally, the importer list shows a count of Warnings / Error / Critical
instead of the binary has / doesn't have errors.
Consistantly use text_type, so that pages like
https://developer.mozilla.org/en-US/docs/Web/CSS/font-variant
will import correctly.
Add the "Commit" button when no errors or critical errors are found on
the page.
Canonical features loaded from the database had the name
{'zxx': 'name'}, exposing the internal implementation and breaking the
display. Fixed so name is 'name' instead.
* Rename tools/gather_import_errors.py
* Use new 'issues' terminology and endpoints
* During import, count up to 100% rather than down to 0%
* Write CSV with new data fields
It appears TravisCI is now using 'ascii' as their default system
encoding, causing reads of UTF-8 files to fail.  Use io.open with
explicit encoding to fix.
Bump requirements as suggested by requires.io
When displaying the diff of a collection, adding the resource name to
the output is redundant, since it appears in the JSON API output.
Order the locales in the translated string (en, then in alpha order), so
that the diffs highlight the changes rather than random dict sorting.
It is likely that the subpath will be set but not the number.
Combine common tool code into tools/common.py, to reduce duplication and
simplify command line parsing.
Collection.load_collection will copy the resources from another
collection. The resources can then be modified, and a
CollectionChangeset used to update the original Collection.
Previously, "false" was interpreted as the JSON value for False.
New tool for gathering MDN pages and creating or updating the branch
features related to them.  Includes converting to canonical names,
better handling of long slugs, and adding URLs to pages w/o
compatibility data.
When searching the importer by MDN URL, drop the querystring and
fragment automatically, rather than returning an error.
Use readonly_fields to prevent loading the whole database in the admin.
Add a page status for "page imported w/o compat data"
In the scrape-constructed view_feature, use name="canonical", rather
than name={"zxx": "canonical"}.
Some pages don't have the expected structure, such as <h2> headings,
resulting in the doc rule not matching.  Turn this into a
doc_parse_error issue, instead of an Exception, for further processing.

Example:

https://developer.mozilla.org/en-US/docs/Navigation_timing
Add issues specname_blank_key, spec2_wrong_kumascript, and
spec2_arg_count to replace assertions resulting in an exception issue.
Also, refactor visitor.unknown_kumascript_issue into
visitor.kumascript_issue, so it can be used in more KumaScript issue
reporting.
- Issues name is a link in importer/issues
- "Download MDN page" rather than "Download MDN pages"
https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_CSS_gradients
has a Bengali (bn-BD) translation that broke processing, prompting several
changes:
- MDN paths increased to 1024 characters
- Gather localized titles of MDN pages from metadata
- Migrate featurepage.status and issue.slug choices from previous work
- When scraping, add localized names of the page to the target feature
  if it is not set as a canonical name
- When a task encounters an unexpected status, assert with the
  human-friendly name
- Add STATUS_NO_DATA as an 'already fetched' state
- Display IRI (with unicode) instead of URI (with percent-encoded
  unicode) on the sample feature page
When mdn.tasks.fetch_translation gets a non-200 response, report as a
failed_download issue and continue. Previously, an exception was also
raised, which halted tools/import_mdn.py.
gather_import_issues.py was counting end position, not issues.
When 'ECMAScript 1st/3rd Edition.' appears as the Specification name,
convert to ES1/ES3.  Other text is an issue, not a parse error.
Add sample specifications so that specrow tests aren't complicated with
'unknown_spec' issues.
Instead of the context being the whole first <td> element, just
highlight the errored KumaScript.
Text (instead of {{Spec2(key)}}) in a specification status will result
in an warning-level issue, instead of halting scraping.  The status text
is not parsed, but instead the SpecName key is used.
* django-allauth 0.19.1 -> 0.20.0 - Bump email field size
* django-sortedm2m 0.9.5 -> 0.10.0 - Better customization hooks
* flake8 2.4.0 -> 2.4.1 - pip blacklist
* pylibmc 1.4.2 -> 1.4.3 - Threading fixes
* static3 branch -> 0.6.1 - Explicit UTF-8 reads
* virtualenv 12.1.1 -> 13.0.1 - pip, setuptools upgrades
Store issues in the FeaturePage JSON rather than load when deserializing
the data.  Speeds up page views, importing, and gathering import issues.
Switch the specification description (third column of the table) from a
simple parser to a tokenizer pattern (like compatibility cells) so that
inline KumaScript can be transformed into plain text.
Translate KumaScript found on MDN in specification descriptions to plain
HTML. Warn on use of {{Spec2}}, which is probably a typo for
{{SpecName}}.
Rearrange the parsing grammar so that generic elements such as <p>,
</p>, and <br> are no longer tied to cell parsing but can be reused in
footnote parsing.
Tokenize the footnote section, in a similar way to the specification
description and compatibility cells.
Change line numbers from 0-index to 1-index.  If the issue context
includes the last lines of the file (which may not be terminated with a
line feed), include in context.
Don't add footnote_no_id issue for empty or whitespace-only footnote
paragraphs.  If a multi-line footnote includes an empty paragraph,
drop it.
When a <pre> tag in the footnotes includes attributes, warn the user
that they won't be in output.  Previously, class="brush:css" was special
cased, but trying to default to no attributes since we'll have to reject
unexpected markup eventually.
Also adjust code block handling so that generics like
_consume_attributes can be used to parse it.
Standardize parsing of common HTML elements with optional attributes
Rename kumascript_to_text to kumascript_to_html, to reflect that it
outputs HTML instead of plain text.
Previously, unknown issue slugs would result in KeyError exceptions.
Now, a generic message is printed, allowing for easier debugging and
fixing.
Instead of tokenizing HTML content as a sequence of tags, parse into a
tree of nested content.  This allows more nuanced handling of HTML, such
as removing tags (<p> and <a> in feature names, <span> everywhere), and
more detailed messages.
In the sample JS display and the browse app, handle features with
canonical names, which are encoded as strings rather than objects.
The list of import issues can be filtered by clicking one of 11
pre-selected topic filters, or by searching for a partial MDN
path, such as:

http://localhost:8000/importer/?topic=Web/CSS/-

Also switched to sorting by MDN slug, and other cleanup for the importer
list view.
Add an endpoint for viewing all the pages affected by an issue.
Reformat the issue summary page to link to the detail page.
If a page doesn't include strings identifying a Browser compatibility
section, a Specifications section, or the CompatibilityTable KumaScript
macro, then stop parsing.  Detection is done with strings, not regex, so
it should be fast but maybe with false positives.
@jwhitlock
Copy link
Contributor Author

Wrong project

@jwhitlock jwhitlock closed this Jun 16, 2015
jwhitlock added a commit that referenced this pull request Aug 4, 2015
fix bug 1170196 - Filter issues by topic, other fixes
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants