This repository has been archived by the owner on Nov 10, 2017. It is now read-only.
fix bug 1170199 - detect pages with no data #35
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Issues are collected in a new model, with: - A slug to identify similar issues, - The start and end position in the MDN page - A dictionary of context data The slugs are also used to lookup: - Issue severity: - Warning: probably does not impact importing - Error: data is present but not imported - Critical: unable to finish importing page - Short and long description templates, using the per-issue context data. - User-contributed content-fixing hints on: https://wiki.mozilla.org/MDN/Development/CompatibilityTables/Importer The parse output has been changed. meta.scrape.raw.issues now contains both the original issues (now Warnings) and errors (Error and Critical). The HTML description no longer appears in the meta section, but is generated at display time. Finally, the importer list shows a count of Warnings / Error / Critical instead of the binary has / doesn't have errors.
Consistantly use text_type, so that pages like https://developer.mozilla.org/en-US/docs/Web/CSS/font-variant will import correctly.
Add the "Commit" button when no errors or critical errors are found on the page.
Canonical features loaded from the database had the name {'zxx': 'name'}, exposing the internal implementation and breaking the display. Fixed so name is 'name' instead.
* Rename tools/gather_import_errors.py * Use new 'issues' terminology and endpoints * During import, count up to 100% rather than down to 0% * Write CSV with new data fields
It appears TravisCI is now using 'ascii' as their default system encoding, causing reads of UTF-8 files to fail. Use io.open with explicit encoding to fix.
Bump requirements as suggested by requires.io
When displaying the diff of a collection, adding the resource name to the output is redundant, since it appears in the JSON API output.
Order the locales in the translated string (en, then in alpha order), so that the diffs highlight the changes rather than random dict sorting.
It is likely that the subpath will be set but not the number.
Combine common tool code into tools/common.py, to reduce duplication and simplify command line parsing.
Collection.load_collection will copy the resources from another collection. The resources can then be modified, and a CollectionChangeset used to update the original Collection.
Previously, "false" was interpreted as the JSON value for False.
New tool for gathering MDN pages and creating or updating the branch features related to them. Includes converting to canonical names, better handling of long slugs, and adding URLs to pages w/o compatibility data.
When searching the importer by MDN URL, drop the querystring and fragment automatically, rather than returning an error.
Use readonly_fields to prevent loading the whole database in the admin.
Add a page status for "page imported w/o compat data"
In the scrape-constructed view_feature, use name="canonical", rather than name={"zxx": "canonical"}.
Some pages don't have the expected structure, such as <h2> headings, resulting in the doc rule not matching. Turn this into a doc_parse_error issue, instead of an Exception, for further processing. Example: https://developer.mozilla.org/en-US/docs/Navigation_timing
Add issues specname_blank_key, spec2_wrong_kumascript, and spec2_arg_count to replace assertions resulting in an exception issue. Also, refactor visitor.unknown_kumascript_issue into visitor.kumascript_issue, so it can be used in more KumaScript issue reporting.
- Issues name is a link in importer/issues - "Download MDN page" rather than "Download MDN pages"
https://developer.mozilla.org/en-US/docs/Web/CSS/@viewport/max-zoom has the text '"max-zoom" descriptor, which doesn't end in quotes.
https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_CSS_gradients has a Bengali (bn-BD) translation that broke processing, prompting several changes: - MDN paths increased to 1024 characters - Gather localized titles of MDN pages from metadata - Migrate featurepage.status and issue.slug choices from previous work - When scraping, add localized names of the page to the target feature if it is not set as a canonical name - When a task encounters an unexpected status, assert with the human-friendly name - Add STATUS_NO_DATA as an 'already fetched' state - Display IRI (with unicode) instead of URI (with percent-encoded unicode) on the sample feature page
When mdn.tasks.fetch_translation gets a non-200 response, report as a failed_download issue and continue. Previously, an exception was also raised, which halted tools/import_mdn.py.
gather_import_issues.py was counting end position, not issues.
When 'ECMAScript 1st/3rd Edition.' appears as the Specification name, convert to ES1/ES3. Other text is an issue, not a parse error.
Add sample specifications so that specrow tests aren't complicated with 'unknown_spec' issues.
Instead of the context being the whole first <td> element, just highlight the errored KumaScript.
Text (instead of {{Spec2(key)}}) in a specification status will result in an warning-level issue, instead of halting scraping. The status text is not parsed, but instead the SpecName key is used.
* django-allauth 0.19.1 -> 0.20.0 - Bump email field size * django-sortedm2m 0.9.5 -> 0.10.0 - Better customization hooks * flake8 2.4.0 -> 2.4.1 - pip blacklist * pylibmc 1.4.2 -> 1.4.3 - Threading fixes * static3 branch -> 0.6.1 - Explicit UTF-8 reads * virtualenv 12.1.1 -> 13.0.1 - pip, setuptools upgrades
Store issues in the FeaturePage JSON rather than load when deserializing the data. Speeds up page views, importing, and gathering import issues.
Switch the specification description (third column of the table) from a simple parser to a tokenizer pattern (like compatibility cells) so that inline KumaScript can be transformed into plain text.
Translate KumaScript found on MDN in specification descriptions to plain HTML. Warn on use of {{Spec2}}, which is probably a typo for {{SpecName}}.
Rearrange the parsing grammar so that generic elements such as <p>, </p>, and <br> are no longer tied to cell parsing but can be reused in footnote parsing.
Tokenize the footnote section, in a similar way to the specification description and compatibility cells.
Change line numbers from 0-index to 1-index. If the issue context includes the last lines of the file (which may not be terminated with a line feed), include in context.
Don't add footnote_no_id issue for empty or whitespace-only footnote paragraphs. If a multi-line footnote includes an empty paragraph, drop it.
When a <pre> tag in the footnotes includes attributes, warn the user that they won't be in output. Previously, class="brush:css" was special cased, but trying to default to no attributes since we'll have to reject unexpected markup eventually.
Also adjust code block handling so that generics like _consume_attributes can be used to parse it.
Standardize parsing of common HTML elements with optional attributes
Rename kumascript_to_text to kumascript_to_html, to reflect that it outputs HTML instead of plain text.
Previously, unknown issue slugs would result in KeyError exceptions. Now, a generic message is printed, allowing for easier debugging and fixing.
Instead of tokenizing HTML content as a sequence of tags, parse into a tree of nested content. This allows more nuanced handling of HTML, such as removing tags (<p> and <a> in feature names, <span> everywhere), and more detailed messages.
In the sample JS display and the browse app, handle features with canonical names, which are encoded as strings rather than objects.
The list of import issues can be filtered by clicking one of 11 pre-selected topic filters, or by searching for a partial MDN path, such as: http://localhost:8000/importer/?topic=Web/CSS/- Also switched to sorting by MDN slug, and other cleanup for the importer list view.
Add an endpoint for viewing all the pages affected by an issue. Reformat the issue summary page to link to the detail page.
If a page doesn't include strings identifying a Browser compatibility section, a Specifications section, or the CompatibilityTable KumaScript macro, then stop parsing. Detection is done with strings, not regex, so it should be fast but maybe with false positives.
Wrong project |
jwhitlock
added a commit
that referenced
this pull request
Aug 4, 2015
fix bug 1170196 - Filter issues by topic, other fixes
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR builds on previous PRs, and can be rebased
If a page doesn't contain a specification or browser compatibility header, or the CompatibilityTable KumaScript macro, then consider it "No Data" and don't parse. Speeds up a full re-parse by 30%, and reduces issues. The biggest drop is in
doc_parse_error
, from 405 to 17.section_skipped
also dropped from 2200 to 1604. This is less expected, but is due to pages like Web/API/PowerManager/removeWakeLockListener, which has a section named "Specification" (not "Specifications"), no browser compatibility data, and no actual specification data to speak of.