Skip to content
This repository was archived by the owner on Oct 16, 2025. It is now read-only.

fix bug 1182542 - Scrape deeply nested MDN pages, other fixes#36

Merged
groovecoder merged 4 commits into
mdn:masterfrom
jwhitlock:1182542_deeply_nested
Aug 10, 2015
Merged

fix bug 1182542 - Scrape deeply nested MDN pages, other fixes#36
groovecoder merged 4 commits into
mdn:masterfrom
jwhitlock:1182542_deeply_nested

Conversation

@jwhitlock
Copy link
Copy Markdown
Contributor

Florian Sholtz and the MDN team have reduced a lot of the importer issues, and have requested a re-scrape of MDN to see if new content has additional issues. This PR includes tool improvements that will help the process. This code has not been run against https://browsercompat.herokuapp.com, since we're trying to do code reviews before "production" pushes.

  • Fix bug 1182542 - Use recursive calls to $children to discover MDN URLs more than 5 levels deep.
  • Handle various error conditions with $children API
  • Make it easy to get fresh content with --no-cache
  • Re-scrape MDN with import_mdn.py.

If you want to run this locally:

  • Setup browsercompat project
  • With a good network connection and power supply, run:
    • time tools/mirror_mdn_features.py - after about 60 minutes, will prompt to make changes, then 5 - 10 minutes to commit changes. For me, got 841 new pages, 19 changed, 969 deleted, 6023 the same.
    • time tools/import_mdn.py - takes about 6.5 hours to parse 5877 pages.

When a tool downloads data from MDN, uses Tool.cached_download to store
a cached copy to a file.  This is useful when debugging tools. However,
fresh data is often desired. The --no-cache option downloads a fresh
copy, even if the file exists.
Previously, cached files were stored in the data/ folder. This allows
specifying a subfolder, which is created on first use.
Ask MDN for child pages one level at a time, so that pages that are more
than 5 levels deep are mirrored. Additional changes:

* Handle null data returned from $children due to a redirect
  (https://developer.mozilla.org/en-US/docs/Navigation_timing$children)
* Handle invalid URLs due to redirect madness
  (https://developer.mozilla.org/en-US/docs/Web/Events$children?depth=1,
   child for name onconnected, was at
   https://developer.mozilla.org/en-US/docs/Web/Events/onconnected)
* Handle 400s (see bug 1192254)
* If an existing feature has an MDN URL and does not appear in the
  scraped list, assume it has been moved and delete it.
Previously, tools/import_mdn.py always did a reparse operation, which
reparses the cached MDN pages if available. This is the right option
when the scraper is updated and needs to be tested against the previous
scraper version. Now, --reparse is needed to get the old behaviour, and
the default is to redownload MDN pages and parse them. This is the right
option to periodically sync the API with MDN.
@jwhitlock
Copy link
Copy Markdown
Contributor Author

Assigning @groovecoder. I'd love to get this merged by Monday August 10th, so I can run it in time for the August 11th meeting. If it looks too hairy, we can see if @jezdez has the bandwidth.

@groovecoder
Copy link
Copy Markdown
Contributor

Code looks good. When I tried to run:

./tools/upload_data.py --data /Users/lcrouch/code/browsercompat-data/data

I got:

Traceback (most recent call last):
  File "./tools/upload_data.py", line 62, in <module>
    changes = tool.run()
  File "./tools/upload_data.py", line 39, in run
    return self.sync_changes(api_collection, local_collection)
  File "/Users/lcrouch/code/browsercompat/tools/common.py", line 311, in sync_changes
    return changeset.change_original_collection()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 673, in change_original_collection
    resource_type, json_api[resource_type])
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 147, in create
    response = self.request('POST', resource_type, data=data)
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 83, in request
    response.status_code, response.content)
client.APIException: (u'POST http://localhost:8000/api/v1/versions: Unexpected response', 400, '{\n    "errors": [\n        {\n            "status": "400",\n            "detail": "This field may not be blank.",\n            "path": "/version"\n        }\n    ]\n}')

@groovecoder
Copy link
Copy Markdown
Contributor

Bah, scratch that. I just hadn't updated https://github.com/mdn/browsercompat-data recently.

@groovecoder
Copy link
Copy Markdown
Contributor

Then I got:

Traceback (most recent call last):
  File "./tools/upload_data.py", line 62, in <module>
    changes = tool.run()
  File "./tools/upload_data.py", line 39, in run
    return self.sync_changes(api_collection, local_collection)
  File "/Users/lcrouch/code/browsercompat/tools/common.py", line 311, in sync_changes
    return changeset.change_original_collection()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 673, in change_original_collection
    resource_type, json_api[resource_type])
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 147, in create
    response = self.request('POST', resource_type, data=data)
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 83, in request
    response.status_code, response.content)
client.APIException: (u'POST http://localhost:8000/api/v1/versions: Unexpected response', 400, '{\n    "errors": [\n        {\n            "status": "400",\n            "detail": "With status \\"unknown\\", version must be numeric.",\n            "path": "/version"\n        }\n    ]\n}')

So I switched to mdn/master branch, and got:

Traceback (most recent call last):
  File "./tools/upload_data.py", line 62, in <module>
    changes = tool.run()
  File "./tools/upload_data.py", line 39, in run
    return self.sync_changes(api_collection, local_collection)
  File "/Users/lcrouch/code/browsercompat/tools/common.py", line 295, in sync_changes
    api_collection, local_collection, skip_deletes)
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 561, in __init__
    self._populate_changes()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 566, in _populate_changes
    my_index = self.new_collection.get_all_by_data_id()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 411, in get_all_by_data_id
    indexed.update(self.get_resources_by_data_id(resource_type))
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 428, in get_resources_by_data_id
    assert data_id not in resources
AssertionError

I'm going to try skipping the browsercompat-data import and just running the MDN mirror.

@jwhitlock
Copy link
Copy Markdown
Contributor Author

I'm finding that upload_data.py doesn't work well with a populated database, but seems to need a fresh copy. Not sure how easy it would be to fix. But it should work with the MDN mirror.

@groovecoder
Copy link
Copy Markdown
Contributor

Yup, MDN mirror is running now. I'm letting it go as I work on other stuff in the main kuma codebase.

@groovecoder
Copy link
Copy Markdown
Contributor

Been running for an hour without a problem ...

INFO - Imported 4600 features

... so far.

@groovecoder
Copy link
Copy Markdown
Contributor

Mirror job is done. Looks good!

INFO - Closing changeset, updating cache...
INFO - Changes complete. Counts:
INFO -   Features: 5900 new

real    66m9.363s
user    2m17.013s
sys 0m11.288s

@groovecoder
Copy link
Copy Markdown
Contributor

INFO -   Processed 304 of 5900 MDN pages (5%)...

Looking good so far. If this is enough for you to feel confident in the code, I can merge it so we can run it overnight.

@jwhitlock
Copy link
Copy Markdown
Contributor Author

Yes, if it got this far, it will probably finish.

groovecoder added a commit that referenced this pull request Aug 10, 2015
fix bug 1182542 - Scrape deeply nested MDN pages, other fixes
@groovecoder groovecoder merged commit 1ed5f73 into mdn:master Aug 10, 2015
@jwhitlock jwhitlock deleted the 1182542_deeply_nested branch August 11, 2015 18:32
@jwhitlock
Copy link
Copy Markdown
Contributor Author

It took a few days of trying and an 8-hour middle of the day run to complete import_mdn.py.

Here's the change in issues, which includes fixes by the writing team:

Issue Slug Old Count New Count
Total 2000 2001
section_skipped 842 897
inline_text 423 193
unknown_version 252 322
unknown_kumascript 250 243
halt_import 126 113
footnote_no_id 32 32
tag_dropped 20 20
skipped_h3 18 19
unknown_spec 15 0
exception 8 7
span_dropped 4 5
section_missed 3 12
compatgeckodesktop_unknown 3 5
unknown_browser 1 7
unexpected_attribute 1 72
spec2_converted 1 2
failed_download 1 8
spec_h2_id 0 2
missing_attribute 0 2
specname_not_kumascript 0 1
spec_h2_name 0 2
footnote_multiple 0 10
footnote_missing 0 19
footnote_unused 0 6
second_footnote 0 2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants