fix bug 1182542 - Scrape deeply nested MDN pages, other fixes by jwhitlock · Pull Request #36 · mdn/browsercompat

jwhitlock · 2015-08-07T15:55:21Z

Florian Sholtz and the MDN team have reduced a lot of the importer issues, and have requested a re-scrape of MDN to see if new content has additional issues. This PR includes tool improvements that will help the process. This code has not been run against https://browsercompat.herokuapp.com, since we're trying to do code reviews before "production" pushes.

Fix bug 1182542 - Use recursive calls to $children to discover MDN URLs more than 5 levels deep.
Handle various error conditions with $children API
Make it easy to get fresh content with --no-cache
Re-scrape MDN with import_mdn.py.

If you want to run this locally:

Setup browsercompat project
- Add a superuser with / username+password
- Optionally populate with data from https://github.com/mdn/browsercompat-data (will have less "unknown_version" type errors)
- Optionally run with memcache
With a good network connection and power supply, run:
- time tools/mirror_mdn_features.py - after about 60 minutes, will prompt to make changes, then 5 - 10 minutes to commit changes. For me, got 841 new pages, 19 changed, 969 deleted, 6023 the same.
- time tools/import_mdn.py - takes about 6.5 hours to parse 5877 pages.

When a tool downloads data from MDN, uses Tool.cached_download to store a cached copy to a file. This is useful when debugging tools. However, fresh data is often desired. The --no-cache option downloads a fresh copy, even if the file exists.

Previously, cached files were stored in the data/ folder. This allows specifying a subfolder, which is created on first use.

Ask MDN for child pages one level at a time, so that pages that are more than 5 levels deep are mirrored. Additional changes: * Handle null data returned from $children due to a redirect (https://developer.mozilla.org/en-US/docs/Navigation_timing$children) * Handle invalid URLs due to redirect madness (https://developer.mozilla.org/en-US/docs/Web/Events$children?depth=1, child for name onconnected, was at https://developer.mozilla.org/en-US/docs/Web/Events/onconnected) * Handle 400s (see bug 1192254) * If an existing feature has an MDN URL and does not appear in the scraped list, assume it has been moved and delete it.

Previously, tools/import_mdn.py always did a reparse operation, which reparses the cached MDN pages if available. This is the right option when the scraper is updated and needs to be tested against the previous scraper version. Now, --reparse is needed to get the old behaviour, and the default is to redownload MDN pages and parse them. This is the right option to periodically sync the API with MDN.

jwhitlock · 2015-08-07T16:09:51Z

Assigning @groovecoder. I'd love to get this merged by Monday August 10th, so I can run it in time for the August 11th meeting. If it looks too hairy, we can see if @jezdez has the bandwidth.

groovecoder · 2015-08-10T20:22:48Z

Code looks good. When I tried to run:

./tools/upload_data.py --data /Users/lcrouch/code/browsercompat-data/data

I got:

Traceback (most recent call last):
  File "./tools/upload_data.py", line 62, in <module>
    changes = tool.run()
  File "./tools/upload_data.py", line 39, in run
    return self.sync_changes(api_collection, local_collection)
  File "/Users/lcrouch/code/browsercompat/tools/common.py", line 311, in sync_changes
    return changeset.change_original_collection()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 673, in change_original_collection
    resource_type, json_api[resource_type])
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 147, in create
    response = self.request('POST', resource_type, data=data)
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 83, in request
    response.status_code, response.content)
client.APIException: (u'POST http://localhost:8000/api/v1/versions: Unexpected response', 400, '{\n    "errors": [\n        {\n            "status": "400",\n            "detail": "This field may not be blank.",\n            "path": "/version"\n        }\n    ]\n}')

groovecoder · 2015-08-10T20:24:10Z

Bah, scratch that. I just hadn't updated https://github.com/mdn/browsercompat-data recently.

groovecoder · 2015-08-10T20:27:41Z

Then I got:

Traceback (most recent call last):
  File "./tools/upload_data.py", line 62, in <module>
    changes = tool.run()
  File "./tools/upload_data.py", line 39, in run
    return self.sync_changes(api_collection, local_collection)
  File "/Users/lcrouch/code/browsercompat/tools/common.py", line 311, in sync_changes
    return changeset.change_original_collection()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 673, in change_original_collection
    resource_type, json_api[resource_type])
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 147, in create
    response = self.request('POST', resource_type, data=data)
  File "/Users/lcrouch/code/browsercompat/tools/client.py", line 83, in request
    response.status_code, response.content)
client.APIException: (u'POST http://localhost:8000/api/v1/versions: Unexpected response', 400, '{\n    "errors": [\n        {\n            "status": "400",\n            "detail": "With status \\"unknown\\", version must be numeric.",\n            "path": "/version"\n        }\n    ]\n}')

So I switched to mdn/master branch, and got:

Traceback (most recent call last):
  File "./tools/upload_data.py", line 62, in <module>
    changes = tool.run()
  File "./tools/upload_data.py", line 39, in run
    return self.sync_changes(api_collection, local_collection)
  File "/Users/lcrouch/code/browsercompat/tools/common.py", line 295, in sync_changes
    api_collection, local_collection, skip_deletes)
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 561, in __init__
    self._populate_changes()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 566, in _populate_changes
    my_index = self.new_collection.get_all_by_data_id()
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 411, in get_all_by_data_id
    indexed.update(self.get_resources_by_data_id(resource_type))
  File "/Users/lcrouch/code/browsercompat/tools/resources.py", line 428, in get_resources_by_data_id
    assert data_id not in resources
AssertionError

I'm going to try skipping the browsercompat-data import and just running the MDN mirror.

jwhitlock · 2015-08-10T20:33:20Z

I'm finding that upload_data.py doesn't work well with a populated database, but seems to need a fresh copy. Not sure how easy it would be to fix. But it should work with the MDN mirror.

groovecoder · 2015-08-10T20:37:07Z

Yup, MDN mirror is running now. I'm letting it go as I work on other stuff in the main kuma codebase.

groovecoder · 2015-08-10T21:23:52Z

Been running for an hour without a problem ...

INFO - Imported 4600 features

... so far.

groovecoder · 2015-08-10T22:08:07Z

Mirror job is done. Looks good!

INFO - Closing changeset, updating cache...
INFO - Changes complete. Counts:
INFO -   Features: 5900 new

real    66m9.363s
user    2m17.013s
sys 0m11.288s

groovecoder · 2015-08-10T22:37:19Z

INFO -   Processed 304 of 5900 MDN pages (5%)...

Looking good so far. If this is enough for you to feel confident in the code, I can merge it so we can run it overnight.

jwhitlock · 2015-08-10T22:40:35Z

Yes, if it got this far, it will probably finish.

fix bug 1182542 - Scrape deeply nested MDN pages, other fixes

jwhitlock · 2015-08-14T02:15:56Z

It took a few days of trying and an 8-hour middle of the day run to complete import_mdn.py.

Here's the change in issues, which includes fixes by the writing team:

Issue Slug	Old Count	New Count
Total	2000	2001
section_skipped	842	897
inline_text	423	193
unknown_version	252	322
unknown_kumascript	250	243
halt_import	126	113
footnote_no_id	32	32
tag_dropped	20	20
skipped_h3	18	19
unknown_spec	15	0
exception	8	7
span_dropped	4	5
section_missed	3	12
compatgeckodesktop_unknown	3	5
unknown_browser	1	7
unexpected_attribute	1	72
spec2_converted	1	2
failed_download	1	8
spec_h2_id	0	2
missing_attribute	0	2
specname_not_kumascript	0	1
spec_h2_name	0	2
footnote_multiple	0	10
footnote_missing	0	19
footnote_unused	0	6
second_footnote	0	2

jwhitlock added 4 commits August 7, 2015 07:29

bug 1181140 - Add folders for cached downloads

b0a3012

Previously, cached files were stored in the data/ folder. This allows specifying a subfolder, which is created on first use.

jwhitlock assigned groovecoder Aug 7, 2015

groovecoder added a commit that referenced this pull request Aug 10, 2015

Merge pull request #36 from jwhitlock/1182542_deeply_nested

1ed5f73

fix bug 1182542 - Scrape deeply nested MDN pages, other fixes

groovecoder merged commit 1ed5f73 into mdn:master Aug 10, 2015

jwhitlock deleted the 1182542_deeply_nested branch August 11, 2015 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug 1182542 - Scrape deeply nested MDN pages, other fixes#36

fix bug 1182542 - Scrape deeply nested MDN pages, other fixes#36
groovecoder merged 4 commits into
mdn:masterfrom
jwhitlock:1182542_deeply_nested

jwhitlock commented Aug 7, 2015

Uh oh!

jwhitlock commented Aug 7, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

jwhitlock commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

jwhitlock commented Aug 10, 2015

Uh oh!

jwhitlock commented Aug 14, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jwhitlock commented Aug 7, 2015

Uh oh!

jwhitlock commented Aug 7, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

jwhitlock commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

groovecoder commented Aug 10, 2015

Uh oh!

jwhitlock commented Aug 10, 2015

Uh oh!

jwhitlock commented Aug 14, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants