fix bug 1182542 - Scrape deeply nested MDN pages, other fixes#36
Conversation
When a tool downloads data from MDN, uses Tool.cached_download to store a cached copy to a file. This is useful when debugging tools. However, fresh data is often desired. The --no-cache option downloads a fresh copy, even if the file exists.
Previously, cached files were stored in the data/ folder. This allows specifying a subfolder, which is created on first use.
Ask MDN for child pages one level at a time, so that pages that are more than 5 levels deep are mirrored. Additional changes: * Handle null data returned from $children due to a redirect (https://developer.mozilla.org/en-US/docs/Navigation_timing$children) * Handle invalid URLs due to redirect madness (https://developer.mozilla.org/en-US/docs/Web/Events$children?depth=1, child for name onconnected, was at https://developer.mozilla.org/en-US/docs/Web/Events/onconnected) * Handle 400s (see bug 1192254) * If an existing feature has an MDN URL and does not appear in the scraped list, assume it has been moved and delete it.
Previously, tools/import_mdn.py always did a reparse operation, which reparses the cached MDN pages if available. This is the right option when the scraper is updated and needs to be tested against the previous scraper version. Now, --reparse is needed to get the old behaviour, and the default is to redownload MDN pages and parse them. This is the right option to periodically sync the API with MDN.
|
Assigning @groovecoder. I'd love to get this merged by Monday August 10th, so I can run it in time for the August 11th meeting. If it looks too hairy, we can see if @jezdez has the bandwidth. |
|
Code looks good. When I tried to run: I got: |
|
Bah, scratch that. I just hadn't updated https://github.com/mdn/browsercompat-data recently. |
|
Then I got: So I switched to I'm going to try skipping the |
|
I'm finding that |
|
Yup, MDN mirror is running now. I'm letting it go as I work on other stuff in the main kuma codebase. |
|
Been running for an hour without a problem ... ... so far. |
|
Mirror job is done. Looks good! |
Looking good so far. If this is enough for you to feel confident in the code, I can merge it so we can run it overnight. |
|
Yes, if it got this far, it will probably finish. |
fix bug 1182542 - Scrape deeply nested MDN pages, other fixes
|
It took a few days of trying and an 8-hour middle of the day run to complete Here's the change in issues, which includes fixes by the writing team:
|
Florian Sholtz and the MDN team have reduced a lot of the importer issues, and have requested a re-scrape of MDN to see if new content has additional issues. This PR includes tool improvements that will help the process. This code has not been run against https://browsercompat.herokuapp.com, since we're trying to do code reviews before "production" pushes.
$childrento discover MDN URLs more than 5 levels deep.$childrenAPI--no-cacheimport_mdn.py.If you want to run this locally:
time tools/mirror_mdn_features.py- after about 60 minutes, will prompt to make changes, then 5 - 10 minutes to commit changes. For me, got 841 new pages, 19 changed, 969 deleted, 6023 the same.time tools/import_mdn.py- takes about 6.5 hours to parse 5877 pages.