New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug 1271509: Add ./manage.py scrape_document #4248

Merged
merged 5 commits into from Jun 6, 2017

Conversation

Projects
None yet
3 participants
@jwhitlock
Member

jwhitlock commented Jun 2, 2017

Fix some stuff that should have been in the last PR (less user test fixtures, improve the scraper), and then add the ./manage.py scrape_document command. Some manual tests:

./manage.py scrape_document https://developer.mozilla.org/en-US/docs/Archive/SAX # Will also add Archive zone
./manage.py scrape_document https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array --depth 1  # Will scrape child docs as well
./manage.py scrape_document https://developer.mozilla.org/fr/docs/Web/CSS/display  # Will scrape the English document as well

jwhitlock added some commits May 30, 2017

bug 1271509: Reduce user test fixtures
Any fixture that touches the database has to be destroyed and recreated
with each test, so there is no benefit to capturing HTML or one-time
model setup as a fixture.
bug 1271509: Improve scraper reporting, cycling
Use pre-assembled format strings to report the progress on a source in
the scraper, and make it a little clearer that repeat=True is getting
set to detect if the scraper should loop again.

Process the sources in reverse order. This makes it more likely that the
new sources will be done, so that the old sources that needed them can
complete. Reduces a full scrape from 18 cycles to 13, but still about
the same time (40 minutes).

@jwhitlock jwhitlock requested a review from escattone Jun 2, 2017

bug 1271509: Add ./manage.py scrape_document
Add the ability to scrape a document, which also includes scraping:

* The rendered page (to detect zones)
* The ancestor pages (parent and higher)
* Metadata and Translations (optional)
* Revisions (configurable)
* Child documents (optional)
* The top-level zone document

It does not include attachments.
@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Jun 2, 2017

Codecov Report

Merging #4248 into master will increase coverage by 0.77%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4248      +/-   ##
==========================================
+ Coverage   87.09%   87.86%   +0.77%     
==========================================
  Files         155      162       +7     
  Lines        9420    10022     +602     
  Branches     1263     1367     +104     
==========================================
+ Hits         8204     8806     +602     
  Misses        989      989              
  Partials      227      227
Impacted Files Coverage Δ
kuma/scrape/storage.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/document_meta.py 100% <100%> (ø)
kuma/scrape/scraper.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/__init__.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/base.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/document.py 100% <100%> (ø)
kuma/scrape/sources/document_rendered.py 100% <100%> (ø)
kuma/scrape/sources/document_history.py 100% <100%> (ø)
kuma/scrape/sources/revision.py 100% <100%> (ø)
kuma/scrape/sources/document_children.py 100% <100%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 754f61b...c484b30. Read the comment docs.

codecov-io commented Jun 2, 2017

Codecov Report

Merging #4248 into master will increase coverage by 0.77%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4248      +/-   ##
==========================================
+ Coverage   87.09%   87.86%   +0.77%     
==========================================
  Files         155      162       +7     
  Lines        9420    10022     +602     
  Branches     1263     1367     +104     
==========================================
+ Hits         8204     8806     +602     
  Misses        989      989              
  Partials      227      227
Impacted Files Coverage Δ
kuma/scrape/storage.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/document_meta.py 100% <100%> (ø)
kuma/scrape/scraper.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/__init__.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/base.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/document.py 100% <100%> (ø)
kuma/scrape/sources/document_rendered.py 100% <100%> (ø)
kuma/scrape/sources/document_history.py 100% <100%> (ø)
kuma/scrape/sources/revision.py 100% <100%> (ø)
kuma/scrape/sources/document_children.py 100% <100%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 754f61b...c484b30. Read the comment docs.

@escattone escattone self-assigned this Jun 2, 2017

@escattone

Wow, this is a lot of great work, with superb tests as well. I really like the core, iterative, state-driven processing loop and how it elegantly accommodates the complex, recursive dependencies of scraping data from Kuma. It reveals a lot of careful thought invested in breaking-down a complex problem. It's a kind of microcosm of Kuma itself. Reading it in detail helped enlarge as well as confirm my knowledge of Kuma.

I'm almost embarrassed to even mention a couple of nitpicks. This PR deserves fanfare.

Show outdated Hide outdated kuma/scrape/management/commands/scrape_document.py
STANDARD_DOC_OPTIONS = {
'force': ('bool', False), # Update existing Document records
'depth': ('int_all', 0), # Scrape the topic tree to this depth
'revisions': ('int', 1), # Scrape this many past revisions

This comment has been minimized.

@escattone

escattone Jun 6, 2017

Member

Why not int_all for revisions as well?

@escattone

escattone Jun 6, 2017

Member

Why not int_all for revisions as well?

This comment has been minimized.

@escattone

escattone Jun 6, 2017

Member

Oh, I see, I think it's because the wiki.document_revisions endpoint doesn't allow a ?limit=all unless the requester is authenticated, which our Requester instance is not.

@escattone

escattone Jun 6, 2017

Member

Oh, I see, I think it's because the wiki.document_revisions endpoint doesn't allow a ?limit=all unless the requester is authenticated, which our Requester instance is not.

This comment has been minimized.

@jwhitlock

jwhitlock Jun 6, 2017

Member

Yes, and I think that was done because scrapers that ignored robots.txt were following the "all" links. One way forward would be to enable ?limit=all for anon callers, just not add the link.

@jwhitlock

jwhitlock Jun 6, 2017

Member

Yes, and I think that was done because scrapers that ignored robots.txt were following the "all" links. One way forward would be to enable ?limit=all for anon callers, just not add the link.

self.locale, self.slug = self.locale_and_slug(
self.normalized_path)
def load_prereq_parent_topic(self, storage, data):

This comment has been minimized.

@escattone

escattone Jun 6, 2017

Member

A nitpick, but it seems this method could be simplified to:

        assert self.normalized_path
        if self.parent_slug:
            parent_topic = storage.get_document(self.locale, self.parent_slug)
            if parent_topic is None:
                data['needs'].append(('document', self.parent_path, {}))
            else:
                data['parent_topic'] = parent_topic
@escattone

escattone Jun 6, 2017

Member

A nitpick, but it seems this method could be simplified to:

        assert self.normalized_path
        if self.parent_slug:
            parent_topic = storage.get_document(self.locale, self.parent_slug)
            if parent_topic is None:
                data['needs'].append(('document', self.parent_path, {}))
            else:
                data['parent_topic'] = parent_topic

This comment has been minimized.

@jwhitlock

jwhitlock Jun 6, 2017

Member

👍, and changing to fit the pattern I'm using elsewhere, to avoid too much nesting:

if this prerequisite not needed:
   return  # Short explanation

prereq = storage.get_prereq(...)
if prereq is None:
  data['needs'].append(('type_of_prereq', param, {})
else:
  data['key'] = extract_data(prereq)
@jwhitlock

jwhitlock Jun 6, 2017

Member

👍, and changing to fit the pattern I'm using elsewhere, to avoid too much nesting:

if this prerequisite not needed:
   return  # Short explanation

prereq = storage.get_prereq(...)
if prereq is None:
  data['needs'].append(('type_of_prereq', param, {})
else:
  data['key'] = extract_data(prereq)

jwhitlock added some commits Jun 6, 2017

bug 1271509: Remove unneeded .get() for verbosity
Verbosity is included in all Django management commands. Assume it is
available, like scrape_user does.
bug 1271509: Refactor load_prereq_parent_topic
Convert the optional parent_topic loader to look more like other
optional loaders.

@jwhitlock jwhitlock merged commit e074485 into mozilla:master Jun 6, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@jwhitlock jwhitlock deleted the jwhitlock:scrape_document_1271509 branch Jun 6, 2017

@jwhitlock jwhitlock referenced this pull request Jun 6, 2017

Merged

Bug 1271509: Sample database and production scraper #4076

15 of 15 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment