Skip to content

Scraper Infrastructure Notes

Paul Winkler edited this page Aug 6, 2012 · 2 revisions

Comparison of ebdata.retrieval vs. ebdata.blobs

BLOBS

RETRIEVAL

Uses templatemaker to strip extraneous HTML

YES NO

Crawling dumb sites (eg. incremental page IDs)

YES NO

Stores records of crawling in database

YES [#note1] YES [#note2] [#note3]

Handles arbitrary attributes

NO YES

Really short scraper scripts

YES [#note4] NO [#note4]

Double-checks location when both location_name and geom are provided

NO YES, see safe_location(), but not automatic

Auto-geocodes location_name if needed

YES (geotagging.py) YES, create_newsitem()

Parses multiple locations per crawled page and auto-creates multiple NewsItems

YES (geotagging.py) NO, you have to call create_newsitem() manually.

Supports multiple schemas in one scraper

NO? YES

Can fetch & parse without saving to db (for testing)

NO? YES, display_data()

notes:

[#=note1] blobs stores crawl history as Page objects which have the text of the crawled page, and a .when_crawled timestamp, and a fair amount of other metadata.

[#=note2] retrieval.scrapers.newsitem_list_detail stores only a timestamp of when each schema was last scraped, by creating a ebpub.db.models.DataUpdate instance, which just has some basic statistics. Scraped content is not saved.

[#=note3]retrieval.scrapers.new_newsitem_list_detail creates instances of ebdata.retrieval.models.ScrapedPage (content and a bit of metadata about a crawled page, much simpler than blobs.models.Page), and NewsItemHistory (just a m2m mapping of ScrapedPages to NewsItems).

[#=note4] anecdotally, scrapers written against ebdata.blobs tend to be shorter.