Scraper Infrastructure Notes

Comparison of ebdata.retrieval vs. ebdata.blobs

	BLOBS	RETRIEVAL
Uses templatemaker to strip extraneous HTML	YES	NO
Crawling dumb sites (eg. incremental page IDs)	YES	NO
Stores records of crawling in database	YES [#note1]	YES [#note2] [#note3]
Handles arbitrary attributes	NO	YES
Really short scraper scripts	YES [#note4]	NO [#note4]
Double-checks location when both location_name and geom are provided	NO	YES, see safe_location(), but not automatic
Auto-geocodes location_name if needed	YES (geotagging.py)	YES, create_newsitem()
Parses multiple locations per crawled page and auto-creates multiple NewsItems	YES (geotagging.py)	NO, you have to call create_newsitem() manually.
Supports multiple schemas in one scraper	NO?	YES
Can fetch & parse without saving to db (for testing)	NO?	YES, display_data()

notes:

[#=note1] blobs stores crawl history as Page objects which have the text of the crawled page, and a .when_crawled timestamp, and a fair amount of other metadata.

[#=note2] retrieval.scrapers.newsitem_list_detail stores only a timestamp of when each schema was last scraped, by creating a ebpub.db.models.DataUpdate instance, which just has some basic statistics. Scraped content is not saved.

[#=note3]retrieval.scrapers.new_newsitem_list_detail creates instances of ebdata.retrieval.models.ScrapedPage (content and a bit of metadata about a crawled page, much simpler than blobs.models.Page), and NewsItemHistory (just a m2m mapping of ScrapedPages to NewsItems).

[#=note4] anecdotally, scrapers written against ebdata.blobs tend to be shorter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper Infrastructure Notes

BLOBS

RETRIEVAL

Uses templatemaker to strip extraneous HTML

Crawling dumb sites (eg. incremental page IDs)

Stores records of crawling in database

Handles arbitrary attributes

Really short scraper scripts

Double-checks location when both location_name and geom are provided

Auto-geocodes location_name if needed

Parses multiple locations per crawled page and auto-creates multiple NewsItems

Supports multiple schemas in one scraper

Can fetch & parse without saving to db (for testing)

Clone this wiki locally