Scraping data for panda api notes

eads edited this page Mar 22, 2012 · 7 revisions

So you'd like to scrape some data from your favorite municipal data source and put it into PANDA. The PANDA API provides an elegant interface for making it happen. But you still have choices to make based on the nature of your data and your goals for using it.

Here are some general guidelines and approaches to scraping data for use with PANDA, along with gotchas and problems.

Guidelines

  • Dataset description should explain how data is scraped and make sense to all stakeholders.
  • Scraping methodology should be discussed and decided with reporters, end-users, and stakeholders.
  • Behavior when updating old records needs particular attention: Have a documented plan that your stakeholders are comfortable with.
  • Mind your external_id: PANDA needs an dataset-unique identifier for each row. You will need to calculate this in a reliable way for any dataset you wish to import.

Methods

Incremental add

This is the easiest case. You have a stable dataset with unique IDs and an easy way of knowing what is new or different, either via an API that can be queried for results created, via a periodically released / obtained data dump, or similar.

Simply use the PANDA API to add these new rows to your dataset. Take care to account for updated / deleted records if your data includes them.

Handy upsides: Easy to implement, easy to understanding what scraper is doing.

Inevitable downsides: Real world data not usually quite this smooth, so not feasible for many common scraping tasks.

Conclusion: If you have simple needs and can add the data in this way, go for it.

Destroy and rebuild

Data portals like the City of Chicago's Socrata-based open data site are often fed data dumps on a regular basis. In many cases, every night the city deletes all the old data and re-imports the entire dataset (including new rows).

This means that the unique identifiers for the same rows change, every night. Similarly, querying for rows changed since a past date is meaningless: How many rows were added since yesterday? All of them!

One way to address this problem is to do what the data providers do: Wipe out your PANDA dataset in the dead of night and re-import it whole.

Advantages: Easy to implement, easy to understand.

Inevitable downsides: No change history. Any alert/notification framework will need to be fairly complex to catch new search results. For large datasets clock time and server load are both issues: It takes a long time to download 4gb of data from a source and re-upload it into PANDA, it also requires CPU, memory, and disk to index those records over and over.

Conclusion: Appropriate for smallish datasets (importing the whole dataset takes less than an hour or two), where old data is unlikely to change in significant ways and/or the history is unimportant.

Intermediate systems

Like the proverbial hedgehog, PANDA knows one big thing: managing and searching big datasets. Developers building scrapers that use the PANDA API are like the proverbial fox and know many small things. How can the foxes and hedgehog/pandas coexist?

A step beyond the destroy-and-rebuild method is to create intermediate data structures on their way to PANDA. The most typical version of this idea is to create a relational database (using PostgreSQL, MySQL, or something more portable like SQLite). In this scenario, you pull or scrape your data, check it against an intermedia database, and then insert it into PANDA.

Such systems can be smart about how they query and store data and can apply more sophisticated heuristics for determining if data has been added or changed.

Two intermediate data storage technologies worth mentioning are SQLite, for its portability and speed, and CouchDB, for its "row"-level history.

Advantages: Data history (including potential for sniffing out deleted records), eases data cleanup and transformation, allows PANDA to contain perfect replica of data that could also be driving a live, dynamic application.

Inevitable downsides: More work to build, more moving parts, intermediate data source must be actively maintained (and probably audited). Requires persistent storage, which may be cheap but still isn't free. May not be feasible for some cases (e.g. as with Chicago's payments database, where there is no safe way to ensure row uniqueness).

Conclusion: You have the power using these techniques, but you must decide if the cost/benefit is worth the maintenance overhead.

Additional wild-assed idea: Another option we're tentatively exploring is storing lightly processed versions of the raw JSON data dumps and using standard diff/version control tools to track changes.

Implications for PANDA notifications and dataset commenting

This discussion of potential scraping methodologies helps bring the goals and design of PANDA's notification and dataset commenting systems into focus.

  • It may be a lot of work for little reward to try to get PANDA to watch for new records and send notifications.
  • It may be a lot of work (but could still be quite valuable) to get PANDA to watch for new search results and send notifications.
  • The PANDA API should expose an endpoint for creating log messages to go with a dataset, and a mechanism to let end-users subscribe to them. Then developers can provide richer change summaries and other useful information when manipulating data via the API.
  • A monolithic commenting system (where log messages and conversation are undifferentiated) could be problematic. There should be some way to distinguish between comments and log messages, and perhaps de rigeur log messages ("successfully uploaded the drycleaners database for the 457th time") and notification-worthy log messages ("300 county contract records disappeared from the database last night").
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.