Permalink
Browse files

Iterating on api tutorial. #549.

  • Loading branch information...
1 parent c7e9677 commit f2d2f8b33f9e06d68034cdeb103acbed9c310727 @onyxfish onyxfish committed Jul 26, 2012
Showing with 112 additions and 1 deletion.
  1. +75 −0 api_examples/scraperwiki_twitter.py
  2. +30 −0 docs/api_tutorial.rst
  3. +1 −0 docs/index.rst
  4. +6 −1 panda/api/data.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python
+
+"""
+Example showing how to import Twitter data from the Scraperwiki API.
+"""
+
+import json
+
+import requests
+
+PANDA_API = 'http://localhost:8000/api/1.0'
+PANDA_AUTH_PARAMS = {
+ 'email': 'panda@pandaproject.net',
+ 'api_key': 'edfe6c5ffd1be4d3bf22f69188ac6bc0fc04c84b'
+}
+PANDA_DATASET_SLUG = 'twitter-pandaproject'
+
+PANDA_DATASET_URL = '%s/dataset/%s/' % (PANDA_API, PANDA_DATASET_SLUG)
+PANDA_DATA_URL = '%s/dataset/%s/data/' % (PANDA_API, PANDA_DATASET_SLUG)
+PANDA_BULK_UPDATE_SIZE = 1000
+
+SCRAPERWIKI_URL = 'https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=jsonlist&name=basic_twitter_scraper_437&query=select%20*%20from%20%60swdata%60'
+COLUMNS = ['text', 'id', 'from_user']
+
+# Utility functions
+def panda_get(url, params={}):
+ params.update(PANDA_AUTH_PARAMS)
+ return requests.get(url, params=params)
+
+def panda_put(url, data, params={}):
+ params.update(PANDA_AUTH_PARAMS)
+ return requests.put(url, data, params=params, headers={ 'Content-Type': 'application/json' })
+
+# Check if dataset exists
+response = panda_get(PANDA_DATASET_URL)
+
+# Create dataset if necessary
+if response.status_code == 404:
+ dataset = {
+ 'name': 'PANDA Project Twitter Search',
+ 'description': 'Results of the scraper at <a href="https://scraperwiki.com/scrapers/basic_twitter_scraper_437/">https://scraperwiki.com/scrapers/basic_twitter_scraper_437/</a>.'
+ }
+
+ response = panda_put(PANDA_DATASET_URL, json.dumps(dataset), params={
+ 'columns': ','.join(COLUMNS),
+ })
+
+# Fetch latest data from Scraperwiki
+print 'Fetching latest data'
+response = requests.get(SCRAPERWIKI_URL)
+
+data = json.loads(response.content)
+
+put_data = {
+ 'objects': []
+}
+
+for i, row in enumerate(data['data']):
+ put_data['objects'].append({
+ 'data': row,
+ 'external_id': unicode(row[1])
+ })
+
+ if i and i % PANDA_BULK_UPDATE_SIZE == 0:
+ print 'Updating %i rows...' % PANDA_BULK_UPDATE_SIZE
+
+ panda_put(PANDA_DATA_URL, json.dumps(put_data))
+ put_data['objects'] = []
+
+if put_data['objects']:
+ print 'Updating %i rows' % len(put_data['objects'])
+ response = panda_put(PANDA_DATA_URL, json.dumps(put_data))
+
+print 'Done'
+
View
@@ -0,0 +1,30 @@
+===================
+API Import Tutorial
+===================
+
+
+PANDA's API is designed to allow you to programmatically import (and, to a lesser extent, export) data from PANDA. In this tutorial we will show you how you can use the PANDA API to programmaticaly import data from a variety of sources. Our examples will be written in Python using the `Requests <python-requests.org>`_ library, but should be easily portable to any language.
+
+If you just want to skip the code, check out our `API examples <https://github.com/pandaproject/panda/tree/master/api_examples>`_ on Github.
+
+Can my data be updated?
+=======================
+
+Before you use the API to import data into PANDA you need to ask yourself if each row of data has a unique id that will allow you to identify it. If you use SQL, you can think of this like the *primary key* for the dataset. In PANDA we call this value the ``external_id``, because it is generated *external* to PANDA. An ``external_id`` could be anything from a row number to a social security number.
+
+If you can provide an ``external_id`` for your data you will be able to read and update your individual rows of data at a unique URL::
+
+ GET http://localhost:8000/api/1.0/dataset/[slug]/data/[external_id]/
+
+If your data doesn't have an ``external_id`` then you won't be able to read or update individual rows of data. (You can still find them via search or make changes in bulk.) In this case the only way to *synchronize* changes between PANDA and your source dataset will be to delete and reimport all rows.
+
+Even if you do have an ``external_id`` you may still need to delete all rows if your source doesn't provide a *changelog*. A changelog is a stream of metadata that describes when rows of data are modified. Without a changelog it will be impossible to tell if a row of data has been *deleted*. CouchDB is an example of a database that provides a changelog. (See our `CouchDB example <https://github.com/pandaproject/panda/blob/master/api_examples/couchdb.py>`_.)
+
+In most cases you will have an ``external_id``, but not a changelog, so you will need to decide if it is important that rows deleted in the source dataset are also deleted in your PANDA. If so, you will need to wipe out all the data before importing the new data.
+
+Our source data
+===============
+
+In this tutorial we are going to be scraping the results of a very simple web scraper, `hosted on Scraperwiki <https://scraperwiki.com/scrapers/basic_twitter_scraper_437/>`_, that is aggregating all tweets about the PANDA Project. Because Scraperwiki has an API, we can write a simple script that will allow us to import the results and then run that script as often as we like.
+
+
View
@@ -88,6 +88,7 @@ Extending PANDA
.. toctree::
:maxdepth: 1
+ api_tutorial.rst
api.rst
* `Source code repository <https://github.com/pandaproject/panda>`_
View
@@ -61,7 +61,9 @@ def is_valid(self, bundle, request=None):
errors['data'] = ['The data field is required.']
if 'external_id' in bundle.data:
- if not re.match('^[\w\d_-]+$', bundle.data['external_id']):
+ if not isinstance(bundle.data['external_id'], basestring):
+ errors['external_id'] = ['external_id must be a string.']
+ elif not re.match('^[\w\d_-]+$', bundle.data['external_id']):
errors['external_id'] = ['external_id can only contain letters, numbers, underscores and dashes.']
return errors
@@ -325,6 +327,9 @@ def put_list(self, request, **kwargs):
self.is_valid(bundle, request)
+ if bundle.errors:
+ self.error_response(bundle.errors, request)
+
bundles.append(bundle)
data.append((
bundle.data['data'],

0 comments on commit f2d2f8b

Please sign in to comment.