Skip to content

Commit

Permalink
[#529] Update importing datasets doc for 2.0.
Browse files Browse the repository at this point in the history
Remove section on loader scripts.
  • Loading branch information
johnglover committed Apr 23, 2013
1 parent d3cd3ec commit 2e64aa5
Showing 1 changed file with 28 additions and 62 deletions.
90 changes: 28 additions & 62 deletions doc/importing-datasets.rst
@@ -1,29 +1,31 @@
=============
Load Datasets
=============
==================
Importing Datasets
==================

You can upload individual datasets through the CKAN front-end, but for importing datasets on masse, you have two choices:
You can create individual datasets using the CKAN front-end.
However, when importing multiple datasets it is generally more efficient to
automate this process in some way.
There are two common approaches to importing datasets in CKAN:

* :ref:`load-data-api`. You can use the `CKAN API <api.html>`_ to script import. To simplify matters, we offer provide standard loading scripts for Google Spreadsheets, CSV and Excel.
* :ref:`load-data-api`. Using the `CKAN API <api.html>`_.

* :ref:`load-data-harvester`. The `CKAN harvester extension <https://github.com/okfn/ckanext-harvest/>`_ provides web and command-line interfaces for larger import tasks.
* :ref:`load-data-harvester`. Using the
`CKAN harvester extension <https://github.com/okfn/ckanext-harvest/>`_.
This provides web and command-line interfaces for larger import tasks.

If you need advice on data import, `contact the ckan-dev mailing list <http://lists.okfn.org/mailman/listinfo/ckan-dev>`_.
.. note :: If loading your data requires scraping a web page regularly, you
may find it best to write a scraper on
`ScraperWiki <http://www.scraperwiki.com>`_ and combine this with either of
the methods above.
.. note :: If loading your data requires scraping a web page regularly, you may find it best to write a scraper on `ScraperWiki <http://www.scraperwiki.com>`_ and combine this with either of the methods above.
.. _load-data-api:

Import Data with the CKAN API
-----------------------------

You can use the `CKAN API <api.html>`_ to upload datasets directly into your CKAN instance.

The Simplest Approach - CKAN API
++++++++++++++++++++++++++++++++

The simplest way to automate dataset loading is with a Python script using
:doc:`CKAN's API <api>`. Here's an example script to create a new dataset::
You can use the `CKAN API <api.html>`_ to upload datasets directly into your
CKAN instance. Here's an example script that creates a new dataset::

#!/usr/bin/env python
import urllib2
Expand All @@ -33,16 +35,16 @@ The simplest way to automate dataset loading is with a Python script using

# Put the details of the dataset we're going to create into a dict.
dataset_dict = {
'name': 'my_dataset_name',
'notes': 'A long description of my dataset',
'name': 'my_dataset_name',
'notes': 'A long description of my dataset',
}

# Use the json module to dump the dictionary to a string for posting.
data_string = urllib.quote(json.dumps(dataset_dict))

# We'll use the package_create function to create a new dataset.
request = urllib2.Request(
'http://www.my_ckan_site.com/api/action/package_create')
'http://www.my_ckan_site.com/api/action/package_create')

# Creating a dataset requires an authorization header.
# Replace *** with your API key, from your user account on the CKAN site
Expand All @@ -62,56 +64,20 @@ The simplest way to automate dataset loading is with a Python script using
pprint.pprint(created_package)


Loader Scripts
++++++++++++++

'Loader scripts' provide a simple way to take any format metadata and bulk upload it to a remote CKAN instance.

Essentially each set of loader scripts converts the dataset metadata to the standard 'dataset' format, and then loads it into CKAN.

To get a flavour of what loader scripts look like, take a look at `the ONS scripts <https://github.com/okfn/ckanext-dgu/tree/master/ckanext/dgu/ons>`_.

Loader Scripts for CSV and Excel
********************************

For CSV and Excel formats, the `SpreadsheetPackageImporter` (found in ``ckanext-importlib/ckanext/importlib/spreadsheet_importer.py``) loader script wraps the file in `SpreadsheetData` before extracting the records into `SpreadsheetDataRecords`.

SpreadsheetPackageImporter copes with multiple title rows, data on multiple sheets, dates. The loader can reload datasets based on a unique key column in the spreadsheet, choose unique names for datasets if there is a clash, add/merge new resources for existing datasets and manage dataset groups.

Loader Scripts for Google Spreadsheets
**************************************

The `SimpleGoogleSpreadsheetLoader` class (found in ``ckanclient.loaders.base``) simplifies the process of loading data from Google Spreadsheets (there is an additional dependency on the ``gdata`` Python package).

`This script <https://bitbucket.org/okfn/ckanext/src/default/bin/ckanload-italy-nexa>`_ has a simple example of loading data from Google Spreadsheets.

Write Your Own Loader Script
****************************

## this needs work ##

First, you need an importer that derives from `PackageImporter` (found in ``ckan/lib/importer.py``). This takes whatever format the metadata is in and sorts it into records of type `DataRecord`.

Next, each DataRecord is converted into the correct fields for a dataset using the `record_2_package` method. This results in dataset dictionaries.

The `PackageLoader` takes the dataset dictionaries and loads them onto a CKAN instance using the ckanclient. There are various settings to determine:

* ##how to identify the same dataset, previously been loaded into CKAN.## This can be simply by name or by an identifier stored in another field.
* how to merge in changes to an existing datasets. It can simply replace it or maybe merge in resources etc.

The loader should be given a command-line interface using the `Command` base class (``ckanext/command.py``).

You need to add a line to the CKAN ``setup.py`` (under ``[console_scripts]``) and when you run ``python setup.py develop`` it creates a script for you in your Python environment.

.. _load-data-harvester:

Import Data with the Harvester Extension
----------------------------------------

The `CKAN harvester extension <https://github.com/okfn/ckanext-harvest/>`_ provides useful tools for more advanced data imports.
The `CKAN harvester extension <https://github.com/okfn/ckanext-harvest/>`_
provides useful tools for more advanced data imports.

These include a command-line interface and a web user interface for running harvesting jobs.
These include a command-line interface and a web user interface for running
harvesting jobs.

To use the harvester extension, create a class that implements the `harvester interface <https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/interfaces.py>` derived from the `base class of the harvester extension <https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/harvesters/base.py>`_.
To use the harvester extension, create a class that implements the
`harvester interface <https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/interfaces.py>`
derived from the
`base class of the harvester extension <https://github.com/okfn/ckanext-harvest/blob/master/ckanext/harvest/harvesters/base.py>`_.

For more information on working with extensions, see :doc:`extensions`.

0 comments on commit 2e64aa5

Please sign in to comment.