data_packages

Installing, finding, using data packages

See also discussion on Python datapackage package issues

Introduction

We propose an API for installing data packages locally, so that libraries can find installed data packages.

Background

We often find that we would like to run code tests or examples against fairly large datasets.

We can't save datasets in the code repository

We can't usually put the datasets in the code repository because:

The datasets may be too large;
The dataset may be shared across projects, therefore requiring the datasets go in both repositories, or that one project depends on the (development) tree of the other;
The datasets may not be properly licensed, or have a license that is incompatible with the source code.

My own use case is that I need suites of example images in different image formats to test image loading in nibabel and nipy. In particular, I need a library of DICOM images of different manufacturers and modalities to test DICOM image conversion. These images will be too large for the code repository, and it may not be possible to release the images for public distribution.

Datasets need to be versioned

The data needs to be versioned, because, as the code and examples evolve, the datasets may also evolve, usually adding more data, but maybe fixing broken data.

The code test or example may therefore need to specify a minimum version of the data that is compatible with the test or example, maybe:

>>> from datatool import find_package
>>> templates = find_package('nipy-templates', version='>=0.3')

We want to be able to keep a local copy of the data, because:

If the data is too large for the repository, it will also likely be too large to do repeated downloads to a temporary directory;
In the spirit of distributed version control, we would like to be able to work offline.

There should be data package directories and data package containers

We would like an API that could do something like:

data-tool install nipy-templates>=0.3

– such that we could run the code above (repeated here):

>>> from datatool import find_package
>>> templates = find_package('nipy-templates', version='>=0.3')

– and templates would be an object that knows where to find the data.

The install facility could be a later addition. For now we might be content to do something like:

curl -O http://an.example.org/data-packages/nipy-templates-0.3.zip
unzip nipy-templates-0.3.zip # to nipy-templates-0.3 directory
data-tool pkg-path add nipy-templates-0.3

We might also want to have directories in which data package utilities expect to find data packages. For example:

data-tool container-path add . # enclosing directory
curl -O http://an.example.org/data-packages/nipy-data-0.2.zip
unzip nipy-data-0.2.zip
curl -O http://an.example.org/data-packages/nipy-other-1.1.zip
unzip nipy-other-1.1.zip

– where nipy-data-0.2 and nipy-other-1.1 would be found by a data package utility because their directories are a registered container path.

Defining PACKAGE_PATH and CONTAINER_PATH

PACKAGE_PATH – a directory containing a data package. The directory will contain a datapackage.json file (see below);
CONTAINER_PATH – A path that contains directories that are PACKAGE_PATHs.

There should be system and user data packages

Imagine a code package my-code.

It could be installed by me, a user, in my home directories, or it could be installed by the system administrator, in the system directories, for all users. I could have installed my own more recent copy on a system with my-code installed system-wide.

Imagine there are some tests and examples in my-code that require data, call this my-data.

If there is a system copy of my-code, there may also be a system copy of my-data. Even if I installed a user copy of my-code, I will still hope to use the system my-data if it has a high enough version to support the new tests and examples. If there is no system copy of my-data I will need to install my-data somewhere in my home space.

If we have install, then I might want to do:

data-tool install --user my-data

or:

data-tool install --system my-data

Similarly:

data-tool pkg-path add --user /path/to/data/my-data

and:

data-tool container-path add --user /path/to/data

We should be able to use data packages under version control

We may well want to keep data packages under version control, perhaps using something like git annex.

The data will develop with the code using it, so it will be common for a developer to have the data packages checked out on their local machine.

The data files are large enough that it would be wasteful of space and time to have to make local package archives and then install them, in order to use the local data package repository.

Therefore we would like to be able to use the data package repository as a PACKAGE_PATH. Something like:

git clone git://my-provider.org/my-data.git
data-tool pkg-path add $PWD/my-data

In this case, we would like to be able to get the data package version from version control with tools like git describe:

$ data-tool version my-data
0.3.0-23-g123bca0

Proposal

Datapackage format

We can use the OKFN data package format, defined in detail in the data package spec.

This is a very simple format that makes a data package from a directory containing a datapackage.json file of given format.

The only absolute requirement for datapackage.json is that it should have a URL-usable name.

See Python datapackage for Python code for working with data packages following this format.

Example use – data suites

We might want to make a suite containing a set of packages for a particular use such as testing with my-code. Git submodules can help:

mkdir my-suite
cd my-suite
git init
git submodule add https://github.com/yarikoptic/nitest-balls1
git submodule add https://github.com/yarikoptic/nitest-balls2
git commit -m "Added some data packages"
cd ..

Register by recording the enclosing path as a CONTAINER_PATH:

data-tool container-path add my-suite

or specify that the record should go in the user configuration (see above):

data-tool container-path add --user my-suite

or in the system configuration (see below):

data-tool container-path add --system .

Unregister with:

data-tool container-path rm --user .

You can instead add the contained directories as PACKAGE_PATHs with:

data-tool pkg-path add nitests-balls1
data-tool pkg-path add nitests-balls2

Using data

For a package name nipy-templates:

>>> from datatool import find_package
>>> templates = find_package('nipy-templates')
>>> templates.path
'/usr/local/share/data/nipy-templates'
>>> templates.version
'0.1'

You can also specify a version string:

>>> templates = find_package('nipy-templates', '>=0.3')

Without a version string, find_package returns the package with the highest version.

You can get a package path from the command line too:

data-tool pkg-path find nipy-templates>=0.3

Making a data package

There is a utility to make data packages from files in a directory:

data-tool make-pkg .

This writes a default datapackage.json file (see below).

A data package is a directory with a configuration file called datapackage.json. This must specify package name:

{
    "name" : "nipy-templates",
    "version" : "0.1"
 }

It may also specify version:

{
    "name" : "nipy-templates",
    "version" : "0.1"
 }

Data packages and version control

If there is no "version", or the version is null, then the library should get this from version control of the package directory, or fail. So this:

{
    "name" : "nipy-templates",
    "version" : null
}

would cause datatool to try git describe in the first instance to get the package version, then the equivalent hg command¹. If all of these fail, the package is not valid.

Version comparisons use distutils.version.LooseVersion:

>>> from distutils.version import LooseVersion
>>> LooseVersion('1.3.1') > LooseVersion('1.3.0-519-ga1b925f')
True

By default datatool will strip an initial v before digits from the output of git describe – for example git describe output of v0.1 will give version 0.1.

If you want a more complicated rule relating git describe to version, use vcs_version_regex:

{
    "name" : "nipy-templates",
    "version" : null,
    "vcs_version_regex" : "rel-(.*)"
}

vcs_version_regex is an extension to the data package spec.

vcs_version_regex accepts the output of git describe and returns a single group containing the version string, as in:

>>> import re
>>> git_describe_output = 'rel-0.1-111-g1234567'
>>> re.match('rel-(.*)', git_describe_output).groups()[0]
'0.1-111-g1234567'

This allows the package author to have their own preferred tag naming scheme.

datapackage.json can also give MD5 hashes for the files in the archive:

{
    "name" : "nipy-templates",
    "resources" : [
        {
            "path" : "mni/T1.img",
            "hash" : "1ea8f4f1e41bc17a94602e48141fdbc8"
        },
        {
            "path" : "mni/T2.img",
            "hash" : "f41f2e1516d880547fbf7d6a83884f0d"
        }
    ]
}

See the data package spec for more detail on specifying resources.

Paths are always relative paths in Unix (/) format, the data package application will adapt Unix paths when validating MD5 hashes on Windows.

The verify command checks the MD5 sums if present:

data-tool verify nipy-templates

Or, from Python:

>>> templates = find_package('nipy-templates')
>>> result, message = templates.verify()

A data package will usually have both a Unix register executable and a Windows register.bat executable. Running these will register the PACKAGE_PATH with a specified application configuration files (see below). For example, register might be:

#!/bin/bash
data-tool pkg-path add $(dirname $BASH_SOURCE[0]) $@

Data configuration file(s)

The default locations for configuration files are (in order of decreasing precedence):

Contents of file named in DATATOOL_CONFIG environment variable;
Contents of datatool.ini in $HOME/.datatool (more generally, directory returned by datatool.environment.get_user_dir());
Contents of datatool.ini in /etc/datatool (more generally, directory returned by datatool.environment.get_system_dir()).

In general, values in files with higher precedence override values in files with lower precedence.

If values are lists, files with higher precedence prepend values to the list, so the files with higher precedence put values earlier in the list.

The configuration file can have section data, with optional subfields package_containers and package_paths:

[data]
package_containers :
    /usr/local/share/nipy/dipy
    /usr/share/nipy/dipy
package_paths :
    /usr/local/share/data/nipy-templates
    /usr/local/share/data/nipy-data

package_paths take precedence over paths found in package_containers, but a path in a package_containers list, in a file with higher precedence, overrides package_paths in files with lower precedence. So, assuming this is a file with lower precedence than the file above:

[data]
package_paths :
    /usr/share/dipy/nipy-dicom

– then if /usr/share/nipy/dipy/ contains the same nipy-dicom package, this package will override a package with the same name and version contained in /usr/share/dipy/nipy-dicom above.

The configuration files can also include other configuration files:

[data]
include :
    ~/data/other_data.json
    ~/data/more_data.json
package_paths :
    /usr/share/dipy/nipy-dicom

Values in included files take lower precedence than values in the file including them.

Tilde ~ will be expanded to the path of the users home directory for all paths in the configuration file.

Default package container paths

The default package container paths have the lowest precedence. The default package container paths are:

$HOME/.datatool/data (more generally, data subdirectory of directory returned by datatool.environment.get_user_dir());
/usr/share/datatool/data and /usr/local/share/datatool/data (more generally, data subdirectories of directories returned by datatool.environment.get_share_dirs()).

Installing packages

See install_data_packages.

Footnotes

Apparently the hg equivalent of git describe is something like hg log -r . --template '{latesttag}-{latesttagdistance}-h{node|short}\n'↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly