How to Create a New Dataset Module
The skdata project is meant to be a library of data sets for use in machine learning algorithms. People keep creating new data sets, and skdata keeps growing to include them. Adding a new data set to skdata means:
- Pick a name for your dataset (e.g. Comics)
- Fork the skdata project on github, and create a new development branch (e.g. new-dataset/comics)
- In your develop branch create a subfolder in the skdata python import folder (e.g. skdata/skdata/comics) with at least the following files:
__init__.py- boilerplate so Python can import your module
dataset.py- downloading, parsing, and unique helper routines
view.py- define views that adhere to the evaluation protocol
main.py- CLI entry points (scripts) that you think are appropriate
tests/test-dataset.py- tests for dataset.py
tests/test-view.py- tests for view.py Other files are welcome, but start with the files above before creating new ones.
- Get things working, and send a PR back to skdata (... or not, if your code has to stay private).
- Add the data set to the data set list.
The best example module to follow is probably the lfw data set: it downloads and unpacks a few different kinds of files, defines several non-trivial views, offers some interesting scripts in main.py, and as of writing has the most complete test coverage.
What goes in
dataset.py file is technically free-form.
Every data set has gnarly unique logic for how to download, unpack, parse, etc. and that logic goes here.
A data set's "dataset.py" file should include:
- A module docstring that describes the nature of the data set, the web site that describes the data set more fully, and contain relevant references to academic literature.
- Logic for downloading the data set from the most official internet distribution location possible.
- Logic for unpacking and loading that data set into primitive Python data types, if possible.
When the data set is listed in the data set list then this string is going to serve as the documentation to what the data set is and what your implementation does.
Usually there is a web site that explains the nature of the data set you are providing. Link to that website, and paraphrase it. Include the official citation that people should list in their publications if they use this data set.
If you had to make some design choices, or have any other remarks to future users of your code about how the implementation works relative to the full data set (any restrictions, caveats, etc.) put them in.
Don't be surprised if this docstring is 20-50 lines of nicely formatted text.
Skdata data set modules often must download large amounts of data.
To keep things organized, we strongly recommend that your module (e.g.
comics) stores/creates any large persistent files to a subdirectory (e.g. ./comics) of
You are encouraged to embed SHA1 hashes in the source to ensure the data integrity of downloads.
utils.download function supports checking sha1 and md5 hashes. Just call it like
skdata.utils.download(url, local_filename, sha1=sha1_string)
Feel free to hard-code URLs of public data sets into the dataset.py file.
dataset.py loading to self.meta
Once files have been downloaded, more Python logic is required to load that
data into Python data structures.
Ideally data should be loaded into a list of dictionaries, in a data set class attribute called
Often one example from the dataset is one element of the list, but sometimes
this mapping isn't obvious. Just pick one.
The dictionary structures need not necessarily be homogeneous (same keys, same
schema) but often it is natural to make them homogeneous when the examples of the
dataset represent I.I.D. data.
self.meta should be built of simple data types and ideally be JSON-encodable.
This opens the door to using various tools to store and access meta-data.
For data sets of large data arrays such as images and video, use your judgement about how to give access to the raw sensor data. The dataset object with the self.meta attribute should be designed to require only a modest amount of memory. Skdata provides some support for lazily-loading large tensors (see larray.py, and the use of larray.py in e.g. lfw's protocol view) that can help in some cases. If it is more natural to memmap a large data file, or use memmap via numpy, or preload downloaded data into an hdf5 database, then do it. The dataset.py is responsible above all for giving access to the data set in Python. Any atypical design choices you make in doing that will simply require view objects to compensate.
Sometimes there is meta-data that feels like it applies to an entire dataset
rather than any particular example. That data goes into a dictionary in
Sometimes there is meta-data that is constant across all examples (e.g. image
size). Such meta-data should go in a
The idea is that when such an attribute exists,
then every element of
self.meta is guaranteed to be consistent with it. In other words,
self.meta[i] == self.meta[i].update(self.meta_const) is always true.
(This mechanism is especially important for describing infinite data
dataset.py loading infinite data sets
Some datasets are for all intents and purposes infinite (i.e. dynamically
generated). In such cases
self.meta could be implemented as lazily-evaluated list.
The currently implemented
task semantics do not cover infinite data sets. If you have an infinite data
set (i.e. dynamically generated) then design a protocol for it, and then
implement dataset.py to support that protocol.
Even if the data set is infinite, it is all the more important to list
meta-data constants in
dataset.py published results
Feel free to hard-code the results of published academic work into this file, to make it easier for library users to produce tables of comparisons with previous work.
What goes in
view.py file to define machine learning problems that a typical user might want to tackle with a learning algorithm.
What is especially valuable are evaluation protocols that are either:
- described by the data set creators, or
- conventional in a research community.
Try to avoid putting one-off problems into the library, because it can distract/confuse users from the more conventional problems that they are probably looking for. If a problem was defined by data set creators, then feel free to put "Official" into its name in the views module.
The design of protocol views is discussed in some detail in the Protocol wiki page.
What goes in
See some of the main.py files in the library for examples. The only consistent aspect across these files is that they are supposed to print some help, and run any dataset specific scripts like this:
python main.py # -- prints usage help python main.py <cmd> [args] # -- runs some script
Some of the data sets in the library use this mechanism to ensure a dataset has been downloaded, or pop up some basic visualization.
Data set Unit-Testing
Try to factor your code so that as much of it as possible can be tested without the actual large data set from the internet / private source that you normally aim to provide. Users will want to run unit tests across the entire project without downloading any additional files. Feel free to check, in your unit tests, if those large data files have already been downloaded, and if they are already available locally then run additional unit tests.