This repository allows building containers containing a file based dataset.
Use makefile image
or push
targets and provider necessary arguments:
SOURCE_DATA_DIR
- path to a directory with a dataset, assumed to be a name of a datasetDATA_ARCHIVE
- path to an archived dataset used to calculated md5sumDESCRIPTION
- short description of a dataset
Example:
make \
SOURCE_DATA_DIR=clinical-trials-data-800k \
DATA_ARCHIVE=clinical-trials-data-800k.tar.xz \
DESCRIPTION="Clinical trials studies and data objects" \
all
The images are pushed to data-container repository on hub.docker.com.
Each data container has matadatach attached to it in form of images labels and container environment variables. The list of metadata attached:
DATA_DIR
- location of a dataset in a containerNUMBER_OF_DIRECTORIES
- numer of directories inside ofDATA_DIR
(excludingDATA_DIR
)NUMBER_OF_FILES
- number of files inside ofDATA_DIR
DESCRIPTION
- short description of a dataset
Each dataset is published with 2 image tags latest and identified with it's md5sum, according to templates:
onedata/data-container:
<dataset name>
-
latest
onedata/data-container:
<dataset name>
-
<md5hash of dataset archive>
The command used to prepare archives:
tar -cf - clinical-trials-data-800k | xz -T 9 -9 -c - > clinical-trials-data-800k.tar.xz
Description | Dataset Link | Docker Image |
---|---|---|
Full set of hf5 files with telescope metadata | cta-hdf5-data-125k.tar.xz | onedata/data-container:cta-hdf5-data-125k-latest |
A subset of hf5 files with telescope metadata | cta-hdf5-data-30k.tar.xz | onedata/data-container:cta-hdf5-data-30k-latest |
Clinical trials studies and data objects | clinical-trials-data-800k.tar.xz | onedata/data-container:clinical-trials-data-800k-latest |
Covid related studies and data objects | covid-data-10k.tar.xz | onedata/data-container:clinical-trials-data-800k-latest |