Skip to content

Can't download dataset metadata without downloading the dataset itself #612

@joaquinvanschoren

Description

@joaquinvanschoren

Description

There are situations where one only wants to see the meta-data of a given dataset. The current get_dataset always downloads and parses the dataset itself. This is slow(!), and can cause errors if the files are really large.

It would be nice to have an option (or another call) to download the meta-data only.

To make matters worse, the arff parser requires many times the file size for the download and parsing. E.g. downloading a 2GB dataset requires more than 16GB disk space. This fails in colab notebooks: they crash because of disk space limits. I would download the csv version directly, but this requires the file url or file_id, neither of which are returned by list_dataset and thus have to be retrieved with get_dataset. So, there is currently no way to download larger OpenML dataset into colab notebooks without manually looking up the file_ids.

Steps/Code to Reproduce

Download any large dataset (e.g. SVHN):

openml.datasets.get_dataset(41081)

Expected Results

An OpenMLDataset object without the actual dataset, or a dict with the meta-data.

Actual Results

get_dataset is very slow, or crashes if disk space is not sufficient.

Versions

Darwin-18.2.0-x86_64-i386-64bit
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.3
SciPy 1.1.0
Scikit-Learn 0.20.0
OpenML 0.7.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions