Description
There are situations where one only wants to see the meta-data of a given dataset. The current get_dataset always downloads and parses the dataset itself. This is slow(!), and can cause errors if the files are really large.
It would be nice to have an option (or another call) to download the meta-data only.
To make matters worse, the arff parser requires many times the file size for the download and parsing. E.g. downloading a 2GB dataset requires more than 16GB disk space. This fails in colab notebooks: they crash because of disk space limits. I would download the csv version directly, but this requires the file url or file_id, neither of which are returned by list_dataset and thus have to be retrieved with get_dataset. So, there is currently no way to download larger OpenML dataset into colab notebooks without manually looking up the file_ids.
Steps/Code to Reproduce
Download any large dataset (e.g. SVHN):
openml.datasets.get_dataset(41081)
Expected Results
An OpenMLDataset object without the actual dataset, or a dict with the meta-data.
Actual Results
get_dataset is very slow, or crashes if disk space is not sufficient.
Versions
Darwin-18.2.0-x86_64-i386-64bit
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.3
SciPy 1.1.0
Scikit-Learn 0.20.0
OpenML 0.7.0
Description
There are situations where one only wants to see the meta-data of a given dataset. The current get_dataset always downloads and parses the dataset itself. This is slow(!), and can cause errors if the files are really large.
It would be nice to have an option (or another call) to download the meta-data only.
To make matters worse, the arff parser requires many times the file size for the download and parsing. E.g. downloading a 2GB dataset requires more than 16GB disk space. This fails in colab notebooks: they crash because of disk space limits. I would download the csv version directly, but this requires the file url or file_id, neither of which are returned by list_dataset and thus have to be retrieved with get_dataset. So, there is currently no way to download larger OpenML dataset into colab notebooks without manually looking up the file_ids.
Steps/Code to Reproduce
Download any large dataset (e.g. SVHN):
Expected Results
An OpenMLDataset object without the actual dataset, or a dict with the meta-data.
Actual Results
get_dataset is very slow, or crashes if disk space is not sufficient.
Versions
Darwin-18.2.0-x86_64-i386-64bit
Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.3
SciPy 1.1.0
Scikit-Learn 0.20.0
OpenML 0.7.0