Support for externally hosted datasets #427

janvanrijn · 2017-06-13T14:46:39Z

Currently, OpenML supports datasets to be hosted elsewhere. By uploading an URL instead of a file OpenML links towards an external dataset.

However, the way this functionality is handled is outdated and requires some updates and support. Main question, do we want to keep this functionality? At the moment, it is barely used (only 7 datasets, these are all in preparation)

Keeping it will have quite some implications for code maintainability and work from our side, removing it might make OpenML less interesting for people that want to use OpenML, but do want to keep control of their data.

joaquinvanschoren · 2017-06-13T21:25:17Z

I definitely want to keep this functionality. In fact, I would say that it is highly preferable if we can link to thousands of datasets that are stored elsewhere, but we indeed need better support (i.e. be able to read and handle more data types)

The reason that most datasets are stored on our server is that we need a stable source of ARFF versions of all these datasets, so we store ARFF versions ourselves. When we offer CSV support there is much more incentive to just link data stored elsewhere. We still need to be able to download it and extract meta-data from them.

Besides, as you point out, we want people to keep control of their data. If I want to store my 4GB dataset on AWS or Zenodo, that should be possible. Data storage is actually one of the things that other projects are very good at, we don't need to replicate that. What we do need is a place where this data is linked to meta-data and ML experiments.

janvanrijn · 2017-06-18T13:46:38Z

Just completed a huge update to facilitate this better. I changed the way 'external' files are handled on the server. Long story short, for each external file we now also keep the file meta-data on the server (i.e., md5 hash, mime type, file size, etc). This will give us in the future more possibilities for consistency checks and so on.

Most importantly, all files (internal and external) are now handles by the data controller. That means that every available dataset on the server now has the fields file_id and (a valid) md5_checksum. For the workbenches nothing changes, as the url field still displays how to download the arff file. However, this url now always goes through the data controller (which gives a redirect in case of external files). Most importantly, this makes it internally very easy to handle everything the same way. This for example gives the possibility to convert external aff files into csv files.

@giuseppec @mfeurer @berndbischl Although nothing changed, it might still be good to rerun unit tests. Everything is deployed on test server. If there are no problems I will migrate this to live.

joaquinvanschoren · 2017-06-18T21:33:46Z

Awesome work! This means that people can also check whether the external data is still the same as when it was registered on OpenML, right?

On Sun, Jun 18, 2017 at 3:46 PM janvanrijn ***@***.***> wrote: Just completed a huge update to facilitate this better. I changed the way 'external' files are handled on the server. Long story short, for each external file we now also keep the file meta-data on the server (i.e., md5 hash, mime type, file size, etc). This will give us in the future more possibilities for consistency checks and so on. Most importantly, all files (internal and external) are now handles by the data controller. That means that every available dataset on the server now has the fields file_id and (a valid) md5_checksum. For the workbenches nothing changes, as the url field still displays how to download the arff file. However, this url now always goes through the data controller (which gives a redirect in case of external files). Most importantly, this makes it internally very easy to handle everything the same way. This for example gives the possibility to convert external aff files into csv files. @giuseppec <https://github.com/giuseppec> @mfeurer <https://github.com/mfeurer> @berndbischl <https://github.com/berndbischl> Although nothing changed, it might still be good to rerun unit tests. Everything is deployed on test server. If there are no problems I will migrate this to live. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#427 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABpQV_M25xAqG3nJDiE6yJV19wc9C2oOks5sFSo_gaJpZM4N4mkP> .

-- Thank you, Joaquin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for externally hosted datasets #427

Support for externally hosted datasets #427

janvanrijn commented Jun 13, 2017

joaquinvanschoren commented Jun 13, 2017

janvanrijn commented Jun 18, 2017

joaquinvanschoren commented Jun 18, 2017 via email

Support for externally hosted datasets #427

Support for externally hosted datasets #427

Comments

janvanrijn commented Jun 13, 2017

joaquinvanschoren commented Jun 13, 2017

janvanrijn commented Jun 18, 2017

joaquinvanschoren commented Jun 18, 2017 via email