Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for externally hosted datasets #427

Open
janvanrijn opened this issue Jun 13, 2017 · 3 comments
Open

Support for externally hosted datasets #427

janvanrijn opened this issue Jun 13, 2017 · 3 comments

Comments

@janvanrijn
Copy link
Member

Currently, OpenML supports datasets to be hosted elsewhere. By uploading an URL instead of a file OpenML links towards an external dataset.

However, the way this functionality is handled is outdated and requires some updates and support. Main question, do we want to keep this functionality? At the moment, it is barely used (only 7 datasets, these are all in preparation)

Keeping it will have quite some implications for code maintainability and work from our side, removing it might make OpenML less interesting for people that want to use OpenML, but do want to keep control of their data.

@joaquinvanschoren
Copy link
Contributor

I definitely want to keep this functionality. In fact, I would say that it is highly preferable if we can link to thousands of datasets that are stored elsewhere, but we indeed need better support (i.e. be able to read and handle more data types)

The reason that most datasets are stored on our server is that we need a stable source of ARFF versions of all these datasets, so we store ARFF versions ourselves. When we offer CSV support there is much more incentive to just link data stored elsewhere. We still need to be able to download it and extract meta-data from them.

Besides, as you point out, we want people to keep control of their data. If I want to store my 4GB dataset on AWS or Zenodo, that should be possible. Data storage is actually one of the things that other projects are very good at, we don't need to replicate that. What we do need is a place where this data is linked to meta-data and ML experiments.

@janvanrijn
Copy link
Member Author

Just completed a huge update to facilitate this better. I changed the way 'external' files are handled on the server. Long story short, for each external file we now also keep the file meta-data on the server (i.e., md5 hash, mime type, file size, etc). This will give us in the future more possibilities for consistency checks and so on.

Most importantly, all files (internal and external) are now handles by the data controller. That means that every available dataset on the server now has the fields file_id and (a valid) md5_checksum. For the workbenches nothing changes, as the url field still displays how to download the arff file. However, this url now always goes through the data controller (which gives a redirect in case of external files). Most importantly, this makes it internally very easy to handle everything the same way. This for example gives the possibility to convert external aff files into csv files.

@giuseppec @mfeurer @berndbischl Although nothing changed, it might still be good to rerun unit tests. Everything is deployed on test server. If there are no problems I will migrate this to live.

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Jun 18, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants