-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for externally hosted datasets #427
Comments
I definitely want to keep this functionality. In fact, I would say that it is highly preferable if we can link to thousands of datasets that are stored elsewhere, but we indeed need better support (i.e. be able to read and handle more data types) The reason that most datasets are stored on our server is that we need a stable source of ARFF versions of all these datasets, so we store ARFF versions ourselves. When we offer CSV support there is much more incentive to just link data stored elsewhere. We still need to be able to download it and extract meta-data from them. Besides, as you point out, we want people to keep control of their data. If I want to store my 4GB dataset on AWS or Zenodo, that should be possible. Data storage is actually one of the things that other projects are very good at, we don't need to replicate that. What we do need is a place where this data is linked to meta-data and ML experiments. |
Just completed a huge update to facilitate this better. I changed the way 'external' files are handled on the server. Long story short, for each external file we now also keep the file meta-data on the server (i.e., md5 hash, mime type, file size, etc). This will give us in the future more possibilities for consistency checks and so on. Most importantly, all files (internal and external) are now handles by the data controller. That means that every available dataset on the server now has the fields file_id and (a valid) md5_checksum. For the workbenches nothing changes, as the url field still displays how to download the arff file. However, this url now always goes through the data controller (which gives a redirect in case of external files). Most importantly, this makes it internally very easy to handle everything the same way. This for example gives the possibility to convert external aff files into csv files. @giuseppec @mfeurer @berndbischl Although nothing changed, it might still be good to rerun unit tests. Everything is deployed on test server. If there are no problems I will migrate this to live. |
Awesome work!
This means that people can also check whether the external data is still
the same as when it was registered on OpenML, right?
On Sun, Jun 18, 2017 at 3:46 PM janvanrijn ***@***.***> wrote:
Just completed a huge update to facilitate this better. I changed the way
'external' files are handled on the server. Long story short, for each
external file we now also keep the file meta-data on the server (i.e., md5
hash, mime type, file size, etc). This will give us in the future more
possibilities for consistency checks and so on.
Most importantly, all files (internal and external) are now handles by the
data controller. That means that every available dataset on the server now
has the fields file_id and (a valid) md5_checksum. For the workbenches
nothing changes, as the url field still displays how to download the arff
file. However, this url now always goes through the data controller (which
gives a redirect in case of external files). Most importantly, this makes
it internally very easy to handle everything the same way. This for example
gives the possibility to convert external aff files into csv files.
@giuseppec <https://github.com/giuseppec> @mfeurer
<https://github.com/mfeurer> @berndbischl <https://github.com/berndbischl>
Although nothing changed, it might still be good to rerun unit tests.
Everything is deployed on test server. If there are no problems I will
migrate this to live.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#427 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABpQV_M25xAqG3nJDiE6yJV19wc9C2oOks5sFSo_gaJpZM4N4mkP>
.
--
Thank you,
Joaquin
|
Currently, OpenML supports datasets to be hosted elsewhere. By uploading an URL instead of a file OpenML links towards an external dataset.
However, the way this functionality is handled is outdated and requires some updates and support. Main question, do we want to keep this functionality? At the moment, it is barely used (only 7 datasets, these are all in preparation)
Keeping it will have quite some implications for code maintainability and work from our side, removing it might make OpenML less interesting for people that want to use OpenML, but do want to keep control of their data.
The text was updated successfully, but these errors were encountered: