New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export OpenML data to data package, Import data package to OpenML #482

Open
HeidiSeibold opened this Issue Oct 11, 2017 · 8 comments

Comments

3 participants
@HeidiSeibold
Member

HeidiSeibold commented Oct 11, 2017

Would be nice if we could do that. Would make it much easier for people to upload data to and to work with data from OpenML.

Data packages are defined by:

  • a csv file with the data
  • a json file with thte meta data

Why?

To improve user friendliness. See also:
https://docs.google.com/document/d/1c_RhDiXTK5bEsY5gGRuQwaF6fKilt4jKq2c_BRqyEDc/edit?usp=sharing

How is meta data specified in data packages?

https://specs.frictionlessdata.io/data-package/#metadata

This is related to #457

@HeidiSeibold

This comment has been minimized.

Show comment
Hide comment
@HeidiSeibold

HeidiSeibold Dec 4, 2017

Member

To be able to import data packages into OpenML I think we need to first do the following steps:

  • Find an interesting ML data set that is stored as a data package (e.g. on http://datahub.io, maybe https://datahub.io/core/population?)
  • Read the data with a tool that can also upload data to OpenML or convert data to ARFF (e.g. R with farff or RWeka, note that R has problems with some data sets on datahub.io atm datahq/datahub-qa#31; python also has ARFF support)
  • Upload the data to OpenML or export the data to ARFF
  • Prepare a script that does all of the above and share it in this issue

Any help on this issue would be very appreciate 👏 🍰

Member

HeidiSeibold commented Dec 4, 2017

To be able to import data packages into OpenML I think we need to first do the following steps:

  • Find an interesting ML data set that is stored as a data package (e.g. on http://datahub.io, maybe https://datahub.io/core/population?)
  • Read the data with a tool that can also upload data to OpenML or convert data to ARFF (e.g. R with farff or RWeka, note that R has problems with some data sets on datahub.io atm datahq/datahub-qa#31; python also has ARFF support)
  • Upload the data to OpenML or export the data to ARFF
  • Prepare a script that does all of the above and share it in this issue

Any help on this issue would be very appreciate 👏 🍰

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh commented Dec 4, 2017

Copying my response from gitter.im at request of @HeidiSeibold


I note there are a few libs in Python for ARFF

And we have a documented way to convert to/from data backends here:

https://github.com/frictionlessdata/tableschema-py#storage

And some example implementations of the storage API at:

So writing an ARFF backend would be great!

@rufuspollock rufuspollock referenced this issue Dec 4, 2017

Closed

Sample Machine Learning datasets on DataHub #33

4 of 4 tasks complete
@HeidiSeibold

This comment has been minimized.

Show comment
Hide comment
@HeidiSeibold

HeidiSeibold Dec 4, 2017

Member

The issue datahq/datahub-qa#33 is in prinicple the same just the other way around. Both are equally helpful and important 😃

Member

HeidiSeibold commented Dec 4, 2017

The issue datahq/datahub-qa#33 is in prinicple the same just the other way around. Both are equally helpful and important 😃

@joaquinvanschoren

This comment has been minimized.

Show comment
Hide comment
@joaquinvanschoren

joaquinvanschoren Dec 4, 2017

Contributor

Interesting dataset: Maybe this one is a good place to start:
https://datahub.io/anuveyatsu/farm-survey-simple

Nice thing is that they have all the attribute file types, offered as a JSON file. Should be easy to convert to ARFF. What is still missing is the task, i.e. what you want to predict. There is also no description of what the dataset is about.

I also couldn't figure out how to navigate DataHub. There are apparently 200+ datasets but I can only see a few of them on the website.

Contributor

joaquinvanschoren commented Dec 4, 2017

Interesting dataset: Maybe this one is a good place to start:
https://datahub.io/anuveyatsu/farm-survey-simple

Nice thing is that they have all the attribute file types, offered as a JSON file. Should be easy to convert to ARFF. What is still missing is the task, i.e. what you want to predict. There is also no description of what the dataset is about.

I also couldn't figure out how to navigate DataHub. There are apparently 200+ datasets but I can only see a few of them on the website.

@HeidiSeibold

This comment has been minimized.

Show comment
Hide comment
@HeidiSeibold

HeidiSeibold Dec 5, 2017

Member

I also couldn't figure out how to navigate DataHub. There are apparently 200+ datasets but I can only see a few of them on the website.

That's a bug datahq/datahub-qa#32

Member

HeidiSeibold commented Dec 5, 2017

I also couldn't figure out how to navigate DataHub. There are apparently 200+ datasets but I can only see a few of them on the website.

That's a bug datahq/datahub-qa#32

@joaquinvanschoren

This comment has been minimized.

Show comment
Hide comment
@joaquinvanschoren

joaquinvanschoren Dec 5, 2017

Contributor

Feedback from the frictionlessdata gitter:

Contributor

joaquinvanschoren commented Dec 5, 2017

Feedback from the frictionlessdata gitter:

@HeidiSeibold

This comment has been minimized.

Show comment
Hide comment
@HeidiSeibold

HeidiSeibold Jan 15, 2018

Member

There are now some machine learning data sets available as data packages: http://datahub.io/machine-learning

Example: http://datahub.io/machine-learning/seismic-bumps
Also available on OpenML: https://www.openml.org/d/1500

I guess a first step now would be to check:

  • which fields in the datapackage correspond to which fields in the ARFF file / XML file.
  • if there is any information in the ARFF file / XML file that we cannot get from the data package.

See also discussion datahq/datahub-qa#33 (comment)

Member

HeidiSeibold commented Jan 15, 2018

There are now some machine learning data sets available as data packages: http://datahub.io/machine-learning

Example: http://datahub.io/machine-learning/seismic-bumps
Also available on OpenML: https://www.openml.org/d/1500

I guess a first step now would be to check:

  • which fields in the datapackage correspond to which fields in the ARFF file / XML file.
  • if there is any information in the ARFF file / XML file that we cannot get from the data package.

See also discussion datahq/datahub-qa#33 (comment)

@HeidiSeibold

This comment has been minimized.

Show comment
Hide comment
@HeidiSeibold

HeidiSeibold Apr 4, 2018

Member

We decided ot wait until frictionlessdata/datapackage-r#13 is properly solved.

Member

HeidiSeibold commented Apr 4, 2018

We decided ot wait until frictionlessdata/datapackage-r#13 is properly solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment