Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration ARFF to Parquet on the OpenML server #50

Open
PGijsbers opened this issue Sep 26, 2022 · 8 comments
Open

Migration ARFF to Parquet on the OpenML server #50

PGijsbers opened this issue Sep 26, 2022 · 8 comments

Comments

@PGijsbers
Copy link

PGijsbers commented Sep 26, 2022

This is a centralised discussion about the server side changes (being) made to the datasets in their conversion from ARFF to Parquet. Related on-going discussions that reference the server state of different datasets:

Let's keep the relevant information about the migration as it relates to server data in this thread.
This is not for connector specific discussions (for example, how openml-python handles this).
@joaquinvanschoren @prabhant @sebffischer

@sebffischer
Copy link

A couple of remarks:

@sebffischer
Copy link

Also it would be great if the types string and categorical were distinguishable from looking at the data features (if this information is somehow updated using the new parquet files)

@sebffischer
Copy link

sebffischer commented Sep 27, 2022

Also it seems that the parquet urls from the test server are wrong.
With wrong I mean that they point to the parquet urls of the publis server.

Edit: more info

@PGijsbers
Copy link
Author

Also it would be great if the types string and categorical were distinguishable from looking at the data features (if this information is somehow updated using the new parquet files)

Do you have an example? We want to see if this issue was with the ARFF file or specifically introduced in the conversion.

@PGijsbers
Copy link
Author

Also it seems that the parquet urls from the test server are wrong.

Parquet URLS from the test server have been disabled for now, until we have a separate minio (or bucket) for the test server.

@PGijsbers
Copy link
Author

openml/OpenML#1165 this is kind of a weird bug

We'll look into that, and for the conversion scripts we'll have a closer look to preserve the feature data, or encode it into correct data types where ARFF was previously not expressive enough (e.g., boolean, 8-bit integers).

@sebffischer
Copy link

This was a confusion from my side, sorry!

@PGijsbers
Copy link
Author

The following will be changed for the conversion script:

  • feature names will have leading and trailing whitespace removed.
  • numeric features with only integer values in the range [0, 255] will be stored as uint8.

Additionally feature meta-data needs to be updated:

  • reflect which features are boolean (instead of categorical)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants