More feature types #876

KOLANICH · 2018-11-27T19:26:36Z

Currently used type system has not enough feature types.

Feature type is the stuff which tells us how to process that feature. More precisely, how we should preprocess it and what kinds of models are useful to fit against it.

So let's have more feature types:

{"type":"cyclic", "period": [60, 60, 24, 7]} (period is an array, each element defines a period in counts of previous periods) - anything having some useful to the domain cyclic structure. Enables circle transform.
"survival" - means that survival analysis methods should be applied to the column, if it is target. Otherwise treat as a simple numerical.
{"type":"time", "base":0} - means that it is absolute time. Inherits from cyclic.
"calendartime" - enables feature engineering using calendar features like holidays, festivals, other periodical events dates like Olimpics, Football cup, annual conferences dates, TV schedules, etc...
"location" - enables feature engineering tied to information from maps, like big circle distances, distances to city centre, city, state, distances to POIs of different types, etc
"NLP" - enables NLP feature engineering, like words embeddings and LSTM encoders
"mysteryString" - enables automatic feature extraction from strings which are not natural languages

the ways how exactly the features are processed are implementation defined.

The text was updated successfully, but these errors were encountered:

janvanrijn · 2018-11-27T19:53:52Z

Thanks for pointing this out! I completely agree.

Currently, internally, OpenML stores all datasets as ARFF, which is a bit restrictive (within ARFF, one can only define date, string, numeric or categorical). We are (slowly) discussing to switch to another internal format (e.g., XARFF, but there are many options). We should probably add these to this discussion.

KOLANICH · 2018-11-29T09:19:58Z

https://frictionlessdata.io/specs/data-package/
https://github.com/frictionlessdata/specs/blob/master/schemas/dictionary.json#L558
https://github.com/frictionlessdata/datapackage-py

vnmabus · 2019-06-26T15:54:03Z

Please, if you are considering using a different format than ARFF (which is awful), consider support for hierarchies/structures and custom data types. I would love that you could store in the same dataset several functional data objects, for example, along the usual multivariate data. As an example of a moderately complex dataset consider https://www.rdocumentation.org/packages/fda/versions/2.4.8/topics/CanadianWeather.

joaquinvanschoren · 2019-06-27T00:00:46Z

@vnmabus If you have suggestions for another format, please let us know.
We need a uniform and efficient format so that we can allow reading/writing different datasets uniformly. As in: export an R dataframe to file, then read it back in as an R dataframe, but also supporting other languages. If every dataset requires custom code/parsers to read it, it will not scale.

In the short term, we are thinking of supporting Parquet/Arrow for all dataframe-type data, and allowing multi-file datasets as bundles similar to frictionlessdata (and supporting frictionlessdata). Would that help you?

KOLANICH · 2019-06-27T04:47:17Z

R dataframe is a bad candidate. It is something like pickle, but for R. With the same security problems, even worser, since the parsing is done in C.

Also it has license issues: R is GPLed.

KOLANICH · 2019-06-27T05:18:46Z

IMHO either hdf5 or https://asdf-standard.readthedocs.io or anything else binary may be helpful

joaquinvanschoren · 2019-06-27T07:44:41Z

Indeed, that's why we're looking at Parquet, which is better than hdf5 both in compression and read/write time. For smaller datasets Arrow/Feather would be an option, which offers the fastest read/write times (but no compression).

I wasn't suggesting to use R dataframes, rather that we need a format for which a direct write from / read to R dataframes is widely supported.

vnmabus · 2019-06-27T09:12:17Z

But is Parquet FLEXIBLE enough? In my case, I work in Functional Data Analysis. Most of my data consists in functions, mostly from the real numbers to the real numbers. Some datasets have more than one function per observation, or even multivariate data, so that each function would be like a "column", or a type. In one "column", the values of the functions at some points (f(t_1), ..., f(t_n)) must be stored along with the points themselves (t_1, ..., t_n). Moreover, in some cases the points are the same for every observation, so only one copy of the points should be stored for the "column" (similarly as how in R the factors only store the names once, or the categorical type in pandas).

Storing all of this properly require a format that AT LEAST can handle structured or hierarchical information, and ideally, that support custom types that can be converted in each language to the most appropriate kind of object (maybe with some mechanism to register classes for custom objects).

I know that for you is preferable to deal with the simpler cases first, but I would not want that you choose an internal format that excludes more complex data by design, because that would make OpenML unusable for certain kinds of data.

joaquinvanschoren · 2019-06-28T09:14:39Z

They are quite flexible. Arrow/Feather definitely does support hierarchical data. Parquet supports nested data. I'm not an expert, but I believe they both preserve the schema of your dataframe. So you'll be able to get it back the way you exported it.

Could you maybe check whether you can export your data to any of these formats? Maybe this helps: https://www.youtube.com/watch?v=b1JSq6LTPAQ

From discussions on the Arrow mailinglist, Feather will become a wrapper around Arrow, which should make import/export even easier in the near future. @berndbischl

In any case, I didn't say that we will use Parquet/Arrow exclusively, we do want to support as many types of data as possible. If there is a clear reason to extend, we can certainly discuss that.

KOLANICH mentioned this issue Dec 2, 2018

Fully machine-readable datasets metadata daviddiazvico/scikit-datasets#10

Open

This was referenced Dec 14, 2018

We need fully automated retrieval of datasets awesomedata/apd-core#79

Open

Keep datasets' metadata in machine-readable form scikit-learn/scikit-learn#11619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More feature types #876

More feature types #876

KOLANICH commented Nov 27, 2018 •

edited

Loading

janvanrijn commented Nov 27, 2018

KOLANICH commented Nov 29, 2018

vnmabus commented Jun 26, 2019

joaquinvanschoren commented Jun 27, 2019

KOLANICH commented Jun 27, 2019 •

edited

Loading

KOLANICH commented Jun 27, 2019

joaquinvanschoren commented Jun 27, 2019

vnmabus commented Jun 27, 2019 •

edited

Loading

joaquinvanschoren commented Jun 28, 2019

More feature types #876

More feature types #876

Comments

KOLANICH commented Nov 27, 2018 • edited Loading

janvanrijn commented Nov 27, 2018

KOLANICH commented Nov 29, 2018

vnmabus commented Jun 26, 2019

joaquinvanschoren commented Jun 27, 2019

KOLANICH commented Jun 27, 2019 • edited Loading

KOLANICH commented Jun 27, 2019

joaquinvanschoren commented Jun 27, 2019

vnmabus commented Jun 27, 2019 • edited Loading

joaquinvanschoren commented Jun 28, 2019

KOLANICH commented Nov 27, 2018 •

edited

Loading

KOLANICH commented Jun 27, 2019 •

edited

Loading

vnmabus commented Jun 27, 2019 •

edited

Loading