Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More feature types #876

Open
KOLANICH opened this issue Nov 27, 2018 · 9 comments
Open

More feature types #876

KOLANICH opened this issue Nov 27, 2018 · 9 comments

Comments

@KOLANICH
Copy link

KOLANICH commented Nov 27, 2018

Currently used type system has not enough feature types.

Feature type is the stuff which tells us how to process that feature. More precisely, how we should preprocess it and what kinds of models are useful to fit against it.

So let's have more feature types:

  • {"type":"cyclic", "period": [60, 60, 24, 7]} (period is an array, each element defines a period in counts of previous periods) - anything having some useful to the domain cyclic structure. Enables circle transform.
  • "survival" - means that survival analysis methods should be applied to the column, if it is target. Otherwise treat as a simple numerical.
  • {"type":"time", "base":0} - means that it is absolute time. Inherits from cyclic.
  • "calendartime" - enables feature engineering using calendar features like holidays, festivals, other periodical events dates like Olimpics, Football cup, annual conferences dates, TV schedules, etc...
  • "location" - enables feature engineering tied to information from maps, like big circle distances, distances to city centre, city, state, distances to POIs of different types, etc
  • "NLP" - enables NLP feature engineering, like words embeddings and LSTM encoders
  • "mysteryString" - enables automatic feature extraction from strings which are not natural languages

the ways how exactly the features are processed are implementation defined.

@janvanrijn
Copy link
Member

Thanks for pointing this out! I completely agree.

Currently, internally, OpenML stores all datasets as ARFF, which is a bit restrictive (within ARFF, one can only define date, string, numeric or categorical). We are (slowly) discussing to switch to another internal format (e.g., XARFF, but there are many options). We should probably add these to this discussion.

@vnmabus
Copy link

vnmabus commented Jun 26, 2019

Please, if you are considering using a different format than ARFF (which is awful), consider support for hierarchies/structures and custom data types. I would love that you could store in the same dataset several functional data objects, for example, along the usual multivariate data. As an example of a moderately complex dataset consider https://www.rdocumentation.org/packages/fda/versions/2.4.8/topics/CanadianWeather.

@joaquinvanschoren
Copy link
Contributor

@vnmabus If you have suggestions for another format, please let us know.
We need a uniform and efficient format so that we can allow reading/writing different datasets uniformly. As in: export an R dataframe to file, then read it back in as an R dataframe, but also supporting other languages. If every dataset requires custom code/parsers to read it, it will not scale.

In the short term, we are thinking of supporting Parquet/Arrow for all dataframe-type data, and allowing multi-file datasets as bundles similar to frictionlessdata (and supporting frictionlessdata). Would that help you?

@KOLANICH
Copy link
Author

KOLANICH commented Jun 27, 2019

R dataframe is a bad candidate. It is something like pickle, but for R. With the same security problems, even worser, since the parsing is done in C.

Also it has license issues: R is GPLed.

@KOLANICH
Copy link
Author

IMHO either hdf5 or https://asdf-standard.readthedocs.io or anything else binary may be helpful

@joaquinvanschoren
Copy link
Contributor

Indeed, that's why we're looking at Parquet, which is better than hdf5 both in compression and read/write time. For smaller datasets Arrow/Feather would be an option, which offers the fastest read/write times (but no compression).

I wasn't suggesting to use R dataframes, rather that we need a format for which a direct write from / read to R dataframes is widely supported.

@vnmabus
Copy link

vnmabus commented Jun 27, 2019

But is Parquet FLEXIBLE enough? In my case, I work in Functional Data Analysis. Most of my data consists in functions, mostly from the real numbers to the real numbers. Some datasets have more than one function per observation, or even multivariate data, so that each function would be like a "column", or a type. In one "column", the values of the functions at some points (f(t_1), ..., f(t_n)) must be stored along with the points themselves (t_1, ..., t_n). Moreover, in some cases the points are the same for every observation, so only one copy of the points should be stored for the "column" (similarly as how in R the factors only store the names once, or the categorical type in pandas).

Storing all of this properly require a format that AT LEAST can handle structured or hierarchical information, and ideally, that support custom types that can be converted in each language to the most appropriate kind of object (maybe with some mechanism to register classes for custom objects).

I know that for you is preferable to deal with the simpler cases first, but I would not want that you choose an internal format that excludes more complex data by design, because that would make OpenML unusable for certain kinds of data.

@joaquinvanschoren
Copy link
Contributor

They are quite flexible. Arrow/Feather definitely does support hierarchical data. Parquet supports nested data. I'm not an expert, but I believe they both preserve the schema of your dataframe. So you'll be able to get it back the way you exported it.

Could you maybe check whether you can export your data to any of these formats? Maybe this helps: https://www.youtube.com/watch?v=b1JSq6LTPAQ

From discussions on the Arrow mailinglist, Feather will become a wrapper around Arrow, which should make import/export even easier in the near future. @berndbischl

In any case, I didn't say that we will use Parquet/Arrow exclusively, we do want to support as many types of data as possible. If there is a clear reason to extend, we can certainly discuss that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants