-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More feature types #876
Comments
Thanks for pointing this out! I completely agree. Currently, internally, OpenML stores all datasets as ARFF, which is a bit restrictive (within ARFF, one can only define date, string, numeric or categorical). We are (slowly) discussing to switch to another internal format (e.g., XARFF, but there are many options). We should probably add these to this discussion. |
Please, if you are considering using a different format than ARFF (which is awful), consider support for hierarchies/structures and custom data types. I would love that you could store in the same dataset several functional data objects, for example, along the usual multivariate data. As an example of a moderately complex dataset consider https://www.rdocumentation.org/packages/fda/versions/2.4.8/topics/CanadianWeather. |
@vnmabus If you have suggestions for another format, please let us know. In the short term, we are thinking of supporting Parquet/Arrow for all dataframe-type data, and allowing multi-file datasets as bundles similar to frictionlessdata (and supporting frictionlessdata). Would that help you? |
R dataframe is a bad candidate. It is something like pickle, but for R. With the same security problems, even worser, since the parsing is done in C. Also it has license issues: R is GPLed. |
IMHO either hdf5 or https://asdf-standard.readthedocs.io or anything else binary may be helpful |
Indeed, that's why we're looking at Parquet, which is better than hdf5 both in compression and read/write time. For smaller datasets Arrow/Feather would be an option, which offers the fastest read/write times (but no compression). I wasn't suggesting to use R dataframes, rather that we need a format for which a direct write from / read to R dataframes is widely supported. |
But is Parquet FLEXIBLE enough? In my case, I work in Functional Data Analysis. Most of my data consists in functions, mostly from the real numbers to the real numbers. Some datasets have more than one function per observation, or even multivariate data, so that each function would be like a "column", or a type. In one "column", the values of the functions at some points (f(t_1), ..., f(t_n)) must be stored along with the points themselves (t_1, ..., t_n). Moreover, in some cases the points are the same for every observation, so only one copy of the points should be stored for the "column" (similarly as how in R the factors only store the names once, or the categorical type in pandas). Storing all of this properly require a format that AT LEAST can handle structured or hierarchical information, and ideally, that support custom types that can be converted in each language to the most appropriate kind of object (maybe with some mechanism to register classes for custom objects). I know that for you is preferable to deal with the simpler cases first, but I would not want that you choose an internal format that excludes more complex data by design, because that would make OpenML unusable for certain kinds of data. |
They are quite flexible. Arrow/Feather definitely does support hierarchical data. Parquet supports nested data. I'm not an expert, but I believe they both preserve the schema of your dataframe. So you'll be able to get it back the way you exported it. Could you maybe check whether you can export your data to any of these formats? Maybe this helps: https://www.youtube.com/watch?v=b1JSq6LTPAQ From discussions on the Arrow mailinglist, Feather will become a wrapper around Arrow, which should make import/export even easier in the near future. @berndbischl In any case, I didn't say that we will use Parquet/Arrow exclusively, we do want to support as many types of data as possible. If there is a clear reason to extend, we can certainly discuss that. |
Currently used type system has not enough feature types.
Feature type is the stuff which tells us how to process that feature. More precisely, how we should preprocess it and what kinds of models are useful to fit against it.
So let's have more feature types:
{"type":"cyclic", "period": [60, 60, 24, 7]}
(period
is an array, each element defines a period in counts of previous periods) - anything having some useful to the domain cyclic structure. Enables circle transform."survival"
- means that survival analysis methods should be applied to the column, if it is target. Otherwise treat as a simple numerical.{"type":"time", "base":0}
- means that it is absolute time. Inherits fromcyclic
."calendartime"
- enables feature engineering using calendar features like holidays, festivals, other periodical events dates like Olimpics, Football cup, annual conferences dates, TV schedules, etc..."location"
- enables feature engineering tied to information from maps, like big circle distances, distances to city centre, city, state, distances to POIs of different types, etc"NLP"
- enables NLP feature engineering, like words embeddings and LSTM encoders"mysteryString"
- enables automatic feature extraction from strings which are not natural languagesthe ways how exactly the features are processed are implementation defined.
The text was updated successfully, but these errors were encountered: