Dataset API proposal

Eleonore edited this page May 11, 2016 · 13 revisions

Motivation

As a part of Incanter and core.matrix integration process, there is an idea to evolve existing in core.matrix dataset type and use it in Incanter.

In order to do that, Incanter dataset functions should be implemented in core.matrix.

Description

Dataset is a matrix, i.e. it implements all matrix protocols.

Columns of dataset support heterogeneous datatypes and are uniquely identified by name of arbitrary type. On attempt of creating a column with a duplicate name, an exception should be raised.

By default, column names are incrementing Long values starting from 0, i.e. 0, 1, 2, etc

Dataset is not seqable. In order to get seq of rows, clojure.core.matrix/rows can be used.

API

core.matrix.dataset version 0.51.0

dataset

(dataset column-names cols)

(dataset m)

Creates dataset from:
    column-names and seq of columns
    map of columns with associated list of values.
    matrix - its columns will be used as dataset columns and incrementing Long values starting from 0, i.e. 0, 1, 2, etc will be used as column names.

column-names

(column-names ds)

Returns a persistent vector containing column names in the same order as they are placed in the dataset.

column-name

(column-name ds idx)

Returns column name at given index.

select-columns

(select-columns ds cols)

Produces a new dataset with the columns in the specified order.
cols is a collection of column names to be selected.

except-columns [Deprecated]

(except-columns ds cols)

Returns new dataset with all columns except specified.
cols is a collection of column names to be excluded.

remove-columns

(remove-columns ds col-names)

Returns new dataset with the specified columns removed.

merge-columns [Deprecated]

(merge-columns & args)

Returns a dataset created by combining columns of the given datasets.
In case of columns with duplicate names, last-one-wins strategy is applied.

merge-datasets

(merge-datasets ds1 ds2)

(merge-datasets ds1 ds2 & args)

Returns a dataset created by combining columns of the given datasets. 
In case of columns with duplicate names, last-one-wins strategy is applied.

add-column

(add-column ds col-name col)

Adds column to the dataset.
If a column with the same name already exists in a dataset, exception would be raised.

rename-columns

(rename-columns ds col-map)

Renames columns based on map of old new column name pairs.
If a column with the same name already exists in a dataset, exception would be raised.

replace-column

(replace-column ds col-name vs)

Replaces column in a dataset with new values.

update-column [Deprecated]

(update-column ds col-name f & args)

Applies function f & args to the specified column of dataset and replaces the column with the resulting new values.

get-row [Deprecated]

(get-row ds idx)

Returns row at given index.

conj-rows [Deprecated]

(conj-rows & args)

Returns a dataset created by combining the rows of the given datasets and/or collections.

join-rows

(join-rows ds1 ds2)

(join-rows ds1 ds2 & args)

Returns a dataset created by combining the rows of the given datasets.

to-matrix

(to-matrix ds)

Creates matrix from dataset.

to-map

(to-map ds)

Returns map of columns with associated list of values.

row-maps

(row-maps ds)

Returns vector of maps with row values.

get-element [Deprecated]

(get-element ds c r)

Returns element at given column and row.

group-by [Deprecated]

(group-by ds cols)

Returns a map of datasets, where keys are grouping columns.

join [Deprecated]

(join ds & args)

Returns a dataset created by right-joining two or datasets.

join-columns

(join-columns ds1 ds2)

(join-columns ds1 ds2 & args)

Returns a dataset created by combining the columns of the given datasets.