Datasets

liebke edited this page Sep 13, 2010 · 7 revisions

Reading Datasets

Data can be read from a file using the read-dataset function in the incanter.io library:

(use 'incanter.io)
(def data (read-dataset "datafile.csv" :header true))

The default delimiter is a comma, but other delimiters can be specified with the :delim option.

Incanter comes with sample data that can be loaded using the get-dataset function from the incanter.datasets library. The get-dataset function relies on the incanter.home property, which is set to ./ by default. If you use bin/clj to start the Clojure shell (REPL) from the Incanter directory, get-dataset will be able to find the data sets in incanter/data. If you want to start the REPL from another directory, or use another environment to run it (e.g. emacs/slime), then you need to pass the incanter.home property to the JVM at startup: java -Dincanter.home=$INCANTER_HOME ... or use the :incanter-home option to get-dataset.

To load and view Edgar Anderson’s Iris dataset:

(use '(incanter core datasets))
(def iris (get-dataset :iris))
(view iris)



Converting Datasets to Matrices

A dataset can be converted to a matrix, where non-numeric columns are converted to either
numeric codes or dummy variables, using the to-matrix function.

(def iris-mat (to-matrix iris))
(view iris-mat)



To convert the ‘Species’ column to two binary dummy-variables use the dummies option.

(def iris-dummy (to-matrix iris :dummies true))
(view iris-dummy)

Saving Data

Datasets and matrices can be written to a file using the save function.

(save iris "/tmp/iris.csv")

The default delimiter is a comma, but other delimiters can be selected with the :delim
options. Dataset headers are written to the file automatically, but headers can be specified
for matrices with the :header option.
(save iris-mat "/tmp/iris_mat.csv" 
  :header ["Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"])

The :append option can be used to append instead of overwriting an existing file.

References

For further information on using datasets and matrices in Incanter see: