Permalink
Fetching contributors…
Cannot retrieve contributors at this time
140 lines (104 sloc) 4.69 KB

DataFrames.jl

Package for working with tabular data in Julia using DataFrame's.

Installation

DataFrames.jl is now an installable package.

To install DataFrames.jl, use the following:

Pkg.add("DataFrames")

DataFrames.jl has one main module named DataFrames. You can load it as:

using DataFrames

Features

  • DataFrame for efficient tabular storage of two-dimensional data
  • Minimized data copying
  • Default columns can handle missing values (NA's) of any type
  • PooledDataFrame for efficient storage of factor-like arrays for characters, integers, and other types
  • Flexible indexing
  • SubDataFrame for efficient subset referencing without copies
  • Grouping operations inspired by plyr, pandas, and data.table
  • Basic merge functionality
  • stack and unstack for long/wide conversions
  • Pipelining support (|) for many operations
  • Several typical R-style functions, including head, tail, describe, unique, duplicated, with, within, and more
  • Formula and design matrix implementation

Demos

Here's a minimal demo showing some grouping operations:

julia> using DataFrames

julia> d = DataFrame(quote     # expressions are one way to create a DataFrame
           x = randn(10)
           y = randn(10)
           i = rand(1:3,10)
           j = rand(1:3,10)
       end);

julia> dump(d)    # dump() is like R's str()
DataFrame  10 observations of 4 variables
  x: DataArray{Float64,1}(10) [-0.22496343871037897,-0.4033933555989207,0.6027847717547058,0.06671669747901597]
  y: DataArray{Float64,1}(10) [0.21904975091285417,-1.3275512477731726,2.266353546459277,-0.19840910239041679]
  i: DataArray{Int64,1}(10) [2,1,3,1]
  j: DataArray{Int64,1}(10) [3,2,1,2]

julia> head(d)
6x4 DataFrame:
                x         y i j
[1,]    -0.224963   0.21905 2 3
[2,]    -0.403393  -1.32755 1 2
[3,]     0.602785   2.26635 3 1
[4,]    0.0667167 -0.198409 1 2
[5,]      1.68303  -1.11183 1 3
[6,]     0.346034   1.68227 2 1

julia> d[1:3, ["x","y"]]     # indexing is similar to R's
3x2 DataFrame
                x        y
[1,]    -0.224963  0.21905
[2,]    -0.403393 -1.32755
[3,]     0.602785  2.26635

julia> # Group on column i, and pipe (|) that result to an expression
julia> # that creates the column x_sum.
julia> groupby(d, "i") | :(x_sum = sum(x))
3x2 DataFrame
        i    x_sum
[1,]    1  2.06822
[2,]    2 -1.80867
[3,]    3 0.319517

julia> groupby(d, "i") | :sum   # Another way to operate on a grouping
3x4 DataFrame
        i    x_sum    y_sum j_sum
[1,]    1  2.06822 -2.73985     8
[2,]    2 -1.80867  1.83489     7
[3,]    3 0.319517  1.03072     2

See demo/workflow_demo.jl for a basic demo of the parts of a Julian data workflow.

See demo/design_demo.jl for a more in-depth demo of DataFrame and related types and library.

Documentation

Development work

The Issues highlight a number of issues and ideas for enhancements. Here are some particular enhancements under way or under discussion:

Possible changes to Julia

DataFrames fit well with Julia's syntax, but some features would improve the user experience, including keyword function arguments (Julia issue 485), "~" for easier expression syntax, and overloading "." for easier column access (df.colA). See here for a bit more information.

Current status

Please consider this a development preview. Many things work, but expect some rough edges. We hope that this can become a standard Julia package.