# Volcanito.jl

Volcanito is an attempt to start standardizing the user-facing API that tables
expose in Julia. Because that task is too ambitious for one person writing code
in spurts every few months, the project is starting with something less
ambitious:

* Standardize on a set of user-facing macros that define primitive operations
    on tables:
    * `@select`
    * `@where`
    * `@group_by`
    * `@aggregate_vector`
    * `@order_by`
    * `@limit`
* Lower those user-facing macros to objects that lazily represent those
    operations and can be used to build a simplified logical plan:
    * `Select`
    * `Where`
    * `GroupBy`
    * `AggregateVector`
    * `OrderBy`
    * `Limit`
* Define a basic implementation of how to carry out the logical plan in terms
    of primitive operations on DataFrames from
    [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl).

For more details, see [docs/architecture.md](https://github.com/johnmyleswhite/Volcanito.jl/blob/master/docs/architecture.md).

# Example Usage

In [1]:
import Pkg
Pkg.activate("..")

[32m[1m Activating[22m[39m environment at `~/Dropbox (Personal)/Coding Projects/Volcanito/Project.toml`


In [2]:
import DataFrames: DataFrame

import Statistics: mean

import Volcanito:
    @select,
    @where,
    @group_by,
    @aggregate_vector,
    @order_by,
    @limit

In [3]:
df = DataFrame(
    a = rand(10_000),
    b = rand(10_000),
    c = rand(Bool, 10_000),
)

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Float64,Float64,Bool
1,0.925917,0.389588,0
2,0.841932,0.276801,1
3,0.366998,0.167371,0
4,0.775667,0.0864964,1
5,0.331247,0.00128394,1
6,0.123244,0.457262,1
7,0.885265,0.975031,1
8,0.467497,0.767182,0
9,0.075116,0.866763,0
10,0.319431,0.979848,0


In [4]:
@select(df, a, b, d = a + b)



10000×3 DataFrame
│ Row   │ a        │ b          │ d        │
│       │ [90mFloat64[39m  │ [90mFloat64[39m    │ [90mFloat64[39m  │
├───────┼──────────┼────────────┼──────────┤
│ 1     │ 0.925917 │ 0.389588   │ 1.3155   │
│ 2     │ 0.841932 │ 0.276801   │ 1.11873  │
│ 3     │ 0.366998 │ 0.167371   │ 0.534369 │
│ 4     │ 0.775667 │ 0.0864964  │ 0.862163 │
│ 5     │ 0.331247 │ 0.00128394 │ 0.332531 │
│ 6     │ 0.123244 │ 0.457262   │ 0.580506 │
│ 7     │ 0.885265 │ 0.975031   │ 1.8603   │
│ 8     │ 0.467497 │ 0.767182   │ 1.23468  │
│ 9     │ 0.075116 │ 0.866763   │ 0.941879 │
│ 10    │ 0.319431 │ 0.979848   │ 1.29928  │
⋮
│ 9990  │ 0.790729 │ 0.827537   │ 1.61827  │
│ 9991  │ 0.698638 │ 0.422611   │ 1.12125  │
│ 9992  │ 0.601989 │ 0.0270746  │ 0.629064 │
│ 9993  │ 0.844225 │ 0.392945   │ 1.23717  │
│ 9994  │ 0.571843 │ 0.431097   │ 1.00294  │
│ 9995  │ 0.919292 │ 0.593459   │ 1.51275  │
│ 9996  │ 0.844214 │ 0.906676   │ 1.75089  │
│ 9997  │ 0.897126 │ 0.733824   │ 1.63095  │
│ 999

In [5]:
@where(df, a > b)



4986×3 DataFrame
│ Row  │ a        │ b          │ c    │
│      │ [90mFloat64[39m  │ [90mFloat64[39m    │ [90mBool[39m │
├──────┼──────────┼────────────┼──────┤
│ 1    │ 0.925917 │ 0.389588   │ 0    │
│ 2    │ 0.841932 │ 0.276801   │ 1    │
│ 3    │ 0.366998 │ 0.167371   │ 0    │
│ 4    │ 0.775667 │ 0.0864964  │ 1    │
│ 5    │ 0.331247 │ 0.00128394 │ 1    │
│ 6    │ 0.786434 │ 0.0293047  │ 0    │
│ 7    │ 0.468639 │ 0.00765101 │ 1    │
│ 8    │ 0.276286 │ 0.0807866  │ 0    │
│ 9    │ 0.552711 │ 0.286191   │ 1    │
│ 10   │ 0.275897 │ 0.178368   │ 1    │
⋮
│ 4976 │ 0.735156 │ 0.261879   │ 1    │
│ 4977 │ 0.886085 │ 0.563352   │ 0    │
│ 4978 │ 0.490348 │ 0.116507   │ 1    │
│ 4979 │ 0.554883 │ 0.396594   │ 0    │
│ 4980 │ 0.698638 │ 0.422611   │ 1    │
│ 4981 │ 0.601989 │ 0.0270746  │ 0    │
│ 4982 │ 0.844225 │ 0.392945   │ 1    │
│ 4983 │ 0.571843 │ 0.431097   │ 0    │
│ 4984 │ 0.919292 │ 0.593459   │ 1    │
│ 4985 │ 0.897126 │ 0.733824   │ 1    │
│ 4986 │ 0.451935 │ 0.104735   

In [6]:
@aggregate_vector(
    @group_by(df, !c),
    m_a = mean(a),
    m_b = mean(b),
    n_a = length(a),
    n_b = length(b),
)



2×5 DataFrame
│ Row │ !c   │ m_a      │ m_b      │ n_a   │ n_b   │
│     │ [90mBool[39m │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mInt64[39m │ [90mInt64[39m │
├─────┼──────┼──────────┼──────────┼───────┼───────┤
│ 1   │ 1    │ 0.50305  │ 0.494135 │ 4977  │ 4977  │
│ 2   │ 0    │ 0.493241 │ 0.504268 │ 5023  │ 5023  │

In [7]:
@order_by(df, a + b)



10000×3 DataFrame
│ Row   │ a          │ b          │ c    │
│       │ [90mFloat64[39m    │ [90mFloat64[39m    │ [90mBool[39m │
├───────┼────────────┼────────────┼──────┤
│ 1     │ 0.00889487 │ 0.012526   │ 0    │
│ 2     │ 0.00788679 │ 0.0158081  │ 0    │
│ 3     │ 0.0132924  │ 0.0119411  │ 0    │
│ 4     │ 0.0282309  │ 4.7734e-5  │ 1    │
│ 5     │ 0.0104398  │ 0.0195211  │ 1    │
│ 6     │ 0.0059419  │ 0.0305527  │ 1    │
│ 7     │ 0.00318514 │ 0.0353464  │ 1    │
│ 8     │ 0.00940498 │ 0.0294835  │ 0    │
│ 9     │ 0.0123159  │ 0.0294189  │ 0    │
│ 10    │ 0.0358268  │ 0.00609674 │ 1    │
⋮
│ 9990  │ 0.995338   │ 0.959325   │ 0    │
│ 9991  │ 0.968485   │ 0.989534   │ 1    │
│ 9992  │ 0.999272   │ 0.962194   │ 0    │
│ 9993  │ 0.992854   │ 0.969188   │ 1    │
│ 9994  │ 0.984692   │ 0.979936   │ 1    │
│ 9995  │ 0.968387   │ 0.999768   │ 0    │
│ 9996  │ 0.999072   │ 0.970213   │ 1    │
│ 9997  │ 0.989329   │ 0.981392   │ 1    │
│ 9998  │ 0.999908   │ 0.97973    │ 1    │
│ 99

In [8]:
@limit(df, 10)



10×3 DataFrame
│ Row │ a        │ b          │ c    │
│     │ [90mFloat64[39m  │ [90mFloat64[39m    │ [90mBool[39m │
├─────┼──────────┼────────────┼──────┤
│ 1   │ 0.925917 │ 0.389588   │ 0    │
│ 2   │ 0.841932 │ 0.276801   │ 1    │
│ 3   │ 0.366998 │ 0.167371   │ 0    │
│ 4   │ 0.775667 │ 0.0864964  │ 1    │
│ 5   │ 0.331247 │ 0.00128394 │ 1    │
│ 6   │ 0.123244 │ 0.457262   │ 1    │
│ 7   │ 0.885265 │ 0.975031   │ 1    │
│ 8   │ 0.467497 │ 0.767182   │ 0    │
│ 9   │ 0.075116 │ 0.866763   │ 0    │
│ 10  │ 0.319431 │ 0.979848   │ 0    │

To make it easier to understand how things work, the examples above all exploit
the fact that Volcanito's user-facing macros construct `LogicalNode` objects
that automatically materialize the result of a query whenever `Base.show` is
called. This makes it seem as is the user-facing macros operate eagerly, but
the truth is that they operate lazily and produce `LogicalNode` objects rather
than DataFrames. If you want to transform a `LogicalNode` object into a full
DataFrame, you should explicitly call `Volcanito.materialize`.

In [9]:
import Pkg
Pkg.activate("..")

[32m[1m Activating[22m[39m environment at `~/Dropbox (Personal)/Coding Projects/Volcanito/Project.toml`


In [10]:
import DataFrames: DataFrame

import Volcanito:
    @select,
    materialize

In [11]:
df = DataFrame(
    a = rand(10_000),
    b = rand(10_000),
    c = rand(Bool, 10_000),
)

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Float64,Float64,Bool
1,0.783001,0.536788,0
2,0.789769,0.262531,0
3,0.674107,0.409885,0
4,0.807469,0.126618,0
5,0.814042,0.769956,0
6,0.578176,0.392734,0
7,0.449241,0.933036,1
8,0.119245,0.130244,0
9,0.244591,0.294622,1
10,0.229329,0.788812,1


In [12]:
plan = @select(df, a, b, d = a + b)



10000×3 DataFrame
│ Row   │ a         │ b        │ d         │
│       │ [90mFloat64[39m   │ [90mFloat64[39m  │ [90mFloat64[39m   │
├───────┼───────────┼──────────┼───────────┤
│ 1     │ 0.783001  │ 0.536788 │ 1.31979   │
│ 2     │ 0.789769  │ 0.262531 │ 1.0523    │
│ 3     │ 0.674107  │ 0.409885 │ 1.08399   │
│ 4     │ 0.807469  │ 0.126618 │ 0.934087  │
│ 5     │ 0.814042  │ 0.769956 │ 1.584     │
│ 6     │ 0.578176  │ 0.392734 │ 0.97091   │
│ 7     │ 0.449241  │ 0.933036 │ 1.38228   │
│ 8     │ 0.119245  │ 0.130244 │ 0.249488  │
│ 9     │ 0.244591  │ 0.294622 │ 0.539213  │
│ 10    │ 0.229329  │ 0.788812 │ 1.01814   │
⋮
│ 9990  │ 0.390972  │ 0.987166 │ 1.37814   │
│ 9991  │ 0.775135  │ 0.722211 │ 1.49735   │
│ 9992  │ 0.368282  │ 0.871369 │ 1.23965   │
│ 9993  │ 0.0368812 │ 0.01269  │ 0.0495713 │
│ 9994  │ 0.538873  │ 0.758099 │ 1.29697   │
│ 9995  │ 0.716272  │ 0.799309 │ 1.51558   │
│ 9996  │ 0.510478  │ 0.813239 │ 1.32372   │
│ 9997  │ 0.945922  │ 0.879409 │ 1.82533   │
│ 999

In [13]:
typeof(plan)

Volcanito.Projection{DataFrame,Tuple{Volcanito.Expression{Symbol,1,Symbol,var"#27#33",var"#28#34"},Volcanito.Expression{Symbol,1,Symbol,var"#29#35",var"#30#36"},Volcanito.Expression{Expr,2,Expr,var"#31#37",var"#32#38"}}}

In [14]:
df = materialize(plan)

Unnamed: 0_level_0,a,b,d
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.783001,0.536788,1.31979
2,0.789769,0.262531,1.0523
3,0.674107,0.409885,1.08399
4,0.807469,0.126618,0.934087
5,0.814042,0.769956,1.584
6,0.578176,0.392734,0.97091
7,0.449241,0.933036,1.38228
8,0.119245,0.130244,0.249488
9,0.244591,0.294622,0.539213
10,0.229329,0.788812,1.01814


In [15]:
typeof(df)

DataFrame

# Expression Rewrites

To simplify working with data, the macros involve rewrite passes to automate
several tedious users otherwise do manually.

## Automatic Three-Valued Logic

Three-valued logic works even with short-circuiting Boolean operators:

In [16]:
import Pkg
Pkg.activate("..")

[32m[1m Activating[22m[39m environment at `~/Dropbox (Personal)/Coding Projects/Volcanito/Project.toml`


In [17]:
import DataFrames: DataFrame

import Volcanito: @where

In [18]:
df = DataFrame(
    a = [missing, 0.25, 0.5, 0.75],
    b = [missing, 0.75, 0.5, 0.25],
)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Float64?,Float64?
1,missing,missing
2,0.25,0.75
3,0.5,0.5
4,0.75,0.25


In [19]:
function f(x)
    println("Calling f(x) on x = $x")
    x + 1
end

f (generic function with 1 method)

In [20]:
@where(df, f(a) > 1.5 && f(b) >= 1.25)

Calling f(x) on x = missing
Calling f(x) on x = missing
Calling f(x) on x = 0.25
Calling f(x) on x = 0.5
Calling f(x) on x = 0.75
Calling f(x) on x = 0.25




1×2 DataFrame
│ Row │ a        │ b        │
│     │ [90mFloat64?[39m │ [90mFloat64?[39m │
├─────┼──────────┼──────────┤
│ 1   │ 0.75     │ 0.25     │

## Local Variable Interpolation/Splicing

Local scalar variables can be interpolated/spliced into expressions:

In [21]:
import Pkg
Pkg.activate("..")

[32m[1m Activating[22m[39m environment at `~/Dropbox (Personal)/Coding Projects/Volcanito/Project.toml`


In [22]:
import DataFrames: DataFrame

import Volcanito: @where

In [23]:
df = DataFrame(
    a = [missing, 0.25, 0.5, 0.75],
    b = [missing, 0.75, 0.5, 0.25],
)


Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Float64?,Float64?
1,missing,missing
2,0.25,0.75
3,0.5,0.5
4,0.75,0.25


In [24]:
let x = 0.5
    @where(df, a >= $x)
end



2×2 DataFrame
│ Row │ a        │ b        │
│     │ [90mFloat64?[39m │ [90mFloat64?[39m │
├─────┼──────────┼──────────┤
│ 1   │ 0.5      │ 0.5      │
│ 2   │ 0.75     │ 0.25     │