# Volcanito.jl

Volcanito is an attempt to start standardizing the user-facing API that tables
expose in Julia. Because that task is too ambitious for one person writing code
in spurts every few months, the project is starting with something less
ambitious:

* Standardize on a set of user-facing macros that define primitive operations
    on tables:
    * `@select`
    * `@where`
    * `@group_by`
    * `@aggregate_vector`
    * `@order_by`
    * `@limit`
* Lower those user-facing macros to objects that lazily represent those
    operations and can be used to build a simplified logical plan:
    * `Select`
    * `Where`
    * `GroupBy`
    * `AggregateVector`
    * `OrderBy`
    * `Limit`
* Define a basic implementation of how to carry out the logical plan in terms
    of primitive operations on DataFrames from
    [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl).

For more details, see [docs/architecture.md](docs/architecture.md).

# Example Usage

In [1]:
import Pkg
Pkg.activate("..")

[32m[1m Activating[22m[39m environment at `~/Dropbox (Personal)/Coding Projects/Volcanito/Project.toml`


In [2]:
import DataFrames: DataFrame

import Statistics: mean

import Volcanito:
    @select,
    @where,
    @group_by,
    @aggregate_vector,
    @order_by,
    @limit

In [3]:
df = DataFrame(
    a = rand(10_000),
    b = rand(10_000),
    c = rand(Bool, 10_000),
)

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Float64,Float64,Bool
1,0.539269,0.916247,1
2,0.640819,0.586741,1
3,0.567328,0.722444,1
4,0.637534,0.0539694,0
5,0.938546,0.773018,0
6,0.981955,0.748735,0
7,0.877929,0.487357,0
8,0.69128,0.11586,0
9,0.0555721,0.214801,0
10,0.512568,0.476032,0


In [4]:
@select(df, a, b, d = a + b)



10000×3 DataFrame
│ Row   │ a         │ b         │ d        │
│       │ [90mFloat64[39m   │ [90mFloat64[39m   │ [90mFloat64[39m  │
├───────┼───────────┼───────────┼──────────┤
│ 1     │ 0.539269  │ 0.916247  │ 1.45552  │
│ 2     │ 0.640819  │ 0.586741  │ 1.22756  │
│ 3     │ 0.567328  │ 0.722444  │ 1.28977  │
│ 4     │ 0.637534  │ 0.0539694 │ 0.691504 │
│ 5     │ 0.938546  │ 0.773018  │ 1.71156  │
│ 6     │ 0.981955  │ 0.748735  │ 1.73069  │
│ 7     │ 0.877929  │ 0.487357  │ 1.36529  │
│ 8     │ 0.69128   │ 0.11586   │ 0.80714  │
│ 9     │ 0.0555721 │ 0.214801  │ 0.270373 │
│ 10    │ 0.512568  │ 0.476032  │ 0.9886   │
⋮
│ 9990  │ 0.242068  │ 0.718355  │ 0.960422 │
│ 9991  │ 0.845784  │ 0.106749  │ 0.952533 │
│ 9992  │ 0.245139  │ 0.600922  │ 0.846062 │
│ 9993  │ 0.0921204 │ 0.627676  │ 0.719796 │
│ 9994  │ 0.318907  │ 0.557485  │ 0.876391 │
│ 9995  │ 0.884076  │ 0.87798   │ 1.76206  │
│ 9996  │ 0.660093  │ 0.426392  │ 1.08648  │
│ 9997  │ 0.725359  │ 0.914951  │ 1.64031  │
│ 999

In [5]:
@where(df, a > b)



4968×3 DataFrame
│ Row  │ a        │ b         │ c    │
│      │ [90mFloat64[39m  │ [90mFloat64[39m   │ [90mBool[39m │
├──────┼──────────┼───────────┼──────┤
│ 1    │ 0.640819 │ 0.586741  │ 1    │
│ 2    │ 0.637534 │ 0.0539694 │ 0    │
│ 3    │ 0.938546 │ 0.773018  │ 0    │
│ 4    │ 0.981955 │ 0.748735  │ 0    │
│ 5    │ 0.877929 │ 0.487357  │ 0    │
│ 6    │ 0.69128  │ 0.11586   │ 0    │
│ 7    │ 0.512568 │ 0.476032  │ 0    │
│ 8    │ 0.728599 │ 0.102919  │ 0    │
│ 9    │ 0.879323 │ 0.426245  │ 0    │
│ 10   │ 0.743608 │ 0.143874  │ 1    │
⋮
│ 4958 │ 0.427635 │ 0.0135866 │ 0    │
│ 4959 │ 0.106481 │ 0.0538055 │ 0    │
│ 4960 │ 0.980864 │ 0.676492  │ 0    │
│ 4961 │ 0.856153 │ 0.693158  │ 0    │
│ 4962 │ 0.845363 │ 0.359023  │ 1    │
│ 4963 │ 0.560947 │ 0.230194  │ 1    │
│ 4964 │ 0.553329 │ 0.240653  │ 1    │
│ 4965 │ 0.845784 │ 0.106749  │ 1    │
│ 4966 │ 0.884076 │ 0.87798   │ 1    │
│ 4967 │ 0.660093 │ 0.426392  │ 1    │
│ 4968 │ 0.894605 │ 0.0818538 │ 1    │

In [6]:
@aggregate_vector(
    @group_by(df, !c),
    m_a = mean(a),
    m_b = mean(b),
    n_a = length(a),
    n_b = length(b),
)



2×5 DataFrame
│ Row │ !c   │ m_a      │ m_b      │ n_a   │ n_b   │
│     │ [90mBool[39m │ [90mFloat64[39m  │ [90mFloat64[39m  │ [90mInt64[39m │ [90mInt64[39m │
├─────┼──────┼──────────┼──────────┼───────┼───────┤
│ 1   │ 0    │ 0.4997   │ 0.501676 │ 5050  │ 5050  │
│ 2   │ 1    │ 0.498988 │ 0.496971 │ 4950  │ 4950  │

In [7]:
@order_by(df, a + b)



10000×3 DataFrame
│ Row   │ a          │ b          │ c    │
│       │ [90mFloat64[39m    │ [90mFloat64[39m    │ [90mBool[39m │
├───────┼────────────┼────────────┼──────┤
│ 1     │ 0.00519039 │ 0.00781176 │ 0    │
│ 2     │ 0.00759572 │ 0.00626751 │ 0    │
│ 3     │ 0.0219159  │ 0.00484135 │ 1    │
│ 4     │ 0.0226781  │ 0.00469032 │ 0    │
│ 5     │ 0.0176394  │ 0.011848   │ 1    │
│ 6     │ 0.00135077 │ 0.0310477  │ 0    │
│ 7     │ 0.0140831  │ 0.0255107  │ 0    │
│ 8     │ 0.018852   │ 0.0253574  │ 0    │
│ 9     │ 0.0369959  │ 0.00795112 │ 0    │
│ 10    │ 0.0270518  │ 0.0200754  │ 1    │
⋮
│ 9990  │ 0.955264   │ 0.987983   │ 0    │
│ 9991  │ 0.983731   │ 0.962799   │ 0    │
│ 9992  │ 0.954755   │ 0.999583   │ 1    │
│ 9993  │ 0.973232   │ 0.984307   │ 1    │
│ 9994  │ 0.994726   │ 0.963807   │ 0    │
│ 9995  │ 0.961814   │ 0.999509   │ 1    │
│ 9996  │ 0.972445   │ 0.989337   │ 0    │
│ 9997  │ 0.995004   │ 0.970239   │ 1    │
│ 9998  │ 0.987697   │ 0.980765   │ 1    │
│ 99

In [8]:
@limit(df, 10)



10×3 DataFrame
│ Row │ a         │ b         │ c    │
│     │ [90mFloat64[39m   │ [90mFloat64[39m   │ [90mBool[39m │
├─────┼───────────┼───────────┼──────┤
│ 1   │ 0.539269  │ 0.916247  │ 1    │
│ 2   │ 0.640819  │ 0.586741  │ 1    │
│ 3   │ 0.567328  │ 0.722444  │ 1    │
│ 4   │ 0.637534  │ 0.0539694 │ 0    │
│ 5   │ 0.938546  │ 0.773018  │ 0    │
│ 6   │ 0.981955  │ 0.748735  │ 0    │
│ 7   │ 0.877929  │ 0.487357  │ 0    │
│ 8   │ 0.69128   │ 0.11586   │ 0    │
│ 9   │ 0.0555721 │ 0.214801  │ 0    │
│ 10  │ 0.512568  │ 0.476032  │ 0    │

To make it easier to understand how things work, the examples above all exploit
the fact that Volcanito's user-facing macros construct `LogicalNode` objects
that automatically materialize the result of a query whenever `Base.show` is
called. This makes it seem as is the user-facing macros operate eagerly, but
the truth is that they operate lazily and produce `LogicalNode` objects rather
than DataFrames. If you want to transform a `LogicalNode` object into a full
DataFrame, you should explicitly call `Volcanito.materialize`.

In [9]:
import Pkg
Pkg.activate("..")

[32m[1m Activating[22m[39m environment at `~/Dropbox (Personal)/Coding Projects/Volcanito/Project.toml`


In [10]:
import DataFrames: DataFrame

import Volcanito:
    @select,
    materialize

In [11]:
df = DataFrame(
    a = rand(10_000),
    b = rand(10_000),
    c = rand(Bool, 10_000),
)

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Float64,Float64,Bool
1,0.852162,0.200106,1
2,0.321918,0.785016,1
3,0.215953,0.0205393,1
4,0.957633,0.916897,1
5,0.720666,0.47683,1
6,0.640876,0.888486,0
7,0.47517,0.969994,0
8,0.805707,0.618759,1
9,0.781557,0.600439,1
10,0.556716,0.0499625,0


In [12]:
plan = @select(df, a, b, d = a + b)



10000×3 DataFrame
│ Row   │ a        │ b         │ d        │
│       │ [90mFloat64[39m  │ [90mFloat64[39m   │ [90mFloat64[39m  │
├───────┼──────────┼───────────┼──────────┤
│ 1     │ 0.852162 │ 0.200106  │ 1.05227  │
│ 2     │ 0.321918 │ 0.785016  │ 1.10693  │
│ 3     │ 0.215953 │ 0.0205393 │ 0.236492 │
│ 4     │ 0.957633 │ 0.916897  │ 1.87453  │
│ 5     │ 0.720666 │ 0.47683   │ 1.1975   │
│ 6     │ 0.640876 │ 0.888486  │ 1.52936  │
│ 7     │ 0.47517  │ 0.969994  │ 1.44516  │
│ 8     │ 0.805707 │ 0.618759  │ 1.42447  │
│ 9     │ 0.781557 │ 0.600439  │ 1.382    │
│ 10    │ 0.556716 │ 0.0499625 │ 0.606679 │
⋮
│ 9990  │ 0.25939  │ 0.418931  │ 0.678321 │
│ 9991  │ 0.91623  │ 0.51624   │ 1.43247  │
│ 9992  │ 0.228147 │ 0.265151  │ 0.493297 │
│ 9993  │ 0.126201 │ 0.32692   │ 0.453121 │
│ 9994  │ 0.342067 │ 0.973083  │ 1.31515  │
│ 9995  │ 0.348134 │ 0.664291  │ 1.01242  │
│ 9996  │ 0.90186  │ 0.875675  │ 1.77753  │
│ 9997  │ 0.543611 │ 0.525717  │ 1.06933  │
│ 9998  │ 0.294553 │ 0.291

In [13]:
typeof(plan)

Volcanito.Projection{DataFrame,Tuple{Volcanito.FunctionSpec{Symbol,1,Symbol,var"#27#33",var"#28#34"},Volcanito.FunctionSpec{Symbol,1,Symbol,var"#29#35",var"#30#36"},Volcanito.FunctionSpec{Expr,2,Expr,var"#31#37",var"#32#38"}}}

In [14]:
df = materialize(plan)

Unnamed: 0_level_0,a,b,d
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.852162,0.200106,1.05227
2,0.321918,0.785016,1.10693
3,0.215953,0.0205393,0.236492
4,0.957633,0.916897,1.87453
5,0.720666,0.47683,1.1975
6,0.640876,0.888486,1.52936
7,0.47517,0.969994,1.44516
8,0.805707,0.618759,1.42447
9,0.781557,0.600439,1.382
10,0.556716,0.0499625,0.606679


In [15]:
typeof(df)

DataFrame