<center><img src = "https://juliacomputing.com/assets/img/new/JuliaDB_logo2.svg" width=400>

# A data system for Julia
</center>
<br>
<table>
    <tr>
        <td style="width: 200px"><strong style="font-size:20px;">Jeff Bezanson<br>Shashi Gowda<br>Josh Day</strong></td>
        <td><img src = "https://juliacomputing.com/assets/img/new/julia-computing.svg" width=400></td>
    </tr>
</table>


# Overview

- Why Julia
- Why JuliaDB
- Analytics with JuliaDB
- Benchmarks
- API Overview

## Why Do We Need Another Language?

https://julialang.org/blog/2012/02/why-we-created-julia

# The Two Language Problem

- **Prototype** code goes into a high-level language
- **Production** code goes into a low-level language

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Julia_prog_language.svg/1280px-Julia_prog_language.svg.png" width=200></center>

# The dream:

- Be (at least) as nice as Python/R/Matlab

- Be as fast as C/Fortran

*Julia is fast because of features which work well together, and it is more than "Fast Python/R/Matlab"*

# Julia Features

- **JIT Compiler**: User code runs fast
- **Multiple Dispatch**:  Code specialized to argument types
- **Type System**: Express yourself
- **Metaprogramming**: Transform and generate code
- **Parallelism**: Built-in
- **Interop** (Call C directly, RCall.jl, PyCall.jl), **Unicode Support**, **MIT Licensed**, ...

# Introducing JuliaDB

# Julia "Developers"

![](https://pkg.julialang.org/img/allver.svg)

# Julia "Users"

![](https://pkg.julialang.org/img/stars.svg)

# Why JuliaDB?

We shouldn't need to glue together tools to get performance

<img src="https://media.giphy.com/media/xT0xelg3s22Ni7gYO4/giphy.gif" width=400>

<table>
    <tr>
        <td width="100"><h1>pandas</h1></td>
        <td><img src = "images/pandas.png"></td>
    </tr>
</table>

<table>
    <tr>
        <td width="100"><h1>dplyr</h1></td>
        <td><img src = "images/dplyr.png"></td>
    </tr>
</table>

<table>
    <tr>
        <td width="100"><h1>data.table</h1></td>
        <td><img src = "images/datatable.png"></td>
    </tr>
</table>

<table>
    <tr>
        <td width="100"><h1>JuliaDB</h1></td>
        <td><img src = "images/juliadb.png"></td>
    </tr>
</table>

# JuliaDB Goals

**JuliaDB brings the promise of Julia to Data Science**

- Efficiently work with multi-file persistent datasets
- Queries/user-defined functions are fast
- Perform the heavy lifting of distributed computing
- Batteries are included (tools for analytics)

# Example: NYC Taxi Data

![](http://www.nyc.gov/html/tlc/includes/site_images/branding/banner.gif)

- First 4 months of yellow cab data for 2017 (3.2 GB)
- One file per month

In [63]:
;ls /Users/joshday/datasets/taxi

yellow_tripdata_2017-01.csv
yellow_tripdata_2017-02.csv
yellow_tripdata_2017-03.csv
yellow_tripdata_2017-04.csv


In [64]:
;du -h /Users/joshday/datasets/taxi

4.0K	/Users/joshday/datasets/taxi/.juliadb
3.2G	/Users/joshday/datasets/taxi


In [65]:
;head /Users/joshday/datasets/taxi/yellow_tripdata_2017-01.csv

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount

1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.30,1,N,263,161,1,12.5,0,0.5,2,0,0.3,15.3
1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,.90,1,N,186,234,1,5,0,0.5,1.45,0,0.3,7.25
1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.10,1,N,164,161,1,5.5,0,0.5,1,0,0.3,7.3
1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.10,1,N,236,75,1,6,0,0.5,1.7,0,0.3,8.5
2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,.02,2,N,249,234,2,52,0,0.5,0,0,0.3,52.8
1,2017-01-01 00:00:02,2017-01-01 00:03:50,1,.50,1,N,48,48,2,4,0.5,0.5,0,0,0.3,5.3
2,2017-01-01 00:00:02,2017-01-01 00:39:22,4,7.75,1,N,186,36,1,22,0.5,0.5,4.66,0,0.3,27.96
1,2017-01-01 00:00:03,2017-01-01 00:06:58,1,.80,1,N,162,161,1,6,0.5,0.5,1.45,0,0.3,8.75


In [66]:
if nprocs() == 1
    @show addprocs(4)
end
using JuliaDB


path = "/Users/joshday/datasets/taxi"
db = loadtable(path; indexcols=[1, 2])
save(db, "mydata")

Metadata for 4 / 4 files can be loaded from cache.


Distributed Table with 39219765 rows in 4 chunks:
Columns:
[1m#   [22m[1mcolname                [22m[1mtype[22m
───────────────────────────────────
1   VendorID               Int64
2   tpep_pickup_datetime   DateTime
3   tpep_dropoff_datetime  DateTime
4   passenger_count        Int64
5   trip_distance          Float64
6   RatecodeID             Int64
7   store_and_fwd_flag     String
8   PULocationID           Int64
9   DOLocationID           Int64
10  payment_type           Int64
11  fare_amount            Float64
12  extra                  Float64
13  mta_tax                Float64
14  tip_amount             Float64
15  tolls_amount           Float64
16  improvement_surcharge  Float64
17  total_amount           Float64

# Run This Once:

```julia
addprocs()
using JuliaDB

path = "/Users/joshday/datasets/taxi"

db = loadtable(path; indexcols=[1, 2])

save(db, "mydata")
```

### Future Julia Sessions  Quickly Reload Data

In [67]:
@time db = load("mydata")

  0.001078 seconds (1.64 k allocations: 82.172 KiB)


Distributed Table with 39219765 rows in 4 chunks:
Columns:
[1m#   [22m[1mcolname                [22m[1mtype[22m
───────────────────────────────────
1   VendorID               Int64
2   tpep_pickup_datetime   DateTime
3   tpep_dropoff_datetime  DateTime
4   passenger_count        Int64
5   trip_distance          Float64
6   RatecodeID             Int64
7   store_and_fwd_flag     String
8   PULocationID           Int64
9   DOLocationID           Int64
10  payment_type           Int64
11  fare_amount            Float64
12  extra                  Float64
13  mta_tax                Float64
14  tip_amount             Float64
15  tolls_amount           Float64
16  improvement_surcharge  Float64
17  total_amount           Float64

# Analytics With OnlineStats
 
 http://joshday.github.io/OnlineStats.jl/latest/
 
- Each statistic/model has its own type
- Values are updated one observation at a time

# OnlineStats (Batteries included)

- **Summary Statistics** (`Mean`, `CovMatrix`, `Quantile`, etc.)
- **Density Estimation** (`OHistogram`, `IHistogram`, `FitCategorical`)
- **Regression** (`LinReg`, `LinRegBuilder`)
- **(Approximate) Penalized GLMs** with a variety of:
    - **Algorithms**: `SGD`, `ADAGRAD`, `ADAM`,...
    - **Losses**: `L2DistLoss`, `L1HingeLoss`,...
    - **Penalties**: `ElasticNetPenalty`, ...

# OnlineStats Features

- Different weighting schemes to handle parameter drift

<center>
![](https://user-images.githubusercontent.com/8075494/27964296-c249baec-6305-11e7-89d0-9875d3bdab3e.gif)
</center>

# OnlineStats Features

- Calculations can be done in parallel

<center><img src="https://user-images.githubusercontent.com/8075494/32748459-519986e8-c88a-11e7-89b3-80dedf7f261b.png" width=500></center>

In [68]:
using Plots, OnlineStats
gr()  # Use plotly backend

Plots.GRBackend()

# `reduce` Operations

In [69]:
reduce(+, db; select = :fare_amount) / length(db)

12.711173323960018

In [70]:
@time reduce(Mean(), db; select = :fare_amount)

  0.099197 seconds (3.58 k allocations: 181.891 KiB)


▦ Series{0}  |  EqualWeight  |  nobs = 39219765
└── Mean(12.7112)

In [71]:
@time s = reduce(FitCategorical(Int64), db; select = :passenger_count)
plot(s, xlab = "Number of Passengers")

  1.048131 seconds (4.98 k allocations: 271.313 KiB)


In [72]:
@time s = reduce(2 * OHistogram(0:.5:60), db; 
    select = (:fare_amount, :tip_amount))
plot(s, label = [:Fare :Tip], xlab = :Amount, title = "Fare and Tips")

  0.585766 seconds (5.03 k allocations: 279.078 KiB)


# Benchmarks


bit.ly/juliadb-pydata-benchmarks

In [73]:
benchtype = vcat(fill("loadtable", 2), fill("groupby", 5), fill("join", 6))
bench = ["Read CSV (DateTime)", 
    "Read CSV (String)", 
    "mean(VendorID)", 
    "count(dayofweek, passenger_count)",
    "count(UDF, passenger_count)", 
    "count(dayofweek, passenger_count, floor)", 
    "count(passenger_count, dayofweek)", 
    "Inner Join (String)", "Inner Join (Number)", "Outer Join (String)", "Outer Join (Number)",
    "Left Join (String)","Left Join (Number)"]
jdb = [16.8086,18.0826,11.797,47.192, 51.797,190.741, 5.207,41.59,12.40,44.43,15.06,42.88,13.34]
pan = [32.8733,29.6489,16.487,83.629,280.981,115.429,51.726,16.48,19.28,24.92,37.32,16.87,19.57]

tb = table(benchtype, bench, jdb, pan; names = [:benchgroup, :benchmark, :JuliaDB, :Pandas])

Table with 13 rows, 4 columns:
benchgroup   benchmark                                   JuliaDB  Pandas
─────────────────────────────────────────────────────────────────────────
"loadtable"  "Read CSV (DateTime)"                       16.8086  32.8733
"loadtable"  "Read CSV (String)"                         18.0826  29.6489
"groupby"    "mean(VendorID)"                            11.797   16.487
"groupby"    "count(dayofweek, passenger_count)"         47.192   83.629
"groupby"    "count(UDF, passenger_count)"               51.797   280.981
"groupby"    "count(dayofweek, passenger_count, floor)"  190.741  115.429
"groupby"    "count(passenger_count, dayofweek)"         5.207    51.726
"join"       "Inner Join (String)"                       41.59    16.48
"join"       "Inner Join (Number)"                       12.4     19.28
"join"       "Outer Join (String)"                       44.43    24.92
"join"       "Outer Join (Number)"                       15.06    37.32
"join"       "Left 

In [74]:
ratio = JuliaDB.select(tb, (:JuliaDB, :Pandas) => t -> t[1] / t[2])
plt = scatter(JuliaDB.select(tb, :benchgroup), ratio, group=JuliaDB.select(tb, :benchmark),
    title = "JuliaDB vs. Pandas Benchmarks", size=(1000, 400), ylab = "Time Relative to Pandas");

In [75]:
plt

# `loadtable` benchmarks

- NY Taxi file: yellow_trips_2016-01.csv -- 10.9 million rows, 19 columns
<br>
<table>
    <tr>
        <td style="width: 200px"><strong style="font-size:20px;">Benchmark</strong></td> 
        <td style="width: 200px"><strong style="font-size:20px;">JuliaDB</strong></td>
        <td style="width: 200px"><strong style="font-size:20px;">Pandas</strong></td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">Read CSV (DateTime)</strong></td>
        <td style="font-size:20px;">16.8086 s</td>
        <td style="font-size:20px;">32.8733 s</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">Read CSV (String)</strong></td>
        <td style="font-size:20px;">18.0826 s</td>
        <td style="font-size:20px;">29.6489 s</td>
    </tr>
</table>

# `groupby` Benchmarks

1.6 million rows of taxi data

<br>
<table>
    <tr>
        <td style="width: 200px"><strong style="font-size:20px;">Benchmark</strong></td> 
        <td style="width: 200px"><strong style="font-size:20px;">JuliaDB</strong></td>
        <td style="width: 200px"><strong style="font-size:20px;">Pandas</strong></td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">mean(VendorID)</strong></td>
        <td style="font-size:20px;">11.797 ms</td>
        <td style="font-size:20px;">16.487 ms</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">count(dayofweek, passenger_count)</strong></td>
        <td style="font-size:20px;">47.192 ms</td>
        <td style="font-size:20px;">83.629 ms</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">count(UDF, passenger_count)</strong></td>
        <td style="font-size:20px;">51.797 ms</td>
        <td style="font-size:20px;">280.981 ms</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">count(dayofweek, passenger_count, floor)</strong></td>
        <td style="font-size:20px;">190.741 ms</td>
        <td style="font-size:20px;">115.429 ms</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">count(dayofweek, passenger_count, floor(Int))</strong></td>
        <td style="font-size:20px;">95.103 ms</td>
        <td style="font-size:20px;">NA</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">count(passenger_count, dayofweek)</strong></td>
        <td style="font-size:20px;">5.207 ms</td>
        <td style="font-size:20px;">51.726 ms</td>
    </tr>
</table>

# Join Benchmarks

**left:** 80k keys with 8k with uniques

**right:** 8k keys where 6k are present in left

**Key types:** 2 string fields, 2 floating point fields

<br>
<table>
    <tr>
        <td style="width: 200px"><strong style="font-size:20px;">Benchmark</strong></td> 
        <td style="width: 200px"><strong style="font-size:20px;">JuliaDB (String)</strong></td>
        <td style="width: 200px"><strong style="font-size:20px;">Pandas (String)</strong></td>
        <td style="width: 200px"><strong style="font-size:20px;">JuliaDB (Number)</strong></td>
        <td style="width: 200px"><strong style="font-size:20px;">Pandas (Number)</strong></td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">Inner Join</strong></td>
        <td style="font-size:20px;">41.59 ms</td>
        <td style="font-size:20px;">16.48 ms</td>
        <td style="font-size:20px;">12.40 ms</td>
        <td style="font-size:20px;">19.28 ms</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">Outer Join</strong></td>
        <td style="font-size:20px;">44.43 ms</td>
        <td style="font-size:20px;">24.92 ms</td>
        <td style="font-size:20px;">15.06 ms</td>
        <td style="font-size:20px;">37.32 ms</td>
    </tr>
    <tr>
        <td style="width: 300px"><strong style="font-size:20px;">Left Join</strong></td>
        <td style="font-size:20px;">42.88 ms</td>
        <td style="font-size:20px;">16.87 ms</td>
        <td style="font-size:20px;">13.34 ms</td>
        <td style="font-size:20px;">19.57 ms</td>
    </tr>
</table>

# API Overview

## Data structures

- **Table** -- a collection of rows, ordered by some fields
  - Good for analytics workloads
- **NDSparse** -- N-Dimensional sparse array
  - Good for scientific computing workloads

# API Overview

```
reindex, select, map, filter, dropna, columns, rows

setcol, pushcol, popcol, insertcol, insertcolafter, insertcolbefore, renamecol

reduce, groupreduce, groupby, flatten

join, groupjoin, asofjoin, merge

save, load
```

## Table specific

`loadtable`

## NDSparse specific

`loadndsparse, keys, values, selectkeys, selectvalues, reducedim, broadcast`

# Selection


In [76]:
using Interact, JuliaDB
tbl = table([0.01,0.05,0.07], [1,2,3], [6,5,4], names=[:t, :x, :y])

selectors = split("""
    :x
    (:x, :y)
    :x => -
    (:x, :y) => p -> hypot(p.x, p.y)
    [1,2,3]
    (:x, :z => [7,8,9])
    (:t, :r => (:x, :y) => p -> hypot(p.x, p.y))""", "\n")


opts = Interact.selection(selectors)

result = map(signal(opts)) do sel
    code = "JuliaDB.select(tbl, $sel)"
    HTML("<pre>
        <strong>
        select(tbl, <span style='color:green'>$sel</span>)</strong>\n\n" * repr(
        eval(parse("JuliaDB.$code"))) * "</pre>"
    )
    end;

In [77]:
display(tbl)
display(opts)
display(result)

Table with 3 rows, 3 columns:
t     x  y
──────────
0.01  1  6
0.05  2  5
0.07  3  4

# Selection

Wildly useful.

<code>select(t, <span style="color: green">which</span>)
map(f, t; <span style="color: green">select</span>)
reduce(f, t; <span style="color: green">select</span>)
filter(f, t; <span style="color: green">select</span>)
groupby(f, t, <span style="color: green">by</span>; <span style="color: green">select</span>)
groupreduce(f, t, <span style="color: green">by</span>; <span style="color: green">select</span>)
join(f, l, r; how, <span style="color: green">lkey</span>, <span style="color: green">rkey</span>, <span style="color: green">lselect</span>, <span style="color: green">rselect</span>)
groupjoin(f, l, r;how,  <span style="color: green">lkey</span>, <span style="color: green">rkey</span>, <span style="color: green">lselect</span>, <span style="color: green">rselect</span>)
</code>


(Green arguments are selectors)

# NDSparse

Key-value store that behaves like an n-dimensional array

In [78]:
path = "/Users/joshday/datasets/pydatanyc/truefx"
X = loadndsparse(path,
        indexcols=[1,2],
        header_exists=false,
        distributed=false,
        colnames=["pair", "time", "bid", "ask"])

Metadata for 5 / 5 files can be loaded from cache.


2-d NDSparse with 6532219 values (2 field named tuples):
pair       time                    │ bid      ask
───────────────────────────────────┼─────────────────
"AUD/JPY"  2016-12-01T00:00:00.04  │ 84.719   84.736
"AUD/JPY"  2016-12-01T00:00:00.415 │ 84.722   84.738
"AUD/JPY"  2016-12-01T00:00:00.625 │ 84.717   84.734
"AUD/JPY"  2016-12-01T00:00:00.842 │ 84.716   84.727
"AUD/JPY"  2016-12-01T00:00:01.056 │ 84.713   84.73
"AUD/JPY"  2016-12-01T00:00:01.457 │ 84.715   84.731
"AUD/JPY"  2016-12-01T00:00:01.926 │ 84.713   84.729
"AUD/JPY"  2016-12-01T00:00:02.39  │ 84.714   84.73
"AUD/JPY"  2016-12-01T00:00:02.611 │ 84.714   84.73
"AUD/JPY"  2016-12-01T00:00:02.85  │ 84.721   84.734
"AUD/JPY"  2016-12-01T00:00:03.159 │ 84.719   84.733
"AUD/JPY"  2016-12-01T00:00:03.407 │ 84.725   84.739
                                   ⋮
"EUR/CHF"  2016-12-30T21:59:11.315 │ 1.07122  1.07281
"EUR/CHF"  2016-12-30T21:59:17.272 │ 1.07122  1.07276
"EUR/CHF"  2016-12-30T21:59:23.293 │ 1.07132  1.07247
"EUR/CH

# Indexing by arbitrary keys

In [79]:
keytype(X)

Tuple{String,DateTime}

In [80]:
eltype(X)

NamedTuples._NT_bid_ask{Float64,Float64}

In [81]:
X["AUD/JPY", DateTime("2016-12-30T00:00:00.032")]

(bid = 84.153, ask = 84.171997)

In [82]:
using IntervalSets

In [83]:
t = table(X)
eltype(t)

NamedTuples._NT_pair_time_bid_ask{String,DateTime,Float64,Float64}

In [84]:
t[1]

(pair = "AUD/JPY", time = 2016-12-01T00:00:00.04, bid = 84.719002, ask = 84.736)

# Indexing by arbitrary keys

In [85]:
X[:, Date("2016-12-30")..Date("2016-12-31")]

2-d NDSparse with 323859 values (2 field named tuples):
pair       time                    │ bid      ask
───────────────────────────────────┼─────────────────
"AUD/JPY"  2016-12-30T00:00:00.032 │ 84.153   84.172
"AUD/JPY"  2016-12-30T00:00:00.457 │ 84.158   84.172
"AUD/JPY"  2016-12-30T00:00:00.978 │ 84.161   84.174
"AUD/JPY"  2016-12-30T00:00:01.579 │ 84.159   84.169
"AUD/JPY"  2016-12-30T00:00:02.213 │ 84.154   84.172
"AUD/JPY"  2016-12-30T00:00:03.301 │ 84.154   84.171
"AUD/JPY"  2016-12-30T00:00:04.558 │ 84.155   84.172
"AUD/JPY"  2016-12-30T00:00:05.427 │ 84.154   84.169
"AUD/JPY"  2016-12-30T00:00:06.258 │ 84.156   84.174
"AUD/JPY"  2016-12-30T00:00:07.086 │ 84.158   84.175
"AUD/JPY"  2016-12-30T00:00:08.134 │ 84.158   84.175
"AUD/JPY"  2016-12-30T00:00:08.747 │ 84.155   84.167
                                   ⋮
"EUR/CHF"  2016-12-30T21:59:11.315 │ 1.07122  1.07281
"EUR/CHF"  2016-12-30T21:59:17.272 │ 1.07122  1.07276
"EUR/CHF"  2016-12-30T21:59:23.293 │ 1.07132  1.07247
"EUR/

# Indexing by arbitrary keys

In [86]:
X[["AUD/USD", "CAD/JPY"], Date("2016-12-30")..Date("2016-12-31")]

2-d NDSparse with 113955 values (2 field named tuples):
pair       time                    │ bid      ask
───────────────────────────────────┼─────────────────
"AUD/USD"  2016-12-30T00:00:00.617 │ 0.72343  0.72358
"AUD/USD"  2016-12-30T00:00:01.433 │ 0.72343  0.72357
"AUD/USD"  2016-12-30T00:00:01.932 │ 0.72343  0.72356
"AUD/USD"  2016-12-30T00:00:03.168 │ 0.72342  0.72356
"AUD/USD"  2016-12-30T00:00:04.431 │ 0.72343  0.72357
"AUD/USD"  2016-12-30T00:00:06.205 │ 0.72343  0.72357
"AUD/USD"  2016-12-30T00:00:07.025 │ 0.72343  0.72357
"AUD/USD"  2016-12-30T00:00:08.548 │ 0.72343  0.72358
"AUD/USD"  2016-12-30T00:00:09.573 │ 0.72353  0.72361
"AUD/USD"  2016-12-30T00:00:10.19  │ 0.72353  0.72366
"AUD/USD"  2016-12-30T00:00:11.076 │ 0.72353  0.72368
"AUD/USD"  2016-12-30T00:00:11.965 │ 0.72353  0.72368
                                   ⋮
"CAD/JPY"  2016-12-30T21:58:52.225 │ 87.003   87.156
"CAD/JPY"  2016-12-30T21:59:00.056 │ 86.818   87.156
"CAD/JPY"  2016-12-30T21:59:01.092 │ 86.976   87.

# Indexing by arbitrary keys

In [87]:
bids = selectvalues(X, :bid)

2-d NDSparse with 6532219 values (Float64):
pair       time                    │
───────────────────────────────────┼────────
"AUD/JPY"  2016-12-01T00:00:00.04  │ 84.719
"AUD/JPY"  2016-12-01T00:00:00.415 │ 84.722
"AUD/JPY"  2016-12-01T00:00:00.625 │ 84.717
"AUD/JPY"  2016-12-01T00:00:00.842 │ 84.716
"AUD/JPY"  2016-12-01T00:00:01.056 │ 84.713
"AUD/JPY"  2016-12-01T00:00:01.457 │ 84.715
"AUD/JPY"  2016-12-01T00:00:01.926 │ 84.713
"AUD/JPY"  2016-12-01T00:00:02.39  │ 84.714
"AUD/JPY"  2016-12-01T00:00:02.611 │ 84.714
"AUD/JPY"  2016-12-01T00:00:02.85  │ 84.721
"AUD/JPY"  2016-12-01T00:00:03.159 │ 84.719
"AUD/JPY"  2016-12-01T00:00:03.407 │ 84.725
                                   ⋮
"EUR/CHF"  2016-12-30T21:59:11.315 │ 1.07122
"EUR/CHF"  2016-12-30T21:59:17.272 │ 1.07122
"EUR/CHF"  2016-12-30T21:59:23.293 │ 1.07132
"EUR/CHF"  2016-12-30T21:59:25.069 │ 1.07132
"EUR/CHF"  2016-12-30T21:59:27.241 │ 1.07122
"EUR/CHF"  2016-12-30T21:59:29.473 │ 1.07122
"EUR/CHF"  2016-12-30T21:59:35.949 │ 1.

# Behaves like N-Dimensional Array

In [88]:
eltype(bids)

Float64

In [89]:
reducedim(+, bids, 2)

1-d NDSparse with 5 values (Float64):
pair      │
──────────┼──────────
"AUD/JPY" │ 1.2507e8
"AUD/NZD" │ 1.16549e6
"AUD/USD" │ 1.05043e6
"CAD/JPY" │ 9.32479e7
"EUR/CHF" │ 1.55584e6

# Behaves like N-Dimensional Array

In [90]:
# Aggregate maximum bid for every hour

hourlybids = convertdim(bids, :time, x->trunc(x, Dates.Hour),
                        agg=max)

2-d NDSparse with 2590 values (Float64):
pair       time                │
───────────────────────────────┼────────
"AUD/JPY"  2016-12-01T00:00:00 │ 84.75
"AUD/JPY"  2016-12-01T01:00:00 │ 84.706
"AUD/JPY"  2016-12-01T02:00:00 │ 84.662
"AUD/JPY"  2016-12-01T03:00:00 │ 84.595
"AUD/JPY"  2016-12-01T04:00:00 │ 84.674
"AUD/JPY"  2016-12-01T05:00:00 │ 84.599
"AUD/JPY"  2016-12-01T06:00:00 │ 84.651
"AUD/JPY"  2016-12-01T07:00:00 │ 84.534
"AUD/JPY"  2016-12-01T08:00:00 │ 84.542
"AUD/JPY"  2016-12-01T09:00:00 │ 84.538
"AUD/JPY"  2016-12-01T10:00:00 │ 84.508
"AUD/JPY"  2016-12-01T11:00:00 │ 84.487
                               ⋮
"EUR/CHF"  2016-12-30T11:00:00 │ 1.07544
"EUR/CHF"  2016-12-30T12:00:00 │ 1.07557
"EUR/CHF"  2016-12-30T13:00:00 │ 1.07405
"EUR/CHF"  2016-12-30T14:00:00 │ 1.07304
"EUR/CHF"  2016-12-30T15:00:00 │ 1.07451
"EUR/CHF"  2016-12-30T16:00:00 │ 1.07316
"EUR/CHF"  2016-12-30T17:00:00 │ 1.07311
"EUR/CHF"  2016-12-30T18:00:00 │ 1.07293
"EUR/CHF"  2016-12-30T19:00:00 │ 1.07288
"EUR

# Behaves like N-Dimensional Array

In [91]:
audjpy = hourlybids["AUD/JPY", :]
audjpy = selectkeys(audjpy, (:time,)) # squeeze

1-d NDSparse with 518 values (Float64):
time                │
────────────────────┼───────
2016-12-01T00:00:00 │ 84.75
2016-12-01T01:00:00 │ 84.706
2016-12-01T02:00:00 │ 84.662
2016-12-01T03:00:00 │ 84.595
2016-12-01T04:00:00 │ 84.674
2016-12-01T05:00:00 │ 84.599
2016-12-01T06:00:00 │ 84.651
2016-12-01T07:00:00 │ 84.534
2016-12-01T08:00:00 │ 84.542
2016-12-01T09:00:00 │ 84.538
2016-12-01T10:00:00 │ 84.508
2016-12-01T11:00:00 │ 84.487
                    ⋮
2016-12-30T11:00:00 │ 84.548
2016-12-30T12:00:00 │ 84.59
2016-12-30T13:00:00 │ 84.565
2016-12-30T14:00:00 │ 84.474
2016-12-30T15:00:00 │ 84.518
2016-12-30T16:00:00 │ 84.515
2016-12-30T17:00:00 │ 84.429
2016-12-30T18:00:00 │ 84.378
2016-12-30T19:00:00 │ 84.351
2016-12-30T20:00:00 │ 84.345
2016-12-30T21:00:00 │ 84.381

# Behaves like N-Dimensional Array

In [92]:
hourlybids ./ audjpy

2-d NDSparse with 2590 values (Float64):
pair       time                │
───────────────────────────────┼──────────
"AUD/JPY"  2016-12-01T00:00:00 │ 1.0
"AUD/JPY"  2016-12-01T01:00:00 │ 1.0
"AUD/JPY"  2016-12-01T02:00:00 │ 1.0
"AUD/JPY"  2016-12-01T03:00:00 │ 1.0
"AUD/JPY"  2016-12-01T04:00:00 │ 1.0
"AUD/JPY"  2016-12-01T05:00:00 │ 1.0
"AUD/JPY"  2016-12-01T06:00:00 │ 1.0
"AUD/JPY"  2016-12-01T07:00:00 │ 1.0
"AUD/JPY"  2016-12-01T08:00:00 │ 1.0
"AUD/JPY"  2016-12-01T09:00:00 │ 1.0
"AUD/JPY"  2016-12-01T10:00:00 │ 1.0
"AUD/JPY"  2016-12-01T11:00:00 │ 1.0
                               ⋮
"EUR/CHF"  2016-12-30T11:00:00 │ 0.0127199
"EUR/CHF"  2016-12-30T12:00:00 │ 0.0127151
"EUR/CHF"  2016-12-30T13:00:00 │ 0.0127009
"EUR/CHF"  2016-12-30T14:00:00 │ 0.0127026
"EUR/CHF"  2016-12-30T15:00:00 │ 0.0127134
"EUR/CHF"  2016-12-30T16:00:00 │ 0.0126979
"EUR/CHF"  2016-12-30T17:00:00 │ 0.0127102
"EUR/CHF"  2016-12-30T18:00:00 │ 0.0127158
"EUR/CHF"  2016-12-30T19:00:00 │ 0.0127192
"EUR/CHF"  2016-12-

# Behaves like N-Dimensional Array

In [93]:
broadcast(hypot, hourlybids, audjpy)

2-d NDSparse with 2590 values (Float64):
pair       time                │
───────────────────────────────┼────────
"AUD/JPY"  2016-12-01T00:00:00 │ 119.855
"AUD/JPY"  2016-12-01T01:00:00 │ 119.792
"AUD/JPY"  2016-12-01T02:00:00 │ 119.73
"AUD/JPY"  2016-12-01T03:00:00 │ 119.635
"AUD/JPY"  2016-12-01T04:00:00 │ 119.747
"AUD/JPY"  2016-12-01T05:00:00 │ 119.641
"AUD/JPY"  2016-12-01T06:00:00 │ 119.715
"AUD/JPY"  2016-12-01T07:00:00 │ 119.549
"AUD/JPY"  2016-12-01T08:00:00 │ 119.56
"AUD/JPY"  2016-12-01T09:00:00 │ 119.555
"AUD/JPY"  2016-12-01T10:00:00 │ 119.512
"AUD/JPY"  2016-12-01T11:00:00 │ 119.483
                               ⋮
"EUR/CHF"  2016-12-30T11:00:00 │ 84.5548
"EUR/CHF"  2016-12-30T12:00:00 │ 84.5968
"EUR/CHF"  2016-12-30T13:00:00 │ 84.5718
"EUR/CHF"  2016-12-30T14:00:00 │ 84.4808
"EUR/CHF"  2016-12-30T15:00:00 │ 84.5248
"EUR/CHF"  2016-12-30T16:00:00 │ 84.5218
"EUR/CHF"  2016-12-30T17:00:00 │ 84.4358
"EUR/CHF"  2016-12-30T18:00:00 │ 84.3848
"EUR/CHF"  2016-12-30T19:00:00 │ 8

# NDSparse -- summary

- indexing by arbitrary keys or intervals of values
- iterates by values
- behaves like multi-dimensional arrays
- supports array syntax like broadcast, reducedim, map, reduce
- great for any "Series" -- esp. time series

## Onward!

- Feature extraction for machine-learning
- More comprehensive out-of-core support
- Dense ND data store (e.g. satellite imagery: X, Y, Z, T -> R, G, B)
- Streaming updates

# Thank You

<center><img src = "https://juliacomputing.com/assets/img/new/JuliaDB_logo2.svg" width=400>

</center>
<br>
<table>
    <tr>
        <td style="width: 200px"><strong style="font-size:20px;">jeff@juliacomputing.com<br>shashi@juliacomputing.com<br>josh@juliacomputing.com</strong></td>
        <td><img src = "https://juliacomputing.com/assets/img/new/julia-computing.svg" width=400></td>
    </tr>
</table>

- https://github.com/JuliaComputing/JuliaDB.jl
- https://juliacomputing.com