# JuliaDB + OnlineStats

[JuliaDB](https://github.com/JuliaComputing/JuliaDB.jl) is a package for working with large persistent datasets.  The online/parallel algorithms in OnlineStats work efficiently with `reduce` operations in JuliaDB to calculate statistics and models out-of-core or in parallel.

In [1]:
using JuliaDB, OnlineStats

## Basics

#### The `reduce` function can accept Julia functions and `OnlineStat`s as reducers.

In [2]:
t = table(@NT(x = randn(100), y = randn(100)))

@show reduce(+, t, select = :x)

reduce(Sum(), t, select = :x)

reduce(+, t, select=:x) = 10.02007601621434


[32m▦ Series{0}  |  EqualWeight  |  nobs = 100[39m
└── Sum{Float64}(10.0201)

#### If the selection is multiple columns, the `OnlineStat` must be able to handle inputs of that size.  

For example, a `Mean` can't accept two numbers as an input, so this errors:

In [3]:
reduce(Mean(), t; select = (:x, :y))

LoadError: [91mMethodError: no method matching fit!(::OnlineStats.Series{0,Tuple{OnlineStats.Mean},OnlineStatsBase.EqualWeight}, ::NamedTuples._NT_x_y{Float64,Float64})[0m
Closest candidates are:
  fit!([91m::OnlineStats.CovMatrix[39m, ::Union{AbstractArray{T,1} where T, NamedTuples.NamedTuple, Tuple}, [91m::Float64[39m) at /Users/joshday/.julia/v0.6/OnlineStats/src/stats/stats.jl:171
  fit!([91m::OnlineStats.HyperLogLog[39m, ::Any, [91m::Float64[39m) at /Users/joshday/.julia/v0.6/OnlineStats/src/stats/stats.jl:313
  fit!([91m::OnlineStats.LinReg[39m, ::Union{AbstractArray{T,1} where T, NamedTuples.NamedTuple, Tuple}, [91m::Real[39m, [91m::Float64[39m) at /Users/joshday/.julia/v0.6/OnlineStats/src/stats/linregbuilder.jl:37
  ...[39m

#### We can, however, create an object that calculates the mean for each column:

In [4]:
s = reduce(2Mean(), t, select = (:x, :y))

[32m▦ Series{1}  |  EqualWeight  |  nobs = 100[39m
└── MV{Mean}(0.10020076016214344, -0.15830652360898992)

# Partitions

The `Partition` type is designed around visualizing large datasets. It incrementally partitions the data into equal parts, each part summarized by an `OnlineStat`.

In [9]:
using Plots; gr()

x = randn()
s = Series(Partition(Extrema(), 40))

@gif for i in 1:2_000
    plot(fit!(s, x += randn()))
end every 25

[1m[36mINFO: [39m[22m[36mSaved animation to /Users/joshday/github/OnlineStatsDemos/tmp.gif
[39m

## Partition on the `diamonds` dataset

In [6]:
diamonds = loadtable("diamonds.csv"; indexcols = [:carat, :cut])

Table with 53940 rows, 10 columns:
[1mcarat  [22m[1mcut          [22mcolor  clarity  depth  table  price  x      y      z
───────────────────────────────────────────────────────────────────────────
0.2    "Ideal"      "E"    "VS2"    59.7   55.0   367    3.86   3.84   2.3
0.2    "Ideal"      "D"    "VS2"    61.5   57.0   367    3.81   3.77   2.33
0.2    "Ideal"      "E"    "VS2"    62.2   57.0   367    3.76   3.73   2.33
0.2    "Premium"    "E"    "SI2"    60.2   62.0   345    3.79   3.75   2.27
0.2    "Premium"    "E"    "VS2"    59.8   62.0   367    3.79   3.77   2.26
0.2    "Premium"    "E"    "VS2"    59.0   60.0   367    3.81   3.78   2.24
0.2    "Premium"    "E"    "VS2"    61.1   59.0   367    3.81   3.78   2.32
0.2    "Premium"    "E"    "VS2"    59.7   62.0   367    3.84   3.8    2.28
0.2    "Premium"    "F"    "VS2"    62.6   59.0   367    3.73   3.71   2.33
0.2    "Premium"    "D"    "VS2"    62.3   60.0   367    3.73   3.68   2.31
0.2    "Premium"    "D"    "VS2"    61.

In [7]:
plot(reduce(Partition(Mean()), diamonds, select=:carat))