# Big Data Visualization

In this notebook, we demonstrate big data visualizations with Kaggle's [Huge Stock Market Dataset](https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs).

### Getting Started
- Zipped CSVs:
    - https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs/downloads/Data.zip
- Unzips to a `Data` directory with `Stocks` inside.  In the code block below, set `path` to your path to `Stocks`

In [1]:
path = "/Users/joshday/datasets/stockdata/Stocks"

"/Users/joshday/datasets/stockdata/Stocks"

# Load packages

In [2]:
@show addprocs()
using JuliaDB, OnlineStats, Plots, Interact
gr()

Plots.GRBackend()

# Load Data

Dataset is split into many CSVs.  This adds a column for filename (called `:Symbol`), and we index by the `:Symbol` and `:Date` fields.

In [3]:
@time stocks = loadtable(path; indexcols = [:Symbol, :Date], filenamecol = :Symbol)

 44.909531 seconds (3.16 M allocations: 169.887 MiB, 0.23% gc time)


Distributed Table with 14887665 rows in 8 chunks:
[1mSymbol      [22m[1mDate        [22mOpen    High    Low     Close   Volume    OpenInt
─────────────────────────────────────────────────────────────────────────
"a.us.txt"  1999-11-18  30.713  33.754  27.002  29.702  66277506  0
"a.us.txt"  1999-11-19  28.986  29.027  26.872  27.257  16142920  0
"a.us.txt"  1999-11-22  27.886  29.702  27.044  29.702  6970266   0
"a.us.txt"  1999-11-23  28.688  29.446  27.002  27.002  6332082   0
"a.us.txt"  1999-11-24  27.083  28.309  27.002  27.717  5132147   0
"a.us.txt"  1999-11-26  27.594  28.012  27.509  27.807  1832635   0
"a.us.txt"  1999-11-29  27.676  28.65   27.38   28.432  4317826   0
"a.us.txt"  1999-11-30  28.35   28.986  27.634  28.48   4567146   0
"a.us.txt"  1999-12-01  28.48   29.324  28.273  28.986  3133746   0
"a.us.txt"  1999-12-02  29.532  30.375  29.155  29.786  3252997   0
"a.us.txt"  1999-12-03  30.336  30.842  29.909  30.039  3223074   0
"a.us.txt"  1999-12-06  30.547  31.3

# Plotting a `Partition`

The `OnlineStats.Partition` type summarizes a data stream using an `OnlineStat`.  For example,

```julia
o = Partition(Mean(), 50)
```

will estimate the mean for each section of data, split into 50-100 parts.

### Example 1: Plot closing price for Amazon

`collect` brings the subset into the master process

In [7]:
@time subset = collect(filter(row -> row.Symbol == "amzn.us.txt", stocks))
@time plot(JuliaDB.select(subset, :Close))

  1.106278 seconds (88.25 k allocations: 4.211 MiB)
  0.001271 seconds (13.15 k allocations: 363.922 KiB)


### Example 2: Add Interactivity and Speed
- After one-time cost of `groupreduce` with `Partition`, plots for any ticket symbol are nearly instant
    - Try typing `aapl` (Apple), `bby` (Best Buy), etc. into the text box below

##### Summarize each part with maximum and minimum

In [5]:
o = Partition(Extrema(), 40)

@time g = collect(groupreduce(o, stocks, :Symbol, select = :Close))

@manipulate for t in textbox("amzn")
    data = filter(x -> x.Symbol == "$t.us.txt", g)
    plot(data[1].Partition, ylab = "Closing Price", legend=false)
end

 16.961628 seconds (6.85 M allocations: 204.420 MiB, 2.18% gc time)


##### Summarize each part with a histogram

In [8]:
o = Partition(Hist(15), 40)

@time g = collect(groupreduce(o, stocks, :Symbol, select = :Close))

@manipulate for t in textbox("amzn")
    data = filter(x -> x.Symbol == "$t.us.txt", g)
    plot(data[1].Partition, ylab = "Closing Price", color=:blues, legend=false)
end

 21.470683 seconds (5.71 M allocations: 240.817 MiB, 2.33% gc time)


Failed to push!
    "amz"
to node
    9: "input-3" = amz String (active)

error at node: 11: "map(input-3)-2" = Plot{Plots.GRBackend() n=1} Any (active)
[91mBoundsError: attempt to access 0-element Array{String,1} at index [1][39m
getindex(::IndexedTables.NextTable{IndexedTables.Columns{NamedTuples._NT_Symbol_Partition{String,OnlineStats.Series{0,Tuple{OnlineStats.Partition{0,OnlineStats.Hist{OnlineStats.AdaptiveBins{Float64}}}},OnlineStatsBase.EqualWeight}},NamedTuples._NT_Symbol_Partition{Array{String,1},Array{OnlineStats.Series{0,Tuple{OnlineStats.Partition{0,OnlineStats.Hist{OnlineStats.AdaptiveBins{Float64}}}},OnlineStatsBase.EqualWeight},1}}}}, ::Int64) at /Users/joshday/.julia/v0.6/IndexedTables/src/table.jl:304
(::Reactive.##33#34{##17#19,Reactive.Signal{Any},Tuple{Reactive.Signal{String}}})() at /Users/joshday/.julia/v0.6/Reactive/src/operators.jl:39
foreach(::Reactive.#runaction, ::Array{Function,1}) at ./abstractarray.jl:1733
run_node(::Reactive.Signal{Any}) at /Users/josh

Failed to push!
    "amz"
to node
    9: "input-3" = amz String (active)

error at node: 11: "map(input-3)-2" = Plot{Plots.GRBackend() n=1} Any (active)
[91mBoundsError: attempt to access 0-element Array{String,1} at index [1][39m
getindex(::IndexedTables.NextTable{IndexedTables.Columns{NamedTuples._NT_Symbol_Partition{String,OnlineStats.Series{0,Tuple{OnlineStats.Partition{0,OnlineStats.Hist{OnlineStats.AdaptiveBins{Float64}}}},OnlineStatsBase.EqualWeight}},NamedTuples._NT_Symbol_Partition{Array{String,1},Array{OnlineStats.Series{0,Tuple{OnlineStats.Partition{0,OnlineStats.Hist{OnlineStats.AdaptiveBins{Float64}}}},OnlineStatsBase.EqualWeight},1}}}}, ::Int64) at /Users/joshday/.julia/v0.6/IndexedTables/src/table.jl:304
(::Reactive.##33#34{##17#19,Reactive.Signal{Any},Tuple{Reactive.Signal{String}}})() at /Users/joshday/.julia/v0.6/Reactive/src/operators.jl:39
foreach(::Reactive.#runaction, ::Array{Function,1}) at ./abstractarray.jl:1733
run_node(::Reactive.Signal{Any}) at /Users/josh