Skip to content

Commit

Permalink
Pulled cruft from statistics chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
Philip (flip) Kromer committed Feb 7, 2014
1 parent 0ea162a commit af399b1
Show file tree
Hide file tree
Showing 3 changed files with 43 additions and 22 deletions.
40 changes: 19 additions & 21 deletions 09-statistics.asciidoc
Expand Up @@ -11,25 +11,23 @@ Some statistical measures let you summarize the whole from summaries of the part

Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That's especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution -- the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.)

But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to

* Holistic vs algebraic aggregations
* Underflow and the "Law of Huge Numbers"
* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
* Count-min sketch for most frequent elements
* Approx histogram

- Counting
- total burgers sold - total commits, repos,
- counting a running and or smoothed average
- standard deviation
- sampling
- uniform
- top k
- reservior
- ?rolling topk/reservior sampling?
- algebraic vs holistic aggregate
- use countmin sketch to turn a holistic aggregate into an algebraic aggregate
- quantile or histogram
- numeric stability
=== Summary Statistics

TODO: content to come

=== Overflow, Underflow and other Dangers

TODO: content to come

=== Quantiles and Histograms

TODO: content to come

=== Algebraic vs Holistic Aggregations

TODO: content to come

=== "Sketching" Algorithms

TODO: content to come

23 changes: 23 additions & 0 deletions 09x-statistics-to_integrate.asciidoc
@@ -0,0 +1,23 @@

But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to

* Holistic vs algebraic aggregations
* Underflow and the "Law of Huge Numbers"
* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog)
* Count-min sketch for most frequent elements
* Approx histogram
- Counting
- total burgers sold - total commits, repos,
- counting a running and or smoothed average
- standard deviation
- sampling
- uniform
- top k
- reservior
- ?rolling topk/reservior sampling?
- algebraic vs holistic aggregate
- use countmin sketch to turn a holistic aggregate into an algebraic aggregate
- quantile or histogram
- numeric stability
2 changes: 1 addition & 1 deletion 17-machine_learning.asciidoc
@@ -1,3 +1,3 @@
[[machine_learning]]
== Machine Learning
== Machine Learning without Grad School

0 comments on commit af399b1

Please sign in to comment.