# infochimps-labs/big_data_for_chimps

Pulled cruft from statistics chapter

1 parent 0ea162a commit af399b1f9bbb599410438d647f807ccc649ae94e Philip (flip) Kromer committed Feb 7, 2014
Showing with 43 additions and 22 deletions.
1. +19 −21 09-statistics.asciidoc
2. +23 −0 09x-statistics-to_integrate.asciidoc
3. +1 −1 17-machine_learning.asciidoc
40 09-statistics.asciidoc
 @@ -11,25 +11,23 @@ Some statistical measures let you summarize the whole from summaries of the part Other statistical summaries require assembling context that grows with the size of the whole dataset. The amount of intermediate data required to count distinct objects, extract an accurate histogram, or find the median and other quantiles can become costly and cumbersome. That's especially unfortunate because so much data at large scale has a long-tail, not normal (Gaussian) distribution -- the median is far more robust indicator of the "typical" value than the average. (If Bill Gates walks into a bar, everyone in there is a billionaire on average.) -But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to - -* Holistic vs algebraic aggregations -* Underflow and the "Law of Huge Numbers" -* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog) -* Count-min sketch for most frequent elements -* Approx histogram - -- Counting - - total burgers sold - total commits, repos, -- counting a running and or smoothed average -- standard deviation -- sampling - - uniform - - top k - - reservior - - ?rolling topk/reservior sampling? -- algebraic vs holistic aggregate -- use countmin sketch to turn a holistic aggregate into an algebraic aggregate -- quantile or histogram -- numeric stability +=== Summary Statistics + +TODO: content to come + +=== Overflow, Underflow and other Dangers + +TODO: content to come + +=== Quantiles and Histograms + +TODO: content to come + +=== Algebraic vs Holistic Aggregations + +TODO: content to come + +=== "Sketching" Algorithms + +TODO: content to come
23 09x-statistics-to_integrate.asciidoc
 @@ -0,0 +1,23 @@ + +But you don't always need an exact value -- you need actionable insight. There's a clever pattern for approximating the whole by combining carefully re-mixed summaries of the parts, and we'll apply it to + +* Holistic vs algebraic aggregations +* Underflow and the "Law of Huge Numbers" +* Approximate holistic aggs: Median vs remedian; percentile; count distinct (hyperloglog) +* Count-min sketch for most frequent elements +* Approx histogram + +- Counting + - total burgers sold - total commits, repos, +- counting a running and or smoothed average +- standard deviation +- sampling + - uniform + - top k + - reservior + - ?rolling topk/reservior sampling? +- algebraic vs holistic aggregate +- use countmin sketch to turn a holistic aggregate into an algebraic aggregate +- quantile or histogram +- numeric stability +
2 17-machine_learning.asciidoc
 @@ -1,3 +1,3 @@ [[machine_learning]] -== Machine Learning +== Machine Learning without Grad School