fill out README

josephruscio · Sep 12, 2009 · b762303 · b762303
1 parent 6626079
commit b762303
Showing 1 changed file with 168 additions and 1 deletion.
diff --git a/README b/README
@@ -1,2 +1,169 @@
-Aggregate is a ruby implementation of a statistics aggregator including histogram support
+Aggregate is an intuitive ruby implementation of a statistics aggregator
+including both default and configurable histogram support. It does this
+without recording/storing any of the actual sample values, making it
+suitable for tracking statistics across millions/billions of sample
+without any impact on performance or memory footprint. Originally
+inspired by the Aggregate support in SystemTap (http://sourceware.org/systemtap/)
 
+Aggregates are easy to instantiate, populate with sample data, and examine
+statistics:
+
+#After instantiation use the << operator to add a sample to the aggregate:
+stats = Aggregate.new
+
+loop do
+  # Take some action that generates a sample measurement
+  stats << sample
+end
+
+# The number of samples
+stats.count
+
+# The average
+stats.mean
+
+# Max sample value
+stats.max
+
+# Min sample value
+stats.min
+
+# The standard deviation
+stats.std_dev
+
+Perhaps more importantly than the basic aggregate statistics detailed above
+Aggregate also maintains a histogram of samples. Good explanation of why
+its important: http://37signals.com/svn/posts/1836-the-problem-with-averages
+
+The histogram is maintained as a set of "buckets". Each bucket represents a
+range of possible sample values. The set of all buckets represents the range
+of "normal" sample values. By default this is a binary histogram, where
+each bucket represents a range twice as large as the preceding bucket i.e.
+[1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram
+provides for 128 buckets, theoretically covering the range [1, (2^127) - 1]
+(See NOTES below for a discussion on the effects in practice of insufficient
+precision.)
+
+Binary histograms are useful when we have little idea about what the
+sample distribution may look like as almost any positive value will
+fall into some bucket. After using binary histograms to determine
+the coarse-grained characteristics of your sample space you can
+configure a linear histogram to examine it in closer detail.
+
+Linear histograms are specified with the three values low, high, and width.
+Low and high specifiy a range [low, high) of values included in the
+histogram (all others are outliers). Width specifies the number of
+values represented by each bucket and therefore the number of
+buckets i.e. granularity of the histogram. The histogram range
+(high - low) must be a multiple of width:
+
+#Want to track aggregate stats on response times in ms
+response_stats = Aggregate.new(0, 2000, 50)
+
+The example above creates a linear histogram that tracks the
+response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully
+most of your samples fall in the first couple buckets! Any values added to the
+aggregate that fall outside of the histogram range are recorded as outliers:
+
+# Number of samples that fall below the normal range
+stats.outliers_low
+
+# Number of samples that fall above the normal range
+stats.outliers_high
+
+Once a histogram is populated Aggregate provides iterator support for
+examining the contents of buckets. The iterators provide both the
+number of samples in the bucket, as well as its range:
+
+#Examine every bucket
+@stats.each do |bucket, count|
+end
+
+#Examine only buckets containing samples
+@stats.each_nonzero do |bucket, count|
+end
+
+Finally Aggregate contains sophisticated pretty-printing support that for
+any given number of columns >= 80 (defaults to 80) and sample distribution
+properly sets a marker weight based on the samples per bucket and aligns all
+output. Empty buckets are skipped to conserve screen space.
+
+# Generate and display an 80 column histogram
+puts stats.to_s
+
+# Generate and display a 120 column histogram
+puts stats.to_s(120)
+
+The following code populates both a binary and linear histogram with the same
+set of 65536 values generated by rand to produce two histograms:
+
+require 'rubygems'
+require 'aggregate'
+
+# Create an Aggregate instance
+binary_aggregate = Aggregate.new
+linear_aggregate = Aggregate.new(0, 65536, 8192)
+
+65536.times do
+  x = rand(65536)
+  binary_aggregate << x
+  linear_aggregate << x
+end
+
+puts binary_aggregate.to_s
+puts linear_aggregate.to_s
+
+** OUTPUT **
+** Binary Histogram**
+value |------------------------------------------------------------------| count
+    1 |                                                                  |     3
+    2 |                                                                  |     1
+    4 |                                                                  |     5
+    8 |                                                                  |     9
+   16 |                                                                  |    15
+   32 |                                                                  |    29
+   64 |                                                                  |    62
+  128 |                                                                  |   115
+  256 |                                                                  |   267
+  512 |@                                                                 |   523
+ 1024 |@                                                                 |   970
+ 2048 |@@@                                                               |  1987
+ 4096 |@@@@@@@@                                                          |  4075
+ 8192 |@@@@@@@@@@@@@@@@                                                  |  8108
+16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                                  | 16405
+32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
+      ~
+Total |------------------------------------------------------------------| 65535
+
+** Linear (0, 65536, 4096) Histogram **
+value |------------------------------------------------------------------| count
+    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4094
+ 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|  4202
+ 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4118
+12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4059
+16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |  3999
+20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4083
+24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4134
+28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |  4143
+32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |  4152
+36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4033
+40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4064
+45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4012
+49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4070
+53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4090
+57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4135
+61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |  4144
+Total |------------------------------------------------------------------| 65532
+
+We can see from these histograms that Ruby's rand function does a relatively good
+job of distributing returned values in the requested range.
+
+** NOTES **
+Ruby doesn't have a log2 function built into Math, so we approximate with
+log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due
+to precision limitations, once n reaches a certain size (somewhere > 32)
+this starts to return n. The larger the value of n, the more numbers i.e.
+(2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into
+using something like BigDecimal, but for the current purposes of the binary
+histogram i.e. a simple coarse-grained view the current implementation is
+sufficient.