In addition to the DTrace tracepoints (hibari GH18), introduce a brick_metrics module, which is a folsom based metrics system to provide statistics in production.
This will replace the DB operation counters in brick servers and add more statistical information such as 95 percentile and standard deviation of latencies in subsystems.
For example, this is a log message from current Hibair 0.3-dev:
2014-01-30 07:35:34.420 [info] <0.672.0>@brick_metrics:process_stats:132 statistics report
(read) sqflash prminig median: 0.15 ms, 95 percentile: 0.244 ms
(write) logging wait median: 60.627 ms, 95 percentile: 100.651 ms
(write) wal sync median: 38.854 ms, 95 percentile: 66.769 ms, reqs 1, 4
exdec was used for the sampling method in above metrics, which exponentially decays less significant readings over time. They only keep recent 1028 readings to minimize performance impact. Note that these sampling methods and number of readings are configurable.
From the log, you can tell:
An early work is done in this commit: hibari/gdss-brick@2e52fc5
Notes about the metrics system and DTrace
brick_metrics will provide a continuous performance statistics in production. Good for monitoring brick server's resource usage and operation latencies.
DTrace tracepoints will be used to drill down performance issues in production. e.g. draw latency histogram in a subsystem.
Set target milestone v0.3.0.