Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add brick_metrics module - a Folsom based metrics system in production #36

Open
tatsuya6502 opened this issue Jan 30, 2014 · 1 comment
Assignees
Milestone

Comments

@tatsuya6502
Copy link
Member

In addition to the DTrace tracepoints (hibari GH18), introduce a brick_metrics module, which is a folsom based metrics system to provide statistics in production.

This will replace the DB operation counters in brick servers and add more statistical information such as 95 percentile and standard deviation of latencies in subsystems.

For example, this is a log message from current Hibair 0.3-dev:

2014-01-30 07:35:34.420 [info] <0.672.0>@brick_metrics:process_stats:132 statistics report
    (read)  sqflash prminig  median: 0.15 ms, 95 percentile: 0.244 ms
    (write) logging wait     median: 60.627 ms, 95 percentile: 100.651 ms
    (write) wal sync         median: 38.854 ms, 95 percentile: 66.769 ms, reqs 1, 4

exdec was used for the sampling method in above metrics, which exponentially decays less significant readings over time. They only keep recent 1028 readings to minimize performance impact. Note that these sampling methods and number of readings are configurable.

From the log, you can tell:

  • all (recent) reads were done from the filesystem cache, none from disk (as 95 percentile is less than 1 ms)
  • the disk drive (single, 2.5 inch, PC-grade) is overloaded by WAL sync (group commit)
  • logging wait takes twice as long as wal sync. I am rewriting the old WAL module (gmt_hlog) from scratch to improve this area.

An early work is done in this commit: hibari/gdss-brick@2e52fc5fc5a64

Notes about the metrics system and DTrace

brick_metrics will provide a continuous performance statistics in production. Good for monitoring brick server's resource usage and operation latencies.

DTrace tracepoints will be used to drill down performance issues in production. e.g. draw latency histogram in a subsystem.

@ghost ghost assigned tatsuya6502 Jan 30, 2014
@tatsuya6502
Copy link
Member Author

Set target milestone v0.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant