Frequently asked questions

jwatte edited this page Oct 6, 2012 · 3 revisions

Home - Frequently asked questions

Q: What is istatd?

  • A: istatd is a system that collects and calculates simple statistics for named values over time (called counters.) It also lets you browse and examing these counters in a web application included with istatd. The data is also available for any third party application using a simple REST/JSON interface.

Q: What hardware does istatd require?

  • A: istatd runs fine on a single-CPU virtual machine using VMWare workstation during development. However, for deployment to large systems, we recommend a multi-CPU system with a SSD RAID disk subsystem. IMVU uses a Dell R610 server with a 1.4 TB RAID-5 SSD array, and records hundreds of thousands of statistics at ten second resolution. If you use spinny disks, expect throughput to be significantly less.

Q: What software does istatd require?

  • A: We develop and deploy istatd on Ubuntu Server x64 10.04 LTS. We expect it to compile and run as-is on any Debian-based x86/x64 distribution. We expect it to port easily to any other Linux-based distribution. We do not use autotools, nor do we intend to add support for autotools. However, if you can get G++, GNU make and Boost to work on your system, chances are good that you can make istatd work on your system.

Q: What's wrong with Zabbix, OpenTSDB, Graphite, Cacti, RRD, or {insert other tool here}

  • A: We've been using Cacti with RRD to track our production systems. However, as we push our continuous deployment process forward, and we scale the size of our systems to keep up with the success of IMVU, we find that that solution is no longer sufficient. We tested a number of other systems, with the requirement that we can get 10 second resolution data from hundreds of thousands of counter names into the system. The other systems simply either didn't keep up, or would require a significantly large server cluster just to keep up. "Yo dawg, I heard you like clusters, so I put a cluster in your cluster so you can manage clusters while you manage clusters..."

  • A second (but not secondary) requirement was statistics. We want to know not just the average for values, but the minimum, maximum, and standard deviation of the many samples that go into a single statistic. We use this information to tell whether a particular change in software, deployed at a particular point in time, had a noticeable impact on important system metrics. None of the other systems we looked at calculate or store these values natively, and bolting it on the side ended up just worsening the already poor system resource usage of those systems.

  • istatd does it all on a single machine (with a second copy as stand-by, should the first die.)

Q: How do multiple retention times work? Does the amount of calculation grow as data grows?

  • A: The amount of time data is kept at each resolution is defined with the --retention configuration option. As soon as a data point for a counter is first seen, istatd will allocate all the space needed to save data for as long as specified. The draw-back of this is that the retention for a particular counter is fixed to the configuration as set at the time of creation.

  • istatd does calculation of count, sum, variance, min, and max, at the time of data capture. It writes to all configured retention levels at the same time. This means that, if you save 10 days of 10 second resolution data, and a year of 5 minute resolution data, the newest 10 days of the 5 minute data overlaps the 10 second data. However, this ends up being beneficial when a request comes in to read data for a time interval that spans intervals. For example, reading data from 9 days ago to 12 days ago would use the 5-minute resolution, as the 10-second resolution doesn't cover the entire interval.

  • Additionally, istatd will keep writing zeros to a file if no data comes in. This is done through the cyclic "flush" mechanism, which rotates through all counters and flushes accumulated data to disk, to avoid too much data loss in case of a system crash. This means that the amount of I/O and disk space needed by istatd is very well defined simply by the number of counters that are stored, and only incrementally affected by the amount of incoming counter traffic. A typical modern server will be limited by I/O rather than CPU usage, especially as istatd scales well across as many threads you want. (Above 256 threads, the sharded locking would have to be expanded for optimal performance -- not an actual problem for us yet ;-)