Join GitHub today
Extremely poor query performance on queries against a new C* cluster #187
We've been running KairosDB 0.9.4.6 against a Cassandra 2.1 cluster for about 6 months now. Our existing C* cluster filled up (due to an unexpected influx of data) and we migrated to a new, much larger cluster. Whereas C* v1 was running on EBS volumes, C* v2 is 6 larger machines with more memory running RAIDed hard disks [since we have significant data storage requirement and not very tight latency requirements].
We would expect some degradation in performance due to the change in storage. Instead, we have experienced a 10x-100x increase in query times. We are currently loading data into the cluster, so degradation is definitely to be expected, but we still saw 10x degradation in the quiet after the completion of the compaction our first load and the start of the next load. The C* cluster was basically unloaded and KairosDB's performance was very poor.
What is especially strange is that query performance for some queries is not far from expectation. Examples:
Sure, kairosdb.http.request_time shouldn't take 300ms, but we're hammering the cluster right now, so some allowance is due. Clearly, querying for 1 minute of kairosdb.datastore.cassandra.key_query_time should not take 54 seconds.
If I look at the result of kairosdb.datastore.query_time, Kairos is claiming that most query times are over 5 seconds. However, if I log into the cluster using C* cqlsh, I can query the kairosdb.data_points without significant latency.
I just saw an exception in Kairos' logs:
Has this issue been observed previously? Do you have any pointers for diagnosing or fixing this issue?
I've seen this TimedOutException causing problems a couple of weeks ago and hanging KairosDB after we stopped by purpose one node of the Cassandra Cluster in a test environment.
As far as I can tell problem occured on the write in our case.
Over the past 10 days, we have moved our C*/K* cluster from EBS volumes to spinning disks to SSDs. From EBS -> spinning disks, our system metrics didn't suggest significant latency, but we saw very high query latency. From spinning disks -> SSDs, we saw query times improve quite a bit, but, in comparison to EBS volumes, the query times are still very long. At this point, we've got a significant difference in query performance between:
Is there something we could have done incorrectly while transferring the data? We have been using KairosDB for months and have never seen much latency. Having migrated data from a shared, EBS-backed cluster to a larger, more performant C* cluster, we're now seeing very serious latency.
I think I've figured most of the performance issues, so will document them and close this ticket.
The performance anomaly around "http.request_time" was due to http.request_time having few tags, so requiring much less aggregation than query_time.
Another performance bottleneck had to do with the EBS performance of the Kairos boxes: the prior cluster used very large machines running many services specifically to get very high EBS bandwidth; the current cluster, for the moment, uses smaller instances and has much lower EBS performance. This means that all interactions with cached data are quite a bit slower than they were.
The query_time metric has several tags that break the time up into chunks by what Kairos is doing. It would be interesting to know what was taking 56 sec to return. We have had some good success with c3.2xlarge C* instances with 1T EBS storage. We are not using it for Kairos but another application that writes and queries about 1 to 4 respectively.