New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely poor query performance on queries against a new C* cluster #187

Closed
alsonkemp opened this Issue Jun 20, 2015 · 4 comments

Comments

Projects
None yet
3 participants
@alsonkemp

alsonkemp commented Jun 20, 2015

We've been running KairosDB 0.9.4.6 against a Cassandra 2.1 cluster for about 6 months now. Our existing C* cluster filled up (due to an unexpected influx of data) and we migrated to a new, much larger cluster. Whereas C* v1 was running on EBS volumes, C* v2 is 6 larger machines with more memory running RAIDed hard disks [since we have significant data storage requirement and not very tight latency requirements].

We would expect some degradation in performance due to the change in storage. Instead, we have experienced a 10x-100x increase in query times. We are currently loading data into the cluster, so degradation is definitely to be expected, but we still saw 10x degradation in the quiet after the completion of the compaction our first load and the start of the next load. The C* cluster was basically unloaded and KairosDB's performance was very poor.

What is especially strange is that query performance for some queries is not far from expectation. Examples:

  • Querying 1 minute of kairosdb.http.request_time takes 300ms
  • Querying 1 minute of kairosdb.datastore.cassandra.key_query_time just took 54,414 ms

Sure, kairosdb.http.request_time shouldn't take 300ms, but we're hammering the cluster right now, so some allowance is due. Clearly, querying for 1 minute of kairosdb.datastore.cassandra.key_query_time should not take 54 seconds.

If I look at the result of kairosdb.datastore.query_time, Kairos is claiming that most query times are over 5 seconds. However, if I log into the cluster using C* cqlsh, I can query the kairosdb.data_points without significant latency.

Other changes:

  • We have upgraded to KairosDB 1.0 (but have verified that the same issue obtains with 0.9.4.6).
  • The new cluster uses authentication whereas our prior cluster did not use authentication.
  • I have turned on Hector's connect keep alive.

I just saw an exception in Kairos' logs:

20:22:43.933 [qtp69567526-35] WARN  [HConnectionManager.java:302] - Could not fullfill request on this host CassandraClient<10.203.1.116:9160-23>
20:22:43.937 [qtp69567526-35] WARN  [HConnectionManager.java:303] - Exception: 
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
    at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:42) ~[hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.service.KeyspaceServiceImpl$15.execute(KeyspaceServiceImpl.java:563) ~[hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.service.KeyspaceServiceImpl$15.execute(KeyspaceServiceImpl.java:549) ~[hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104) ~[hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253) ~[hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:132) [hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.service.KeyspaceServiceImpl.multigetSlice(KeyspaceServiceImpl.java:567) [hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery$1.doInKeyspace(ThriftMultigetSliceQuery.java:68) [hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery$1.doInKeyspace(ThriftMultigetSliceQuery.java:59) [hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20) [hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:101) [hector-core-1.1-4.jar:na]
    at me.prettyprint.cassandra.model.thrift.ThriftMultigetSliceQuery.execute(ThriftMultigetSliceQuery.java:58) [hector-core-1.1-4.jar:na]
    at org.kairosdb.datastore.cassandra.QueryRunner.runQuery(QueryRunner.java:112) [kairosdb-1.0.0-1.jar:1.0.0-1.20150604225857]
[... snip ...]

Has this issue been observed previously? Do you have any pointers for diagnosing or fixing this issue?

@lcoulet

This comment has been minimized.

Show comment
Hide comment
@lcoulet

lcoulet Jun 22, 2015

I've seen this TimedOutException causing problems a couple of weeks ago and hanging KairosDB after we stopped by purpose one node of the Cassandra Cluster in a test environment.
We had Write consistency and read consistency of 1, and a RF of 3 onto a 6-nodes cluster using RAID disks.
I've not been able to reproduce the issue since then, but I'd like to understand what happened, and it may be similar or related.

As far as I can tell problem occured on the write in our case.

lcoulet commented Jun 22, 2015

I've seen this TimedOutException causing problems a couple of weeks ago and hanging KairosDB after we stopped by purpose one node of the Cassandra Cluster in a test environment.
We had Write consistency and read consistency of 1, and a RF of 3 onto a 6-nodes cluster using RAID disks.
I've not been able to reproduce the issue since then, but I'd like to understand what happened, and it may be similar or related.

As far as I can tell problem occured on the write in our case.

@alsonkemp

This comment has been minimized.

Show comment
Hide comment
@alsonkemp

alsonkemp Jun 24, 2015

Over the past 10 days, we have moved our C*/K* cluster from EBS volumes to spinning disks to SSDs. From EBS -> spinning disks, our system metrics didn't suggest significant latency, but we saw very high query latency. From spinning disks -> SSDs, we saw query times improve quite a bit, but, in comparison to EBS volumes, the query times are still very long. At this point, we've got a significant difference in query performance between:

  • v1: a heavily loaded c4.4xlarge cluster which had some KariosDB nodes talking to Cassandra nodes with EBS volumes
  • v2: a lightly loaded i2.2xlarge cluster with dedicated KairosDB nodes talking to dedicated C* nodes with ephemeral SSD volumes

Is there something we could have done incorrectly while transferring the data? We have been using KairosDB for months and have never seen much latency. Having migrated data from a shared, EBS-backed cluster to a larger, more performant C* cluster, we're now seeing very serious latency.

alsonkemp commented Jun 24, 2015

Over the past 10 days, we have moved our C*/K* cluster from EBS volumes to spinning disks to SSDs. From EBS -> spinning disks, our system metrics didn't suggest significant latency, but we saw very high query latency. From spinning disks -> SSDs, we saw query times improve quite a bit, but, in comparison to EBS volumes, the query times are still very long. At this point, we've got a significant difference in query performance between:

  • v1: a heavily loaded c4.4xlarge cluster which had some KariosDB nodes talking to Cassandra nodes with EBS volumes
  • v2: a lightly loaded i2.2xlarge cluster with dedicated KairosDB nodes talking to dedicated C* nodes with ephemeral SSD volumes

Is there something we could have done incorrectly while transferring the data? We have been using KairosDB for months and have never seen much latency. Having migrated data from a shared, EBS-backed cluster to a larger, more performant C* cluster, we're now seeing very serious latency.

@alsonkemp

This comment has been minimized.

Show comment
Hide comment
@alsonkemp

alsonkemp Jun 25, 2015

I think I've figured most of the performance issues, so will document them and close this ticket.

The performance anomaly around "http.request_time" was due to http.request_time having few tags, so requiring much less aggregation than query_time.

Another performance bottleneck had to do with the EBS performance of the Kairos boxes: the prior cluster used very large machines running many services specifically to get very high EBS bandwidth; the current cluster, for the moment, uses smaller instances and has much lower EBS performance. This means that all interactions with cached data are quite a bit slower than they were.

Closing.

alsonkemp commented Jun 25, 2015

I think I've figured most of the performance issues, so will document them and close this ticket.

The performance anomaly around "http.request_time" was due to http.request_time having few tags, so requiring much less aggregation than query_time.

Another performance bottleneck had to do with the EBS performance of the Kairos boxes: the prior cluster used very large machines running many services specifically to get very high EBS bandwidth; the current cluster, for the moment, uses smaller instances and has much lower EBS performance. This means that all interactions with cached data are quite a bit slower than they were.

Closing.

@alsonkemp alsonkemp closed this Jun 25, 2015

@brianhks

This comment has been minimized.

Show comment
Hide comment
@brianhks

brianhks Jun 25, 2015

Member

The query_time metric has several tags that break the time up into chunks by what Kairos is doing. It would be interesting to know what was taking 56 sec to return. We have had some good success with c3.2xlarge C* instances with 1T EBS storage. We are not using it for Kairos but another application that writes and queries about 1 to 4 respectively.

Member

brianhks commented Jun 25, 2015

The query_time metric has several tags that break the time up into chunks by what Kairos is doing. It would be interesting to know what was taking 56 sec to return. We have had some good success with c3.2xlarge C* instances with 1T EBS storage. We are not using it for Kairos but another application that writes and queries about 1 to 4 respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment