Switching cassandra3 to span2 model #1695 #1758

michaelsembwever · 2017-10-05T03:44:11Z

(a lot here is from @adriancole !)

llinder · 2017-10-06T02:59:15Z

Looks great so far! Thanks a ton for working on this @michaelsembwever

michaelsembwever · 2017-10-06T05:59:06Z

zipkin-storage/cassandra3/src/main/resources/cassandra3-schema.cql

@@ -1,88 +1,79 @@
-CREATE KEYSPACE IF NOT EXISTS zipkin3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
+CREATE KEYSPACE IF NOT EXISTS zipkin2_cassandra3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};


in this file i still have to add some comments, and i want to duplicate all comments into the schema's table-level WITH COMMENT = '…'; so that operators get that information too…

michaelsembwever · 2017-10-08T22:33:04Z

Some experimental benchmarking, aiming only to give rough latency spread using a safe throughput rate.

macbook pro (macOS 10.13 , 3.5GHz dual-core Intel Core i7 processor, 16GB 2133MHz, SSD)
CCM (4 nodes, Cassandra-3.11.0)
ccm stress run for each of the three profiles in write mode beforehand
stress testing ran for 6 hours (overnight), but throttled to very low throughput.

Results:

ccm stress user profile=trace-stress.yaml ops\(insert=1,by_trace=1,by_trace_ts_id=1,by_annotation=1\) no-warmup duration=6h -rate threads=4 throttle=50/s

Results:
Op rate                   :       50 op/s  [by_annotation: 12 op/s, by_trace: 13 op/s, by_trace_ts_id: 13 op/s, insert: 13 op/s]
Partition rate            :       50 pk/s  [by_annotation: 12 pk/s, by_trace: 13 pk/s, by_trace_ts_id: 13 pk/s, insert: 13 pk/s]
Row rate                  :      174 row/s [by_annotation: 137 row/s, by_trace: 13 row/s, by_trace_ts_id: 13 row/s, insert: 13 row/s]
Latency mean              :   10.1 ms [by_annotation: 31.9 ms, by_trace: 1.8 ms, by_trace_ts_id: 1.8 ms, insert: 5.1 ms]
Latency median            :    2.5 ms [by_annotation: 23.7 ms, by_trace: 0.6 ms, by_trace_ts_id: 0.7 ms, insert: 3.1 ms]
Latency 95th percentile   :   43.6 ms [by_annotation: 77.7 ms, by_trace: 6.3 ms, by_trace_ts_id: 6.3 ms, insert: 15.4 ms]
Latency 99th percentile   :   84.7 ms [by_annotation: 125.1 ms, by_trace: 20.6 ms, by_trace_ts_id: 20.5 ms, insert: 37.5 ms]
Latency 99.9th percentile :  159.6 ms [by_annotation: 221.4 ms, by_trace: 65.2 ms, by_trace_ts_id: 65.3 ms, insert: 88.3 ms]

ccm stress user profile=trace_by_service_span-stress.yaml ops\(insert=1,select=1,by_duration=1\) no-warmup duration=6h -rate threads=4 throttle=50/s

Results:
Op rate                   :       50 op/s  [by_duration: 17 op/s, insert: 17 op/s, select: 17 op/s]
Partition rate            :       33 pk/s  [by_duration: 0 pk/s, insert: 17 pk/s, select: 17 pk/s]
Row rate                  :       33 row/s [by_duration: 0 row/s, insert: 17 row/s, select: 17 row/s]
Latency mean              :    2.6 ms [by_duration: 4.0 ms, insert: 1.8 ms, select: 1.9 ms]
Latency median            :    0.8 ms [by_duration: 1.8 ms, insert: 0.6 ms, select: 0.6 ms]
Latency 95th percentile   :   11.1 ms [by_duration: 15.9 ms, insert: 7.5 ms, select: 8.1 ms]
Latency 99th percentile   :   27.9 ms [by_duration: 35.0 ms, insert: 22.6 ms, select: 22.8 ms]
Latency 99.9th percentile :   73.3 ms [by_duration: 82.4 ms, insert: 64.2 ms, select: 69.9 ms]

ccm stress user profile=span_by_service-stress.yaml ops\(insert=1,select=1,select_spans=1\) no-warmup duration=6h -rate threads=4 throttle=50/s

Results:
Op rate                   :       50 op/s  [insert: 17 op/s, select: 17 op/s, select_spans: 17 op/s]
Partition rate            :       50 pk/s  [insert: 17 pk/s, select: 17 pk/s, select_spans: 17 pk/s]
Row rate                  :    1,707 row/s [insert: 17 row/s, select: 1,673 row/s, select_spans: 17 row/s]
Latency mean              :    7.5 ms [insert: 1.7 ms, select: 18.9 ms, select_spans: 1.8 ms]
Latency median            :    0.8 ms [insert: 0.5 ms, select: 11.2 ms, select_spans: 0.5 ms]
Latency 95th percentile   :   34.3 ms [insert: 7.4 ms, select: 57.1 ms, select_spans: 8.1 ms]
Latency 99th percentile   :   69.5 ms [insert: 21.5 ms, select: 93.3 ms, select_spans: 22.5 ms]
Latency 99.9th percentile :  136.4 ms [insert: 63.6 ms, select: 176.8 ms, select_spans: 68.5 ms]

As is to be expected the writes are reasonably on-par for the different tables (bc bound by the one piece of hardware).
Again as expected reads are fast when selecting traces by their IDs, when finding trace ids by the service+span, and when selecting span names by service name.
The three SASI indexes here all perform well, roughly 5x times the latency compared to the fast (select by partition key) reads.
Of the three SASI indexes reads, searching "by_annotation" in the traces table is the slowest, as expected as this actually uses two SASI indexes, one of which is the only SASI index that uses the CONTAINS mode permitting full text searching.

codefromthecrypt · 2017-10-09T00:29:01Z

Thanks for the benching. Will take a review sweep today

michaelsembwever · 2017-10-09T00:34:08Z

@adriancole ta.
Two commits added:

changing dependency table to store the individual fields to DependencyLink, rather than the whole list of links as a blob,
refactoring zipkin.storage.cassandra3 to zipkin2.storage.cassandra3

Upgrade cassandra3 storage backend use the zipkin2.* API instead of zipkin.* Bump cassandra3 backend to require minimum Java8. Take service_name out of the annotation search index, using a combination of two SASI instead. Remove TraceIdUDT and BinaryAnnotationUDT. Trace ID is now stored as text instead of a UDT. Adds the ListenableFutureCall class, which handles our Call->guava interactions. Change the default keyspace to "zipkin2_cassandra3", making it explicit it's a fresh start and breaking compatibility. Add the precondition check that SimpleStrategy and LOCAL_* consistency levels can not be used together. ref: - #1695 - #1758

codefromthecrypt · 2017-10-09T00:41:11Z

- changing dependency table to store the individual fields to DependencyLink, rather than the whole list of links as a blob, - refactoring zipkin.storage.cassandra3 to zipkin2.storage.cassandra3 Ah ok will wait on that

michaelsembwever · 2017-10-09T00:47:04Z

@adriancole oh, i edited my comment above. both items have been pushed already (and exist as separate commits).

codefromthecrypt · 2017-10-09T00:49:06Z

Perfect

michaelsembwever · 2017-10-09T11:24:07Z

LGTM. thanks for a lot of the clean up in the last commit: a few silly bugs and typos there you caught!

codefromthecrypt · 2017-10-10T01:40:18Z

just added a commit to scrub the v1 compile dep (also preventing re-preparing)
one package change I made was autoconfiguration.. right now, the zipkin-server is still based on the v1 dep (basically it has v2 endpoints bolted on)

sometime in the near future we'll add a v2 zipkin-server which then removes the whole dep tree on v1 types (and spring boot 1.5). so for now, the autoconfiguration for cassandra3 makes most sense in the v1 package (as its only user is the v1 zipkin server)

michaelsembwever · 2017-10-11T00:14:27Z

zipkin-storage/cassandra3/src/main/resources/cassandra3-schema.cql

    bucket        int,              //-- time bucket, calculated as ts/interval (in microseconds), for some pre-configured interval like 1 day.
    ts            timeuuid,         //-- start timestamp of the span, truncated to millisecond precision
-    trace_id      frozen<trace_id>, //-- trace ID
+    trace_id      text,             //-- trace ID
    duration      bigint,           //-- span duration, in microseconds


@llinder @fedj
i want to change duration to be stored as tens of milliseconds (or in hundredths of seconds).
this will make the SASI on this column contains fewer distinct values, improving write performance.
and i can't see users being so discerning about searches they perform in the UI.

duration would be rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

any objections?

Sounds like a good optimization to me.

cc @openzipkin/cassandra

added as an extra commit. can be undone if any objections/concerns.

…s of seconds). this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)

codefromthecrypt · 2017-10-12T04:43:46Z

fyi refactored to reduce commits, but intentionally left the important ones (relevant for performance)

…s of seconds). this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)

michaelsembwever · 2017-10-18T09:30:34Z

zipkin-storage/zipkin2_cassandra/src/main/resources/zipkin2_cassandra-schema.cql

    ts_uuid             timeuuid,
+    trace_id_high       text, // when strictTraceId=false, contains right-most 16 chars if present


did you mean here "left-most 16 chars"?

codefromthecrypt · 2017-10-18T10:13:11Z

ok I have 23 failures locally.. trying to move that number below 20 soon :)

codefromthecrypt · 2017-10-18T11:00:22Z

> ts_uuid timeuuid, + trace_id_high text, // when strictTraceId=false, contains right-most 16 chars if present did you mean here "left-most 16 chars"?

yep!

codefromthecrypt · 2017-10-18T11:11:15Z

16 failures locally

codefromthecrypt · 2017-10-18T12:24:19Z

Down to 10 failures

codefromthecrypt · 2017-11-10T08:33:32Z

@openzipkin/cassandra ok I'm ready to merge this. You might want to check about how pre-computed queries are delimited in the annotation_query column. They are braced with an unlikely character to stop us from accidentally matching on substring in the API. ex bracing like ░error░ ensures we don't accidentally match on a tag key named not_error.

michaelsembwever · 2017-11-11T02:07:42Z

zipkin-storage/zipkin2_cassandra/src/main/java/zipkin2/storage/cassandra/InsertSpan.java

+        annotation_query,
+        Boolean.TRUE.equals(span.debug()),
+        Boolean.TRUE.equals(span.shared())
+      );


Retouching on the null binding resulting in tombstone issue.

Here there's 8 columns here that can be null. (maybe more?)

If there's consistently 8 tombstones (nulls) per row, then we'll only need 125 spans in a trace (rows in a partition) to trigger the `tombstone_warn_threshold warnings being logged in the C* nodes. And if we go to 12500 spans in a trace then that whole trace partition would become unreadable. Cassandra warns at a 1000 tombstones in any query, and fails on 100000 tombstones.

There's also a small question about disk usage efficiency. Each tombstone is a cell name and basically empty cell value entry stored on disk. Given that the cells are, apart from tags and annotations, generally very small then this could be proportionally an unnecessary waste of disk.

To avoid this relying upon a number of variant prepared statements for inserting a span is the normal practice.

Another popular practice is to insert those potentially null columns as separate statements (and optionally put them together into UNLOGGED batches). This works as multiple writes to the same partition has little overhead, and here we're not worried about lack of isolation between those writes, as the write is asynchronous anyway. An example of this approach is in the cassandra-reaper project here: https://github.com/thelastpickle/cassandra-reaper/blob/master/src/server/src/main/java/io/cassandrareaper/storage/CassandraStorage.java#L622-L642

UPDATE: a few lines down, under protected ResultSetFuture newFuture() the nulls are in fact not written (not bound).

thanks captured

michaelsembwever · 2017-11-11T02:12:58Z

zipkin-storage/zipkin2_cassandra/src/main/java/zipkin2/storage/cassandra/InsertSpan.java

+      bound.setString("annotation_query", input.annotation_query());
+    }
+    if (input.shared()) bound.setBool("shared", true);
+    if (input.debug()) bound.setBool("debug", true);


the code above avoids writing the nulls (tombstones) :-)

michaelsembwever · 2017-11-11T03:18:50Z

zipkin-storage/zipkin2_cassandra/src/test/resources/trace_by_service_span-stress.yaml

+    cql: SELECT * FROM trace_by_service_span WHERE service = ? AND span = ? AND bucket = ? AND duration < ? LIMIT 1
+    fields: samerow


it looks like we're missing two queries here:

search by timestamp range,

search by timestamp range and duration

I couldn't quickly find out how to do a range query in a stress test, so left it out for now with a TODO

michaelsembwever

all LGTM! approved.
(i added a small comment about adding two extra queries to the stress tests…)

this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)

codefromthecrypt · 2017-11-12T03:09:48Z

ok I've got the dependencies job working.. just waiting for green

Upgrade cassandra3 storage backend use the zipkin2.* API instead of zipkin.* Bump cassandra3 backend to require minimum Java8. Take service_name out of the annotation search index, using a combination of two SASI instead. Remove TraceIdUDT and BinaryAnnotationUDT. Trace ID is now stored as text instead of a UDT. Adds the ListenableFutureCall class, which handles our Call->guava interactions. Change the default keyspace to "zipkin2_cassandra3", making it explicit it's a fresh start and breaking compatibility. Add the precondition check that SimpleStrategy and LOCAL_* consistency levels can not be used together.

… a separate column. All links for each day still remain in one partition key, keeking selects simple. But the data is now transparent, and just as performant and compact (due to compression).

- DeduplicatingExecutor around writes to span_by_service, - 40k max queue length in cql driver, - durable_writes disabled (no commitlog on disk), - disable read repairs, - lower gc_grace to 3 hours (match hint window), - row cache span_by_service, - increase server-side speculative retries…

this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)

This extracts out calls in efforts to make the overall flow inside Cassandra easier to read and the commands themselves easier to debug.

this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)

codefromthecrypt · 2017-11-12T05:57:43Z

Will release 2.3 on master green. Epic help @michaelsembwever and @llinder thanks for all the work on the experimental driver preceding this, including the spark code

this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: openzipkin#1758 (comment)

michaelsembwever commented Oct 6, 2017

View reviewed changes

michaelsembwever force-pushed the mck/cassandra3-use-span2 branch 2 times, most recently from f1f6bd8 to a7ca588 Compare October 8, 2017 11:42

michaelsembwever force-pushed the mck/cassandra3-use-span2 branch from a7ca588 to 4fc63ad Compare October 9, 2017 00:37

michaelsembwever changed the title ~~WIP – Consider switching cassandra3 to span2 model #1695~~ Switching cassandra3 to span2 model #1695 Oct 9, 2017

michaelsembwever commented Oct 11, 2017

View reviewed changes

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from 4179a4e to 6082180 Compare October 12, 2017 04:35

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from 6082180 to 1b9b7a6 Compare October 18, 2017 06:01

michaelsembwever commented Oct 18, 2017

View reviewed changes

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from 1b9b7a6 to 249dd2b Compare October 18, 2017 10:12

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from 9606b3b to 724d805 Compare October 18, 2017 11:10

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from 40fff73 to e714640 Compare October 18, 2017 13:22

michaelsembwever commented Nov 11, 2017

View reviewed changes

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from d2d95b6 to d9925b6 Compare November 12, 2017 01:57

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch 2 times, most recently from 5cb21c1 to e0a22db Compare November 12, 2017 03:09

michaelsembwever and others added 7 commits November 12, 2017 11:21

Store DependencyLinks in the dependency table as-is, ie each field as…

546fe18

… a separate column. All links for each day still remain in one partition key, keeking selects simple. But the data is now transparent, and just as performant and compact (due to compression).

Refactor zipkin.storage.cassandra3 to zipkin2.storage.cassandra3

a20c39f

Allows customization of pooling options

2a28bd6

Fixes tests

357662b

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from e0a22db to 9475148 Compare November 12, 2017 03:22

Adrian Cole added 3 commits November 12, 2017 11:44

simplifies default schema name to zipkin2

23191f0

Removes guava async composition as conflicts with spark

7fdb605

This extracts out calls in efforts to make the overall flow inside Cassandra easier to read and the commands themselves easier to debug.

Adds bracing with ░ to ensure no accidental substring match

a21a98d

codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch from 9475148 to a21a98d Compare November 12, 2017 03:44

codefromthecrypt merged commit de80609 into master Nov 12, 2017

codefromthecrypt deleted the mck/cassandra3-use-span2 branch November 12, 2017 05:56

shakuzen added the cassandra label Oct 23, 2018

shakuzen mentioned this pull request Oct 23, 2018

Consider switching cassandra3 to span2 model #1695

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switching cassandra3 to span2 model #1695 #1758

Switching cassandra3 to span2 model #1695 #1758

michaelsembwever commented Oct 5, 2017 •

edited

llinder commented Oct 6, 2017

michaelsembwever Oct 6, 2017

michaelsembwever Oct 8, 2017

michaelsembwever commented Oct 8, 2017 •

edited

codefromthecrypt commented Oct 9, 2017 via email

michaelsembwever commented Oct 9, 2017 •

edited

codefromthecrypt commented Oct 9, 2017 via email

michaelsembwever commented Oct 9, 2017

codefromthecrypt commented Oct 9, 2017 via email

michaelsembwever commented Oct 9, 2017

codefromthecrypt commented Oct 10, 2017

michaelsembwever Oct 11, 2017 •

edited

llinder Oct 11, 2017

codefromthecrypt Oct 11, 2017

michaelsembwever Oct 11, 2017

codefromthecrypt commented Oct 12, 2017

michaelsembwever Oct 18, 2017

codefromthecrypt commented Oct 18, 2017

codefromthecrypt commented Oct 18, 2017 via email

codefromthecrypt commented Oct 18, 2017

codefromthecrypt commented Oct 18, 2017

codefromthecrypt commented Nov 10, 2017

michaelsembwever Nov 11, 2017 •

edited

codefromthecrypt Nov 12, 2017

michaelsembwever Nov 11, 2017

michaelsembwever Nov 11, 2017

codefromthecrypt Nov 12, 2017

michaelsembwever left a comment

codefromthecrypt commented Nov 12, 2017

codefromthecrypt commented Nov 12, 2017

		@@ -1,88 +1,79 @@
		CREATE KEYSPACE IF NOT EXISTS zipkin3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
		CREATE KEYSPACE IF NOT EXISTS zipkin2_cassandra3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};

		ts_uuid timeuuid,
		trace_id_high text, // when strictTraceId=false, contains right-most 16 chars if present

		cql: SELECT * FROM trace_by_service_span WHERE service = ? AND span = ? AND bucket = ? AND duration < ? LIMIT 1
		fields: samerow

Switching cassandra3 to span2 model #1695 #1758

Switching cassandra3 to span2 model #1695 #1758

Conversation

michaelsembwever commented Oct 5, 2017 • edited

llinder commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelsembwever commented Oct 8, 2017 • edited

codefromthecrypt commented Oct 9, 2017 via email

michaelsembwever commented Oct 9, 2017 • edited

codefromthecrypt commented Oct 9, 2017 via email

michaelsembwever commented Oct 9, 2017

codefromthecrypt commented Oct 9, 2017 via email

michaelsembwever commented Oct 9, 2017

codefromthecrypt commented Oct 10, 2017

michaelsembwever Oct 11, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt commented Oct 12, 2017

Choose a reason for hiding this comment

codefromthecrypt commented Oct 18, 2017

codefromthecrypt commented Oct 18, 2017 via email

codefromthecrypt commented Oct 18, 2017

codefromthecrypt commented Oct 18, 2017

codefromthecrypt commented Nov 10, 2017

michaelsembwever Nov 11, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelsembwever left a comment

Choose a reason for hiding this comment

codefromthecrypt commented Nov 12, 2017

codefromthecrypt commented Nov 12, 2017

michaelsembwever commented Oct 5, 2017 •

edited

michaelsembwever commented Oct 8, 2017 •

edited

michaelsembwever commented Oct 9, 2017 •

edited

michaelsembwever Oct 11, 2017 •

edited

michaelsembwever Nov 11, 2017 •

edited