Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching cassandra3 to span2 model #1695 #1758

Merged
merged 10 commits into from Nov 12, 2017

Conversation

michaelsembwever
Copy link
Member

@michaelsembwever michaelsembwever commented Oct 5, 2017

#1695

(a lot here is from @adriancole !)

@llinder
Copy link
Member

llinder commented Oct 6, 2017

Looks great so far! Thanks a ton for working on this @michaelsembwever

@@ -1,88 +1,79 @@
CREATE KEYSPACE IF NOT EXISTS zipkin3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
CREATE KEYSPACE IF NOT EXISTS zipkin2_cassandra3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this file i still have to add some comments, and i want to duplicate all comments into the schema's table-level WITH COMMENT = '…'; so that operators get that information too…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@michaelsembwever michaelsembwever force-pushed the mck/cassandra3-use-span2 branch 2 times, most recently from f1f6bd8 to a7ca588 Compare October 8, 2017 11:42
@michaelsembwever
Copy link
Member Author

michaelsembwever commented Oct 8, 2017

Some experimental benchmarking, aiming only to give rough latency spread using a safe throughput rate.

  • macbook pro (macOS 10.13 , 3.5GHz dual-core Intel Core i7 processor, 16GB 2133MHz, SSD)
  • CCM (4 nodes, Cassandra-3.11.0)
  • ccm stress run for each of the three profiles in write mode beforehand
  • stress testing ran for 6 hours (overnight), but throttled to very low throughput.

Results:

ccm stress user profile=trace-stress.yaml ops\(insert=1,by_trace=1,by_trace_ts_id=1,by_annotation=1\) no-warmup duration=6h -rate threads=4 throttle=50/s

Results:
Op rate                   :       50 op/s  [by_annotation: 12 op/s, by_trace: 13 op/s, by_trace_ts_id: 13 op/s, insert: 13 op/s]
Partition rate            :       50 pk/s  [by_annotation: 12 pk/s, by_trace: 13 pk/s, by_trace_ts_id: 13 pk/s, insert: 13 pk/s]
Row rate                  :      174 row/s [by_annotation: 137 row/s, by_trace: 13 row/s, by_trace_ts_id: 13 row/s, insert: 13 row/s]
Latency mean              :   10.1 ms [by_annotation: 31.9 ms, by_trace: 1.8 ms, by_trace_ts_id: 1.8 ms, insert: 5.1 ms]
Latency median            :    2.5 ms [by_annotation: 23.7 ms, by_trace: 0.6 ms, by_trace_ts_id: 0.7 ms, insert: 3.1 ms]
Latency 95th percentile   :   43.6 ms [by_annotation: 77.7 ms, by_trace: 6.3 ms, by_trace_ts_id: 6.3 ms, insert: 15.4 ms]
Latency 99th percentile   :   84.7 ms [by_annotation: 125.1 ms, by_trace: 20.6 ms, by_trace_ts_id: 20.5 ms, insert: 37.5 ms]
Latency 99.9th percentile :  159.6 ms [by_annotation: 221.4 ms, by_trace: 65.2 ms, by_trace_ts_id: 65.3 ms, insert: 88.3 ms]

ccm stress user profile=trace_by_service_span-stress.yaml ops\(insert=1,select=1,by_duration=1\) no-warmup duration=6h -rate threads=4 throttle=50/s

Results:
Op rate                   :       50 op/s  [by_duration: 17 op/s, insert: 17 op/s, select: 17 op/s]
Partition rate            :       33 pk/s  [by_duration: 0 pk/s, insert: 17 pk/s, select: 17 pk/s]
Row rate                  :       33 row/s [by_duration: 0 row/s, insert: 17 row/s, select: 17 row/s]
Latency mean              :    2.6 ms [by_duration: 4.0 ms, insert: 1.8 ms, select: 1.9 ms]
Latency median            :    0.8 ms [by_duration: 1.8 ms, insert: 0.6 ms, select: 0.6 ms]
Latency 95th percentile   :   11.1 ms [by_duration: 15.9 ms, insert: 7.5 ms, select: 8.1 ms]
Latency 99th percentile   :   27.9 ms [by_duration: 35.0 ms, insert: 22.6 ms, select: 22.8 ms]
Latency 99.9th percentile :   73.3 ms [by_duration: 82.4 ms, insert: 64.2 ms, select: 69.9 ms]

ccm stress user profile=span_by_service-stress.yaml ops\(insert=1,select=1,select_spans=1\) no-warmup duration=6h -rate threads=4 throttle=50/s

Results:
Op rate                   :       50 op/s  [insert: 17 op/s, select: 17 op/s, select_spans: 17 op/s]
Partition rate            :       50 pk/s  [insert: 17 pk/s, select: 17 pk/s, select_spans: 17 pk/s]
Row rate                  :    1,707 row/s [insert: 17 row/s, select: 1,673 row/s, select_spans: 17 row/s]
Latency mean              :    7.5 ms [insert: 1.7 ms, select: 18.9 ms, select_spans: 1.8 ms]
Latency median            :    0.8 ms [insert: 0.5 ms, select: 11.2 ms, select_spans: 0.5 ms]
Latency 95th percentile   :   34.3 ms [insert: 7.4 ms, select: 57.1 ms, select_spans: 8.1 ms]
Latency 99th percentile   :   69.5 ms [insert: 21.5 ms, select: 93.3 ms, select_spans: 22.5 ms]
Latency 99.9th percentile :  136.4 ms [insert: 63.6 ms, select: 176.8 ms, select_spans: 68.5 ms]

As is to be expected the writes are reasonably on-par for the different tables (bc bound by the one piece of hardware).
Again as expected reads are fast when selecting traces by their IDs, when finding trace ids by the service+span, and when selecting span names by service name.
The three SASI indexes here all perform well, roughly 5x times the latency compared to the fast (select by partition key) reads.
Of the three SASI indexes reads, searching "by_annotation" in the traces table is the slowest, as expected as this actually uses two SASI indexes, one of which is the only SASI index that uses the CONTAINS mode permitting full text searching.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Oct 9, 2017 via email

@michaelsembwever
Copy link
Member Author

michaelsembwever commented Oct 9, 2017

@adriancole ta.
Two commits added:

  • changing dependency table to store the individual fields to DependencyLink, rather than the whole list of links as a blob,
  • refactoring zipkin.storage.cassandra3 to zipkin2.storage.cassandra3

michaelsembwever added a commit that referenced this pull request Oct 9, 2017
Upgrade cassandra3 storage backend use the zipkin2.* API instead of zipkin.*
Bump cassandra3 backend to require minimum Java8.
Take service_name out of the annotation search index, using a combination of two SASI instead.
Remove TraceIdUDT and BinaryAnnotationUDT. Trace ID is now stored as text instead of a UDT.
Adds the ListenableFutureCall class, which handles our Call->guava interactions.
Change the default keyspace to "zipkin2_cassandra3", making it explicit it's a fresh start and breaking compatibility.
Add the precondition check that SimpleStrategy and LOCAL_* consistency levels can not be used together.

ref:
 - #1695
 - #1758
@michaelsembwever michaelsembwever changed the title WIP – Consider switching cassandra3 to span2 model #1695 Switching cassandra3 to span2 model #1695 Oct 9, 2017
@codefromthecrypt
Copy link
Member

codefromthecrypt commented Oct 9, 2017 via email

@michaelsembwever
Copy link
Member Author

@adriancole oh, i edited my comment above. both items have been pushed already (and exist as separate commits).

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Oct 9, 2017 via email

@michaelsembwever
Copy link
Member Author

LGTM. thanks for a lot of the clean up in the last commit: a few silly bugs and typos there you caught!

@codefromthecrypt
Copy link
Member

just added a commit to scrub the v1 compile dep (also preventing re-preparing)
one package change I made was autoconfiguration.. right now, the zipkin-server is still based on the v1 dep (basically it has v2 endpoints bolted on)

sometime in the near future we'll add a v2 zipkin-server which then removes the whole dep tree on v1 types (and spring boot 1.5). so for now, the autoconfiguration for cassandra3 makes most sense in the v1 package (as its only user is the v1 zipkin server)

bucket int, //-- time bucket, calculated as ts/interval (in microseconds), for some pre-configured interval like 1 day.
ts timeuuid, //-- start timestamp of the span, truncated to millisecond precision
trace_id frozen<trace_id>, //-- trace ID
trace_id text, //-- trace ID
duration bigint, //-- span duration, in microseconds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@llinder @fedj
i want to change duration to be stored as tens of milliseconds (or in hundredths of seconds).
this will make the SASI on this column contains fewer distinct values, improving write performance.
and i can't see users being so discerning about searches they perform in the UI.

duration would be rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

any objections?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good optimization to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @openzipkin/cassandra

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as an extra commit. can be undone if any objections/concerns.

michaelsembwever added a commit that referenced this pull request Oct 11, 2017
…s of seconds).

this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
codefromthecrypt pushed a commit that referenced this pull request Oct 12, 2017
…s of seconds).

this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
@codefromthecrypt
Copy link
Member

fyi refactored to reduce commits, but intentionally left the important ones (relevant for performance)

codefromthecrypt pushed a commit that referenced this pull request Oct 18, 2017
…s of seconds).

this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
ts_uuid timeuuid,
trace_id_high text, // when strictTraceId=false, contains right-most 16 chars if present
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean here "left-most 16 chars"?

@codefromthecrypt
Copy link
Member

ok I have 23 failures locally.. trying to move that number below 20 soon :)

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Oct 18, 2017 via email

@codefromthecrypt
Copy link
Member

16 failures locally

@codefromthecrypt
Copy link
Member

Down to 10 failures

@codefromthecrypt
Copy link
Member

@openzipkin/cassandra ok I'm ready to merge this. You might want to check about how pre-computed queries are delimited in the annotation_query column. They are braced with an unlikely character to stop us from accidentally matching on substring in the API. ex bracing like ░error░ ensures we don't accidentally match on a tag key named not_error.

annotation_query,
Boolean.TRUE.equals(span.debug()),
Boolean.TRUE.equals(span.shared())
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retouching on the null binding resulting in tombstone issue.

Here there's 8 columns here that can be null. (maybe more?)

If there's consistently 8 tombstones (nulls) per row, then we'll only need 125 spans in a trace (rows in a partition) to trigger the `tombstone_warn_threshold warnings being logged in the C* nodes. And if we go to 12500 spans in a trace then that whole trace partition would become unreadable. Cassandra warns at a 1000 tombstones in any query, and fails on 100000 tombstones.

There's also a small question about disk usage efficiency. Each tombstone is a cell name and basically empty cell value entry stored on disk. Given that the cells are, apart from tags and annotations, generally very small then this could be proportionally an unnecessary waste of disk.

To avoid this relying upon a number of variant prepared statements for inserting a span is the normal practice.

Another popular practice is to insert those potentially null columns as separate statements (and optionally put them together into UNLOGGED batches). This works as multiple writes to the same partition has little overhead, and here we're not worried about lack of isolation between those writes, as the write is asynchronous anyway. An example of this approach is in the cassandra-reaper project here: https://github.com/thelastpickle/cassandra-reaper/blob/master/src/server/src/main/java/io/cassandrareaper/storage/CassandraStorage.java#L622-L642

UPDATE: a few lines down, under protected ResultSetFuture newFuture() the nulls are in fact not written (not bound).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks captured

bound.setString("annotation_query", input.annotation_query());
}
if (input.shared()) bound.setBool("shared", true);
if (input.debug()) bound.setBool("debug", true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code above avoids writing the nulls (tombstones) :-)

cql: SELECT * FROM trace_by_service_span WHERE service = ? AND span = ? AND bucket = ? AND duration < ? LIMIT 1
fields: samerow
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we're missing two queries here:

  • search by timestamp range,
  • search by timestamp range and duration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't quickly find out how to do a range query in a stress test, so left it out for now with a TODO

Copy link
Member Author

@michaelsembwever michaelsembwever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all LGTM! approved.
(i added a small comment about adding two extra queries to the stress tests…)

codefromthecrypt pushed a commit that referenced this pull request Nov 12, 2017
this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
@codefromthecrypt codefromthecrypt force-pushed the mck/cassandra3-use-span2 branch 2 times, most recently from 5cb21c1 to e0a22db Compare November 12, 2017 03:09
codefromthecrypt pushed a commit that referenced this pull request Nov 12, 2017
this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
@codefromthecrypt
Copy link
Member

ok I've got the dependencies job working.. just waiting for green

michaelsembwever and others added 7 commits November 12, 2017 11:21
Upgrade cassandra3 storage backend use the zipkin2.* API instead of zipkin.*
Bump cassandra3 backend to require minimum Java8.
Take service_name out of the annotation search index, using a combination of two SASI instead.
Remove TraceIdUDT and BinaryAnnotationUDT. Trace ID is now stored as text instead of a UDT.
Adds the ListenableFutureCall class, which handles our Call->guava interactions.
Change the default keyspace to "zipkin2_cassandra3", making it explicit it's a fresh start and breaking compatibility.
Add the precondition check that SimpleStrategy and LOCAL_* consistency levels can not be used together.
… a separate column.

All links for each day still remain in one partition key, keeking selects simple.
But the data is now transparent, and just as performant and compact (due to compression).
 - DeduplicatingExecutor around writes to span_by_service,
 - 40k max queue length in cql driver,
 - durable_writes disabled (no commitlog on disk),
 - disable read repairs,
 - lower gc_grace to 3 hours (match hint window),
 - row cache span_by_service,
 - increase server-side speculative retries…
this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
Adrian Cole added 3 commits November 12, 2017 11:44
This extracts out calls in efforts to make the overall flow inside
Cassandra easier to read and the commands themselves easier to debug.
@codefromthecrypt codefromthecrypt merged commit de80609 into master Nov 12, 2017
codefromthecrypt pushed a commit that referenced this pull request Nov 12, 2017
this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: #1758 (comment)
@codefromthecrypt codefromthecrypt deleted the mck/cassandra3-use-span2 branch November 12, 2017 05:56
@codefromthecrypt
Copy link
Member

Will release 2.3 on master green. Epic help @michaelsembwever and @llinder thanks for all the work on the experimental driver preceding this, including the spark code

abesto pushed a commit to abesto/zipkin that referenced this pull request Sep 10, 2019
this will make the SASI on this column contain fewer distinct values, improving write performance.
users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field.

duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.

ref: openzipkin#1758 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants