New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching cassandra3 to span2 model #1695 #1758
Conversation
Looks great so far! Thanks a ton for working on this @michaelsembwever |
@@ -1,88 +1,79 @@ | |||
CREATE KEYSPACE IF NOT EXISTS zipkin3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}; | |||
CREATE KEYSPACE IF NOT EXISTS zipkin2_cassandra3 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this file i still have to add some comments, and i want to duplicate all comments into the schema's table-level WITH COMMENT = '…';
so that operators get that information too…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
f1f6bd8
to
a7ca588
Compare
Some experimental benchmarking, aiming only to give rough latency spread using a safe throughput rate.
Results:
As is to be expected the writes are reasonably on-par for the different tables (bc bound by the one piece of hardware). |
Thanks for the benching. Will take a review sweep today
|
@adriancole ta.
|
a7ca588
to
4fc63ad
Compare
Upgrade cassandra3 storage backend use the zipkin2.* API instead of zipkin.* Bump cassandra3 backend to require minimum Java8. Take service_name out of the annotation search index, using a combination of two SASI instead. Remove TraceIdUDT and BinaryAnnotationUDT. Trace ID is now stored as text instead of a UDT. Adds the ListenableFutureCall class, which handles our Call->guava interactions. Change the default keyspace to "zipkin2_cassandra3", making it explicit it's a fresh start and breaking compatibility. Add the precondition check that SimpleStrategy and LOCAL_* consistency levels can not be used together. ref: - #1695 - #1758
- changing dependency table to store the individual fields to
DependencyLink, rather than the whole list of links as a blob,
- refactoring zipkin.storage.cassandra3 to zipkin2.storage.cassandra3
Ah ok will wait on that
|
@adriancole oh, i edited my comment above. both items have been pushed already (and exist as separate commits). |
Perfect
|
LGTM. thanks for a lot of the clean up in the last commit: a few silly bugs and typos there you caught! |
just added a commit to scrub the v1 compile dep (also preventing re-preparing) sometime in the near future we'll add a v2 zipkin-server which then removes the whole dep tree on v1 types (and spring boot 1.5). so for now, the autoconfiguration for cassandra3 makes most sense in the v1 package (as its only user is the v1 zipkin server) |
bucket int, //-- time bucket, calculated as ts/interval (in microseconds), for some pre-configured interval like 1 day. | ||
ts timeuuid, //-- start timestamp of the span, truncated to millisecond precision | ||
trace_id frozen<trace_id>, //-- trace ID | ||
trace_id text, //-- trace ID | ||
duration bigint, //-- span duration, in microseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@llinder @fedj
i want to change duration to be stored as tens of milliseconds (or in hundredths of seconds).
this will make the SASI on this column contains fewer distinct values, improving write performance.
and i can't see users being so discerning about searches they perform in the UI.
duration would be rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds.
any objections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good optimization to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @openzipkin/cassandra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as an extra commit. can be undone if any objections/concerns.
…s of seconds). this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
4179a4e
to
6082180
Compare
…s of seconds). this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
fyi refactored to reduce commits, but intentionally left the important ones (relevant for performance) |
6082180
to
1b9b7a6
Compare
…s of seconds). this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within 10 milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
ts_uuid timeuuid, | ||
trace_id_high text, // when strictTraceId=false, contains right-most 16 chars if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean here "left-most 16 chars"?
1b9b7a6
to
249dd2b
Compare
ok I have 23 failures locally.. trying to move that number below 20 soon :) |
> ts_uuid timeuuid,
+ trace_id_high text, // when strictTraceId=false, contains right-most 16 chars if present
did you mean here "left-most 16 chars"?
yep!
|
9606b3b
to
724d805
Compare
16 failures locally |
Down to 10 failures |
40fff73
to
e714640
Compare
@openzipkin/cassandra ok I'm ready to merge this. You might want to check about how pre-computed queries are delimited in the annotation_query column. They are braced with an unlikely character to stop us from accidentally matching on substring in the API. ex bracing like |
annotation_query, | ||
Boolean.TRUE.equals(span.debug()), | ||
Boolean.TRUE.equals(span.shared()) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retouching on the null binding resulting in tombstone issue.
Here there's 8 columns here that can be null. (maybe more?)
If there's consistently 8 tombstones (nulls) per row, then we'll only need 125 spans in a trace (rows in a partition) to trigger the `tombstone_warn_threshold warnings being logged in the C* nodes. And if we go to 12500 spans in a trace then that whole trace partition would become unreadable. Cassandra warns at a 1000 tombstones in any query, and fails on 100000 tombstones.
There's also a small question about disk usage efficiency. Each tombstone is a cell name and basically empty cell value entry stored on disk. Given that the cells are, apart from tags and annotations, generally very small then this could be proportionally an unnecessary waste of disk.
To avoid this relying upon a number of variant prepared statements for inserting a span is the normal practice.
Another popular practice is to insert those potentially null columns as separate statements (and optionally put them together into UNLOGGED batches). This works as multiple writes to the same partition has little overhead, and here we're not worried about lack of isolation between those writes, as the write is asynchronous anyway. An example of this approach is in the cassandra-reaper project here: https://github.com/thelastpickle/cassandra-reaper/blob/master/src/server/src/main/java/io/cassandrareaper/storage/CassandraStorage.java#L622-L642
UPDATE: a few lines down, under protected ResultSetFuture newFuture()
the nulls are in fact not written (not bound).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks captured
bound.setString("annotation_query", input.annotation_query()); | ||
} | ||
if (input.shared()) bound.setBool("shared", true); | ||
if (input.debug()) bound.setBool("debug", true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code above avoids writing the nulls (tombstones) :-)
cql: SELECT * FROM trace_by_service_span WHERE service = ? AND span = ? AND bucket = ? AND duration < ? LIMIT 1 | ||
fields: samerow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like we're missing two queries here:
- search by timestamp range,
- search by timestamp range and duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't quickly find out how to do a range query in a stress test, so left it out for now with a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all LGTM! approved.
(i added a small comment about adding two extra queries to the stress tests…)
d2d95b6
to
d9925b6
Compare
this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
5cb21c1
to
e0a22db
Compare
this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
ok I've got the dependencies job working.. just waiting for green |
Upgrade cassandra3 storage backend use the zipkin2.* API instead of zipkin.* Bump cassandra3 backend to require minimum Java8. Take service_name out of the annotation search index, using a combination of two SASI instead. Remove TraceIdUDT and BinaryAnnotationUDT. Trace ID is now stored as text instead of a UDT. Adds the ListenableFutureCall class, which handles our Call->guava interactions. Change the default keyspace to "zipkin2_cassandra3", making it explicit it's a fresh start and breaking compatibility. Add the precondition check that SimpleStrategy and LOCAL_* consistency levels can not be used together.
… a separate column. All links for each day still remain in one partition key, keeking selects simple. But the data is now transparent, and just as performant and compact (due to compression).
- DeduplicatingExecutor around writes to span_by_service, - 40k max queue length in cql driver, - durable_writes disabled (no commitlog on disk), - disable read repairs, - lower gc_grace to 3 hours (match hint window), - row cache span_by_service, - increase server-side speculative retries…
this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
e0a22db
to
9475148
Compare
This extracts out calls in efforts to make the overall flow inside Cassandra easier to read and the commands themselves easier to debug.
9475148
to
a21a98d
Compare
this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: #1758 (comment)
Will release 2.3 on master green. Epic help @michaelsembwever and @llinder thanks for all the work on the experimental driver preceding this, including the spark code |
this will make the SASI on this column contain fewer distinct values, improving write performance. users won't be discerning within milliseconds about searches they perform in the UI, in regards to the duration field. duration is always rounded up, so the result list would not skip any results. and duration in the span table would remain accurate to microseconds. ref: openzipkin#1758 (comment)
#1695
(a lot here is from @adriancole !)