2018 07 02 Dependency Link Query at Ascend

ZIPKIN : 2018-07-02 Dependency Link Query at Ascend

Created by Adrian Cole on Sep 17, 2018

A few Zipkin contributors are present at least 2-3 July. Let’s get some stuff done!

Attendees

Raja

Lance

Adrian

Things we’ll do

Kickoff will be Monday morning at Ascend where Raja works. We’ll fill this out with ideas beforehand and also then. Sometime later, we’ll have a dinner as well.

Dependency Link Query

Issue 1206 was the first about this (I think)

Relationship discovery is difficult for users. For example, they want to understand why errors are on the dependency graph. Right now, if they know to look at the dependency view, they can’t easily understand which traces were associated with these errors. Trying to find out can be a frustrating search because you cannot search by multiple services, much less a service that calls another in order. Lance (SmartThings) and Raja (Ascend) feel it worthwhile to jump from a service dependency link to search results.

Design

The idea is to create a data structure that represents unpacking of dependency links. For example, a dependency link looks like this:

{

“parent”: “edge”,

“child”: “mid tier”,

“callCount”: 124,

“errorCount”: 2

}

In order to facilitate search, we’d need to unpack this into trace IDs:

{

“parent”: “edge”,

“child”: “mid tier”,

“callCount”: 124,

“errorCount”: 2,

“traceIds”: [

“463ac35c9f6413ad48485a3953bb6124”,

“1312321463ac35c9f6413ad48485a395”,

…

],

“errorTraceIds”: [

“1312321463ac35c9f6413ad48485a395”,

…

]

}

Note that the traceId count will be equal to or less than callCount due to multiple spans per trace ID. For example each call represents an RPC, or a send to a broker. There may be many of these in a single trace. The errorTraceIds field is redundant to traceIds in the same way that errorCount is how many of callCount are in error. We add this separately as the most common question is “What is causing errors on this link?”

The data here would need to be created similarly, although more frequently to the current dependency link data. Eventually this will need to be a spark or flink job. For starting experiments, this will be in-memory. The “jump” implies a navigation element on the dependency link view to search results. We may eventually want search support in the main search page or an independent one.

Parts list

To create this, we minimally need the following, which will be changes to the link-query branch:

Java library to create the data structure give a list of traces (this could be used just like DependencyLinker is today)

Initially InMemoryStorage aggregation and fetch of this data

Multi-fetch api, similar to GET /api/v2/trace/:traceId except .. well it is a list of trace IDs

Strategy for presenting results

For example right now trace results are only possible in the main search screen.
If we only link view dependency link, we need no further search elements
If we have a separate query, we may need an index of parent->child to populate the UI similarly to the GET /api/v2/spans?serviceName=:serviceName endpoint

Jump from dependency link view to results view

Spark/Flink job to create the data on normal data stores

Initial Constraints

Bucketing of trace ID data into smaller (likely 5m) intervals

Dependency link data is currently aggregated daily. We will at some point need to jump to data which is not daily, for example, 5m intervals. Querying by day is not a common use case. So, later we need to expand search into a smaller bucket of time than a day

Deduplicating logic and aggregation of dependency links

The only difference between dependency links today and this data is presence of trace Ids (and almost certainly a bucket less than 24hrs). It should be possible to re-aggregate this data into our normal dependency link data simply by summing whatever the traceID enriched data is into 24hrs windows.

What this doesn’t do

Currently, the input data is parent->child only (ex. mid tier -> data store), not a complete path (ex edge -> mid tier -> data store). Computing complete paths down to leaf nodes is possible, but out of scope for now. This avoids larger data and also UI elements to disambiguate which type of path one is interested in.

Document generated by Confluence on Jun 18, 2019 18:50

Atlassian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly