Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define DSL for analysis and query span data #1811

Open
pavolloffay opened this issue Sep 24, 2019 · 12 comments
Open

Define DSL for analysis and query span data #1811

pavolloffay opened this issue Sep 24, 2019 · 12 comments

Comments

@pavolloffay
Copy link
Member

pavolloffay commented Sep 24, 2019

Created based on #1639 (comment).

Define domain-specific language (DSL) for analysis and query Span data. An example from Facebook's canopy system:

65429369-5a787000-de16-11e9-8ace-b9b25bfb1f61

The library should be able to connect to any Span source - jaeger query, json file, storage.

DSL in Canopy https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/

cc @jaegertracing/data-analytics

@jpkrohling
Copy link
Contributor

Would it make sense to start by implementing GraphQL (#169)?

@pavolloffay
Copy link
Member Author

@yurishkuro I am wondering do we want to make the library to work in a distributed way (like spark RDD)?

In the previous discussions, we have also mentioned that we could reuse existing graph traversal frameworks e.g. Gremlin. I am not sure if GraphQL #169 provides the same capabilities or it is used only for UI integrations.

@pavolloffay
Copy link
Member Author

We should also think about use cases this feature would solve:

  • users could ask the system more complicated questions e.g. - test a hypothesis
  • the query could be periodically run and invoke an action (alert) if conditions are met

@pavolloffay
Copy link
Member Author

pavolloffay commented Sep 24, 2019

There are two popular graph query languages - Gremlin (https://tinkerpop.apache.org/gremlin.html) and Cypher (https://neo4j.com/developer/cypher-basics-i/)

Gremlin

Cypher

@pavolloffay
Copy link
Member Author

I am not sure how we could use this query language without backed supporting it. To use gremlin we would have provide a gremlin compatible layer to allow query execution. @jaegertracing/data-analytics @yurishkuro any ideas?

Maybe running the query on the subset of data directly in-memory would work.

@annanay25
Copy link
Member

Would it be easier if we could curate traces from a (relatively) complex system that someone in the community is running in production and would volunteer to publish? It would move focus from data collection to actual analysis and it would also help different teams collate and confirm results while working on the same data-set.

Didn't dig very deep but seem relevant - https://github.com/google/cluster-data

@yurishkuro
Copy link
Member

yurishkuro commented Sep 25, 2019

@pavolloffay there are several parts to the DSL/library:

1. a way to define a stream of traces

This may include:
* loading traces from files
* loading traces from a Kafka topic (e.g. start 10,000 messages back) that contains spans.

In case of a source providing just spans, there needs to be a pre-aggregation step that assembles them into traces. This creates interesting challenges when done on a live stream as opposed to historical data, since on historical data we can simply group-by, while with a live stream we need to use window aggregation

The output of the first step is RDD-like stream of traces.

2. Filtering step

This is where the first part of the DSL comes in - how to express a query on a trace, when trace is represented as a graph. Joe's proposal didn't really address the graph nature of the trace, only filtering conditions on individual spans (which could also be a valid use case).

3. Evaluation / feature extraction step

The second part of the DSL - expressing feature extraction computation on the graph, like the Facebook's Canopy example above. Note an interesting thing in that example - it operates on a trace almost like on a flat collection of spans. They probably have expressions that can walk the graph, like $node->parent, but they didn't show it in the public talks.


I think the minimum DSL we need is just an ability to walk the in-memory representation of the trace as graph (i.e. for n in node.childen ...) and extract data (e.g. span.operationName, span.tag['key']`). The actual evaluations can be normal programs, in case of the filtering step returning boolean.

In other words, what we need is just a data model, and maybe some simple helper functions for finding things, like browser_thread = trace.execution_units[attr.name == 'client'], which is really, in generic sense, is func (t *Trace) findSpans(predicate func(*Span) bool) []*Spans. Helpers can actually come later, as long as we have the data model people can write them themselves initially.

@pavolloffay
Copy link
Member Author

I have started defining DSL with gremlin in https://github.com/pavolloffay/jaeger-tracedsl

Here is an example from app class https://github.com/pavolloffay/jaeger-tracedsl/blob/master/src/main/java/io/jaegertracing/dsl/gremlin/App.java

    TraceTraversalSource traceSource = graph.traversal(TraceTraversalSource.class);
    GraphTraversal<Vertex, Vertex> spans = traceSource
        .hasTag(Tags.SPAN_KIND.getKey(), Tags.SPAN_KIND_CLIENT)
        .duration(P.gt(100));

    for (Vertex v : spans.toList()) {
      System.out.println(v.label());
      System.out.println(v.property(Keys.OPERATION_NAME).value());
      System.out.println(v.keys());
    }

You can see how the filtering and extraction look like. The API allows to use trace DSL but also core gremlin API at the same time. This is a simple example but it should be possible to do things like:

  • determine if two spans are connected graph.connected(tagsSpan1, tagSpan2)
  • process distance between two spans
  • distribution of service/process depth/breadth

Any suggestions are welcome. My next step would be:

  • more complicated filtering methods outlined above
  • simplify extraction/iteration (children(), root()...)
  • graph creation API - from file (downloaded from UI?), jaeger-query, directly from storage

@annanay25
Copy link
Member

In case of a source providing just spans, there needs to be a pre-aggregation step that assembles them into traces. This creates interesting challenges when done on a live stream as opposed to historical data, since on historical data we can simply group-by, while with a live stream we need to use window aggregation

@yurishkuro - Is this aggregator component available in open source?

@pavolloffay
Copy link
Member Author

I have made some progress in my repository. The repository so far contains:

Gremlin trace DSL - defined methods for easier filtering and iteration over graph (extraction)
Examples with gremlin - e.g. find a span with given properties, are two spans connected? What is the distance between two spans? What is the maximum depth of trace (based on spans not services)?
Spark streaming with Kafka connector. It reads kafka topic in intervals, groups by traceids, creates graph for each trace and extracts max deph of the trace and prints it to stdout.

The next steps are:

  • publish extracted features to another kafka topic and get them to Prometheus.
  • wrap the code to jupyterlab notebook
  • get a trace query REST API
  • write a blog post
  • move span proto to IDL repository and build to java and python, we should consider publishing model classes to maven/pip..
  • publish graph DSL as a library
  • make it easy to deploy jupyterlab on k8s and connect to kafka
  • create a distribution (like spark-dependencies) with models/metrics which prove to be useful.

@pavolloffay
Copy link
Member Author

It would be great if somebody could help with moving protos to IDL and make configure build process to different languages #1213.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants