Initial Cassandra and Elasticserch spark job #1

pavolloffay · 2017-09-13T13:32:39Z

This is dependencies spark job ported from https://github.com/openzipkin/zipkin-dependencies to use Jaeger model.

pavolloffay · 2017-09-13T13:51:35Z

@yurishkuro could you please review? The beauty of this is that it does not require any other service e.g. (kafka). Later some parts could be reused for new data pipeline processing. (e.g. the job will be the same but it will acquire data from other sources)

pavolloffay · 2017-09-13T13:53:42Z

...ra/src/main/java/io/jaegertracing/spark/dependencies/cassandra/CassandraDependenciesJob.java

+        microsUpper);
+
+    JavaSparkContext sc = new JavaSparkContext(conf);
+    List<Dependency> dependencies = javaFunctions(sc).cassandraTable(keyspace, "traces", mapRowTo(Span.class))


@Jiri-Kremser could you please have a look at these lines?

you please have a look at these lines?

nice lines :)

perhaps

.flatMapValues(foo) .values() .mapToPair(bar)

could be all expressed in the first flatMapValues.

RDDs are fun to write, but can be slow sometimes, if you want optimization for free, consider using the DataFrame abstraction in should be in the C*-spark connector.

yurishkuro · 2017-09-13T22:51:43Z

@yurishkuro could you please review?

On my todo list

yurishkuro

The logic looks good.

yurishkuro · 2017-09-15T03:44:05Z

README.md

+and store them for later presentation in the UI.
+
+This job parses all traces in the current day in UTC time. This means you should schedule it to run
+just prior to midnight UTC.


not sure I follow this. The job we run internally is also based on UTC, but it runs after UTC midnight and processes the previous day.

Perhaps it would be more accurate to have something like:

This job parses all traces on a given day, based on UTC. By default, it processes the current day, but other days can be explicitly specified.

yurishkuro · 2017-09-15T03:48:16Z

README.md

@@ -0,0 +1,73 @@
+# Jaeger Spark dependencies
+
+This is a Spark job that will collect spans from your datastore, analyze links between services,


Should this emphasize that only Cassandra is currently supported?

Also, I would make it clear that this will only work with data collected via Jaeger client libraries using the default model of single-host spans, i.e. it will not work with spans collected from most Zipkin or jaeger-configured-as-zipkin libraries that share the span ID between client and server.

yurishkuro · 2017-09-15T03:48:46Z

...ra/src/main/java/io/jaegertracing/spark/dependencies/cassandra/CassandraDependenciesJob.java

+        microsUpper);
+
+    JavaSparkContext sc = new JavaSparkContext(conf);
+    List<Dependency> dependencies = javaFunctions(sc).cassandraTable(keyspace, "traces", mapRowTo(Span.class))


mapRowTo(Span.class)

Ha, cheating! Nice! Was a lot more painful in Go.

pavolloffay · 2017-09-22T14:10:06Z

@jpkrohling @objectiser could you please review?

jpkrohling

I'm only half way through, but I wanted to share my comments before leaving for the weekend :)

jpkrohling · 2017-09-22T15:39:49Z

README.md

+and store them for later presentation in the UI.
+
+This job parses all traces in the current day in UTC time. This means you should schedule it to run
+just prior to midnight UTC.


Perhaps it would be more accurate to have something like:

This job parses all traces on a given day, based on UTC. By default, it processes the current day, but other days can be explicitly specified.

jpkrohling · 2017-09-22T15:41:15Z

README.md

+Cassandra is used when `STORAGE_TYPE=cassandra`.
+
+    * `CASSANDRA_KEYSPACE`: The keyspace to use. Defaults to "zipkin".
+    * `CASSANDRA_CONTACT_POINTS`: Comma separated list of hosts / ip addresses part of Cassandra cluster. Defaults to localhost


s/CASSANDRA_CONTACT_POINTS/CASSANDRA_HOSTS/ ? The _HOSTS version seems to be the "standard".

These terms are similar, however not interchangeable. Contact points has slightly different meaning.

Is the person using this job supposed to know the difference? Is this something common in Cassandra world?

jpkrohling · 2017-09-22T15:41:50Z

README.md

+Example usage:
+
+```bash
+$ STORAGE_TYPE=cassandra CASSANDRA_USERNAME=user CASSANDRA_PASSWORD=pass java -jar jaeager-dependencies.jar


s/jaeager/jaeger/

jpkrohling · 2017-09-22T15:42:28Z

README.md

+$ STORAGE_TYPE=cassandra CASSANDRA_USERNAME=user CASSANDRA_PASSWORD=pass java -jar jaeager-dependencies.jar
+```
+### Elasticsearch
+Elasticsearch is used when `STORAGE_TYPE=cassandra`.


Wait, what? Did you mean STORAGE_TYPE=elasticsearch ?

This is a little puzzle to keep readers in focus 😆 .

jpkrohling · 2017-09-22T15:44:06Z

...ra/src/main/java/io/jaegertracing/spark/dependencies/cassandra/CassandraDependenciesJob.java

+    String[] jars;
+
+    // By default the job only works on traces whose first timestamp is today
+    long day = Utils.midnightUTC(System.currentTimeMillis());


You have no excuse on this project :) You can use Clock instead of System.current.....

What method do you mean exactly? This https://docs.oracle.com/javase/8/docs/api/java/time/Clock.html#millis--?

I think for this use case call system.millis is fine. Maybe a simple refactor with passing a Clock/Date(I am not sure here) to Job would simplify something.

The idea is that your code shouldn't care if it's UTC or not. You'd only get a clock, and whoever created the clock would set the timezone. This code here would only call the method you mentioned (#millis()), if that's what you need.

jpkrohling · 2017-09-22T16:09:06Z

.../src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java

+    public Builder hosts(String hosts) {
+      Utils.checkNoTNull(hosts, "hosts");
+      this.hosts = hosts;
+      sparkProperties.put("es.nodes.wan.only", "true");


Shouldn't this be on the builder?

I am not sure, I took the builder from zipkin. I suppose if it was needed it would be in the builder.

jpkrohling · 2017-09-22T16:10:46Z

.../src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java

+    JavaEsSpark.saveJsonToEs(javaSparkContext.parallelize(Collections.singletonList(json)), resource);
+  }
+
+  static String parseHosts(String hosts) {


Could you share the motivation for this ? Looks like the only thing it's doing is add the port to the host, in case it's missing. Am I right? In any case, examples of the input and output would be sufficient.

It's taken from zipkin. It's a little bit magical, I am wondering if we should to keep it.

Remove it and see if it breaks :) Code that does nothing shouldn't exist.

jpkrohling · 2017-09-22T16:13:55Z

.../src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java

+
+    public String getTimestamp() {
+      // Jaeger ES dependency storage uses RFC3339Nano for timestamp
+        return new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")


Isn't there a constant for this format already?

I couldn't find any.

Again, Java 8 could have helped:
https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html

(btw, I just noticed now that the method name is misleading: this is not returning a timestamp, but a date as string).

jpkrohling · 2017-09-22T16:16:17Z

...rc/test/java/io/opentracing/spark/dependencies/elastic/ElasticsearchDependenciesJobTest.java

+
+  @Override
+  protected void waitBetweenTraces() throws InterruptedException {
+    // TODO otherwise elastic drops some spans


Is this a real TODO?

well yes, I think we could do it completely without waits but it's more complicated therefore a would like to leave a comment here for further improvements.

jpkrohling · 2017-09-22T16:16:46Z

...rc/test/java/io/opentracing/spark/dependencies/elastic/ElasticsearchDependenciesJobTest.java

+  protected void deriveDependencies() {
+    ElasticsearchDependenciesJob.builder()
+        .hosts("http://localhost:" + elasticsearch.getMappedPort(9200))
+        .day(System.currentTimeMillis())


You know what I'm going to comment here, don't you? :) (Clock vs. System.currentTimeMillis)

jpkrohling · 2017-09-25T08:26:37Z

...ra/src/main/java/io/jaegertracing/spark/dependencies/cassandra/CassandraDependenciesJob.java

+          .groupByKey();
+
+      // TODO remove for debug purposes
+      traces.foreach(stringIterableTuple2 -> {


OK, then put something like "Remove before merging (or after 2017-10-01)", so that you signal to the next maintainer that this can be safely removed if you forget to remove "now".

jpkrohling · 2017-09-25T08:28:38Z

.../src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java

+    JavaEsSpark.saveJsonToEs(javaSparkContext.parallelize(Collections.singletonList(json)), resource);
+  }
+
+  static String parseHosts(String hosts) {


Remove it and see if it breaks :) Code that does nothing shouldn't exist.

jpkrohling · 2017-09-25T08:31:22Z

.../src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java

+
+    public String getTimestamp() {
+      // Jaeger ES dependency storage uses RFC3339Nano for timestamp
+        return new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")


Again, Java 8 could have helped:
https://docs.oracle.com/javase/8/docs/api/java/time/OffsetDateTime.html

jpkrohling · 2017-09-25T08:32:16Z

.../src/main/java/io/jaegertracing/spark/dependencies/elastic/ElasticsearchDependenciesJob.java

+
+    public String getTimestamp() {
+      // Jaeger ES dependency storage uses RFC3339Nano for timestamp
+        return new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")


(btw, I just noticed now that the method name is misleading: this is not returning a timestamp, but a date as string).

jpkrohling · 2017-09-25T08:39:58Z

...pendencies-test/src/main/java/io/jaegertracing/spark/dependencies/test/TracersGenerator.java

+ */
+public class TracersGenerator {
+
+  public static class Tuple<A, B> {


I think you do have a Tuple class somewhere else, don't you?

in scala, This module does not depend on it.

jpkrohling · 2017-09-25T08:44:44Z

...ndencies-test/src/main/java/io/jaegertracing/spark/dependencies/test/tree/TreeGenerator.java

+      return;
+    }
+    // +1 to assure that we generate all exact number of nodes
+    int numOfDescendants = descendantsRandom.nextInt(maxNumberOfDescendants) + 1;


Why a random number of descendants? Wouldn't a fixed number do it? Adding randomness to a test should be done only when it adds real value, as it's hard to debug a test failure when something is random.

jpkrohling · 2017-09-25T08:54:43Z

jaeger-spark-dependencies/pom.xml

+                See https://stackoverflow.com/questions/36877897/detected-guava-issue-1635-which-indicates-that-a-version-of-guava-less-than-16
+                TODO: When cassandra-driver-core 4.x is out, revisit this as it no longer uses guava
+              -->
+              <relocations>


Is this module supposed to be used as a library to a downstream/client module? If so, then it might be a good idea to relocate everything.

jpkrohling · 2017-09-25T08:55:18Z

jaeger-spark-dependencies/pom.xml

+          <execution>
+            <phase>package</phase>
+            <goals>
+              <goal>shade</goal>


Is this build publishing both the shaded and no-shaded versions?

jpkrohling · 2017-09-25T08:56:32Z

...ark-dependencies/src/main/java/io/jaegertracing/spark/dependencies/DependenciesSparkJob.java

+
+    String storage = System.getenv("STORAGE");
+    if (storage == null) {
+      throw new IllegalArgumentException("Missing environmental variable STORAGE");


At the this level, it's better to just print out the message and exit with System.exit(rc) than throwing an exception.

jpkrohling · 2017-09-25T09:45:27Z

Forgot to mention on the review itself, so, I'll leave it here as comment: some of those files are based on Zipkin ones, so, it would be polite to mention this on the readme. Besides, it would also be nice to either not add your @author tags, or to add a second @author tag like @author The Zipkin Authors, to make sure people won't think you are the sole author of that source file.

pavolloffay · 2017-09-25T09:50:39Z

@jpkrohling thanks, done in the pavolloffay@8e48ca4

objectiser

Some comments to consider. LGTM.

objectiser · 2017-09-25T08:29:07Z

README.md

+### Cassandra
+Cassandra is used when `STORAGE_TYPE=cassandra`.
+
+    * `CASSANDRA_KEYSPACE`: The keyspace to use. Defaults to "jaeger_v1_test".


Why is the keyspace default _test?

it's the default what create.sh creates. Maybe jaeger_v1_dc1 would be better (this creates out k8s deployment)

objectiser · 2017-09-25T08:33:28Z

README.md

+Elasticsearch is used when `STORAGE_TYPE=elasticsearch`.
+
+    * `ES_INDEX`: The index prefix to use when generating daily index names. Defaults to jaeger.
+                  The final index look like jaeger-span-yyyy-DD-mm.


Why does this require a separate index per day - can't the data be stored in a single index with timestamps to distinguish the values per day?

How do the indexes get cleaned up?

This is how jaeger ES storage implementation works

objectiser · 2017-09-25T08:56:08Z

...rc/test/java/io/jaegertracing/spark/dependencies/cassandra/CassandraDependenciesJobTest.java

+
+  @Override
+  protected void deriveDependencies() throws Exception {
+    // flush all date to the storage


Did you mean data?

objectiser · 2017-09-25T10:02:10Z

...ark-dependencies-common/src/main/java/io/jaegertracing/spark/dependencies/model/Process.java

+/**
+ * @author Pavol Loffay
+ */
+public class Process implements Serializable {


Wondering whether it would be better to have the java model is some independent package, as it could be generally useful for Java tooling wanting to read data from the jaeger storage?

agree, I would like to raise some issues about generic tests and reusable parts in java.

objectiser · 2017-09-25T10:31:19Z

...dencies-test/src/main/java/io/jaegertracing/spark/dependencies/test/tree/TracingWrapper.java

+    @Override
+    public void createChildSpan(TracingWrapper<JaegerWrapper> parent) {
+      io.opentracing.Tracer.SpanBuilder spanBuilder = tracer.buildSpan(UUID.randomUUID().toString().replace("-", ""))
+          .ignoreActiveSpan();


Why is the ignoreActiveSpan required when you are setting the parent and starting manual?

objectiser · 2017-09-25T10:40:29Z

...ark-dependencies/src/main/java/io/jaegertracing/spark/dependencies/DependenciesSparkJob.java

+    }
+  }
+
+  static String pathToUberJar() throws UnsupportedEncodingException {


Does this need to be UberJar? Could it be JaegerJar?

I think it's more uber jar, no?

May be I should have asked what is the Uber jar?

a fat jar with all dependencies.

Ah ok - overloaded use of the term uber :)

ROLF now I get it why did you ask... 😆

objectiser · 2017-09-25T10:43:20Z

...pendencies-test/src/main/java/io/jaegertracing/spark/dependencies/test/DependenciesTest.java

+    Node<ZipkinWrapper> root = treeGenerator.generateTree(150, 3);
+    Traversals.inorder(root, (node, parent) -> node.getTracingWrapper().get().getSpan().finish());
+    waitBetweenTraces();
+    treeGenerator.getTracers().forEach(tracer -> {


Could be moved onto the TreeGenerator

jaeger tracer does not implement closeable I will create a PR and add TODO here.

pavolloffay · 2017-09-25T13:25:21Z

@jpkrohling I have updated it to use java 8 date API pavolloffay@f918317

jpkrohling · 2017-09-25T13:34:35Z

The usage of the Java 8 date/time API could be better, but it's a strong first version. One future improvement could have been about setting the timezone on the main class, and let everything else be agnostic of the timezone. But that's not needed/required for this PR.

I think it's very close to a LGTM, missing only the removal of the parseHosts and the resolution for the comment about the shade plugin.

pavolloffay · 2017-09-25T13:47:11Z

The usage of the Java 8 date/time API could be better, but it's a strong first version. One future improvement could have been about setting the timezone on the main class, and let everything else be agnostic of the timezone. But that's not needed/required for this PR.

The current functionality is what OpenZipkin uses, and is probably enough. If you have a specific requirement feel free to open issue on this repo right away.

About the relocation, see todo. There is a problem with guava. Based on the maturity of the original project I suspect it's there for a reason.

…to prop

pavolloffay · 2017-09-25T13:55:58Z

I will merge soon and transform the repo to the jaegertracing.

If you want a different repo name please comment. cc @yurishkuro

pavolloffay · 2017-09-25T16:31:48Z

@yurishkuro @Jiri-Kremser @objectiser @jpkrohling thanks.

Upgrade jackson

Initial Cassandra spark job

1d7c8c8

pavolloffay commented Sep 13, 2017

View reviewed changes

pavolloffay added 4 commits September 13, 2017 15:57

Remove zipkin from source

dceeb71

fix travis

929af58

integration test using testcontainers-java

8b7dc07

Remove docker prefix from images

e069fb2

pavolloffay added 2 commits September 14, 2017 13:51

testcontainers refactor

fc33112

testcontainers define containers as classes

39a5e99

yurishkuro approved these changes Sep 15, 2017

View reviewed changes

pavolloffay mentioned this pull request Sep 15, 2017

Senders fix, do not use static thrift factory jaegertracing/jaeger-client-java#233

Merged

pavolloffay added 9 commits September 15, 2017 15:01

working random tree for jaeger

f58a3f3

Zipkin brave itest

48b5e7b

Working elastic, breaking changes in C* tests

ede21e2

fix C* tests

402f853

testcontainers init for elastic

00f7df3

introduce test artifact

1df3492

use jaeger-core 0.21.0

3e42aab

simple elastic test

0f97fd7

dependency tests for elastic and C* working locally

316505c

pavolloffay changed the title ~~Initial Cassandra spark job~~ Initial Cassandra and Elasticserch spark job Sep 21, 2017

pavolloffay added 5 commits September 22, 2017 12:40

Small test refactoring

224fca3

Add unit tests and sleep 500 ms for c*

bef2eee

license headers

86e1eff

fix dockerfile

8017545

minor refactor

fc59894

pavolloffay added 2 commits September 22, 2017 16:59

Disable zipkin tests for now

b47aa32

tests: ignore activespan

6e7ca29

increase wait time

2546eb6

jpkrohling reviewed Sep 22, 2017

View reviewed changes

pavolloffay added 4 commits September 25, 2017 09:02

Tests: use smaller # of spans in traces

3a5474f

Fix juca's comments except date API

1e30ae4

Fix readme

2a613cf

Add comment to generic spark job

1df46a5

jpkrohling reviewed Sep 25, 2017

View reviewed changes

Add openzipkin authors

8e48ca4

objectiser approved these changes Sep 25, 2017

View reviewed changes

pavolloffay added 3 commits September 25, 2017 13:43

Fix Gary's comments

e599660

End with system.exit

d93c85b

Use java8 date API

f918317

pavolloffay added 3 commits September 25, 2017 15:28

Remove comments/ debug messages

5fdf13a

remove unused bits

1739654

fix javadoc

17470ba

Remove elastic hosts manipulation, export maven shade plugin version …

6bbe68c

…to prop

pavolloffay added 3 commits September 25, 2017 16:36

Rename es hosts to es nodes to align with es hadoop connector

dacd23e

Readme minor fix

a42a0d4

Rename dependencies helper

9b0f9ec

pavolloffay merged commit 26da5f6 into master Sep 25, 2017

pavolloffay deleted the cassandra-spark-job branch September 25, 2017 16:43

jpkrohling mentioned this pull request Oct 9, 2017

CronJob for dependencies jaegertracing/jaeger-kubernetes#39

Merged

janhoy pushed a commit to janhoy/spark-dependencies that referenced this pull request Nov 15, 2023

Merge pull request jaegertracing#1 from janhoy/jackson-upgrade

d838c90

Upgrade jackson

		@@ -0,0 +1,73 @@
		# Jaeger Spark dependencies

		This is a Spark job that will collect spans from your datastore, analyze links between services,

Initial Cassandra and Elasticserch spark job #1

Initial Cassandra and Elasticserch spark job #1

Conversation

pavolloffay commented Sep 13, 2017 • edited

pavolloffay commented Sep 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurishkuro commented Sep 13, 2017

yurishkuro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Sep 22, 2017

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpkrohling commented Sep 25, 2017

pavolloffay commented Sep 25, 2017

objectiser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavolloffay commented Sep 25, 2017

jpkrohling commented Sep 25, 2017

pavolloffay commented Sep 25, 2017

pavolloffay commented Sep 25, 2017 • edited

pavolloffay commented Sep 25, 2017

pavolloffay commented Sep 13, 2017 •

edited

pavolloffay commented Sep 25, 2017 •

edited