Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#362): adds flush method to http reporter. #417

Merged
merged 4 commits into from
Jul 23, 2019

Conversation

jcchavezs
Copy link
Contributor

@jcchavezs jcchavezs commented Jul 20, 2019

This PR adds a flush method for http reporter.

Closes #362

Ping @csp197 @adriancole

@@ -135,6 +147,15 @@ class BatchRecorder {
}
}

flush() {
this.partialSpans.forEach((span, id) => {
if (span.delegate.duration === undefined) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the span is finished, we don't want to add akward annotations outside the span lifespan. Do you agree?

Copy link
Member

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help on this @jcchavezs

It is unnecessary to expose the flush interval to give the functionality of flush. If you do both at the same time this means more tests and docs. You can solve the problem just by exposing flush alone from what I understand.

Anyway, I would recommend re-using values and text from async reporter. ex zero is an invalid time so just use that if you choose to disable the flush thread.

Adding of a flush annotation should not happen when someone calls flush(). This means the timeout code should either do the annotation adding logic, or both logic dispatch to an internal function with a boolean. I wouldn't expose to users an api for flush that has any parameters.

Also note that this logic will be extracted later as it is arbitrary that the queuing is pinned to http transport.

https://github.com/openzipkin/zipkin-reporter-java/blob/master/core/src/main/java/zipkin2/reporter/AsyncReporter.java

    /**
     * Default 1 second. 0 implies spans are {@link #flush() flushed} externally.
     *
     * <p>Instead of sending one message at a time, spans are bundled into messages, up to {@link
     * Sender#messageMaxBytes()}. This timeout ensures that spans are not stuck in an incomplete
     * message.
     *
     * <p>Note: this timeout starts when the first unsent span is reported.
     */
  /**
   * Calling this will flush any pending spans to the transport on the current thread.
   *
   * <p>Note: If you set {@link Builder#messageTimeout(long, TimeUnit) message timeout} to zero, you
   * must call this externally as otherwise spans will never be sent.
   *
   * @throws IllegalStateException if closed
   */
  @Override public abstract void flush();

@codefromthecrypt
Copy link
Member

actually I am not sure the logic here is timeout or interval in nature. my recommendation is to remove extraction of the time parameter and just expose flush. that way we dont have to setup a new parameter just to remove it later as we need to redo the batching part anyway.

@jcchavezs
Copy link
Contributor Author

jcchavezs commented Jul 21, 2019 via email

@jcchavezs
Copy link
Contributor Author

jcchavezs commented Jul 21, 2019

PTAL @adriancole


recorder.flush();

expect(logSpan.calledOnce).to.equal(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can rebase and redo this test. A similar old one was lying.

Copy link
Member

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, made some explanatory comments mainly to engage you on some of the recent test related refactorings.


expect(spans).to.be.empty; // eslint-disable-line no-unused-expressions

clock.tick(100); // 1000 is de batching interval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally remove the lolex stuff as it is irrelevant. For example, we have other tests in the same file which show that records are not flushed until complete. The only responsibility this one should have is that when you hit flush the data is expected. The clock ticking stuff distracts from an otherwise simple goal IOTW. For this test you can even use fake timestamps like other things here

Copy link
Contributor Author

@jcchavezs jcchavezs Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think for me it was better to make it clear that the flush occurs when it was not yet the time for a scheduled flush but definitively I can remove it.

expect(popSpan()).to.deep.equal({
traceId: rootId.traceId,
id: rootId.spanId,
kind: 'SERVER'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprised the timestamp didn't make it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was because of lolex. I think this is ready to be merged.

@codefromthecrypt codefromthecrypt merged commit 6b29aa1 into master Jul 23, 2019
@codefromthecrypt codefromthecrypt deleted the 362_adds_http_logger_flush branch July 23, 2019 09:52
@codefromthecrypt
Copy link
Member

win!

@jeffthompson1971
Copy link
Contributor

This zipkin.flush spans just bring down my whole system. Any time an endpoint returns a 400 error code, for some reason I get a zipkin.flush span ... and they all pile up under the same traceID. So i end up with this SINGLE trace that is thousands of spans all with same trace ID. This makes it so the zipkin portal UI doesn't render cuz it stack overflows. I don't understand why every single span has same trace ID... here is a SMALL part of the json for the trace.. it's ENDLESS

question is how can i NOT include this feature? this feature totally broke me...

[{"traceId":"2fede1516e4a9531","parentId":"20376d16c0bdeef9","id":"80001e186964c474","kind":"CLIENT","name":"post","timestamp":1570639888002950,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570639949002000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"614fd007f68a98e2","id":"8006b05ef6c90fb4","kind":"CLIENT","name":"post","timestamp":1570642255539077,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570642316145000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"1653057111a48a03","id":"8006fdf52b7751a4","kind":"CLIENT","name":"post","timestamp":1570637439396212,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570637499780000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"53eff77f02d6e14f","id":"800f2d0832f65b5f","kind":"CLIENT","name":"post","timestamp":1570636293176881,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570636353781000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"a108ced553744bcc","id":"80212179d2b26ba7","kind":"CLIENT","name":"post","timestamp":1570633862471293,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570633922531000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"f23363dbbb25a6b1","id":"802d851d77e25dd1","kind":"CLIENT","name":"post","timestamp":1570638686552524,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570638746915000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"38b9db64fd9413dc","id":"802e871225693318","kind":"CLIENT","name":"post","timestamp":1570645863532340,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570645924478000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"2a59056443015c05","id":"8041f9691cc256cb","kind":"CLIENT","name":"post","timestamp":1570632850830676,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570632911628000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"2c3378fbbb126517","id":"804482adb6500c7e","kind":"CLIENT","name":"post","timestamp":1570639890764047,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570639951005000,"value":"zipkin-js.flush"}],"tags":{"component":"AbtTracer","http.path":"/superset/explore_json/","project":"aip-core-ui"}},{"traceId":"2fede1516e4a9531","parentId":"9988ae5b455be743","id":"8049e96752bd04cb","kind":"CLIENT","name":"post","timestamp":1570647080045901,"duration":1157594,"localEndpoint":{"serviceName":"bis-service"},"annotations":[{"timestamp":1570647081229572,"value":"finish"},{"timestamp":1570647081471349,"value":"finish"},{"timestamp":1570647081473861,"value":"finish"},{"timestamp":1570647081476349,"value":"finish"},{"timestamp":1570647081486183,"value":"finish"},{"timestamp":1570647081521970,"value":"finish"},{"timestamp":1570647081523545,"value":"finish"},{"timestamp":1570647081533483,"value":"finish"},{"timestamp":1570647081546296,"value":"finish"},{"timestamp":1570647081934752,"value":"finish"},{"timestamp":1570647081963690,"value":"finish"},{"timestamp":1570647081995823,"value":"finish"},{"timestamp":1570647082056161,"value":"finish"},{"timestamp":1570647082061941,"value":"finish"},{"timestamp":1570647082066852,"value":"finish"},{"timestamp":1570647082457872,"value":"finish"},{"timestamp":1570647082459342,"value":"finish"},{"timestamp":1570647082464122,"value":"finish"},{"timestamp":1570647082464445,"value":"finish"},{"timestamp":1570647082464722,"value":"finish"},

@codefromthecrypt
Copy link
Member

@jeffthompson1971 what you are describing is a really bad bug it seems. I've tried to help you on gitter before and maybe notifications aren't working. there's a human communication problem by which I'm not able to get a suggestion to you to try connect instead (ideally avoiding the bug) https://gitter.im/openzipkin/zipkin?at=5d93e7a9fb131014e721b07f

At any rate, you can mask the bug by setting the batch recorder timeout to a very high number, which essentially makes it impossible to ever become timed out.

@codefromthecrypt
Copy link
Member

note that if you do set this to a really high number there's a chance, depending on the nature of the bug, that it could leak memory. I really implore trying some suggestions to make the bug not happen

@jcchavezs
Copy link
Contributor Author

Hi @jeffthompson1971,

What you are describing is really bad but I am not sure this is a problem caused for this PR. This PR includes a method for flush but that method is being called by the user, not by any zipkin instrumentation. The piece of code that adds the zipkin-js.flush annotations is #416 and that is adding the annotation. What I think you are experiencing is that the middleware is holding the span and not finishing. Could you please provide your setup over gitter?

@jcchavezs
Copy link
Contributor Author

Hi @jeffthompson1971 we just released https://github.com/openzipkin/zipkin-js/releases/tag/v0.19.1. This should fix the problem you have in express as hooks usage have been fixed, would you mind trying it?

@jeffthompson1971
Copy link
Contributor

sorry I missed this guys. I will try this today!

@jeffthompson1971
Copy link
Contributor

Actually I already tried it I had forgotten. Still see the issue. And I was also given the recommendation to move to zipkin-instrumentation-connect and I did that too.. still i get these zipkin.flush spans. wish those would just go away until i figure out why this is happening. Just not easy to debug this it seems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add flush to HttpLogger
3 participants