Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipkin Receiver: default timestamp #1068

Merged
merged 4 commits into from Jun 11, 2020

Conversation

chris-smith-zocdoc
Copy link
Contributor

Fixes #954

#954 is being caused from the conversion of OC -> Internal -> OC span types.

OC technically allows nil times because its a pointer, but docs state

// This field is semantically required. When not set on receive -
// receiver should set it to the value of end_time field if it was
// set. Or to the current time if neither was set. It is important to
// keep end_time > start_time for consistency.
//
// This field is required.
StartTime *timestamp.Timestamp

So this pr changes the zipkin receiver to set the start/end timestamps if they're not set on receive

@@ -123,7 +123,7 @@ func (ze *zipkinExporter) PushTraceData(_ context.Context, td consumerdata.Trace

body, err := ze.serializer.Serialize(tbatch)
if err != nil {
return len(td.Spans), err
return len(td.Spans), consumererror.Permanent(err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the serialization failure to a permanent error because it had caused an infinite loop in the retry processor otherwise. Is there any reason we'd want to retry on a serialization error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any reason to retry them

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of spans with no timestamps? Is there a particular reason to have them in zipkin?

Btw we will need to remove the OC part from the zipkin -> OC -> internal translation later on. So we need to think how to represent such spans in internal data structure if cannot discard them.

@@ -123,7 +123,7 @@ func (ze *zipkinExporter) PushTraceData(_ context.Context, td consumerdata.Trace

body, err := ze.serializer.Serialize(tbatch)
if err != nil {
return len(td.Spans), err
return len(td.Spans), consumererror.Permanent(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any reason to retry them

@dmitryax dmitryax self-assigned this Jun 3, 2020
now := internal.TimeToTimestamp(time.Now())
span.StartTime = now
span.EndTime = now
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this but I don't know what the right thing to do is. Addin/modifying data like this just to make serialize happy doesn't sound that great to me. This mean backends will show misleading information to users.

Alternatively, we could just drop such spans as a validation step in the zipkin receiver and log messages but we can't do that as zipkin spec allows for the possibility of not having timestamps. Is there a way we can serialize it differently without adding fake timestamps?

If there is no other viable solution, we should at least add an attribute to the span that represents that the timestamps are fake and shouldn't be trusted. Also, can we may be use more obvious fake value for timestamp like unix epoch or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zipkin spec on this:

Epoch microseconds of the start of this span, possibly absent if
incomplete.

For example, 1502787600000000 corresponds to 2017-08-15 09:00 UTC

This value should be set directly by instrumentation, using the most
precise value possible. For example, gettimeofday or multiplying epoch
millis by 1000.

There are three known edge-cases where this could be reported absent.

    A span was allocated but never started (ex not yet received a timestamp)
    The span’s start event was lost
    Data about a completed span (ex tags) were sent after the fact

Copy link
Contributor

@owais owais Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A span was allocated but never started (ex not yet received a timestamp)

If this was the only edge-case, we could just drop the span as it didn't actually represent a real operation but we have two more scenarios that can result in spans without timestamps so we can't just drop.

It looks like setting a static timestamp that is obviously fake along with a well-known attribute that says the timestamps were injected by Otel collector as a workaround is probably our best bet. Does Jaeger proto allow having empty timestamps? If so, an OTLP>Jaeger/Zipkin translator should probably set the timestamp to nil again when it encounters this tag on a span. One concern with setting a very old timestamp is that it might affect some backends in an unexpected way. For example, some backend might drop the span very soon if it stores only recent data and by span operation timestamps instead of ingestion timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Jaeger, will need someone else to weigh in on that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like setting a static timestamp that is obviously fake along with a well-known attribute that says the timestamps were injected by Otel collector as a workaround is probably our best bet. Does Jaeger proto allow having empty timestamps? If so, an OTLP>Jaeger/Zipkin translator should probably set the timestamp to nil again when it encounters this tag on a span. One concern with setting a very old timestamp is that it might affect some backends in an unexpected way. For example, some backend might drop the span very soon if it stores only recent data and by span operation timestamps instead of ingestion timestamp.

I agree with this approach. We can set a fake current timestamp along with an attribute identifying this workaround and use it in Jaeger/zipkin exporters to reconstruct null timestamps.

Copy link
Contributor

@owais owais Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a zipkin convention. We should probably have an otel convention about such tags. May be otel.zipkin.missingTimestamp or otel.zipkin.missingFields with value being comma separated list of field names? May be we should have otel collector specific conventions for behavior like this. Perhaps prefix with otelcol or something?

We can set a fake current timestamp

Why current timestamp? Can we do zero time? Like 0 seconds since unix epoch? I'm afraid if we use current time and the span get exported using something other than zipkin, it might be very misleading to users when seeing the span in a tracing backend. "zero time" will at least raise some eye brows.

Copy link
Member

@dmitryax dmitryax Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@owais , 0 timestamps look cleaner, but if I understood your concern right it doesn't look like the best way:

One concern with setting a very old timestamp is that it might affect some backends in an unexpected way. For example, some backend might drop the span very soon if it stores only recent data and by span operation timestamps instead of ingestion timestamp.

IMO still dropping spans with 0 timestamp looks like the best way to handle it in otel-collector. @chris-smith-zocdoc could you please let us know if the spans with no timestamps are actually visible in the backend that you use or they needed for aggregation purposes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO still dropping spans with 0 timestamp looks like the best way to handle it in otel-collector

I disagree. Throwing away data seems suboptimal.

Through coincidence they actually end up as a zero timestamp right now, but after the translation from OC -> internal -> OC for the zipkin export, the timestamp is becoming invalid and causing the serialization issue.

could you please let us know if the spans with no timestamps are actually visible in the backend that you use or they needed for aggregation purposes?

Yes they're visible. We have two backends, one is zipkin the other is honeycomb.

The data in these spans are just events and annotations, sent after the original span. This appears to be a common behavior in the zipkin libraries we're using for javascript and scala.

[
    {
        "traceId": "404059d49fac24fc",
        "id": "404059d49fac24ff",
        "annotations": [
            {
                "timestamp": 1589292979495000,
                "value": "zipkin-js.flush"
            }
        ],
        "tags": {
            "response_size_bytes": "9622"
        }
    }
]

In zipkin the data is merged into the original span.

In honeycomb these end up as a second event in their system. I've considered building a buffer/merge processor (similar to tail sampling) to do the merge in OTel before sending it to them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that throwing away the spans does not sound correct for people who might be using zipkin end to end (or even some parts of it) since it is a valid zipkin use case. I still have concerns about zero'ed timestamps causing backends to drop spans too soon. May be backends can use the tag we inject to detect this edge-case but not sure we can expect all backends to do this especially since it is not part of the otel spec. I guess adding current timestamp for both start and end is the safest choice here even though it's not very optimal for backends other than zipkin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's stick with the current timestamp.

Also @owais had a good point about about the attribute name. We probably should not use a name that looks like zipkin convention. @owais, what about otel.zipkin.absentField.startTime?

@chris-smith-zocdoc
Copy link
Contributor Author

What is the value of spans with no timestamps? Is there a particular reason to have them in zipkin?

Yes there is value in keeping them, @owais covered the cases where this can occur.

In my environment I believe I was running into case 3 "Data about a completed span (ex tags) were sent after the fact"

@tigrannajaryan tigrannajaryan added this to In progress in Collector via automation Jun 4, 2020
@dmitryax
Copy link
Member

dmitryax commented Jun 5, 2020

Please fix linter issues

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 8, 2020

CLA Check
The committers are authorized under a signed CLA.

@codecov
Copy link

codecov bot commented Jun 8, 2020

Codecov Report

Merging #1068 into master will decrease coverage by 0.00%.
The diff coverage is 97.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1068      +/-   ##
==========================================
- Coverage   86.34%   86.33%   -0.01%     
==========================================
  Files         198      198              
  Lines       14159    14187      +28     
==========================================
+ Hits        12226    12249      +23     
- Misses       1477     1481       +4     
- Partials      456      457       +1     
Impacted Files Coverage Δ
exporter/zipkinexporter/zipkin.go 78.18% <0.00%> (ø)
...ceiver/internal/scraper/cpuscraper/cpu_metadata.go 100.00% <100.00%> (ø)
...eceiver/internal/scraper/cpuscraper/cpu_scraper.go 88.23% <100.00%> (ø)
...r/internal/scraper/cpuscraper/cpu_scraper_linux.go 100.00% <100.00%> (ø)
receiver/hostmetricsreceiver/internal/testutils.go 88.46% <100.00%> (-11.54%) ⬇️
receiver/zipkinreceiver/trace_receiver.go 70.29% <100.00%> (+0.09%) ⬆️
translator/trace/zipkin/protospan_to_zipkinv1.go 88.23% <100.00%> (+0.21%) ⬆️
translator/trace/zipkin/zipkinv1_to_protospan.go 83.96% <100.00%> (+1.00%) ⬆️
internal/internal.go 78.94% <0.00%> (-10.53%) ⬇️
translator/internaldata/resource_to_oc.go 76.47% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caadbbc...c3f6252. Read the comment docs.

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Collector automation moved this from In progress to Reviewer approved Jun 9, 2020
@tigrannajaryan tigrannajaryan merged commit 746db76 into open-telemetry:master Jun 11, 2020
Collector automation moved this from Reviewer approved to Done Jun 11, 2020
@chris-smith-zocdoc chris-smith-zocdoc deleted the cs_954_pt2 branch June 12, 2020 14:05
wyTrivail pushed a commit to mxiamxia/opentelemetry-collector that referenced this pull request Jul 13, 2020
Fixes open-telemetry#954

open-telemetry#954 is being caused from the conversion of OC -> Internal -> OC span types. 

OC technically allows nil times because its a pointer, but [docs state](https://github.com/census-instrumentation/opencensus-proto/blob/master/src/opencensus/proto/trace/v1/trace.proto#L125-L135)

```go
// This field is semantically required. When not set on receive -
// receiver should set it to the value of end_time field if it was
// set. Or to the current time if neither was set. It is important to
// keep end_time > start_time for consistency.
//
// This field is required.
StartTime *timestamp.Timestamp
```

So this pr changes the zipkin receiver to set the start/end timestamps if they're not set on receive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Collector
  
Done
Development

Successfully merging this pull request may close these issues.

Zipkin export fails on span without timestamp when used with queued_retry
4 participants