Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented a "group by trace" processor. #1362

Merged
merged 1 commit into from
Aug 11, 2020

Conversation

jpkrohling
Copy link
Member

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

Description: This change adds a new 'groupbytrace' processor, as defined on #1309. This addresses the first item in the action plan.

Link to tracking Issue: #1309

Testing: extensive unit tests have been added

Documentation: sample configuration + readme added

@codecov
Copy link

codecov bot commented Jul 15, 2020

Codecov Report

Merging #1362 into master will increase coverage by 0.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1362      +/-   ##
==========================================
+ Coverage   91.35%   91.49%   +0.14%     
==========================================
  Files         240      245       +5     
  Lines       16744    17045     +301     
==========================================
+ Hits        15297    15596     +299     
- Misses       1044     1045       +1     
- Partials      403      404       +1     
Impacted Files Coverage Δ
processor/groupbytraceprocessor/event.go 100.00% <100.00%> (ø)
processor/groupbytraceprocessor/factory.go 100.00% <100.00%> (ø)
processor/groupbytraceprocessor/processor.go 100.00% <100.00%> (ø)
processor/groupbytraceprocessor/ring_buffer.go 100.00% <100.00%> (ø)
processor/groupbytraceprocessor/storage_memory.go 100.00% <100.00%> (ø)
service/defaultcomponents/defaults.go 84.90% <100.00%> (+0.29%) ⬆️
receiver/otlpreceiver/otlp.go 91.66% <0.00%> (-2.78%) ⬇️
translator/internaldata/resource_to_oc.go 87.50% <0.00%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11198b1...03995c6. Read the comment docs.

@jpkrohling jpkrohling marked this pull request as draft July 16, 2020 08:15
@jpkrohling
Copy link
Member Author

@pjanotti, @bogdandrutu , @tigrannajaryan , would one of you please review this one? I marked it as draft for now, as there might be things to discuss, but from my perspective, this is ready for review.

@jpkrohling
Copy link
Member Author

@owais , would you be able to look into this one, please?

@jpkrohling jpkrohling marked this pull request as ready for review July 21, 2020 07:08
Copy link
Contributor

@jrcamp jrcamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what the use case should be in the REAMDE (reading the linked issue looks like it's for tail based sampling).

Is this a common enough use case to go into core or should it go in contrib?

processor/groupbytraceprocessor/README.md Show resolved Hide resolved
processor/groupbytraceprocessor/config.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member Author

Is this a common enough use case to go into core or should it go in contrib?

My idea is that once this processor gets stable enough, the tail-based sampler starts using it instead of having its own similar logic.

@jpkrohling
Copy link
Member Author

I just realized that I left a change to service/defaultcomponents/defaults.go out. Will amend this PR later to include this, but the change there is trivial and doesn't block the review.

Copy link
Member

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few initial comments on the design.

Using pdata model without conversion to OTLP might be better, and that will allow you will to use TraceID.String() instead of hashing.

processor/groupbytraceprocessor/config.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/config.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/config.go Show resolved Hide resolved
processor/groupbytraceprocessor/factory.go Show resolved Hide resolved
processor/groupbytraceprocessor/factory.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member Author

jpkrohling commented Jul 27, 2020

Using pdata model without conversion to OTLP might be better, and that will allow you will to use TraceID.String() instead of hashing.

Based on your suggestion, the last commit changes the PR to use the internal pdata model instead of OTLP. The performance gain seems to be consistent and relevant:

Before the last commit (f81445e), 3 benchmark runs:

$ go test -benchmem -run="^$" ./processor/groupbytraceprocessor -bench "^(BenchmarkConsumeTracesCompleteOnFirstBatch)$"
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/collector/processor/groupbytraceprocessor
BenchmarkConsumeTracesCompleteOnFirstBatch-12    	 1000000	      7714 ns/op	   12468 B/op	     159 allocs/op
PASS
ok  	go.opentelemetry.io/collector/processor/groupbytraceprocessor	7.779s
$ go test -benchmem -run="^$" ./processor/groupbytraceprocessor -bench "^(BenchmarkConsumeTracesCompleteOnFirstBatch)$"
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/collector/processor/groupbytraceprocessor
BenchmarkConsumeTracesCompleteOnFirstBatch-12    	 1000000	      8129 ns/op	   13159 B/op	     168 allocs/op
PASS
ok  	go.opentelemetry.io/collector/processor/groupbytraceprocessor	8.193s
$ go test -benchmem -run="^$" ./processor/groupbytraceprocessor -bench "^(BenchmarkConsumeTracesCompleteOnFirstBatch)$"
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/collector/processor/groupbytraceprocessor
BenchmarkConsumeTracesCompleteOnFirstBatch-12    	 1000000	      8736 ns/op	   13946 B/op	     178 allocs/op
PASS
ok  	go.opentelemetry.io/collector/processor/groupbytraceprocessor	8.802s

After:

$ go test -benchmem -run="^$" ./processor/groupbytraceprocessor -bench "^(BenchmarkConsumeTracesCompleteOnFirstBatch)$"
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/collector/processor/groupbytraceprocessor
BenchmarkConsumeTracesCompleteOnFirstBatch-12    	 1000000	      3377 ns/op	    6463 B/op	      54 allocs/op
PASS
ok  	go.opentelemetry.io/collector/processor/groupbytraceprocessor	3.438s
$ go test -benchmem -run="^$" ./processor/groupbytraceprocessor -bench "^(BenchmarkConsumeTracesCompleteOnFirstBatch)$"
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/collector/processor/groupbytraceprocessor
BenchmarkConsumeTracesCompleteOnFirstBatch-12    	 1000000	      2781 ns/op	    5020 B/op	      44 allocs/op
PASS
ok  	go.opentelemetry.io/collector/processor/groupbytraceprocessor	2.842s
$ go test -benchmem -run="^$" ./processor/groupbytraceprocessor -bench "^(BenchmarkConsumeTracesCompleteOnFirstBatch)$"
goos: linux
goarch: amd64
pkg: go.opentelemetry.io/collector/processor/groupbytraceprocessor
BenchmarkConsumeTracesCompleteOnFirstBatch-12    	 1000000	      3560 ns/op	    7062 B/op	      59 allocs/op
PASS
ok  	go.opentelemetry.io/collector/processor/groupbytraceprocessor	3.636s

Copy link
Member

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle the graceful shutdown properly, see relevant comment thread.

processor/groupbytraceprocessor/config.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/factory.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/hash.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/hash_test.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/storage_memory.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member Author

@nilebox, @owais: other than the clarification around the shutdown behavior, is there anything pending for this PR?

Copy link
Member

@nilebox nilebox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall idea and high-level implementation make sense to me.

However, I personally find the code hard to read with 4 channels and 3 data structures manipulated in more than one place in different ways, but might be just me.
I'd probably prefer something like "event queue" with a few simple steps for each event type.

I would like one of the maintainers (Tigran or Bogdan) to do a review of this code as well.

processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/storage_memory.go Outdated Show resolved Hide resolved
processor/groupbytraceprocessor/processor.go Outdated Show resolved Hide resolved
@nilebox nilebox self-requested a review July 30, 2020 23:15
nilebox
nilebox previously approved these changes Jul 30, 2020
@nilebox nilebox dismissed their stale review July 30, 2020 23:18

Need to properly handle spans with different traces

@nilebox nilebox assigned nilebox and unassigned owais Jul 31, 2020
@jpkrohling
Copy link
Member Author

The last commit addresses the scenario where resource spans contain spans from multiple traces. There are only two points open: shutdown behavior and nil next consumer.

@nilebox
Copy link
Member

nilebox commented Jul 31, 2020

The code looks good to me now, but need to check the remaining open questions with maintainers of the repo.
Also not sure if this processor should live in the core repo or contrib.

@jpkrohling
Copy link
Member Author

Also not sure if this processor should live in the core repo or contrib.

This has been asked before by @jrcamp, and here's my answer:

My idea is that once this processor gets stable enough, the tail-based sampler starts using it instead of having its own similar logic.

@jpkrohling
Copy link
Member Author

The last commit also removed the nil check on the next consumer, as I agree that it doesn't really make sense for a processor to not have a next consumer.

The only remaining point is about the shutdown then.

@tigrannajaryan, @bogdandrutu, would you please help us on making a decision here? Basically:

what should happen with in-flight traces when the processor is shutting down? Should we just discard them? Should we flush them immediately, blocking until they are consumed, possibly honoring the context's deadline?

@jpkrohling
Copy link
Member Author

@nilebox would it be OK if I create an issue to discuss the shutdown behavior, implementing a fix in a follow-up PR?

@nilebox
Copy link
Member

nilebox commented Jul 31, 2020

@jpkrohling sure, I will approve this PR then once you create an issue. Still requires a maintainer to look at it and merge though :)

@jpkrohling
Copy link
Member Author

Created #1465

processor/groupbytraceprocessor/factory.go Outdated Show resolved Hide resolved
Closes open-telemetry#1309.

Signed-off-by: Juraci Paixão Kröhling <juraci@kroehling.de>
@jpkrohling
Copy link
Member Author

I addressed @bogdandrutu's comment and squashed all the commits, since this was probably the last change in this PR.

@jpkrohling
Copy link
Member Author

@nilebox , @bogdandrutu is there anything extra to be done with this PR? If not, would you mind merging it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants