Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/signalfx] TestCorrelationClient unit test failure #27059

Closed
crobert-1 opened this issue Sep 21, 2023 · 4 comments · Fixed by #27608
Closed

[exporter/signalfx] TestCorrelationClient unit test failure #27059

crobert-1 opened this issue Sep 21, 2023 · 4 comments · Fixed by #27608
Assignees
Labels

Comments

@crobert-1
Copy link
Member

Component(s)

exporter/signalfx

Describe the issue you're reporting

CI action failure

--- FAIL: TestCorrelationClient (10.03s)
    client_test.go:61: Test server got PUT request: /v2/apm/correlate/host/test-box/service
    client_test.go:61: Test server got DELETE request: /v2/apm/correlate/host/test-box/service/test-service
    client_test.go:61: Test server got PUT request: /v2/apm/correlate/host/test-box/environment
    client_test.go:61: Test server got DELETE request: /v2/apm/correlate/host/test-box/environment/test-service
    client_test.go:61: Test server got GET request: /v2/apm/correlate/host/test-box
    --- FAIL: TestCorrelationClient/does_retry_500_responses (4.01s)
        client_test.go:250: 
            	Error Trace:	/home/runner/work/opentelemetry-collector-contrib/opentelemetry-collector-contrib/exporter/signalfxexporter/internal/apm/correlations/client_test.go:250
            	Error:      	Not equal: 
            	            	expected: 5
            	            	actual  : 10
            	Test:       	TestCorrelationClient/does_retry_500_responses
FAIL
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/signalfxexporter/internal/apm/correlations
@crobert-1
Copy link
Member Author

I'm looking into this. I can't reproduce locally and it's only failing sometimes so it seems environment dependent somehow. I did find during debugging that the actual response count can end up being much lower if the program is paused during the timeout window, so there's at least some way to change resulting behavior here.

@crobert-1
Copy link
Member Author

Adding more context: This test was introduced with the deprecation of the Signalfx Agent. A lot of code was moved over for #27040, which was merged the day before this issue was filed.

@crobert-1
Copy link
Member Author

I believe I've found the issue here:

			CleanupInterval: 0,

The test sets up a client for correlating dimensions on data. Deduplication is done on the client side as a performance enhancement to make fewer update requests to the server. Deduplication on data is done within a given time context, so the client is deduplicating any data seen within the CleanupInterval (essentially). When it's set to 0 here it's making it so that there's almost no chance to dedup data, since there will likely be an event to trigger purging the requests before there can be multiple correlation calls. This results in no deduplication being done and the test failing.

@crobert-1 crobert-1 added flaky test a test is flaky and removed needs triage New item requiring triage labels Oct 20, 2023
dmitryax pushed a commit that referenced this issue Oct 31, 2023
**Description:** 
The APM correlation test is failing with too many retry updates. The
solution is to increase the `CleanupInterval` to allow the test time to
make multiple correlation calls within the same dedup cleanup interval.
I posted the full description in [this issue
comment.](#27059 (comment))

Before this fix this test would fail _almost_ every time since it was
introduced into this repo, I haven't seen any failure since this update
(9 successful runs).

**Link to tracking Issue:** Resolves #27059
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
…metry#27608)

**Description:** 
The APM correlation test is failing with too many retry updates. The
solution is to increase the `CleanupInterval` to allow the test time to
make multiple correlation calls within the same dedup cleanup interval.
I posted the full description in [this issue
comment.](open-telemetry#27059 (comment))

Before this fix this test would fail _almost_ every time since it was
introduced into this repo, I haven't seen any failure since this update
(9 successful runs).

**Link to tracking Issue:** Resolves open-telemetry#27059
RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this issue Nov 24, 2023
…metry#27608)

**Description:** 
The APM correlation test is failing with too many retry updates. The
solution is to increase the `CleanupInterval` to allow the test time to
make multiple correlation calls within the same dedup cleanup interval.
I posted the full description in [this issue
comment.](open-telemetry#27059 (comment))

Before this fix this test would fail _almost_ every time since it was
introduced into this repo, I haven't seen any failure since this update
(9 successful runs).

**Link to tracking Issue:** Resolves open-telemetry#27059
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants