Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add consume mode #34

Merged
merged 16 commits into from
Dec 14, 2020
Merged

Add consume mode #34

merged 16 commits into from
Dec 14, 2020

Conversation

simonswine
Copy link
Collaborator

@simonswine simonswine commented Nov 5, 2020

In the consume mode the adapter subscribes to a pulsar topic and writes
out the metrics as remote_write request to cortex.

TODOs:

  • Verify error behaviour and ensure those cases have tests
  • Provide integration tests (esp. for tenant ID forwarding)

An image is available:

simonswine/prometheus-pulsar-remote-write:consume-mode

@simonswine simonswine force-pushed the add-consume-mode branch 3 times, most recently from c57e155 to 328609b Compare November 6, 2020 14:40
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
pkg/app/consume.go Outdated Show resolved Hide resolved
pkg/remote/write.go Outdated Show resolved Hide resolved
@simonswine simonswine force-pushed the add-consume-mode branch 3 times, most recently from fb727e9 to f6731c4 Compare November 25, 2020 09:35
@simonswine simonswine changed the title WIP: Add consume mode Add consume mode Nov 25, 2020
integration/consume_integration_test.go Outdated Show resolved Hide resolved
integration/integration_test.go Outdated Show resolved Hide resolved
pkg/remote/write.go Show resolved Hide resolved
@simonswine simonswine requested review from replay and a team December 4, 2020 12:52
Copy link

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly just minor nits

Comment on lines +51 to +52
// TODO: Not too sure how relevant it is for consuming from the bus
//clientOptions.OperationTimeout = p.readTimeout
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still relevant? I can't see the vars clientOptions nor p, so it looks like maybe that line would be outdated now anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still relevant, something to ask when I am speaking to a pulsar expert next: 6b31f27

pkg/app/consume.go Outdated Show resolved Hide resolved
pkg/app/consume.go Outdated Show resolved Hide resolved

// this is set when a retry able error happend
errRemoteWriteRetryable := false
blockingSampleCh := make(chan pulsar.ReceivedSample)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the purpose of the blockingSampleCh...

So when a retryable error happens on remote write then:

  1. errRemoteWriteRetryable gets set to true on :182
  2. on the next iteration sample() gets called again on :131
  3. because errRemoteWriteRetryable is true, sample() returns blockingSampleCh
  4. the select at the beginning of the loop will then wait for blockingSampleCh to yield an object or the ticker to tick or the ctx to get cancelled
  5. most likely the next event will then be the ticker ticking, causing a retry

So is the desired function of blockingSampleCh to introduce a short wait time between retries? If that's the case, then wouldn't we only want to sleep before sending the next sample of that tenant for which the error occurred (since limits or often tenant specific)? In the current implementation, wouldn't that sleep block all tenants?

Copy link

@replay replay Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, even if it's true that in the case of a retryable error we're blocking the forwarding of data for all tenants, instead of just blocking it for the ones which yielded the error, i don't think this is an issue which is critical enough to block the merging of this PR. so i'll just approve anyway to not unnecessarily block you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point you are raising here. Currently when an error for only a single tenant happens on the remote write end of the adapter we would still receive messages from pulsar for that tenant as they are all coming from the same queue and we have no way to tell the tenant.

I think there would be more intelligent was of handling that e.g.:

  • Nack messages for blocked tenants, but that would mean we need to have a good resumption strategy to avoid out of order samples.
  • Having different queue per tenant in pulsar

For now I would like to keep that simple behaviour and figure out what would be the best way forward in terms of real world error cases.

I have improved the comment in code to reflect that

* pulsar-client-go v0.2.0
* prometheus v2.22.0

Signed-off-by: Christian Simon <simon@swine.de>
In the consume mode the adapter subscribes to a pulsar topic and writes
out the metrics as remote_write request to cortex.

TODOs:
- [ ] Verify error behaviour and ensure those cases have tests
- [ ] Provide integration tests (esp. for tenant ID forwarding)

Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Only check once checkPeriod is reached or more samples than MaxBatchSize
have been received.

Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
integration/consume_integration_test.go Outdated Show resolved Hide resolved
pkg/pulsar/pulsar.go Outdated Show resolved Hide resolved
pkg/remote/write.go Outdated Show resolved Hide resolved
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
Signed-off-by: Christian Simon <simon@swine.de>
@simonswine simonswine merged commit a63899c into master Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants