Add consume mode #34

simonswine · 2020-11-05T17:08:00Z

In the consume mode the adapter subscribes to a pulsar topic and writes
out the metrics as remote_write request to cortex.

TODOs:

Verify error behaviour and ensure those cases have tests
Provide integration tests (esp. for tenant ID forwarding)

An image is available:

simonswine/prometheus-pulsar-remote-write:consume-mode

README.md

pkg/app/consume.go

pkg/remote/write.go

integration/consume_integration_test.go

integration/integration_test.go

pkg/remote/write.go

replay

mostly just minor nits

replay · 2020-12-04T13:22:58Z

pkg/app/consume.go

+	// TODO: Not too sure how relevant it is for consuming from the bus
+	//clientOptions.OperationTimeout = p.readTimeout


is this comment still relevant? I can't see the vars clientOptions nor p, so it looks like maybe that line would be outdated now anyway?

I think it's still relevant, something to ask when I am speaking to a pulsar expert next: 6b31f27

pkg/app/consume.go

replay · 2020-12-04T13:46:27Z

pkg/remote/write.go

+
+	// this is set when a retry able error happend
+	errRemoteWriteRetryable := false
+	blockingSampleCh := make(chan pulsar.ReceivedSample)


I'm not sure I understand the purpose of the blockingSampleCh...

So when a retryable error happens on remote write then:

errRemoteWriteRetryable gets set to true on :182

on the next iteration sample() gets called again on :131

because errRemoteWriteRetryable is true, sample() returns blockingSampleCh

the select at the beginning of the loop will then wait for blockingSampleCh to yield an object or the ticker to tick or the ctx to get cancelled

most likely the next event will then be the ticker ticking, causing a retry

So is the desired function of blockingSampleCh to introduce a short wait time between retries? If that's the case, then wouldn't we only want to sleep before sending the next sample of that tenant for which the error occurred (since limits or often tenant specific)? In the current implementation, wouldn't that sleep block all tenants?

Actually, even if it's true that in the case of a retryable error we're blocking the forwarding of data for all tenants, instead of just blocking it for the ones which yielded the error, i don't think this is an issue which is critical enough to block the merging of this PR. so i'll just approve anyway to not unnecessarily block you

Very good point you are raising here. Currently when an error for only a single tenant happens on the remote write end of the adapter we would still receive messages from pulsar for that tenant as they are all coming from the same queue and we have no way to tell the tenant.

I think there would be more intelligent was of handling that e.g.:

Nack messages for blocked tenants, but that would mean we need to have a good resumption strategy to avoid out of order samples.

Having different queue per tenant in pulsar

For now I would like to keep that simple behaviour and figure out what would be the best way forward in terms of real world error cases.

I have improved the comment in code to reflect that

* pulsar-client-go v0.2.0 * prometheus v2.22.0 Signed-off-by: Christian Simon <simon@swine.de>

In the consume mode the adapter subscribes to a pulsar topic and writes out the metrics as remote_write request to cortex. TODOs: - [ ] Verify error behaviour and ensure those cases have tests - [ ] Provide integration tests (esp. for tenant ID forwarding) Signed-off-by: Christian Simon <simon@swine.de>

Signed-off-by: Christian Simon <simon@swine.de>

Only check once checkPeriod is reached or more samples than MaxBatchSize have been received. Signed-off-by: Christian Simon <simon@swine.de>

Signed-off-by: Christian Simon <simon@swine.de>

integration/consume_integration_test.go

pkg/pulsar/pulsar.go

pkg/remote/write.go

Signed-off-by: Christian Simon <simon@swine.de>

simonswine force-pushed the add-consume-mode branch 3 times, most recently from c57e155 to 328609b Compare November 6, 2020 14:40