Fix data corruption in remote write if max_sample_age is applied #14078

FUSAKLA · 2024-05-10T21:41:03Z

The PR #13002 added the option to drop old samples in remote write.

Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979

after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the filter in buildTimeSeries is applied.

We did investigate and it appears that the issue is in the newly added buildTimeSeries function which does modify the timeSeries argument causing the corruption.

The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.

Signed-off-by: Martin Chodur <martin.chodur@firma.seznam.cz>

FUSAKLA · 2024-05-10T22:02:31Z

Unfortunately, I had no luck with writing a test that would cover the case yet

Yet, it is possible to reproduce the bug with 100% "success"

Run locally Prometheus with following config

global:
  scrape_interval: 5s
  external_labels:
    __replica__: prometheus-local-test-0
    cluster: local-test
    mimir_cluster_id: test-cluster

remote_write:
  - url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
    queue_config:
      max_shards: 1
      min_shards: 1
      batch_send_deadline: 5s
      capacity: 100
      sample_age_limit: 30s
    metadata_config:
      send: true
      send_interval: 1m
      max_samples_per_send: 100

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: node-exporter # running with default configuration
    static_configs:
      - targets:
          - localhost:9100

When it is running, block the communication to the remote write endpoint IP using sudo iptables -I OUTPUT -d <ip> -j DROP

Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again sudo iptables -F OUTPUT

At this point the issue happens. If you run with debugger inspecting the pendingData variable in the runShard function, you can find for example up metrics with duplicated labels like job, mimir_cluster_id etc

FUSAKLA · 2024-05-10T22:08:32Z

Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind.

cstyan · 2024-05-15T01:31:17Z

Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here.

My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request.

Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again.

FUSAKLA · 2024-05-15T12:56:30Z

Hi @cstyan, thanks for answer.

Yes, I agree. We used this bug fix successfully in our production, since we needed it asap,
but I'm planning to look into writing the test, so I can verify it's really not happening anymore (it's hard to find the corrupted data).

I'll ping you once I manage to reproduce it, if you don't mind.

cstyan · 2024-05-15T18:33:40Z

I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.

We used this bug fix successfully in our production, since we needed it asap

That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.

I'll ping you once I manage to reproduce it, if you don't mind.

👍

fix: malformation of remote write data if samples are dropped

b44a4e1

Signed-off-by: Martin Chodur <martin.chodur@firma.seznam.cz>

FUSAKLA marked this pull request as ready for review May 10, 2024 22:04

FUSAKLA requested review from cstyan, bwplotka and tomwilkie as code owners May 10, 2024 22:04

FUSAKLA mentioned this pull request May 10, 2024

Corrupting data written to remote storage in case sample_age_limit is hit #13979

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix data corruption in remote write if max_sample_age is applied #14078

Fix data corruption in remote write if max_sample_age is applied #14078

FUSAKLA commented May 10, 2024 •

edited

FUSAKLA commented May 10, 2024 •

edited

FUSAKLA commented May 10, 2024

cstyan commented May 15, 2024

FUSAKLA commented May 15, 2024

cstyan commented May 15, 2024 •

edited

Fix data corruption in remote write if max_sample_age is applied #14078

Are you sure you want to change the base?

Fix data corruption in remote write if max_sample_age is applied #14078

Conversation

FUSAKLA commented May 10, 2024 • edited

FUSAKLA commented May 10, 2024 • edited

FUSAKLA commented May 10, 2024

cstyan commented May 15, 2024

FUSAKLA commented May 15, 2024

cstyan commented May 15, 2024 • edited

FUSAKLA commented May 10, 2024 •

edited

FUSAKLA commented May 10, 2024 •

edited

cstyan commented May 15, 2024 •

edited