Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data corruption in remote write if max_sample_age is applied #14078

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

FUSAKLA
Copy link
Contributor

@FUSAKLA FUSAKLA commented May 10, 2024

/fixes #13979

The PR #13002 added the option to drop old samples in remote write.

Unfortunatelly we bumped into a bug causing data to be corrupted (metrics withlabels getting merged with other metrics labels was the most obvious) reported here #13979

after trying to repproduce the issue it showed up it is strictly connected to situations, when retries does happen and the filter in buildTimeSeries is applied.

We did investigate and it appears that the issue is in the newly added buildTimeSeries function which does modify the timeSeries argument causing the corruption.

The suggested change, which avoids modification of the original timeSeries seems to fix the issue, but is naive and not sure how optimal.

Signed-off-by: Martin Chodur <martin.chodur@firma.seznam.cz>
@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 10, 2024

Unfortunately, I had no luck with writing a test that would cover the case yet

Yet, it is possible to reproduce the bug with 100% "success"

Run locally Prometheus with following config

global:
  scrape_interval: 5s
  external_labels:
    __replica__: prometheus-local-test-0
    cluster: local-test
    mimir_cluster_id: test-cluster

remote_write:
  - url: http://foo.bar # URL of some remote write endpoint with known IP si it can be blocked
    queue_config:
      max_shards: 1
      min_shards: 1
      batch_send_deadline: 5s
      capacity: 100
      sample_age_limit: 30s
    metadata_config:
      send: true
      send_interval: 1m
      max_samples_per_send: 100

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets:
          - localhost:9090

  - job_name: node-exporter # running with default configuration
    static_configs:
      - targets:
          - localhost:9100

When it is running, block the communication to the remote write endpoint IP using sudo iptables -I OUTPUT -d <ip> -j DROP

Wait until you see a context deadline exceeded, log in the Prometheus and enable the traffic again sudo iptables -F OUTPUT

At this point the issue happens. If you run with debugger inspecting the pendingData variable in the runShard function, you can find for example up metrics with duplicated labels like job, mimir_cluster_id etc

@FUSAKLA FUSAKLA marked this pull request as ready for review May 10, 2024 22:04
@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 10, 2024

Mentioning @marctc and @tpaschalis as you were authors of the original code and have most context, hope you don't mind.

@cstyan
Copy link
Member

cstyan commented May 15, 2024

Sorry, but for a bug that sounds as bad as what you're describing we need a test case to ensure we've fixed the problem/this doesn't happen again. I think because of the number of functions at play here and the fact that some of them are closures within other functions that is obscuring the underlying problem and also making it hard to test things here.

My suggestion would be to add a test that uses a version of the test client that can fail and enforce a retry since you're saying this only happens when we need to retry sending an existing write request.

Beyond that, we likely need to refactor some of the code path here so that it's easier to test and less likely to break again.

@FUSAKLA
Copy link
Contributor Author

FUSAKLA commented May 15, 2024

Hi @cstyan, thanks for answer.

Yes, I agree. We used this bug fix successfully in our production, since we needed it asap,
but I'm planning to look into writing the test, so I can verify it's really not happening anymore (it's hard to find the corrupted data).

I'll ping you once I manage to reproduce it, if you don't mind.

@cstyan
Copy link
Member

cstyan commented May 15, 2024

I've pinged Paschalis and Marc internally as well, they're aware of the problem and are looking into it.

We used this bug fix successfully in our production, since we needed it asap

That's good to know, if it becomes pressing we can just move forward with your PR as is. Having a test is still ideal, some part of me feels like while the fix here works it might not be the correct fix because it seems like what's happening is we're modifying the underlying slice of buffered data but not returning a reference to the proper range for the modified slice after a few retries.

I'll ping you once I manage to reproduce it, if you don't mind.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupting data written to remote storage in case sample_age_limit is hit
2 participants