Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/chunkenc: ignore duplicate lines pushed to a stream #1519

Merged
merged 2 commits into from
Jan 14, 2020

Conversation

rfratto
Copy link
Member

@rfratto rfratto commented Jan 14, 2020

This commit changes the behavior of appending to a stream to ignore an incoming line if its timestamp and contents match the previous line received. This should reduce the chances of ingesters storing duplicate log lines when clients retry push requests whenever a 500 is returned.

Fixes #1517.

@slim-bean
Copy link
Collaborator

instead of ignoring, should we consider fudging the timestamp?

@slim-bean
Copy link
Collaborator

in addition to not applying to flushed chunks, it also doesn't apply to the last entry in the previous block, though this we could probably address?

@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

I'm not sure I understand why we'd want to do that? Consider what happens today:

  1. A distributor pushes to ingesters A, B, C.
  2. Ingester A accepts the log lines.
  3. Ingesters B and C have some intermittent 500 error.
  4. This 500 error is propagated back to Promtail, who backs off and retries the request.
  5. B and C now accept the new log lines.
  6. A returns OutOfOrder errors for all lines except for the most recent, which it accepts as valid and appends to the stream, even though it already has it.

There's a few issues with this:

  1. It's a duplicate log line that will show up in query results (I think, since it's coming from one ingester)
  2. It causes the chunks for A, B, and C to be desynced as A has one more entry than B and C. Two different chunks will be flushed to the backing store.

@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

in addition to not applying to flushed chunks, it also doesn't apply to the last entry in the previous block, though this we could probably address?

We could, I didn't bother because it requires decompressing the entire chunk. It probably wouldn't be required that often, but it sounds like a non-zero performance hit, right?

@slim-bean
Copy link
Collaborator

there is a different case here, the client sends two different log lines with the same timestamp, this is the case were I think we could consider fudging the timestamp for them

@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

yeah, I don't know, that sounds hard to get right. What if all their logs are only precise to the second and they send 50 log lines with the same timestamp? We'd have to track the fudged timestamp over time and keep that running across cut chunks.

@slim-bean
Copy link
Collaborator

i think this is why nobody has addressed this before, currently we would store these logs for them, this would start dropping them.

@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

this only drops the lines if the content matches the last stored line, though?

@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

fwiw, Cortex drops all duplicates with the same timestamp, even if the value doesn't match. i don't think that's the right thing for us to do, though

@slim-bean
Copy link
Collaborator

this only drops the lines if the content matches the last stored, though?

ah right, yeah this would be beneficial and less risky/impacting

@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

after talking to Ed, I'm going to enhance this commit to work across flushes and cut chunks by tracking the most recently written line and timestamp per stream.

@rfratto rfratto force-pushed the ignore-duplicate-lines-append branch from 4d9884b to 3c6a05f Compare January 14, 2020 17:38
@rfratto rfratto changed the title pkg/chunkenc: ignore duplicate lines within a head block pkg/chunkenc: ignore duplicate lines pushed to a stream Jan 14, 2020
This commit changes the behavior of appending to a stream to ignore
an incoming line if its timestamp and contents match the previous line
received. This should reduce the chances of ingesters storing duplicate
log lines when clients retry push requests whenever a 500 is returned.

Fixes grafana#1517
@rfratto rfratto force-pushed the ignore-duplicate-lines-append branch from 3c6a05f to ad8bcde Compare January 14, 2020 17:38
@rfratto
Copy link
Member Author

rfratto commented Jan 14, 2020

@slim-bean PTAL, this should cover more edge cases better now

Copy link
Collaborator

@slim-bean slim-bean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rfratto rfratto merged commit 6501321 into grafana:master Jan 14, 2020
@rfratto rfratto deleted the ignore-duplicate-lines-append branch January 14, 2020 19:17
@maxdialpad
Copy link

We're running into a similar issue, except the exact same timestamp applies to two legitimately different log lines.
For example, I have my_service, which is outputting something like this in my_service.log:

2022-03-28T00:00:00.000Z 1Starting a new channel
2022-03-28T00:00:00.000Z 2Doing something else. Yes this is the exact same timestamp
2022-03-28T00:00:00.015Z 3Doing some other thing

In Loki, we may see the logs ordered like so

2022-03-28T00:00:00.000Z 2Doing something else. Yes this is the exact same timestamp
2022-03-28T00:00:00.000Z 1Starting a new channel
2022-03-28T00:00:00.015Z 3Doing some other thing

I think this could be solved in a more traditional index-heavy log aggregator by incorporating a offset label (or index) that can essentially act as a tie-breaker in these scenarios. The offset is the byte offset in the original log file at which a given log line starts.
With Loki, that's tough. The offset label would obviously be incredibly diverse in cardinality, so I don't think that would work smoothly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingesters should ignore data with the same timestamp and contents as previously received line
3 participants