Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for loki-canary to generate a percentage of out of order log lines #4126

Merged
merged 6 commits into from
Aug 20, 2021
Merged

Allow for loki-canary to generate a percentage of out of order log lines #4126

merged 6 commits into from
Aug 20, 2021

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Aug 9, 2021

Allows loki canary to generate a percentage of log lines with out of order timestamps, percentage and out of order time range is configurable.

Signed-off-by: Callum Styan callumstyan@gmail.com

@cstyan cstyan requested a review from a team as a code owner August 9, 2021 22:44
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits then lgtm

@@ -42,6 +42,10 @@ func main() {
queryTimeout := flag.Duration("query-timeout", 10*time.Second, "How long to wait for a query response from Loki")

interval := flag.Duration("interval", 1000*time.Millisecond, "Duration between log entries")
outOfOrderPercentage := flag.Int("out-of-order-percentage", 0, "Percentage (0-100) of log entries that should be sent out of order.")
outOfOrderMin := flag.Int("out-of-order-min", 30, "Minimum amount of time to go back for out of order entries (in seconds).")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these time.Duration instead of an int measuring seconds for ease of use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought this would make the math ugly later on, but it's fine since <time.Duration>.Seconds() exists, so done

@@ -58,6 +65,10 @@ func (w *Writer) run() {
select {
case <-t.C:
t := time.Now()
if i := rand.Intn(99) + 1; i <= w.outOfOrderPercentage {
Copy link
Member

@owen-d owen-d Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if i := rand.Intn(99) + 1; i <= w.outOfOrderPercentage {
if i := rand.Intn(100) + 1; i <= w.outOfOrderPercentage {

I think this should be Intn(100) due to inclusivity ([0,n)).

edit: Oh I see, were you using 99 then adding one in order to ensure it was at least one second behind? I see the appeal, but I think it's unnecessarily complex as we're only guaranteeing it's older than time.Now(), not necessarily the log line previously sent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you're right it should be 100, I misread the notation.

This rng is to decide whether the log line should be out of order in the first place, not how much time to subtract.

pkg/querier/queryrange/cache.go Outdated Show resolved Hide resolved
@slim-bean slim-bean changed the title Allow for loki-canary to generate a percentage of log lines Allow for loki-canary to generate a percentage of out of order log lines Aug 10, 2021
order timestamps, percentage and out of order time range is all
configurable.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
percentage check.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
pkg/canary/writer/writer.go Outdated Show resolved Hide resolved
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this locally and think there are a few pieces missing. Notably, the reader and comparator currently expect the data to be ordered and will likely need to be refactored to support unordered writes. This may end up being a nontrivial refactor, but worthwhile.

@cstyan
Copy link
Contributor Author

cstyan commented Aug 12, 2021

Without unordered writes enabled we get log lines (stderr) like this in the canary

websocket missing entry: 1628791469549189937
websocket missing entry: 1628791474548965353
websocket missing entry: 1628791484548915584
websocket missing entry: 1628791497549229427
websocket missing entry: 1628791501548969567
websocket missing entry: 1628791513549245135
websocket missing entry: 1628791524549176662
websocket missing entry: 1628791518549201846
websocket missing entry: 1628791520549661233
websocket missing entry: 1628791527548708454
websocket missing entry: 1628791550548754504
websocket missing entry: 1628791550548966003
out of order entry 2021-08-12 11:07:55.548713535 -0700 PDT m=+182.013106001 was received before entries: [2021-08-12 11:06:54.54871262 -0700 PDT m=+121.013105152 2021-08-12 11:07:09.548973316 -0700 PDT m=+136.013365782]
out of order entry 2021-08-12 11:07:56.548824291 -0700 PDT m=+183.013216736 was received before entries: [2021-08-12 11:06:54.54871262 -0700 PDT m=+121.013105152 2021-08-12 11:07:09.548973316 -0700 PDT m=+136.013365782]
succesfull query!
out of order entry 2021-08-12 11:07:57.548959794 -0700 PDT m=+184.013352238 was received before entries: [2021-08-12 11:06:54.54871262 -0700 PDT m=+121.013105152 2021-08-12 11:07:09.548973316 -0700 PDT m=+136.013365782]
out of order entry 2021-08-12 11:07:58.548764825 -0700 PDT m=+185.013157267 was received before entries: [2021-08-12 11:06:54.54871262 -0700 PDT m=+121.013105152 2021-08-12 11:07:09.548973316 -0700 PDT m=+136.013365782]

The first block is from the pruneEntries routine, which is checking for each log line generated by the canary by running a query for that lines timestamp +/- 10 seconds. Note that the query is run after those log lines. This works because the tailing routine run by the reader code/websocket connection sends every timestamp it receives over the recv channel, and the comparator removes those timestamps from entries So pruneEntries/confirmMissing is only trying to query for logs we haven't seen come in over the tail connection yet. If the queries return some of the logs we're missing those are removed from the missingEntries slice, and eventually missing entries that are too old are removed as well.

The second block is from the comparators code entryReceived, which is called for each entry sent over the channel from the tailing routine. This is the part where we remove log lines from entries when we see each timestamp make it to Loki. The out of order entry message is printed since when we receive some timestamps the next timestamp we actually expected to receive was the out of order timestamp, but it was rejected by Loki.

c.entries = pruneList(c.entries,
        func(_ int, t *time.Time) bool {
	        return ts.Equal(*t)
        },
        func(i int, t *time.Time) {
	        matched = true
	        // If this isn't the first item in the list we received it out of order
	        if i != 0 {
		        outOfOrderEntries.Inc()
		        fmt.Fprintf(c.w, ErrOutOfOrderEntry, t, c.entries[:i])
	        }
	        responseLatency.Observe(time.Since(ts).Seconds())
	        // Put this element in the acknowledged entries list so we can use it to check for duplicates
	        c.ackdEntries = append(c.ackdEntries, c.entries[i])
        })

We don't see log lines from either of the above example blocks if unordered_writes_enabled is set to true.

EDIT: also note that we need to have promtail pipeline stages to use the generated timestamp from the canaries log lines as the actual timestamp, otherwise all the log lines are ordered as far as loki is concerned.

cstyan and others added 2 commits August 12, 2021 11:44
various parts of the canary code are actually doing.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@owen-d
Copy link
Member

owen-d commented Aug 13, 2021

If the queries return some of the logs we're missing those are removed from the missingEntries slice, and eventually missing entries that are too old are removed as well.

We should ensure that the out-of-order-max < spot-check-max.

Also, the equality matcher in pruneEntries is a little worrisome as it only checks the timestamp. Since we're now creating entries that may be older than time.Now(), it's possible, albeit unlikely (nanosecond collision) that we see duplicates here. We can likely leave that for a followup PR if necessary.

@slim-bean
Copy link
Collaborator

I think we should update the docs to include information about these changes and make it explicit you need to configure promtail time extraction from the log line for this to work.

Aside from adding some docs this LGTM!

…uirement to use pipeline stages.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@owen-d owen-d merged commit 3ee8ed3 into grafana:main Aug 20, 2021
@owen-d
Copy link
Member

owen-d commented Aug 30, 2021

ref #1544

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants