Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

Open
bboreham opened this issue Apr 22, 2023 · 4 comments
Open

WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

bboreham opened this issue Apr 22, 2023 · 4 comments

Comments

@bboreham
Copy link
Member

bboreham commented Apr 22, 2023

EDIT: the problem has been substantially improved by #12297; check out the write-up there for details.

What did you do?

I observed a Prometheus where prometheus_tsdb_head_series varied from 4 million to 7 million over each compaction cycle.

The number of series in the WAL checkpoint is not observable via a metric, so I downloaded it and used the following code, adapted from TestReadCheckpointMultipleSegments to see the number of series:

	wt := newWriteToMock()
	watcher := NewWatcher(wMetrics, nil, nil, "", wt, dir, false, false)
	watcher.MaxSegment = -1
	watcher.setMetrics()

	lastCheckpoint, _, err := LastCheckpoint(watcher.walDir)
	err = watcher.readCheckpoint(lastCheckpoint, (*Watcher).readSegment)
	fmt.Println(len(wt.seriesSegmentIndexes))

What did you expect to see?

About 7 million series in the WAL checkpoint (same as the max observed number of series).

Reasoning: a WAL checkpoint is generated every head compaction* (= 2 hours with default settings) and the checkpoint covers about the same time, so the checkpoint should have about the same number of series as I observe in the head.

* unless the amount of data being collected is small, in which case it's every 2 head compactions.

What did you see instead? Under which circumstances?

The WAL checkpoint had 18 million series in it.

Prometheus version

I was looking at 2.41 but further tests show the same thing happens in 2.43.
@bboreham
Copy link
Member Author

I tried to write out step-by-step what happens to cause this.
Illustrated is a Prometheus TSDB that has been collecting data since 10:00 UTC.
WAL segments are named A, B, C, ...

Consider a series foo which received a few samples at 10:15, then stopped.
The samples for series foo are in WAL segment C.

           10:00        12:00
Head       ┌───────────────────┐
           └───────────────────┘
WAL        A- B- C- D- E-- F- G-

At approx 13:00, head compaction runs. A block is created from 10-12, and
that data is dropped from the head. Series foo is garbage-collected, but
the head notes in its 'deleted' map that it might be needed until WAL segment G has
been dropped.

A WAL checkpoint is created from the first two thirds of the segments, A-D.
This checkpoint has no samples, since any samples before 12:00 are excluded.
WAL segments A-D are removed from disk.

           10:00        12:00
Head                    ┌──────┐
                        └──────┘
Blocks     ┌────────────┐
           └────────────┘
WAL                    E-- F- G-
Checkpoint            X

After two more hours, the head and WAL have built up:

           10:00        12:00        14:00
Head                    ┌───────────────────┐
                        └───────────────────┘
Blocks     ┌────────────┐
           └────────────┘
WAL                    E-- F- G- H I- J-- K-
Checkpoint            X

At approx 15:00, head compaction runs again.
A WAL checkpoint is created covering segments E-H.
Series foo is retained in the checkpoint, since it is in the 'deleted' map.
The 'deleted' map is cleaned of any series needed until segment 'E', so series foo remains in the map.

           10:00        12:00        14:00
Head                                 ┌──────┐
                                     └──────┘
Blocks     ┌────────────┬────────────┐
           └────────────┴────────────┘
WAL                                I- J-- K-
Checkpoint                        X

After two more hours, the head and WAL have built up:

           10:00        12:00        14:00        16:00
Head                                 ┌───────────────────┐
                                     └───────────────────┘
Blocks     ┌────────────┬────────────┐
           └────────────┴────────────┘
WAL                                I- J-- K- L M N-- O- P-
Checkpoint                        X

At approx 17:00, head compaction runs again.
A WAL checkpoint is created covering segments I-L.
Series foo is retained in the checkpoint, since it is still in the 'deleted' map.
Now, series foo is dropped from the 'deleted' map since segment G is before I.

           10:00        12:00        14:00        16:00
Head                                              ┌──────┐
                                                  └──────┘
Blocks     ┌────────────┬────────────┬────────────┐
           └────────────┴────────────┴────────────┘
WAL                                            M N-- O- P-
Checkpoint                                    X

Only at the next head compaction at 19:00 will series foo be dropped from
the checkpoint, since it is no longer in the 'deleted' map.

In this way, a series which stopped receiving data at 10:45 is retained in the WAL until 19:00.

@bboreham bboreham changed the title WAL Checkpoint holds series from too far back WAL Checkpoint holds deleted series for 3 compaction cycles Apr 24, 2023
@jesusvazquez
Copy link
Member

jesusvazquez commented Apr 24, 2023

This is a nice find. We probably have to dive in to see why is this happening. Just want to leave a comment here saying that whatever is affecting the WAL is probably affecting the WBL too since they both rely on the same implementation with minor differences.

@bboreham
Copy link
Member Author

Thanks @jesusvazquez : I don't see anything about a checkpoint in the WBL code I looked at:

prometheus/tsdb/head.go

Lines 1247 to 1261 in 5442d7e

func (h *Head) truncateOOO(lastWBLFile int, minOOOMmapRef chunks.ChunkDiskMapperRef) error {
curMinOOOMmapRef := chunks.ChunkDiskMapperRef(h.minOOOMmapRef.Load())
if minOOOMmapRef.GreaterThan(curMinOOOMmapRef) {
h.minOOOMmapRef.Store(uint64(minOOOMmapRef))
if err := h.truncateSeriesAndChunkDiskMapper("truncateOOO"); err != nil {
return err
}
}
if h.wbl == nil {
return nil
}
return h.wbl.Truncate(lastWBLFile)
}

Is there some other mechanism to ensure there is a series record for all samples in the remaining part of the WBL?

BTW it would be nice to have the WBL mentioned in the docs https://github.com/prometheus/prometheus/blob/a0f7c31c2666dc45f8006ee66395b5409a59a2b9/tsdb/docs/

@bboreham
Copy link
Member Author

bboreham commented May 1, 2023

Following up my last comment: it turns out every sample in the WBL is first written to the WAL, so the series records in the WAL will work for the WBL too.

Now that #12297 is merged, the problem is reduced: we are holding series for 1 extra compaction cycle.
In terms of the example we will remove the record of series foo at 15:00.

@bboreham bboreham changed the title WAL Checkpoint holds deleted series for 3 compaction cycles WAL Checkpoint holds deleted series for 1 extra compaction cycle May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants