WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

bboreham · 2023-04-22T17:38:42Z

EDIT: the problem has been substantially improved by #12297; check out the write-up there for details.

What did you do?

I observed a Prometheus where prometheus_tsdb_head_series varied from 4 million to 7 million over each compaction cycle.

The number of series in the WAL checkpoint is not observable via a metric, so I downloaded it and used the following code, adapted from TestReadCheckpointMultipleSegments to see the number of series:

	wt := newWriteToMock()
	watcher := NewWatcher(wMetrics, nil, nil, "", wt, dir, false, false)
	watcher.MaxSegment = -1
	watcher.setMetrics()

	lastCheckpoint, _, err := LastCheckpoint(watcher.walDir)
	err = watcher.readCheckpoint(lastCheckpoint, (*Watcher).readSegment)
	fmt.Println(len(wt.seriesSegmentIndexes))

What did you expect to see?

About 7 million series in the WAL checkpoint (same as the max observed number of series).

Reasoning: a WAL checkpoint is generated every head compaction* (= 2 hours with default settings) and the checkpoint covers about the same time, so the checkpoint should have about the same number of series as I observe in the head.

* unless the amount of data being collected is small, in which case it's every 2 head compactions.

What did you see instead? Under which circumstances?

The WAL checkpoint had 18 million series in it.

Prometheus version

I was looking at 2.41 but further tests show the same thing happens in 2.43.

The text was updated successfully, but these errors were encountered:

bboreham · 2023-04-23T14:56:27Z

I tried to write out step-by-step what happens to cause this.
Illustrated is a Prometheus TSDB that has been collecting data since 10:00 UTC.
WAL segments are named A, B, C, ...

Consider a series foo which received a few samples at 10:15, then stopped.
The samples for series foo are in WAL segment C.

           10:00        12:00
Head       ┌───────────────────┐
           └───────────────────┘
WAL        A- B- C- D- E-- F- G-

At approx 13:00, head compaction runs. A block is created from 10-12, and
that data is dropped from the head. Series foo is garbage-collected, but
the head notes in its 'deleted' map that it might be needed until WAL segment G has
been dropped.

A WAL checkpoint is created from the first two thirds of the segments, A-D.
This checkpoint has no samples, since any samples before 12:00 are excluded.
WAL segments A-D are removed from disk.

           10:00        12:00
Head                    ┌──────┐
                        └──────┘
Blocks     ┌────────────┐
           └────────────┘
WAL                    E-- F- G-
Checkpoint            X

After two more hours, the head and WAL have built up:

           10:00        12:00        14:00
Head                    ┌───────────────────┐
                        └───────────────────┘
Blocks     ┌────────────┐
           └────────────┘
WAL                    E-- F- G- H I- J-- K-
Checkpoint            X

At approx 15:00, head compaction runs again.
A WAL checkpoint is created covering segments E-H.
Series foo is retained in the checkpoint, since it is in the 'deleted' map.
The 'deleted' map is cleaned of any series needed until segment 'E', so series foo remains in the map.

           10:00        12:00        14:00
Head                                 ┌──────┐
                                     └──────┘
Blocks     ┌────────────┬────────────┐
           └────────────┴────────────┘
WAL                                I- J-- K-
Checkpoint                        X

After two more hours, the head and WAL have built up:

           10:00        12:00        14:00        16:00
Head                                 ┌───────────────────┐
                                     └───────────────────┘
Blocks     ┌────────────┬────────────┐
           └────────────┴────────────┘
WAL                                I- J-- K- L M N-- O- P-
Checkpoint                        X

At approx 17:00, head compaction runs again.
A WAL checkpoint is created covering segments I-L.
Series foo is retained in the checkpoint, since it is still in the 'deleted' map.
Now, series foo is dropped from the 'deleted' map since segment G is before I.

           10:00        12:00        14:00        16:00
Head                                              ┌──────┐
                                                  └──────┘
Blocks     ┌────────────┬────────────┬────────────┐
           └────────────┴────────────┴────────────┘
WAL                                            M N-- O- P-
Checkpoint                                    X

Only at the next head compaction at 19:00 will series foo be dropped from
the checkpoint, since it is no longer in the 'deleted' map.

In this way, a series which stopped receiving data at 10:45 is retained in the WAL until 19:00.

jesusvazquez · 2023-04-24T09:09:59Z

This is a nice find. We probably have to dive in to see why is this happening. Just want to leave a comment here saying that whatever is affecting the WAL is probably affecting the WBL too since they both rely on the same implementation with minor differences.

bboreham · 2023-04-24T10:25:59Z

Thanks @jesusvazquez : I don't see anything about a checkpoint in the WBL code I looked at:

prometheus/tsdb/head.go

Lines 1247 to 1261 in 5442d7e

    
           func (h *Head) truncateOOO(lastWBLFile int, minOOOMmapRef chunks.ChunkDiskMapperRef) error { 
        
           	curMinOOOMmapRef := chunks.ChunkDiskMapperRef(h.minOOOMmapRef.Load()) 
        
           	if minOOOMmapRef.GreaterThan(curMinOOOMmapRef) { 
        
           		h.minOOOMmapRef.Store(uint64(minOOOMmapRef)) 
        
           		if err := h.truncateSeriesAndChunkDiskMapper("truncateOOO"); err != nil { 
        
           			return err 
        
           		} 
        
           	} 
        
           	if h.wbl == nil { 
        
           		return nil 
        
           	} 
        
           	return h.wbl.Truncate(lastWBLFile) 
        
           }

Is there some other mechanism to ensure there is a series record for all samples in the remaining part of the WBL?

BTW it would be nice to have the WBL mentioned in the docs https://github.com/prometheus/prometheus/blob/a0f7c31c2666dc45f8006ee66395b5409a59a2b9/tsdb/docs/

bboreham · 2023-05-01T15:56:36Z

Following up my last comment: it turns out every sample in the WBL is first written to the WAL, so the series records in the WAL will work for the WBL too.

Now that #12297 is merged, the problem is reduced: we are holding series for 1 extra compaction cycle.
In terms of the example we will remove the record of series foo at 15:00.

This was referenced Apr 23, 2023

Memory usage spikes during WAL replay to more than normal usage #6934

Open

Idea: add a mode where Prometheus will do head compaction/WAL truncation before starting scraping #11306

Open

bboreham changed the title ~~WAL Checkpoint holds series from too far back~~ WAL Checkpoint holds deleted series for 3 compaction cycles Apr 24, 2023

bboreham added kind/bug component/tsdb labels Apr 24, 2023

bboreham mentioned this issue Apr 24, 2023

tsdb: remove deleted series from WAL checkpoint without delay #12288

Closed

bboreham mentioned this issue Apr 26, 2023

tsdb: drop deleted series from the WAL sooner #12297

Merged

bboreham changed the title ~~WAL Checkpoint holds deleted series for 3 compaction cycles~~ WAL Checkpoint holds deleted series for 1 extra compaction cycle May 1, 2023

bboreham added kind/enhancement and removed kind/bug labels May 1, 2023

bboreham mentioned this issue Jun 20, 2023

Improve remote write WAL replay #6733

Open

nemobis mentioned this issue Sep 21, 2023

WAL directory does not respect disk limits #12882

Open

bboreham mentioned this issue Jan 9, 2024

prometheus_tsdb_head_series increases to double previous value during WAL replay then drops #13385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

bboreham commented Apr 22, 2023 •

edited

bboreham commented Apr 23, 2023

jesusvazquez commented Apr 24, 2023 •

edited

bboreham commented Apr 24, 2023

bboreham commented May 1, 2023

WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

WAL Checkpoint holds deleted series for 1 extra compaction cycle #12286

Comments

bboreham commented Apr 22, 2023 • edited

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Prometheus version

bboreham commented Apr 23, 2023

jesusvazquez commented Apr 24, 2023 • edited

bboreham commented Apr 24, 2023

bboreham commented May 1, 2023

bboreham commented Apr 22, 2023 •

edited

jesusvazquez commented Apr 24, 2023 •

edited