chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627

gianarb · 2020-01-22T17:25:42Z

When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.

The error is: operation timed out with error: create snapshot: snapshot in progress.

This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.

This PR skips snapshots if the snapshotter does not come available
after three attempts when a backup is requested.

The backup won't contain the data in the cache or WAL.

Signed-off-by: Gianluca Arbezzano gianarb92@gmail.com

dgnorton · 2020-01-23T15:44:27Z

tsdb/engine/tsm1/engine.go

+	}
+
+	if err != nil {
+		e.logger.Info("WAL busy: Backup proceeding without WAL contents", zap.Error(err))


nit: change WAL to cache since it's Cache.Snapshot() that returns the error

dgnorton · 2020-01-23T16:30:42Z

tsdb/engine/tsm1/engine.go

+		if err != nil {
+			switch err {
+			case ErrSnapshotInProgress:
+				backoff := time.Duration(math.Pow(3.8, float64(i))) * time.Millisecond


Not sure what the best backoff would be. 3.8 seems low. That's 14.4ms on the 3rd iteration. Snapshots are generally < 1s but can be longer even in a healthy system sometimes. Maybe 32 instead of 3.8? That would top out at ~1s.

e-dard · 2020-01-24T15:35:41Z

tsdb/engine/tsm1/engine.go

@@ -1893,8 +1893,22 @@ func (e *Engine) WriteSnapshot() (err error) {
 // CreateSnapshot will create a temp directory that holds
 // temporary hardlinks to the underylyng shard files.
 func (e *Engine) CreateSnapshot() (string, error) {
-	if err := e.WriteSnapshot(); err != nil {
-		return "", err
+	var err error


Question - is this only used for backups?

For what I can tell this is the unique place where we use it

influxdb/storage/engine.go

Line 674 in a56e022

func (e *Engine) CreateBackup(ctx context.Context) (int, []string, error) {

So yes

I think CreateSnapshot() is called from:

https://github.com/influxdata/influxdb/blob/1.8/tsdb/shard.go#L1125
https://github.com/influxdata/influxdb/blob/1.8/tsdb/engine/tsm1/engine.go#L981
https://github.com/influxdata/influxdb/blob/1.8/tsdb/engine/tsm1/engine.go#L915

e-dard

Since this PR is about fixing a specific backup problem, and not CreateSnapshot per-se, I'm wondering if (*Engine).CreateSnapshot should broadly be left alone, and the retry logic should be added to the actual backup code, which would be:

influxdb/tsdb/engine/tsm1/engine.go

Lines 914 to 923 in 741ac8e

    
           func (e *Engine) Backup(w io.Writer, basePath string, since time.Time) error { 
        
           	path, err := e.CreateSnapshot() 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	// Remove the temporary snapshot dir 
        
           	defer os.RemoveAll(path) 
        
           	return intar.Stream(w, path, basePath, intar.SinceFilterTarFile(since)) 
        
           }

I guess?

What do you think?

e-dard · 2020-01-31T18:32:09Z

tsdb/engine/tsm1/engine.go

@@ -1893,8 +1893,22 @@ func (e *Engine) WriteSnapshot() (err error) {
 // CreateSnapshot will create a temp directory that holds
 // temporary hardlinks to the underylyng shard files.
 func (e *Engine) CreateSnapshot() (string, error) {
-	if err := e.WriteSnapshot(); err != nil {
-		return "", err
+	var err error


I think CreateSnapshot() is called from:

https://github.com/influxdata/influxdb/blob/1.8/tsdb/shard.go#L1125
https://github.com/influxdata/influxdb/blob/1.8/tsdb/engine/tsm1/engine.go#L981
https://github.com/influxdata/influxdb/blob/1.8/tsdb/engine/tsm1/engine.go#L915

e-dard · 2020-02-04T10:35:55Z

tsdb/engine/tsm1/engine.go

+			default:
+				return err
+			}
+		}


One more tiny nit. Since technically after three attempts we will be potentially ignoring an error we should still handle it. In this case we should log a warning to say that we couldn't snapshot the cache.

e-dard

LGTM 👍

When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: `operation timed out with error: create snapshot: snapshot in progress`. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. This PR skips snapshots if the `snapshotter` does not come available after three attempts when a backup is requested. The backup won't contain the data in the cache or WAL. Signed-off-by: Gianluca Arbezzano <gianarb92@gmail.com>

When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227

When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() Adding engine_internal_test.go theto test the skipCacheOk flag to tsdb.Shard.CreateSnapshot() and tsdb.Engine.CreateSnapshot() influxdata/plutonium#3227

When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227 (cherry picked from commit 23be20b)

When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227 (cherry picked from commit 23be20b) (cherry picked from commit 0b1ee04)

When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was #16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() influxdata/plutonium#3227 (cherry picked from commit 23be20b)

v1.8.4 [2021-01-27] ------------------- ### Bugfixes - [#19696](influxdata/influxdb#19697): fix(flux): add durations to Flux logging v1.8.3 [2020-09-30] ------------------- ### Features - [#19187](influxdata/influxdb#19187): feat: Collect values written stats. - [#19611](influxdata/influxdb#19611): feat: Add -lponly flag to export sub-command. ### Bugfixes - [#19409](influxdata/influxdb#19409): chore: update uuid library from satori to gofrs. - [#19439](influxdata/influxdb#19439): fix(storage): ArrayFilterCursor truncation for multi-block data. - [#19460](influxdata/influxdb#19460): chore: Use latest version of influxql package. - [#19512](influxdata/influxdb#19512): chore: Quiet static analysis tools. - [#19592](influxdata/influxdb#19592): fix(services/storage): multi measurement queries return all applicable series. - [#19612](influxdata/influxdb#19612): fix: lock map before writes. v1.8.2 [2020-08-13] ------------------- ### Bugfixes - [#19253](influxdata/influxdb#19253): fix(tsdb): Revert disable series id set cache size by default. v1.8.1 [2020-07-08] ------------------- ### Features - [#18457](influxdata/influxdb#18457): feat(query): Parallelize field iterator planning. - [#18886](influxdata/influxdb#18886): feat(http): Allow user supplied HTTP headers. ### Bugfixes - [#17319](influxdata/influxdb#17319): fix(flux): buckets call no longer panics. - [#18212](influxdata/influxdb#18212): fix(tsdb): Defer closing of underlying SeriesIDSetIterators. - [#18286](influxdata/influxdb#18286): fix(tsdb): Disable series id set cache size by default. - [#18299](influxdata/influxdb#18299): refactor(http): Simplify Authorizer. - [#18694](influxdata/influxdb#18694): fix(tsi1): wait deleting epoch before dropping shard. - [#18687](influxdata/influxdb#18687): perf(tsi1): batch write tombstone entries when dropping/deleting. - [#18826](influxdata/influxdb#18826): fix: gracefully handle errors when creating snapshots. - [#18849](influxdata/influxdb#18849): chore(build): remove all of the go1.12 references from build. v1.8.0 [2020-04-11] ------------------- ### Features - [#15952](influxdata/influxdb#15952): Add influx_inspect verify-tombstone tool. - [#16542](influxdata/influxdb#16542): Add offline series compaction to influx_inspect buildtsi. - [#16599](influxdata/influxdb#16599): Make influx CLI support custom HTTP endpoint. - [#16908](influxdata/influxdb#16908): Add support for InfluxDB 2.0 write API. - [#17621](influxdata/influxdb#17621): Update Flux to v0.65.0. - [#17188](influxdata/influxdb#17188): Enhance support for bound parameters. ### Bugfixes - [#10503](influxdata/influxdb#10503): Delete rebuilds series index when series to be deleted are only found in cache. - [#10504](https://github.com/influxdata/influxdb/issue/10504): Delete rebuilds series index when series to be deleted are outside timerange. - [#14485](influxdata/influxdb#14485): Parse Accept header correctly. - [#16524](influxdata/influxdb#16524): Upgrade compaction error log from `Info` to `Warn`. - [#16525](influxdata/influxdb#16525): Remove double increment of meta index. - [#16595](influxdata/influxdb#16595): Improve series cardinality limit for inmem index. - [#16606](influxdata/influxdb#16606): Ensure all block data returned. - [#16627](influxdata/influxdb#16627): Skip WriteSnapshot during backup if snapshotter is busy. - [#16709](influxdata/influxdb#16709): Reduce influxd and influx startup time if Flux isn't used. - [#16762](influxdata/influxdb#16762): Fix bugs in -compact-series-file. - [#16944](influxdata/influxdb#16944): Update to Go 1.13.8 and Go modules. - [#17032](influxdata/influxdb#17032): Fix a SIGSEGV when accessing tsi active log. - [#17656](influxdata/influxdb#17656): Verify precision in write requests. - [#17698](influxdata/influxdb#17698): Enable configuration of TLS 1.3.

gianarb added area/backup and restore 1.x labels Jan 22, 2020

gianarb force-pushed the feature/skip-wal-cache branch from 0e5b41a to 4f359b3 Compare January 22, 2020 17:41

gianarb requested a review from dgnorton January 23, 2020 15:21

dgnorton reviewed Jan 23, 2020

View reviewed changes

gianarb force-pushed the feature/skip-wal-cache branch from 4f359b3 to ea7e060 Compare January 23, 2020 17:11

gianarb requested a review from dgnorton January 23, 2020 17:12

dgnorton approved these changes Jan 23, 2020

View reviewed changes

gianarb requested a review from e-dard January 24, 2020 09:38

e-dard reviewed Jan 24, 2020

View reviewed changes

gianarb requested a review from e-dard January 30, 2020 22:16

e-dard suggested changes Jan 31, 2020

View reviewed changes

gianarb force-pushed the feature/skip-wal-cache branch from ea7e060 to 7d7a25d Compare January 31, 2020 22:37

gianarb requested a review from e-dard January 31, 2020 22:38

e-dard reviewed Feb 4, 2020

View reviewed changes

gianarb force-pushed the feature/skip-wal-cache branch from 7d7a25d to bfc5de6 Compare February 4, 2020 13:34

gianarb requested a review from e-dard February 4, 2020 15:44

e-dard approved these changes Feb 4, 2020

View reviewed changes

gianarb force-pushed the feature/skip-wal-cache branch from bfc5de6 to 30621ca Compare February 4, 2020 19:09

gianarb merged commit a3e37f4 into 1.8 Feb 4, 2020

gianarb deleted the feature/skip-wal-cache branch February 5, 2020 08:50

davidby-influx mentioned this pull request Oct 30, 2020

fix(tsm1): "snapshot in progress" error during backup #19869

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627

chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627

gianarb commented Jan 22, 2020

dgnorton Jan 23, 2020

dgnorton Jan 23, 2020

e-dard Jan 24, 2020

gianarb Jan 30, 2020

e-dard Jan 31, 2020

e-dard left a comment

e-dard Jan 31, 2020

e-dard Feb 4, 2020

e-dard left a comment

	func (e *Engine) Backup(w io.Writer, basePath string, since time.Time) error {
	path, err := e.CreateSnapshot()
	if err != nil {
	return err
	}
	// Remove the temporary snapshot dir
	defer os.RemoveAll(path)

	return intar.Stream(w, path, basePath, intar.SinceFilterTarFile(since))
	}

chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627

chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627

Conversation

gianarb commented Jan 22, 2020

dgnorton Jan 23, 2020

Choose a reason for hiding this comment

dgnorton Jan 23, 2020

Choose a reason for hiding this comment

e-dard Jan 24, 2020

Choose a reason for hiding this comment

gianarb Jan 30, 2020

Choose a reason for hiding this comment

e-dard Jan 31, 2020

Choose a reason for hiding this comment

e-dard left a comment

Choose a reason for hiding this comment

e-dard Jan 31, 2020

Choose a reason for hiding this comment

e-dard Feb 4, 2020

Choose a reason for hiding this comment

e-dard left a comment

Choose a reason for hiding this comment