Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy #16627

Merged
merged 1 commit into from Feb 4, 2020

Conversation

gianarb
Copy link
Contributor

@gianarb gianarb commented Jan 22, 2020

When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.

The error is: operation timed out with error: create snapshot: snapshot in progress.

This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.

This PR skips snapshots if the snapshotter does not come available
after three attempts when a backup is requested.

The backup won't contain the data in the cache or WAL.

Signed-off-by: Gianluca Arbezzano gianarb92@gmail.com

}

if err != nil {
e.logger.Info("WAL busy: Backup proceeding without WAL contents", zap.Error(err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change WAL to cache since it's Cache.Snapshot() that returns the error

if err != nil {
switch err {
case ErrSnapshotInProgress:
backoff := time.Duration(math.Pow(3.8, float64(i))) * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the best backoff would be. 3.8 seems low. That's 14.4ms on the 3rd iteration. Snapshots are generally < 1s but can be longer even in a healthy system sometimes. Maybe 32 instead of 3.8? That would top out at ~1s.

@gianarb gianarb requested a review from e-dard January 24, 2020 09:38
@@ -1893,8 +1893,22 @@ func (e *Engine) WriteSnapshot() (err error) {
// CreateSnapshot will create a temp directory that holds
// temporary hardlinks to the underylyng shard files.
func (e *Engine) CreateSnapshot() (string, error) {
if err := e.WriteSnapshot(); err != nil {
return "", err
var err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - is this only used for backups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what I can tell this is the unique place where we use it

func (e *Engine) CreateBackup(ctx context.Context) (int, []string, error) {

So yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianarb gianarb requested a review from e-dard January 30, 2020 22:16
Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this PR is about fixing a specific backup problem, and not CreateSnapshot per-se, I'm wondering if (*Engine).CreateSnapshot should broadly be left alone, and the retry logic should be added to the actual backup code, which would be:

func (e *Engine) Backup(w io.Writer, basePath string, since time.Time) error {
path, err := e.CreateSnapshot()
if err != nil {
return err
}
// Remove the temporary snapshot dir
defer os.RemoveAll(path)
return intar.Stream(w, path, basePath, intar.SinceFilterTarFile(since))
}

I guess?

What do you think?

@@ -1893,8 +1893,22 @@ func (e *Engine) WriteSnapshot() (err error) {
// CreateSnapshot will create a temp directory that holds
// temporary hardlinks to the underylyng shard files.
func (e *Engine) CreateSnapshot() (string, error) {
if err := e.WriteSnapshot(); err != nil {
return "", err
var err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default:
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more tiny nit. Since technically after three attempts we will be potentially ignoring an error we should still handle it. In this case we should log a warning to say that we couldn't snapshot the cache.

Copy link
Contributor

@e-dard e-dard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.

The error is: `operation timed out with error: create snapshot: snapshot in progress`.

This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.

This PR skips snapshots if the `snapshotter` does not come available
after three attempts when a backup is requested.

The backup won't contain the data in the cache or WAL.

Signed-off-by: Gianluca Arbezzano <gianarb92@gmail.com>
@gianarb gianarb merged commit a3e37f4 into 1.8 Feb 4, 2020
@gianarb gianarb deleted the feature/skip-wal-cache branch February 5, 2020 08:50
davidby-influx added a commit that referenced this pull request Oct 29, 2020
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
davidby-influx added a commit that referenced this pull request Oct 30, 2020
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
davidby-influx added a commit that referenced this pull request Nov 9, 2020
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()
Adding engine_internal_test.go theto test the skipCacheOk flag to
tsdb.Shard.CreateSnapshot() and tsdb.Engine.CreateSnapshot()

influxdata/plutonium#3227
davidby-influx added a commit that referenced this pull request Nov 13, 2020
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
(cherry picked from commit 23be20b)
davidby-influx added a commit that referenced this pull request Nov 14, 2020
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
(cherry picked from commit 23be20b)
(cherry picked from commit 0b1ee04)
davidby-influx added a commit that referenced this pull request Mar 2, 2021
When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
(cherry picked from commit 23be20b)
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request May 12, 2021
v1.8.4 [2021-01-27]
-------------------

### Bugfixes

-	[#19696](influxdata/influxdb#19697): fix(flux): add durations to Flux logging

v1.8.3 [2020-09-30]
-------------------

### Features

-	[#19187](influxdata/influxdb#19187): feat: Collect values written stats.
-	[#19611](influxdata/influxdb#19611): feat: Add -lponly flag to export sub-command.

### Bugfixes

-	[#19409](influxdata/influxdb#19409): chore: update uuid library from satori to gofrs.
-	[#19439](influxdata/influxdb#19439): fix(storage): ArrayFilterCursor truncation for multi-block data.
-	[#19460](influxdata/influxdb#19460): chore: Use latest version of influxql package.
-	[#19512](influxdata/influxdb#19512): chore: Quiet static analysis tools.
-	[#19592](influxdata/influxdb#19592): fix(services/storage): multi measurement queries return all applicable series.
-	[#19612](influxdata/influxdb#19612): fix: lock map before writes.

v1.8.2 [2020-08-13]
-------------------

### Bugfixes

-	[#19253](influxdata/influxdb#19253): fix(tsdb): Revert disable series id set cache size by default.

v1.8.1 [2020-07-08]
-------------------

### Features

-	[#18457](influxdata/influxdb#18457): feat(query): Parallelize field iterator planning.
-	[#18886](influxdata/influxdb#18886): feat(http): Allow user supplied HTTP headers.

### Bugfixes

-	[#17319](influxdata/influxdb#17319): fix(flux): buckets call no longer panics.
-	[#18212](influxdata/influxdb#18212): fix(tsdb): Defer closing of underlying SeriesIDSetIterators.
-	[#18286](influxdata/influxdb#18286): fix(tsdb): Disable series id set cache size by default.
-	[#18299](influxdata/influxdb#18299): refactor(http): Simplify Authorizer.
-	[#18694](influxdata/influxdb#18694): fix(tsi1): wait deleting epoch before dropping shard.
-	[#18687](influxdata/influxdb#18687): perf(tsi1): batch write tombstone entries when dropping/deleting.
-	[#18826](influxdata/influxdb#18826): fix: gracefully handle errors when creating snapshots.
-	[#18849](influxdata/influxdb#18849): chore(build): remove all of the go1.12 references from build.

v1.8.0 [2020-04-11]
-------------------

### Features

-	[#15952](influxdata/influxdb#15952): Add influx_inspect verify-tombstone tool.
-	[#16542](influxdata/influxdb#16542): Add offline series compaction to influx_inspect buildtsi.
-	[#16599](influxdata/influxdb#16599): Make influx CLI support custom HTTP endpoint.
-	[#16908](influxdata/influxdb#16908): Add support for InfluxDB 2.0 write API.
-	[#17621](influxdata/influxdb#17621): Update Flux to v0.65.0.
-	[#17188](influxdata/influxdb#17188): Enhance support for bound parameters.

### Bugfixes

-	[#10503](influxdata/influxdb#10503): Delete rebuilds series index when series to be deleted are only found in cache.
-	[#10504](https://github.com/influxdata/influxdb/issue/10504): Delete rebuilds series index when series to be deleted are outside timerange.
-	[#14485](influxdata/influxdb#14485): Parse Accept header correctly.
-	[#16524](influxdata/influxdb#16524): Upgrade compaction error log from `Info` to `Warn`.
-	[#16525](influxdata/influxdb#16525): Remove double increment of meta index.
-	[#16595](influxdata/influxdb#16595): Improve series cardinality limit for inmem index.
-	[#16606](influxdata/influxdb#16606): Ensure all block data returned.
-	[#16627](influxdata/influxdb#16627): Skip WriteSnapshot during backup if snapshotter is busy.
-	[#16709](influxdata/influxdb#16709): Reduce influxd and influx startup time if Flux isn't used.
-	[#16762](influxdata/influxdb#16762): Fix bugs in -compact-series-file.
-	[#16944](influxdata/influxdb#16944): Update to Go 1.13.8 and Go modules.
-	[#17032](influxdata/influxdb#17032): Fix a SIGSEGV when accessing tsi active log.
-	[#17656](influxdata/influxdb#17656): Verify precision in write requests.
-	[#17698](influxdata/influxdb#17698): Enable configuration of TLS 1.3.
jperkin added a commit to TritonDataCenter/pkgsrc that referenced this pull request May 12, 2021
v1.8.4 [2021-01-27]
-------------------

### Bugfixes

-	[#19696](influxdata/influxdb#19697): fix(flux): add durations to Flux logging

v1.8.3 [2020-09-30]
-------------------

### Features

-	[#19187](influxdata/influxdb#19187): feat: Collect values written stats.
-	[#19611](influxdata/influxdb#19611): feat: Add -lponly flag to export sub-command.

### Bugfixes

-	[#19409](influxdata/influxdb#19409): chore: update uuid library from satori to gofrs.
-	[#19439](influxdata/influxdb#19439): fix(storage): ArrayFilterCursor truncation for multi-block data.
-	[#19460](influxdata/influxdb#19460): chore: Use latest version of influxql package.
-	[#19512](influxdata/influxdb#19512): chore: Quiet static analysis tools.
-	[#19592](influxdata/influxdb#19592): fix(services/storage): multi measurement queries return all applicable series.
-	[#19612](influxdata/influxdb#19612): fix: lock map before writes.

v1.8.2 [2020-08-13]
-------------------

### Bugfixes

-	[#19253](influxdata/influxdb#19253): fix(tsdb): Revert disable series id set cache size by default.

v1.8.1 [2020-07-08]
-------------------

### Features

-	[#18457](influxdata/influxdb#18457): feat(query): Parallelize field iterator planning.
-	[#18886](influxdata/influxdb#18886): feat(http): Allow user supplied HTTP headers.

### Bugfixes

-	[#17319](influxdata/influxdb#17319): fix(flux): buckets call no longer panics.
-	[#18212](influxdata/influxdb#18212): fix(tsdb): Defer closing of underlying SeriesIDSetIterators.
-	[#18286](influxdata/influxdb#18286): fix(tsdb): Disable series id set cache size by default.
-	[#18299](influxdata/influxdb#18299): refactor(http): Simplify Authorizer.
-	[#18694](influxdata/influxdb#18694): fix(tsi1): wait deleting epoch before dropping shard.
-	[#18687](influxdata/influxdb#18687): perf(tsi1): batch write tombstone entries when dropping/deleting.
-	[#18826](influxdata/influxdb#18826): fix: gracefully handle errors when creating snapshots.
-	[#18849](influxdata/influxdb#18849): chore(build): remove all of the go1.12 references from build.

v1.8.0 [2020-04-11]
-------------------

### Features

-	[#15952](influxdata/influxdb#15952): Add influx_inspect verify-tombstone tool.
-	[#16542](influxdata/influxdb#16542): Add offline series compaction to influx_inspect buildtsi.
-	[#16599](influxdata/influxdb#16599): Make influx CLI support custom HTTP endpoint.
-	[#16908](influxdata/influxdb#16908): Add support for InfluxDB 2.0 write API.
-	[#17621](influxdata/influxdb#17621): Update Flux to v0.65.0.
-	[#17188](influxdata/influxdb#17188): Enhance support for bound parameters.

### Bugfixes

-	[#10503](influxdata/influxdb#10503): Delete rebuilds series index when series to be deleted are only found in cache.
-	[#10504](https://github.com/influxdata/influxdb/issue/10504): Delete rebuilds series index when series to be deleted are outside timerange.
-	[#14485](influxdata/influxdb#14485): Parse Accept header correctly.
-	[#16524](influxdata/influxdb#16524): Upgrade compaction error log from `Info` to `Warn`.
-	[#16525](influxdata/influxdb#16525): Remove double increment of meta index.
-	[#16595](influxdata/influxdb#16595): Improve series cardinality limit for inmem index.
-	[#16606](influxdata/influxdb#16606): Ensure all block data returned.
-	[#16627](influxdata/influxdb#16627): Skip WriteSnapshot during backup if snapshotter is busy.
-	[#16709](influxdata/influxdb#16709): Reduce influxd and influx startup time if Flux isn't used.
-	[#16762](influxdata/influxdb#16762): Fix bugs in -compact-series-file.
-	[#16944](influxdata/influxdb#16944): Update to Go 1.13.8 and Go modules.
-	[#17032](influxdata/influxdb#17032): Fix a SIGSEGV when accessing tsi active log.
-	[#17656](influxdata/influxdb#17656): Verify precision in write requests.
-	[#17698](influxdata/influxdb#17698): Enable configuration of TLS 1.3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants