aborted full compaction group ... compaction in progress ... file exists #8559

phemmer · 2017-07-03T14:27:24Z

Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]
Version: e3918e0
OS: Linux

Steps to reproduce:

Run InfluxDB

Expected behavior: [What you expected to happen]
Not see errors in the log.

Actual behavior: [What actually happened]
See errors in the log.

[I] 2017-07-03T01:57:21Z beginning full compaction of group 0, 5 TSM files engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000005.tsm (#0) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000006.tsm (#1) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000007.tsm (#2) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000008.tsm (#3) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000678-000000004.tsm (#4) engine=tsm1
[I] 2017-07-03T01:57:21Z aborted full compaction group (0). compaction in progress: open /mnt/influxdb/data/spec/autogen/317/000000678-000000005.tsm.tmp: file exists engine=tsm1
[I] 2017-07-03T01:57:22Z beginning full compaction of group 0, 5 TSM files engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000005.tsm (#0) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000006.tsm (#1) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000007.tsm (#2) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000008.tsm (#3) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000678-000000004.tsm (#4) engine=tsm1
[I] 2017-07-03T01:57:22Z aborted full compaction group (0). compaction in progress: open /mnt/influxdb/data/spec/autogen/317/000000678-000000005.tsm.tmp: file exists engine=tsm1
[I] 2017-07-03T01:57:23Z beginning full compaction of group 0, 5 TSM files engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000005.tsm (#0) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000006.tsm (#1) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000007.tsm (#2) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000008.tsm (#3) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000678-000000004.tsm (#4) engine=tsm1
[I] 2017-07-03T01:57:23Z aborted full compaction group (0). compaction in progress: open /mnt/influxdb/data/spec/autogen/317/000000678-000000005.tsm.tmp: file exists engine=tsm1

Additional info: [Include gist of relevant config, logs, etc.]
Issue being opened as requested here.

The issue continues until I restart InfluxDB.

This has happened to me several times now. I have not identified a pattern on when it happens. Though I have been experiencing a ton of performance issues with InfluxDB, so I'm not sure if it's related. But basically InfluxDB has only 16GB of data, ~20k series, and only about 80KB/s of writes (according to "http" measurement in _internal db), yet is experiencing horrible performance. Queries & Writes frequently time out, constantly high CPU usage, high disk IO (~18MB/s reads and ~19MB/s writes), etc. So maybe compactions are going so slow that one is kicking off before the other finishes, dunno.

The text was updated successfully, but these errors were encountered:

e-dard · 2017-07-03T15:47:21Z

From my current analysis it appears that in some cases compactions freeze up. The reason for the aborting messages is due to the compactor trying the compaction again but failing because tmp files left over from a previous compaction are still in existence.

Restarting the server fixes it because tmp files are cleaned up on a restart.

@phemmer how many logical cores do you have on the machine? What about RAM?

e-dard · 2017-07-03T15:51:32Z

@phemmer the next time it happens would you be able to SIGQUIT the process for me and provide the dump?

phemmer · 2017-07-03T15:52:02Z

8 cores, 16gb ram.

e-dard · 2017-07-03T16:27:54Z

@phemmer

are your queries typically selecting small portions of data based on time ranges, e.g., the kind of things might have setup in Grafana?
Are you deleting/dropping data, or just writing it and querying it?
Is this just affecting the spec db? Do you have other DBs that are being written to as well?

phemmer · 2017-07-03T16:31:03Z

are your queries typically selecting small portions of data based on time ranges, e.g., the kind of things might have setup in Grafana?

Yes, usually a span of about 5 minutes.

Are you deleting/dropping data, or just writing it and querying it?

Sometimes deleting/dropping, but not regularly.

Is this just affecting the spec db? Do you have other DBs that are being written to as well?

I only have one DB.

e-dard · 2017-07-03T16:31:29Z

Sorry forgot to add:

How many shards does the spec database have and are you writing based on current time or scattering writes over historic shards?

phemmer · 2017-07-03T16:39:35Z

303 shards.
99% of the writes are new based on current time (continuously ongoing).
I do have a small amount of data (a 24 hour time range) that I'm performing analysis on, and writing the results. This data is written to its own measurement, and I drop the measurement each time I run the analysis. This is not a constant process, its kicked off manually.

e-dard · 2017-07-03T16:43:01Z

Thanks for the quick responses. I'm currently trying to come up with a similar workload and see if I can reproduce. Do you have lots of measurement or just a few? You mentioned occasional deletes/drops - which command would they be using typically? DROP MEASUREMENT or DROP SERIES?

phemmer · 2017-07-03T16:44:15Z

11 measurements.

I use DROP MEASUREMENT.

e-dard · 2017-07-03T17:32:06Z

@phemmer DROP MEASUREMENT looks to be key. I've managed to repro 🎉

e-dard · 2017-07-04T19:35:06Z

@phemmer would you be able to try #8562 on a test box if you have one? It's a bit of a WIP so not to be tried with data you can't afford to lose.

phemmer · 2017-07-04T19:50:20Z

I might be able to. I think I've figured out why I'm having so many issues with InfluxDB, and why the performance is so abysmal. It seems writing points to time in the past is not handled well. Even if the points are written to a completely different database. So I'm going to have to spin up a second InfluxDB with a completely different data storage path to store my analytical results in, so that it doesn't tank the main InfluxDB. I can run this second database with your patch since I'm not worried about losing data there. I just mention it as it won't be the same configuration as what the issue was reported on. There'll still be DROP MEASUREMENTs going on, but there will be no real-time data being written.

e-dard · 2017-07-04T20:19:18Z

@phemmer I have managed to repro the issue without needing incoming writes. Merely dropping measurements and getting in the way of compactions is enough to trigger it, so I think your alternative setup would be fine for testing the issue.

In terms of historical data. That's an interesting suggestion, and we will certainly look into that. Do you feel historical write performance has degraded as compared to 1.2.x or just in general?

jwilder · 2017-07-05T21:50:57Z

Fixed via #8562

e-dard mentioned this issue Jul 4, 2017

Ensure tmp files cleaned up when compaction disabled #8562

Merged

3 tasks

jwilder closed this as completed Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aborted full compaction group ... compaction in progress ... file exists #8559

aborted full compaction group ... compaction in progress ... file exists #8559

phemmer commented Jul 3, 2017 •

edited

Loading

e-dard commented Jul 3, 2017

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017 •

edited

Loading

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017

e-dard commented Jul 3, 2017

e-dard commented Jul 4, 2017

phemmer commented Jul 4, 2017 •

edited

Loading

e-dard commented Jul 4, 2017

jwilder commented Jul 5, 2017

aborted full compaction group ... compaction in progress ... file exists #8559

aborted full compaction group ... compaction in progress ... file exists #8559

Comments

phemmer commented Jul 3, 2017 • edited Loading

Bug report

e-dard commented Jul 3, 2017

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017 • edited Loading

e-dard commented Jul 3, 2017

phemmer commented Jul 3, 2017

e-dard commented Jul 3, 2017

e-dard commented Jul 4, 2017

phemmer commented Jul 4, 2017 • edited Loading

e-dard commented Jul 4, 2017

jwilder commented Jul 5, 2017

phemmer commented Jul 3, 2017 •

edited

Loading

phemmer commented Jul 3, 2017 •

edited

Loading

phemmer commented Jul 4, 2017 •

edited

Loading