Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aborted full compaction group ... compaction in progress ... file exists #8559

Closed
phemmer opened this issue Jul 3, 2017 · 14 comments
Closed

Comments

@phemmer
Copy link
Contributor

phemmer commented Jul 3, 2017

Bug report

System info: [Include InfluxDB version, operating system name, and other relevant details]
Version: e3918e0
OS: Linux

Steps to reproduce:

  1. Run InfluxDB

Expected behavior: [What you expected to happen]
Not see errors in the log.

Actual behavior: [What actually happened]
See errors in the log.

[I] 2017-07-03T01:57:21Z beginning full compaction of group 0, 5 TSM files engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000005.tsm (#0) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000006.tsm (#1) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000007.tsm (#2) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000008.tsm (#3) engine=tsm1
[I] 2017-07-03T01:57:21Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000678-000000004.tsm (#4) engine=tsm1
[I] 2017-07-03T01:57:21Z aborted full compaction group (0). compaction in progress: open /mnt/influxdb/data/spec/autogen/317/000000678-000000005.tsm.tmp: file exists engine=tsm1
[I] 2017-07-03T01:57:22Z beginning full compaction of group 0, 5 TSM files engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000005.tsm (#0) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000006.tsm (#1) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000007.tsm (#2) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000008.tsm (#3) engine=tsm1
[I] 2017-07-03T01:57:22Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000678-000000004.tsm (#4) engine=tsm1
[I] 2017-07-03T01:57:22Z aborted full compaction group (0). compaction in progress: open /mnt/influxdb/data/spec/autogen/317/000000678-000000005.tsm.tmp: file exists engine=tsm1
[I] 2017-07-03T01:57:23Z beginning full compaction of group 0, 5 TSM files engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000005.tsm (#0) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000006.tsm (#1) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000007.tsm (#2) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000675-000000008.tsm (#3) engine=tsm1
[I] 2017-07-03T01:57:23Z compacting full group (0) /mnt/influxdb/data/spec/autogen/317/000000678-000000004.tsm (#4) engine=tsm1
[I] 2017-07-03T01:57:23Z aborted full compaction group (0). compaction in progress: open /mnt/influxdb/data/spec/autogen/317/000000678-000000005.tsm.tmp: file exists engine=tsm1

Additional info: [Include gist of relevant config, logs, etc.]
Issue being opened as requested here.

The issue continues until I restart InfluxDB.

This has happened to me several times now. I have not identified a pattern on when it happens. Though I have been experiencing a ton of performance issues with InfluxDB, so I'm not sure if it's related. But basically InfluxDB has only 16GB of data, ~20k series, and only about 80KB/s of writes (according to "http" measurement in _internal db), yet is experiencing horrible performance. Queries & Writes frequently time out, constantly high CPU usage, high disk IO (~18MB/s reads and ~19MB/s writes), etc. So maybe compactions are going so slow that one is kicking off before the other finishes, dunno.

@e-dard
Copy link
Contributor

e-dard commented Jul 3, 2017

From my current analysis it appears that in some cases compactions freeze up. The reason for the aborting messages is due to the compactor trying the compaction again but failing because tmp files left over from a previous compaction are still in existence.

Restarting the server fixes it because tmp files are cleaned up on a restart.

@phemmer how many logical cores do you have on the machine? What about RAM?

@e-dard
Copy link
Contributor

e-dard commented Jul 3, 2017

@phemmer the next time it happens would you be able to SIGQUIT the process for me and provide the dump?

@phemmer
Copy link
Contributor Author

phemmer commented Jul 3, 2017

8 cores, 16gb ram.

@e-dard
Copy link
Contributor

e-dard commented Jul 3, 2017

@phemmer

  • are your queries typically selecting small portions of data based on time ranges, e.g., the kind of things might have setup in Grafana?
  • Are you deleting/dropping data, or just writing it and querying it?
  • Is this just affecting the spec db? Do you have other DBs that are being written to as well?

@phemmer
Copy link
Contributor Author

phemmer commented Jul 3, 2017

are your queries typically selecting small portions of data based on time ranges, e.g., the kind of things might have setup in Grafana?

Yes, usually a span of about 5 minutes.

Are you deleting/dropping data, or just writing it and querying it?

Sometimes deleting/dropping, but not regularly.

Is this just affecting the spec db? Do you have other DBs that are being written to as well?

I only have one DB.

@e-dard
Copy link
Contributor

e-dard commented Jul 3, 2017

Sorry forgot to add:

  • How many shards does the spec database have and are you writing based on current time or scattering writes over historic shards?

@phemmer
Copy link
Contributor Author

phemmer commented Jul 3, 2017

303 shards.
99% of the writes are new based on current time (continuously ongoing).
I do have a small amount of data (a 24 hour time range) that I'm performing analysis on, and writing the results. This data is written to its own measurement, and I drop the measurement each time I run the analysis. This is not a constant process, its kicked off manually.

@e-dard
Copy link
Contributor

e-dard commented Jul 3, 2017

Thanks for the quick responses. I'm currently trying to come up with a similar workload and see if I can reproduce. Do you have lots of measurement or just a few? You mentioned occasional deletes/drops - which command would they be using typically? DROP MEASUREMENT or DROP SERIES?

@phemmer
Copy link
Contributor Author

phemmer commented Jul 3, 2017

11 measurements.

I use DROP MEASUREMENT.

@e-dard
Copy link
Contributor

e-dard commented Jul 3, 2017

@phemmer DROP MEASUREMENT looks to be key. I've managed to repro 🎉

@e-dard
Copy link
Contributor

e-dard commented Jul 4, 2017

@phemmer would you be able to try #8562 on a test box if you have one? It's a bit of a WIP so not to be tried with data you can't afford to lose.

@phemmer
Copy link
Contributor Author

phemmer commented Jul 4, 2017

I might be able to. I think I've figured out why I'm having so many issues with InfluxDB, and why the performance is so abysmal. It seems writing points to time in the past is not handled well. Even if the points are written to a completely different database. So I'm going to have to spin up a second InfluxDB with a completely different data storage path to store my analytical results in, so that it doesn't tank the main InfluxDB. I can run this second database with your patch since I'm not worried about losing data there. I just mention it as it won't be the same configuration as what the issue was reported on. There'll still be DROP MEASUREMENTs going on, but there will be no real-time data being written.

@e-dard
Copy link
Contributor

e-dard commented Jul 4, 2017

@phemmer I have managed to repro the issue without needing incoming writes. Merely dropping measurements and getting in the way of compactions is enough to trigger it, so I think your alternative setup would be fine for testing the issue.

In terms of historical data. That's an interesting suggestion, and we will certainly look into that. Do you feel historical write performance has degraded as compared to 1.2.x or just in general?

@jwilder
Copy link
Contributor

jwilder commented Jul 5, 2017

Fixed via #8562

@jwilder jwilder closed this as completed Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants