Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB starts consuming all available disk space / compaction errors #8417

Closed
garceri opened this issue May 22, 2017 · 25 comments · Fixed by #8425
Closed

InfluxDB starts consuming all available disk space / compaction errors #8417

garceri opened this issue May 22, 2017 · 25 comments · Fixed by #8425
Assignees
Milestone

Comments

@garceri
Copy link

garceri commented May 22, 2017

Bug report

System info:
Nighlty build from 2017-05-18
Running in docker container under Rancher
EBS gp2 storage backend (separate devices for WAL and TSM)
Using telegraf to inject stats from different rancher environments

Steps to reproduce:

  1. Clear data and wal directories
  2. Start influxdb
  3. Let it run for a while
  4. This error starts to appear on the log
[I] 2017-05-22T12:35:34Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/3 written in 128.902839ms engine=tsm1
[I] 2017-05-22T12:35:34Z error writing snapshot: compaction in progress engine=tsm1
[I] 2017-05-22T12:35:34Z error writing snapshot from compactor: compaction in progress engine=tsm1

  1. Leave it running for 6 hours, and it consumes 100GB of space

Expected behavior:
These errors should not appear
Disk space usage seems exagerated in one of the databases (telegraf-dev, the one with highest usage)

Actual behavior:
Errors start appearing on the logs
Disk space utilization skyrockets

Additional info:

vars.txt

@garceri garceri changed the title InfluxDB starts consuming all available disk space InfluxDB starts consuming all available disk space / compaction errors May 23, 2017
@jwilder
Copy link
Contributor

jwilder commented May 23, 2017

Can your show how you started the container? Is the data dir writeable? That error indicates that we could not create a tmp file in the data dir to write a snapshot to disk. The root error looks like it's masked unfortunately. Your disk may be filling up due to wal segments not being able to be compacted. What does the contents of you wal dir (/data/influxdb/wal/telegraf-internal-prod/autogen/3) look like?

@garceri
Copy link
Author

garceri commented May 23, 2017

The errors on the OP were logged about 30 minutes after starting InfluxDB with clean data/wal directories.
Both directories are being actively written by InfluxDB, i left it running overnight and this morning it had consumed all 250GB of space in the data directory:

Here is the data dir usage

root@cc35be712981:/data/influxdb/data# du -sh *
908K	_internal
12K	rancher
6.6M	telegraf
245G	telegraf-dev

And wal:

root@cc35be712981:/data/influxdb/wal# du -sh *
22M	_internal
26M	rancher
168M	telegraf
5.0G	telegraf-dev

The data directory for telegraf-dev, under the autogen/13 directory is full of tsm and tsm.tmp files:

...
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024171-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024172-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024173-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024174-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024175-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024176-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024177-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024178-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024179-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024180-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024181-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024182-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024183-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024184-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024185-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024186-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024187-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024188-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024189-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024190-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024191-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024192-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024193-000000001.tsm
-rw-r--r-- 1 root root  9719808 May 23 09:11 000024194-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024195-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024196-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024197-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024198-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024199-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024200-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024201-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024202-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024203-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024204-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024205-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024206-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024207-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024208-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024209-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024210-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024211-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024212-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024213-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024214-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024215-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024216-000000001.tsm.tmp
...

Here is the count of files

root@cc35be712981:/data/influxdb/data/telegraf-dev/autogen/13# find *.tsm |wc -l
24193
root@cc35be712981:/data/influxdb/data/telegraf-dev/autogen/13# find *.tmp |wc -l
8365

Log was full of these errors before running out of disk space

[I] 2017-05-23T08:44:21Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 285.264169ms engine=tsm1
[I] 2017-05-23T08:44:21Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:21Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 9.128µs engine=tsm1
[I] 2017-05-23T08:44:21Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:22Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:22Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 282.545818ms engine=tsm1
[I] 2017-05-23T08:44:22Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:22Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 12.319µs engine=tsm1
[I] 2017-05-23T08:44:22Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:23Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:23Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 274.55072ms engine=tsm1
[I] 2017-05-23T08:44:23Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:23Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 9.242µs engine=tsm1
[I] 2017-05-23T08:44:23Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:24Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:24Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 283.661506ms engine=tsm1
[I] 2017-05-23T08:44:24Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:24Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 11.222µs engine=tsm1
[I] 2017-05-23T08:44:24Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:25Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:25Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 275.902009ms engine=tsm1
[I] 2017-05-23T08:44:25Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:25Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 12.154µs engine=tsm1
[I] 2017-05-23T08:44:25Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1

This is the compose file used to start the container:

version: '2'
volumes:
  ebs_influxdb-data:
    external: true
    driver: 'rancher-ebs'
  ebs_influxdb-wal:
    external: true
    driver: 'rancher-ebs'
  ebs_influxdb-meta:
    external: true
    driver: 'rancher-ebs'
services:
  influxdb:
    image: pythiant9shared/influxdb:nightly_20170518
    ulimits:
      memlock:
        hard: -1
        soft: -1
    volumes:
    - ebs_influxdb-data:/data/influxdb/data
    - ebs_influxdb-meta:/data/influxdb/meta
    - ebs_influxdb-wal:/data/influxdb/wal
    ports:
    - 2003:2003/tcp
    - 8086:8086/tcp

Here is the df output from inside the container

root@cc35be712981:/# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          30G   11G   19G  36% /
tmpfs           7.5G     0  7.5G   0% /dev
tmpfs           7.5G     0  7.5G   0% /sys/fs/cgroup
/dev/xvda1       30G   11G   19G  36% /.r
shm              64M     0   64M   0% /dev/shm
/dev/xvdh       4.8G   10M  4.6G   1% /data/influxdb/meta
/dev/xvdg       197G  5.3G  182G   3% /data/influxdb/wal
/dev/xvdf       246G  246G     0 100% /data/influxdb/data
tmpfs           7.5G     0  7.5G   0% /sys/firmware

And PS output as well

root@cc35be712981:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1 60.7 47.7 264020284 7481028 ?   Ssl  May22 572:39 influxd
root        30  0.0  0.0  20260   504 ?        Ss+  May22   0:00 bash
root       142  0.0  0.0  20260  3236 ?        Ss   14:19   0:00 bash
root       152  0.0  0.0  17500  2048 ?        R+   14:24   0:00 ps aux

@jwilder
Copy link
Contributor

jwilder commented May 23, 2017

#8416 fixes some issues where tmp files could linger which should help here. There is something else going on that is preventing snapshots from getting written though. I'll need to fix the error that is returned so that the root issue is not masked.

@jwilder
Copy link
Contributor

jwilder commented May 23, 2017

#8420 will fix the error getting masked so we can see why the snapshot is failing to be written. There are some other fixes in there for problems that occur when this error occurs.

@jwilder jwilder added this to the 1.3.0 milestone May 23, 2017
@jwilder
Copy link
Contributor

jwilder commented May 24, 2017

@garceri From you comment, can you attach one of recent -00000001.tsm files from /data/influxdb/data/telegraf-dev/autogen/5? Something is odd in that the it looks like the snapshot was written successfully, but it could not be opened immediately after writing it. That hints at a file system/hardware/OS issue.

@garceri
Copy link
Author

garceri commented May 24, 2017

Here is the TSM file
000029152-000000001.tsm.zip.
Is there any way to increase log debug output to unmask the error ?
This is EBS storage using the rancher-ebs Storage Engine:
https://docs.rancher.com/rancher/v1.6/en/rancher-services/storage-service/rancher-ebs/
https://github.com/rancher/storage/tree/master/package/ebs

I'm restarting the container, this time w/o the the rancher-ebs plugin volumes

@jwilder jwilder self-assigned this May 24, 2017
@jwilder
Copy link
Contributor

jwilder commented May 24, 2017

@garceri I found the issue. Do you know how the orientdb_stats measurement is written? Is it through the graphite endpoint? The issue is that a series key that is too large has managed to be accepted. When it's snapshotted to a TSM file it's overflowing the field length which causes it to fail to load. I'm wondering if these writes are coming in from something other than the HTTP line protocol endpoint because I can't seem to reproduce the issue via that endpoint.

@jwilder
Copy link
Contributor

jwilder commented May 24, 2017

I figured it out. It can occur via the HTTP API as well. The orientdb_stat is writing a field name that is very large which is causing the series key to overflow. The field name seems to be well over 65k. It looks like it's using a query as the field name. The length of the field name is not part of the series key check currently in the write end points.

As a workaround until this is fixed, I would disable whatever is writing orientdb_stat for now.

@garceri
Copy link
Author

garceri commented May 24, 2017

Okey, yeah, that measurement is taken from Telegraf using the http_json input, i'm going to check the output directly to see if there is an error with telegraf or orientdb itself.

@jwilder
Copy link
Contributor

jwilder commented May 24, 2017

@garceri You won't see an error as the write is accepted by the DB, but it shouldn't be.

@garceri
Copy link
Author

garceri commented May 24, 2017

Okey, but i should be seeing the irregularly-formatted data in the http output from orientdb, right ?

@jwilder
Copy link
Contributor

jwilder commented May 24, 2017

Yeah, you should be able to see the writes to the DB. The field name starts with realtime_chronos_db.service_point_group_db.command.sql.select createWhitelist and is then followed by many GUIDs.

@garceri
Copy link
Author

garceri commented May 24, 2017

@jwilder should some kind of checking be implemented in InfluxDB to prevent these errors ? Maybe discarding improperly formatted metrics or something ?

@jwilder
Copy link
Contributor

jwilder commented May 24, 2017

@garceri Yes, #8425 should fix this now. Would you be abel to test that branch?

@garceri
Copy link
Author

garceri commented May 24, 2017

Okey, compiled and built docker image for your branch, i'm gonna leave it running overnight and check on it ocasionally, i should be seeing compaction notification messages in the log, right ?

@garceri
Copy link
Author

garceri commented May 25, 2017

Compiled your branch, still no luck, keep getting:

[I] 2017-05-25T02:45:53Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:54Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:54Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/3 written in 269.148722ms engine=tsm1
[I] 2017-05-25T02:45:54Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:55Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:55Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/3 written in 266.584342ms engine=tsm1
[I] 2017-05-25T02:45:55Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1

after 6 hours, not a single message indicating that compaction has taken place and it has consumed already 90GB of space.

@jwilder
Copy link
Contributor

jwilder commented May 25, 2017

You probably need to remove the wal segments for that problem shard. They still have the bad writes and won't be dropped on their own at this point. They are reloaded at startup and still will be attempted to be snapshotted. If you see that error for tsm files at startup, those files would need to be removed as well.

@garceri
Copy link
Author

garceri commented May 25, 2017

I cleaned up the whole data/meta/wal directories before starting InfluxDB again with your fixes.

@jwilder
Copy link
Contributor

jwilder commented May 25, 2017

Hmm. Can you attach another one of the 00001.tsm files? Also, what commit did you build off of?

@jwilder
Copy link
Contributor

jwilder commented May 25, 2017

Ah. I think I know what I missed. I need to update the PR.

@jwilder
Copy link
Contributor

jwilder commented May 25, 2017

@garceri Were you able to verify the build commit you are testing?

@ghost ghost removed the review label May 25, 2017
@garceri
Copy link
Author

garceri commented May 26, 2017

@jwilder , i figured out i was using an older commit, i confirm that commit 62a9325 solves the issue

@jwilder
Copy link
Contributor

jwilder commented May 26, 2017

@garceri Great! Thanks for your help!

@spiffytech
Copy link

I'm experiencing this bug with InfluxDB 1.3, via the influxdb:latest Docker image.

I get these for a while, and eventually I get too many files open, Chronograf stops showing new data, and my storage fills up rapidly.

[I] 2017-11-02T14:19:38Z Snapshot for path /var/lib/influxdb/data/telegraf/autogen/2 written in 19.637876ms engine=tsm1
[I] 2017-11-02T14:19:38Z error writing snapshot: sync /var/lib/influxdb/data/telegraf/autogen/2: invalid argument engine=tsm1
[I] 2017-11-02T14:19:39Z error adding new TSM files from snapshot: sync /var/lib/influxdb/data/telegraf/autogen/2: invalid argument engine=
tsm1

@spiffytech
Copy link

spiffytech commented Nov 6, 2017

This appears to only happen when my volume is backed by a samba mount. I'll create a new issue for this (#9065).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants