New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB 0.10.x performance unusable with less than 30 days history #5856

Closed
fxstein opened this Issue Feb 29, 2016 · 10 comments

Comments

Projects
None yet
5 participants
@fxstein
Copy link

fxstein commented Feb 29, 2016

Sorry to say but InfluxDB has degraded to a point that makes it unusable for me. I have been running pre 0.9 versions last year with up to a year of data and now since my move to 0.9.x (disaster) and now 0.10.x (just as bad if not worse) I cannot handle 28 days of IoT data with less than 1 million events per day.

I hope I am just doing something very wrong - so any help would be very much appreciated.

Dedicated Mac Mini with 16GB of RAM, Dual SSD.
The biggest issue since 0.9.x is that memory consumption goes up linearly with the amount of history. And is more than the physical storage of all the data. If you query it or not is irrelevant.

Prior to 0.9 I was running 0.8.8 and could have a year of the same data on the same machine. multi months queries would take a while, but startup or loads never experienced any issues.

Now even with no queries running, memory footprint grows and grows until the server swaps to death (even without a single query running) - as in TBs of swap per day for about 16GB of total data in the db.

As it has been mentioned the startup times explode and can take multiple hours, but that is not even the worst issue. Every nn minutes that database goes through a cycle, where it does the same thing for multiple minutes - running out of all memory and refusing any new writes.

Writes slowing down after a few minutes where single writes can take 9-10+ seconds instead of us/ms:

[http] 2016/02/28 16:53:22 192.168.1.253 - listener [28/Feb/2016:16:53:22 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 d4caf0df-de7e-11e5-8278-000000000000 2.388403ms
[http] 2016/02/28 16:53:24 192.168.1.253 - listener [28/Feb/2016:16:53:24 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 d5e1bfb9-de7e-11e5-8279-000000000000 873.016µs
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:31 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 da376c16-de7e-11e5-827c-000000000000 2.23672ms
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:31 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 da37667d-de7e-11e5-827b-000000000000 2.908262ms
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:21 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 d47e4881-de7e-11e5-8275-000000000000 9.605373316s
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:21 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 d47e4326-de7e-11e5-8273-000000000000 9.60728112s
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:31 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 da401279-de7e-11e5-827e-000000000000 1.257287ms
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:24 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 d5ff8605-de7e-11e5-827a-000000000000 7.426334248s
[http] 2016/02/28 16:53:31 192.168.1.253 - listener [28/Feb/2016:16:53:31 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 da772c1b-de7e-11e5-827f-000000000000 7.231543ms
[http] 2016/02/28 16:53:32 192.168.1.253 - listener [28/Feb/2016:16:53:31 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 da3770b4-de7e-11e5-827d-000000000000 1.167589908s
[http] 2016/02/28 16:53:33 192.168.1.253 - listener [28/Feb/2016:16:53:33 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 db4b9b60-de7e-11e5-8280-000000000000 1.034804ms
[http] 2016/02/28 16:53:36 192.168.1.253 - listener [28/Feb/2016:16:53:36 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 dd48724f-de7e-11e5-8282-000000000000 108.620638ms
[http] 2016/02/28 16:53:36 192.168.1.253 - listener [28/Feb/2016:16:53:36 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 dd39f15d-de7e-11e5-8281-000000000000 205.257245ms
[http] 2016/02/28 16:53:37 192.168.1.253 - listener [28/Feb/2016:16:53:37 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 ddf94ca2-de7e-11e5-8283-000000000000 4.180005ms
[http] 2016/02/28 16:53:37 192.168.1.253 - listener [28/Feb/2016:16:53:37 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 ddfc5dec-de7e-11e5-8284-000000000000 904.446µs

Here is the one database in the filesystem I am working with:

influx:~ oratzes$ sudo ls /usr/local/var/influxdb/data/Home
default
influx:~ oratzes$ sudo ls /usr/local/var/influxdb/data/Home/default
38  46  54  62
influx:~ oratzes$ sudo ls -la /usr/local/var/influxdb/data/Home/default/62
total 7858928
drwxr-xr-x  7 root  admin         238 Feb 27 18:20 .
drwx------  6 root  admin         204 Feb 21 16:00 ..
-rw-r--r--  1 root  admin  2154039044 Feb 24 14:54 000000016-000000004.tsm
-rw-r--r--  1 root  admin   522710123 Feb 25 18:40 000000021-000000003.tsm
-rw-r--r--  1 root  admin   550569002 Feb 26 13:46 000000025-000000003.tsm
-rw-r--r--  1 root  admin   569589942 Feb 27 09:34 000000029-000000003.tsm
-rw-r--r--  1 root  admin   226852751 Feb 27 18:20 000000031-000000002.tsm
influx:~ oratzes$ sudo ls -la /usr/local/var/influxdb/data/Home/default/54
total 8886624
drwxr-xr-x  5 root  admin         170 Feb 21 17:02 .
drwx------  6 root  admin         204 Feb 21 16:00 ..
-rw-r--r--  1 root  admin  2154710718 Feb 18 00:20 000000016-000000004.tsm
-rw-r--r--  1 root  admin   277368467 Feb 18 00:20 000000016-000000005.tsm
-rw-r--r--  1 root  admin  2117868415 Feb 21 17:02 000000033-000000004.tsm
influx:~ oratzes$ sudo ls -la /usr/local/var/influxdb/data/Home/default/46
total 9634656
drwxr-xr-x  6 root  admin         204 Feb 15 16:01 .
drwx------  6 root  admin         204 Feb 21 16:00 ..
-rw-r--r--  1 root  admin  2154679495 Feb 10 15:31 000000016-000000004.tsm
-rw-r--r--  1 root  admin    15075236 Feb 10 15:32 000000016-000000005.tsm
-rw-r--r--  1 root  admin  2154692889 Feb 15 16:01 000000038-000000002.tsm
-rw-r--r--  1 root  admin   608488287 Feb 15 16:01 000000038-000000003.tsm
influx:~ oratzes$ sudo ls -la /usr/local/var/influxdb/data/Home/default/38
total 3966624
drwxr-xr-x  3 root  admin         102 Feb  7 17:01 .
drwx------  6 root  admin         204 Feb 21 16:00 ..
-rw-r--r--  1 root  admin  2030908817 Feb  7 17:01 000000016-000000004.tsm

Here a reboot after ideling the server, shutting it down and starting it without any writes hitting the server.
Parallel loading is definitely not the problem and will not improve that since the culprit is total memory exhaustion that makes a clean boot that hours.

[http] 2016/02/28 15:09:57 192.168.1.253 - listener [28/Feb/2016:15:09:57 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 624a5111-de70-11e5-97f9-000000000000 20.575935ms
[http] 2016/02/28 15:09:57 192.168.1.253 - listener [28/Feb/2016:15:09:57 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 62975ab3-de70-11e5-97fb-000000000000 5.753367ms
[http] 2016/02/28 15:09:58 192.168.1.253 - listener [28/Feb/2016:15:09:58 -0800] POST /write?db=Home&precision=s HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 62e14d2e-de70-11e5-97fc-000000000000 6.134094ms
[http] 2016/02/28 15:10:00 192.168.1.253 - listener [28/Feb/2016:15:10:00 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 64660b6c-de70-11e5-97fd-000000000000 532.691µs
[http] 2016/02/28 15:10:06 192.168.1.253 - listener [28/Feb/2016:15:10:06 -0800] POST /write?precision=s&db=Home HTTP/1.1 204 0 - python-requests/2.7.0 CPython/3.4.3 Darwin/15.0.0 681169ce-de70-11e5-97fe-000000000000 5.410662ms
[run] 2016/02/28 15:10:37 Signal received, initializing clean shutdown...
[run] 2016/02/28 15:10:37 Waiting for clean shutdown...
[snapshot] 2016/02/28 15:10:37 snapshot listener closed
[copier] 2016/02/28 15:10:37 copier listener closed
[cluster] 2016/02/28 15:10:37 cluster service accept error: network connection closed
[shard-precreation] 2016/02/28 15:10:37 Precreation service terminating
[continuous_querier] 2016/02/28 15:10:37 continuous query service terminating
[retention] 2016/02/28 15:10:37 retention policy enforcement terminating
[monitor] 2016/02/28 15:10:37 shutting down monitor system
[monitor] 2016/02/28 15:10:37 terminating storage of statistics
[handoff] 2016/02/28 15:10:37 shutting down hh service

 8888888           .d888 888                   8888888b.  888888b.
   888            d88P"  888                   888  "Y88b 888  "88b
   888            888    888                   888    888 888  .88P
   888   88888b.  888888 888 888  888 888  888 888    888 8888888K.
   888   888 "88b 888    888 888  888  Y8bd8P' 888    888 888  "Y88b
   888   888  888 888    888 888  888   X88K   888    888 888    888
   888   888  888 888    888 Y88b 888 .d8""8b. 888  .d88P 888   d88P
 8888888 888  888 888    888  "Y88888 888  888 8888888P"  8888888P"

2016/02/28 15:11:16 InfluxDB starting, version 0.10.1, branch 0.10.0, commit b8bb32ecad9808ef00219e7d2469514890a0987a, built unknown
2016/02/28 15:11:16 Go version go1.6, GOMAXPROCS set to 8
2016/02/28 15:11:17 Using configuration at: /usr/local/etc/influxdb.conf
[meta] 2016/02/28 15:11:17 Starting meta service
[meta] 2016/02/28 15:11:17 Listening on HTTP: 127.0.0.1:8091
[metastore] 2016/02/28 15:11:17 Using data dir: /usr/local/var/influxdb/meta
[metastore] 2016/02/28 15:11:17 Node at localhost:8088 [Follower]
[metastore] 2016/02/28 15:11:18 Node at localhost:8088 [Leader]. peers=[localhost:8088]
[meta] 2016/02/28 15:11:18 127.0.0.1 - - [28/Feb/2016:15:11:18 -0800] GET /?index=0 HTTP/1.1 200 1315 - Go-http-client/1.1 92de0980-de70-11e5-8001-000000000000 3.114313ms
[store] 2016/02/28 15:11:18 Using data dir: /usr/local/var/influxdb/data
[tsm1wal] 2016/02/28 15:11:18 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:11:18 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/38
[filestore]2016/02/28 15:11:22 /usr/local/var/influxdb/data/Home/default/38/000000016-000000004.tsm (#0) opened in 4.143386832s
[cacheloader] 2016/02/28 15:11:22 reading file /usr/local/var/influxdb/wal/Home/default/38/_00082.wal, size 0
[tsm1wal] 2016/02/28 15:13:24 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:13:24 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/46
[filestore]2016/02/28 15:13:24 /usr/local/var/influxdb/data/Home/default/46/000000016-000000005.tsm (#1) opened in 21.73417ms
[filestore]2016/02/28 15:13:26 /usr/local/var/influxdb/data/Home/default/46/000000038-000000003.tsm (#3) opened in 2.053936208s
[filestore]2016/02/28 15:13:30 /usr/local/var/influxdb/data/Home/default/46/000000016-000000004.tsm (#0) opened in 6.686841164s
[filestore]2016/02/28 15:13:30 /usr/local/var/influxdb/data/Home/default/46/000000038-000000002.tsm (#2) opened in 6.701890206s
[cacheloader] 2016/02/28 15:13:30 reading file /usr/local/var/influxdb/wal/Home/default/46/_00186.wal, size 0
[tsm1wal] 2016/02/28 15:23:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:23:27 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/54
[filestore]2016/02/28 15:23:29 /usr/local/var/influxdb/data/Home/default/54/000000016-000000005.tsm (#1) opened in 2.146638136s
[filestore]2016/02/28 15:23:41 /usr/local/var/influxdb/data/Home/default/54/000000033-000000004.tsm (#2) opened in 13.822702246s
[filestore]2016/02/28 15:23:41 /usr/local/var/influxdb/data/Home/default/54/000000016-000000004.tsm (#0) opened in 14.203828173s
[cacheloader] 2016/02/28 15:23:41 reading file /usr/local/var/influxdb/wal/Home/default/54/_00161.wal, size 0
[tsm1wal] 2016/02/28 15:47:45 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:47:45 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/62
[filestore]2016/02/28 15:47:48 /usr/local/var/influxdb/data/Home/default/62/000000031-000000002.tsm (#4) opened in 2.429135819s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000021-000000003.tsm (#1) opened in 4.416776609s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000025-000000003.tsm (#2) opened in 4.663655016s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000029-000000003.tsm (#3) opened in 4.764130897s
[filestore]2016/02/28 15:47:57 /usr/local/var/influxdb/data/Home/default/62/000000016-000000004.tsm (#0) opened in 12.2506927s
[cacheloader] 2016/02/28 15:47:57 reading file /usr/local/var/influxdb/wal/Home/default/62/_00149.wal, size 19055
[cacheloader] 2016/02/28 15:47:57 reading file /usr/local/var/influxdb/wal/Home/default/62/_00150.wal, size 10310945
[cacheloader] 2016/02/28 15:47:59 reading file /usr/local/var/influxdb/wal/Home/default/62/_00151.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/60
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/60/000000001-000000001.tsm (#0) opened in 1.022839ms
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/60/_00006.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/61
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/61/000000001-000000001.tsm (#0) opened in 1.03813ms
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/61/_00006.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/63
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/63/000000001-000000001.tsm (#0) opened in 716.348µs
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/63/_00006.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/64
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/64/000000001-000000001.tsm (#0) opened in 851.431µs
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/64/_00006.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/65
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/65/000000001-000000001.tsm (#0) opened in 1.063461ms
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/65/_00005.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/66
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/66/000000001-000000001.tsm (#0) opened in 896.981µs
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/66/_00004.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/67
[filestore]2016/02/28 16:46:04 /usr/local/var/influxdb/data/_internal/monitor/67/000000001-000000001.tsm (#0) opened in 957.554µs
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/67/_00004.wal, size 0
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:04 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/68
[cacheloader] 2016/02/28 16:46:04 reading file /usr/local/var/influxdb/wal/_internal/monitor/68/_00001.wal, size 8535543
[cacheloader] 2016/02/28 16:46:05 reading file /usr/local/var/influxdb/wal/_internal/monitor/68/_00002.wal, size 540429
[cacheloader] 2016/02/28 16:46:05 reading file /usr/local/var/influxdb/wal/_internal/monitor/68/_00003.wal, size 0
[handoff] 2016/02/28 16:46:05 Starting hinted handoff service
[monitor] 2016/02/28 16:46:05 'hh' registered for diagnostics monitoring
[handoff] 2016/02/28 16:46:05 Using data dir: /usr/local/var/influxdb/hh
[subscriber] 2016/02/28 16:46:05 opened service
[monitor] 2016/02/28 16:46:05 Starting monitor system
[monitor] 2016/02/28 16:46:05 'build' registered for diagnostics monitoring
[monitor] 2016/02/28 16:46:05 'runtime' registered for diagnostics monitoring
[monitor] 2016/02/28 16:46:05 'network' registered for diagnostics monitoring
[monitor] 2016/02/28 16:46:05 'system' registered for diagnostics monitoring
[cluster] 2016/02/28 16:46:05 Starting cluster service
[shard-precreation] 2016/02/28 16:46:05 Starting precreation service with check interval of 10m0s, advance period of 30m0s
[snapshot] 2016/02/28 16:46:05 Starting snapshot service
[monitor] 2016/02/28 16:46:05 Storing statistics in database '_internal' retention policy '', at interval 10s
[copier] 2016/02/28 16:46:05 Starting copier service
[admin] 2016/02/28 16:46:05 Starting admin service
[admin] 2016/02/28 16:46:05 Listening on HTTP: [::]:8083
[continuous_querier] 2016/02/28 16:46:05 Starting continuous query service
[httpd] 2016/02/28 16:46:05 Starting HTTP service
[httpd] 2016/02/28 16:46:05 Authentication enabled: false
[httpd] 2016/02/28 16:46:05 Listening on HTTP: [::]:8086
[retention] 2016/02/28 16:46:05 Starting retention policy enforcement service with check interval of 30m0s
[run] 2016/02/28 16:46:05 Listening for signals
2016/02/28 16:46:05 Sending anonymous usage statistics to m.influxdb.com
[meta] 2016/02/28 16:46:15 127.0.0.1 - - [28/Feb/2016:15:11:18 -0800] GET /?index=229 HTTP/1.1 200 1315 - Go-http-client/1.1 92dea0c8-de70-11e5-8002-000000000000 1h34m57.17259283s
[meta] 2016/02/28 16:46:15 127.0.0.1 - - [28/Feb/2016:16:46:15 -0800] POST /execute HTTP/1.1 200 29 - Go-http-client/1.1 d69fcc44-de7d-11e5-8003-000000000000 126.492955ms
[meta] 2016/02/28 16:46:15 127.0.0.1 - - [28/Feb/2016:16:46:15 -0800] POST /execute HTTP/1.1 200 29 - Go-http-client/1.1 d6b38540-de7d-11e5-8005-000000000000 2.126476ms
[meta] 2016/02/28 16:46:15 127.0.0.1 - - [28/Feb/2016:16:46:15 -0800] GET /?index=230 HTTP/1.1 200 1315 - Go-http-client/1.1 d6b384b9-de7d-11e5-8004-000000000000 3.517376ms
[meta] 2016/02/28 16:46:16 127.0.0.1 - - [28/Feb/2016:16:46:16 -0800] POST /execute HTTP/1.1 200 29 - Go-http-client/1.1 d6dea1e2-de7d-11e5-8007-000000000000 17.749223ms
[meta] 2016/02/28 16:46:16 127.0.0.1 - - [28/Feb/2016:16:46:15 -0800] GET /?index=231 HTTP/1.1 200 1327 - Go-http-client/1.1 d6b56897-de7d-11e5-8006-000000000000 288.8107ms
[tsm1wal] 2016/02/28 16:46:16 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 16:46:16 tsm1 WAL writing to /usr/local/var/influxdb/wal/_internal/monitor/69
@jwilder

This comment has been minimized.

Copy link
Contributor

jwilder commented Feb 29, 2016

@fxstein Can you run influx_inspect dumptsmdev -all /usr/local/var/influxdb/data/Home/default/62/000000016-000000004.tsm and put the output in a gist?

Do you have some examples of the kind of data you are writing? If you have some sample writes in line protocol format, that would be really useful.

The slow startup is a known issue with loading the in-memory index. If you are able to build the code, I'd be interested to know if #5372 helps.

@fxstein

This comment has been minimized.

Copy link

fxstein commented Feb 29, 2016

Looks like homebrew does not install influx_inspect - any pointer as to how to install it?

Here are example data feeds in addition to collectd stats from a few machines:
https://github.com/fxstein/SentientHome/tree/develop/feed

For example Solar PV Sensor Data from SMA Inverters:
screen shot 2016-02-28 at 5 34 41 pm
Need to write a little extension to dump the line protocol format.

Have not tested the load improvements, but from watching top and memory consumption, nothing but cutting memory footprint will help it. The last startup swapped more then 500GB of IO in and out. It seems that for about 16GB of data for 4 weeks the memory footprint is well in access of 20GB of RAM with the server only having 16GB total.

@jwilder

This comment has been minimized.

Copy link
Contributor

jwilder commented Feb 29, 2016

@fxstein If you have go installed, you could try running:

go get github.com/influxdata/influxdb/cmd/influx_inspect/...

It looks like how you are using tags may be creating the memory issue based on the code you have referenced. Here are a few things I see that could be causing problems:

  • Storing dates as tags - Tag values should really have a bounded cardinality. Storing a date as the value ends up creating a new series each time it changes. Each series is tracked in the in-memory index so this will cause your memory usage to continue to grow as new data is written. It looks like several of the feeds are using dates as tag values which can create very high cardinality tags.
  • Large tag keys or values - emergency_contact_descriptionis pretty long for a tag key. The value looks like it could be pretty long as well. This isn't a problem in itself other than it increases the size of the series key held in-memory. If any of the tag values in that series vary significantly, this could increase memory usage more quickly.
  • Large number of tags - This particular feed has 20 tags. Some of the values are dates or possibly high-cardinality values. These could be create a lot of series for you and increasing RAM requirements.

If you convert the date based tags to fields, that will probably help. Also, removing the very high cardinality tag values should also help. I really need to see the report generated by influx_inspect to see if these items are actually the issue or not though.

@zstyblik

This comment has been minimized.

Copy link

zstyblik commented Feb 29, 2016

@jwilder, please, is this documented in FAQ or guidelines? I found note on series cardinality. The rest is sort of "new", eg. large tag keys or values.

@fxstein

This comment has been minimized.

Copy link

fxstein commented Mar 5, 2016

@jwilder Thank you for the insights. Sorry for my slow response - just came back from a business trip. Going to work on this over the weekend, really want to get to the bottom of this. Have wondered how to best use tags and in which cases. Especially with new IoT devices and APIs its sometime totally unknown what data you will receive over time. As a default I have taken anything that is numeric as a field and anything that comes as a string as a tag - might create exactly the problem you describe.

Having said that, is there any way to change an existing time series or do I have to start over if I remove certain tags and turn them into fields?

I am going to get you the output of inlfux_inspect.

@fxstein

This comment has been minimized.

Copy link

fxstein commented Mar 5, 2016

@jwilder I have inspect running right now, but it might take a while given the server is OOM most of the time.

The header of inspect is already pointing at the problem you suggested: high cardinality of tags leading to tons of series:

File: /usr/local/var/influxdb/data/Home/default/62/000000033-000000002.tsm Time Range: 2016-02-22T00:00:00Z - 2016-02-28T23:10:06Z Duration: 167h10m6s Series: 7212585 File Size: 2154696413

Need to find a way to eliminate tags that don't make sense.

I think the main culprit is the Ubiquiti MFI data that has timestamps for the last update of a sensor and that might change every few seconds.

@fxstein

This comment has been minimized.

Copy link

fxstein commented Mar 5, 2016

@jwilder Ok this is definitely the tag cardinality problem you suggested. Within minutes I had a 2+GB inspect output file. Not gonna upload any part of it as I can see what is happening with some of the tags. Most of my series have some high cardinality tags and the worst offender is definitely the MFI sensor data with 3 timestamps that update every few seconds for every individual sensor.

Now as for the cleanup: Can I simply drop tags I want to eliminate and free up memory by doing so or do I need to reload all of that data?

@joelegasse

This comment has been minimized.

Copy link
Contributor

joelegasse commented Mar 19, 2016

@fxstein The data is split up by series, which takes the measurement name and tag set (keys and values) to create a series key. I do not believe there is currently a means of combining series by "dropping" tags. I think you'll have to re-load your data and make those tags fields instead.

@joelegasse joelegasse closed this Mar 19, 2016

@fxstein

This comment has been minimized.

Copy link

fxstein commented Mar 19, 2016

Thanks I was afraid that would be the answer.

It makes the whole schema less - late binding - you don't need to design a database - a mood point. It means that if you don't know your data upfront and you make a mistake you are screwed and have to start over. We have learned the same from the likes of MongoDB and now so many other no relational approaches. This always becomes the achilles heal of any such solutions.
Especially in an agile environment you cannot predict the behavior of data upfront all the time. Things change and any solution that forces you into a drop your data and reload approach has a huge problem for more than the most simplistic solutions. Not only don't you know data upfront in many cases, it also changes on you. The source system or connector might all of a sudden turn low cardinality into high cardinality fields.
This will have to get addressed or developers will quickly cut their losses after a few of these experiences and move on. This is the 3rd time that I will start over with loading IoT data into InfluxDB, always with another change that prevents moving forward.
I probably should have invested in writing an export and reload feature, but then again these are the basic capabilities that an underlying platform has to enable.

@KrishnaPG

This comment has been minimized.

Copy link

KrishnaPG commented Aug 5, 2016

I agree with @fxstein

Unfortunately InfluxDB has this weekspot with tags. Some approach @desa laid out in #3445 (comment) could be of some help to maintain the evolving tags, though I would prefer to see InfluxDB supporting the tags / metadata as first class citizen treating them as another time-series, albeit a low frequency one (stays constant most of the time but possibly can change anytime), rather than treating it as "additional tagged along data".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment