InfluxDB starts for 3 hours #5764

undera · 2016-02-20T11:08:02Z

I have POC with InfluxDB with 14GB database. When I do restart service or restart the machine, it takes 3 hours to start up and start serving. This is unacceptable, since my production can't afford 3 hours of downtime. Max downtime I can accept is ~1 minute.

Please suggest what can be changed in configuration or something else to make it up instantly. I understand that it might execute requests a bit slower at the beginning, while it would have "cold" memory. But it makes much more sense to start serving at least with higher response times than not serve at all for 3 hours.

My server is on 8CPU 32GB RAM SSD machine (my local testing env, the question of moving onto AWS-hosted env depends on my POC success). It is Ubuntu 14.04,
InfluxDB version 0.10.0, branch 0.10.0, commit b8bb32e, built 2016-02-04T17:06:04.850564

The text was updated successfully, but these errors were encountered:

undera · 2016-02-20T11:08:53Z

I can provide the log for startup, any stats including stacktraces from gcc during startup.

jwilder · 2016-02-20T15:16:22Z

Can you upload your startup log?

undera · 2016-02-20T15:55:08Z

influxd.log-20160218.txt

sbouchex · 2016-02-22T09:26:07Z

Looks like the wal files loading takes a while... That's a real problem as the service may be down for a while after a restart and customers may loose data points which may cause penalty due to some SLA.

What about parallelizing wal loading ? Starting accepting connection before WAL are loaded and store the data points in a new new one and merge everything after the complete startup ?

undera · 2016-02-22T11:16:20Z

The WAL files are all zero length, since there are no writes for a long time and I did several restarts. The startup log does not provide good info here. From my observations, something happends after loading WAL file and before another log records. I assume the server can't spend hours to load zero WAL files.

undera · 2016-02-22T11:27:20Z

Here are 2 stacktraces, taken at the time when log says nothing and server just burns CPU. I omitted all sleeping/idle threads, the only job seem to be done by this one, doing "NewFieldCodec". It is the most of time spent during startup.

stack2.txt

jwilder · 2016-02-22T20:09:25Z

That stack trace points to loading the in-memory index which is taking too much time. If you have a test env, I would be curious to see if #5372 helps. There are some issues I still need to fix with that PR so if you try it out, make sure it's on a test env.

undera · 2016-02-23T07:28:31Z

I can test it, but I'd need a Debian file to install it. If someboy can build it for me, or there is CI doing that, it would be very nice. I'm not a Go guy, sorry...

r0bj · 2016-02-23T12:52:48Z

I have a similar case, it takes ~1:20h to start up my 19GB database.

https://gist.github.com/r0bj/ed8d01ff5e5d1254b95b

hrzbrg · 2016-02-28T18:39:39Z

Same problem here in combination with a graphite listener.
I have a testing environment that receives the same data as my production DB. Today I installed the most recent nightly and experienced the described slow start after that. Log shows similar messages.
Waiting until your PR is merged and I will check the nightly again.

fxstein · 2016-02-28T23:55:15Z

Same problem here. The startup performance and overall performance have again degraded on 0.10.1 to a point where it cannot handle a database of 28 days or less than 1 million data points per day on a dedicated 16GB RAM and dual SSD machine. Pre 0.9 I was able to have a database with a year of the same data. Starting and loading was never an issue, only queries spanning months could take minutes to run.
I startest a totally fresh database on 0.10 at the beginning of February and by now the database is unusable. The main issue i that it consumes 20+GB of RAM on a 16GB machine and its the swapping that kills it.

Here a snapshot of the log after 45min startup (still processing and killing my system)

`
8888888 .d888 888 8888888b. 888888b.
888 d88P" 888 888 "Y88b 888 "88b
888 888 888 888 888 888 .88P
888 88888b. 888888 888 888 888 888 888 888 888 8888888K.
888 888 "88b 888 888 888 888 Y8bd8P' 888 888 888 "Y88b
888 888 888 888 888 888 888 X88K 888 888 888 888
888 888 888 888 888 Y88b 888 .d8""8b. 888 .d88P 888 d88P
8888888 888 888 888 888 "Y88888 888 888 8888888P" 8888888P"

2016/02/28 15:11:16 InfluxDB starting, version 0.10.1, branch 0.10.0, commit b8bb32e, built unknown
2016/02/28 15:11:16 Go version go1.6, GOMAXPROCS set to 8
2016/02/28 15:11:17 Using configuration at: /usr/local/etc/influxdb.conf
[meta] 2016/02/28 15:11:17 Starting meta service
[meta] 2016/02/28 15:11:17 Listening on HTTP: 127.0.0.1:8091
[metastore] 2016/02/28 15:11:17 Using data dir: /usr/local/var/influxdb/meta
[metastore] 2016/02/28 15:11:17 Node at localhost:8088 [Follower]
[metastore] 2016/02/28 15:11:18 Node at localhost:8088 [Leader]. peers=[localhost:8088]
[meta] 2016/02/28 15:11:18 127.0.0.1 - - [28/Feb/2016:15:11:18 -0800] GET /?index=0 HTTP/1.1 200 1315 - Go-http-client/1.1 92de0980-de70-11e5-8001-000000000000 3.114313ms
[store] 2016/02/28 15:11:18 Using data dir: /usr/local/var/influxdb/data
[tsm1wal] 2016/02/28 15:11:18 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:11:18 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/38
[filestore]2016/02/28 15:11:22 /usr/local/var/influxdb/data/Home/default/38/000000016-000000004.tsm (#0) opened in 4.143386832s
[cacheloader] 2016/02/28 15:11:22 reading file /usr/local/var/influxdb/wal/Home/default/38/_00082.wal, size 0
[tsm1wal] 2016/02/28 15:13:24 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:13:24 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/46
[filestore]2016/02/28 15:13:24 /usr/local/var/influxdb/data/Home/default/46/000000016-000000005.tsm (#1) opened in 21.73417ms
[filestore]2016/02/28 15:13:26 /usr/local/var/influxdb/data/Home/default/46/000000038-000000003.tsm (#3) opened in 2.053936208s
[filestore]2016/02/28 15:13:30 /usr/local/var/influxdb/data/Home/default/46/000000016-000000004.tsm (#0) opened in 6.686841164s
[filestore]2016/02/28 15:13:30 /usr/local/var/influxdb/data/Home/default/46/000000038-000000002.tsm (#2) opened in 6.701890206s
[cacheloader] 2016/02/28 15:13:30 reading file /usr/local/var/influxdb/wal/Home/default/46/_00186.wal, size 0
[tsm1wal] 2016/02/28 15:23:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:23:27 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/54
[filestore]2016/02/28 15:23:29 /usr/local/var/influxdb/data/Home/default/54/000000016-000000005.tsm (#1) opened in 2.146638136s
[filestore]2016/02/28 15:23:41 /usr/local/var/influxdb/data/Home/default/54/000000033-000000004.tsm (#2) opened in 13.822702246s
[filestore]2016/02/28 15:23:41 /usr/local/var/influxdb/data/Home/default/54/000000016-000000004.tsm (#0) opened in 14.203828173s
[cacheloader] 2016/02/28 15:23:41 reading file /usr/local/var/influxdb/wal/Home/default/54/_00161.wal, size 0
[tsm1wal] 2016/02/28 15:47:45 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:47:45 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/62
[filestore]2016/02/28 15:47:48 /usr/local/var/influxdb/data/Home/default/62/000000031-000000002.tsm (#4) opened in 2.429135819s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000021-000000003.tsm (#1) opened in 4.416776609s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000025-000000003.tsm (#2) opened in 4.663655016s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000029-000000003.tsm (#3) opened in 4.764130897s
[filestore]2016/02/28 15:47:57 /usr/local/var/influxdb/data/Home/default/62/000000016-000000004.tsm (#0) opened in 12.2506927s
[cacheloader] 2016/02/28 15:47:57 reading file /usr/local/var/influxdb/wal/Home/default/62/_00149.wal, size 19055
[cacheloader] 2016/02/28 15:47:57 reading file /usr/local/var/influxdb/wal/Home/default/62/_00150.wal, size 10310945
[cacheloader] 2016/02/28 15:47:59 reading file /usr/local/var/influxdb/wal/Home/default/62/_00151.wal, size 0`

cheribral · 2016-03-22T05:29:25Z

We are having this problem as well, but to be fair we've had it for a long time. I'm not sure if it is even the same thing I reported initially in #4952, but we haven't had any huge improvements since then in the startup time. While we've put up with it, it became painful while we tripped over #5965.

I was reminded of it again today when I had to make a config change, which required a restart and caused quite lengthy downtime.

We have only about 70G worth of data on 8 CPUs and 60G of memory with about 600k series. During startup, the influxdb server uses only 1.5 CPUs, and there is no real disk IO to speak of, and it takes ages to do whatever it is doing. If there is anything I can contribute to sorting this out, please let me know.

EDIT - We are using version influxdb-0.10.3-1

zstyblik · 2016-03-22T13:09:10Z

@jwilder are there any plans to merge #5372?

jwilder · 2016-03-22T18:06:32Z

@zstyblik Yes, but It would be very helpful if someone with a slow load time could try that PR out to see if it helps.

jwilder · 2016-03-22T18:07:15Z

@cheribral How many shards do you have?

zstyblik · 2016-03-22T18:13:11Z

@jwilder if somebody can provide build with that fix, then I'll bet it won't be a problem. Quote from couple comments above:

I can test it, but I'd need a Debian file to install it. If someboy can build it for me, or there is CI doing that, it would be very nice. I'm not a Go guy, sorry...

cheribral · 2016-03-22T22:20:37Z

@jwilder We have 410 shards (counted from the output of 'show shards') across 10 databases.

cheribral · 2016-03-24T00:33:16Z

@jwilder I set up a similar box to our production one, and tried the branch on a copy of the real database. The startup time was 7 minutes, and the limiting factor seems to have been disk IO this time, which is much better than it was, so the change seems to have worked.

Since the disk doesn't really ever break a sweat once the server is up and we are running normally, it would seem a bit silly to spend more on disks simply to cut down the startup time. Do all the files absolutely have to be loaded before accepting writes? Is there any chance the loading could be made lazier so that the listening socket can be available sooner?

cheribral · 2016-03-24T02:23:26Z

Just as a test, I swapped in a much faster disk, and while it didn't saturate CPU or the disk by any means, it started in under 2 minutes.

jwilder · 2016-03-24T02:55:38Z

@cheribral That's good news. How long was it taking before? There is definitely more room for improvement in the load time. Lazy loading shards might be possible. The shard has to be loaded before it can take a write or serve queries though.

undera · 2016-03-24T08:15:52Z

+1 for lazy load. That would solve my case completely. In my case shard takes writes for short time, then just stored for possible reads. The reader is ok with waiting to load "cold" shards.

Can we have an option in config file to enable/disable it? Just to not harm those who prefer to load everything upfront...

cheribral · 2016-03-28T21:06:39Z

@jwilder Before it was taking over 30 minutes.

jwilder · 2016-03-29T19:16:03Z

Fixed via #5372

jwilder added the area/tsm label Feb 22, 2016

MikeSchroll mentioned this issue Mar 11, 2016

Influxdb starts very slow with semi-large databases #945

Closed

jwilder added this to the 0.12.0 milestone Mar 22, 2016

jwilder closed this as completed Mar 29, 2016

undera mentioned this issue Apr 7, 2016

InfluxDB starts for 2 hours #6250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfluxDB starts for 3 hours #5764

InfluxDB starts for 3 hours #5764

undera commented Feb 20, 2016

undera commented Feb 20, 2016

jwilder commented Feb 20, 2016

undera commented Feb 20, 2016

sbouchex commented Feb 22, 2016

undera commented Feb 22, 2016

undera commented Feb 22, 2016

jwilder commented Feb 22, 2016

undera commented Feb 23, 2016

r0bj commented Feb 23, 2016

hrzbrg commented Feb 28, 2016

fxstein commented Feb 28, 2016

cheribral commented Mar 22, 2016

zstyblik commented Mar 22, 2016

jwilder commented Mar 22, 2016

jwilder commented Mar 22, 2016

zstyblik commented Mar 22, 2016

cheribral commented Mar 22, 2016

cheribral commented Mar 24, 2016

cheribral commented Mar 24, 2016

jwilder commented Mar 24, 2016

undera commented Mar 24, 2016

cheribral commented Mar 28, 2016

jwilder commented Mar 29, 2016

InfluxDB starts for 3 hours #5764

InfluxDB starts for 3 hours #5764

Comments

undera commented Feb 20, 2016

undera commented Feb 20, 2016

jwilder commented Feb 20, 2016

undera commented Feb 20, 2016

sbouchex commented Feb 22, 2016

undera commented Feb 22, 2016

undera commented Feb 22, 2016

jwilder commented Feb 22, 2016

undera commented Feb 23, 2016

r0bj commented Feb 23, 2016

hrzbrg commented Feb 28, 2016

fxstein commented Feb 28, 2016

cheribral commented Mar 22, 2016

zstyblik commented Mar 22, 2016

jwilder commented Mar 22, 2016

jwilder commented Mar 22, 2016

zstyblik commented Mar 22, 2016

cheribral commented Mar 22, 2016

cheribral commented Mar 24, 2016

cheribral commented Mar 24, 2016

jwilder commented Mar 24, 2016

undera commented Mar 24, 2016

cheribral commented Mar 28, 2016

jwilder commented Mar 29, 2016