Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InfluxDB starts for 3 hours #5764

Closed
undera opened this issue Feb 20, 2016 · 23 comments
Closed

InfluxDB starts for 3 hours #5764

undera opened this issue Feb 20, 2016 · 23 comments
Labels
Milestone

Comments

@undera
Copy link

undera commented Feb 20, 2016

I have POC with InfluxDB with 14GB database. When I do restart service or restart the machine, it takes 3 hours to start up and start serving. This is unacceptable, since my production can't afford 3 hours of downtime. Max downtime I can accept is ~1 minute.

Please suggest what can be changed in configuration or something else to make it up instantly. I understand that it might execute requests a bit slower at the beginning, while it would have "cold" memory. But it makes much more sense to start serving at least with higher response times than not serve at all for 3 hours.

My server is on 8CPU 32GB RAM SSD machine (my local testing env, the question of moving onto AWS-hosted env depends on my POC success). It is Ubuntu 14.04,
InfluxDB version 0.10.0, branch 0.10.0, commit b8bb32e, built 2016-02-04T17:06:04.850564

@undera
Copy link
Author

undera commented Feb 20, 2016

I can provide the log for startup, any stats including stacktraces from gcc during startup.

@jwilder
Copy link
Contributor

jwilder commented Feb 20, 2016

Can you upload your startup log?

@undera
Copy link
Author

undera commented Feb 20, 2016

influxd.log-20160218.txt

@sbouchex
Copy link
Contributor

Looks like the wal files loading takes a while... That's a real problem as the service may be down for a while after a restart and customers may loose data points which may cause penalty due to some SLA.

What about parallelizing wal loading ? Starting accepting connection before WAL are loaded and store the data points in a new new one and merge everything after the complete startup ?

@undera
Copy link
Author

undera commented Feb 22, 2016

The WAL files are all zero length, since there are no writes for a long time and I did several restarts. The startup log does not provide good info here. From my observations, something happends after loading WAL file and before another log records. I assume the server can't spend hours to load zero WAL files.

@undera
Copy link
Author

undera commented Feb 22, 2016

Here are 2 stacktraces, taken at the time when log says nothing and server just burns CPU. I omitted all sleeping/idle threads, the only job seem to be done by this one, doing "NewFieldCodec". It is the most of time spent during startup.

stack2.txt

@jwilder
Copy link
Contributor

jwilder commented Feb 22, 2016

That stack trace points to loading the in-memory index which is taking too much time. If you have a test env, I would be curious to see if #5372 helps. There are some issues I still need to fix with that PR so if you try it out, make sure it's on a test env.

@undera
Copy link
Author

undera commented Feb 23, 2016

I can test it, but I'd need a Debian file to install it. If someboy can build it for me, or there is CI doing that, it would be very nice. I'm not a Go guy, sorry...

@r0bj
Copy link

r0bj commented Feb 23, 2016

I have a similar case, it takes ~1:20h to start up my 19GB database.

https://gist.github.com/r0bj/ed8d01ff5e5d1254b95b

@hrzbrg
Copy link

hrzbrg commented Feb 28, 2016

Same problem here in combination with a graphite listener.
I have a testing environment that receives the same data as my production DB. Today I installed the most recent nightly and experienced the described slow start after that. Log shows similar messages.
Waiting until your PR is merged and I will check the nightly again.

@fxstein
Copy link

fxstein commented Feb 28, 2016

Same problem here. The startup performance and overall performance have again degraded on 0.10.1 to a point where it cannot handle a database of 28 days or less than 1 million data points per day on a dedicated 16GB RAM and dual SSD machine. Pre 0.9 I was able to have a database with a year of the same data. Starting and loading was never an issue, only queries spanning months could take minutes to run.
I startest a totally fresh database on 0.10 at the beginning of February and by now the database is unusable. The main issue i that it consumes 20+GB of RAM on a 16GB machine and its the swapping that kills it.

Here a snapshot of the log after 45min startup (still processing and killing my system)

`
8888888 .d888 888 8888888b. 888888b.
888 d88P" 888 888 "Y88b 888 "88b
888 888 888 888 888 888 .88P
888 88888b. 888888 888 888 888 888 888 888 888 8888888K.
888 888 "88b 888 888 888 888 Y8bd8P' 888 888 888 "Y88b
888 888 888 888 888 888 888 X88K 888 888 888 888
888 888 888 888 888 Y88b 888 .d8""8b. 888 .d88P 888 d88P
8888888 888 888 888 888 "Y88888 888 888 8888888P" 8888888P"

2016/02/28 15:11:16 InfluxDB starting, version 0.10.1, branch 0.10.0, commit b8bb32e, built unknown
2016/02/28 15:11:16 Go version go1.6, GOMAXPROCS set to 8
2016/02/28 15:11:17 Using configuration at: /usr/local/etc/influxdb.conf
[meta] 2016/02/28 15:11:17 Starting meta service
[meta] 2016/02/28 15:11:17 Listening on HTTP: 127.0.0.1:8091
[metastore] 2016/02/28 15:11:17 Using data dir: /usr/local/var/influxdb/meta
[metastore] 2016/02/28 15:11:17 Node at localhost:8088 [Follower]
[metastore] 2016/02/28 15:11:18 Node at localhost:8088 [Leader]. peers=[localhost:8088]
[meta] 2016/02/28 15:11:18 127.0.0.1 - - [28/Feb/2016:15:11:18 -0800] GET /?index=0 HTTP/1.1 200 1315 - Go-http-client/1.1 92de0980-de70-11e5-8001-000000000000 3.114313ms
[store] 2016/02/28 15:11:18 Using data dir: /usr/local/var/influxdb/data
[tsm1wal] 2016/02/28 15:11:18 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:11:18 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/38
[filestore]2016/02/28 15:11:22 /usr/local/var/influxdb/data/Home/default/38/000000016-000000004.tsm (#0) opened in 4.143386832s
[cacheloader] 2016/02/28 15:11:22 reading file /usr/local/var/influxdb/wal/Home/default/38/_00082.wal, size 0
[tsm1wal] 2016/02/28 15:13:24 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:13:24 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/46
[filestore]2016/02/28 15:13:24 /usr/local/var/influxdb/data/Home/default/46/000000016-000000005.tsm (#1) opened in 21.73417ms
[filestore]2016/02/28 15:13:26 /usr/local/var/influxdb/data/Home/default/46/000000038-000000003.tsm (#3) opened in 2.053936208s
[filestore]2016/02/28 15:13:30 /usr/local/var/influxdb/data/Home/default/46/000000016-000000004.tsm (#0) opened in 6.686841164s
[filestore]2016/02/28 15:13:30 /usr/local/var/influxdb/data/Home/default/46/000000038-000000002.tsm (#2) opened in 6.701890206s
[cacheloader] 2016/02/28 15:13:30 reading file /usr/local/var/influxdb/wal/Home/default/46/_00186.wal, size 0
[tsm1wal] 2016/02/28 15:23:27 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:23:27 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/54
[filestore]2016/02/28 15:23:29 /usr/local/var/influxdb/data/Home/default/54/000000016-000000005.tsm (#1) opened in 2.146638136s
[filestore]2016/02/28 15:23:41 /usr/local/var/influxdb/data/Home/default/54/000000033-000000004.tsm (#2) opened in 13.822702246s
[filestore]2016/02/28 15:23:41 /usr/local/var/influxdb/data/Home/default/54/000000016-000000004.tsm (#0) opened in 14.203828173s
[cacheloader] 2016/02/28 15:23:41 reading file /usr/local/var/influxdb/wal/Home/default/54/_00161.wal, size 0
[tsm1wal] 2016/02/28 15:47:45 tsm1 WAL starting with 10485760 segment size
[tsm1wal] 2016/02/28 15:47:45 tsm1 WAL writing to /usr/local/var/influxdb/wal/Home/default/62
[filestore]2016/02/28 15:47:48 /usr/local/var/influxdb/data/Home/default/62/000000031-000000002.tsm (#4) opened in 2.429135819s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000021-000000003.tsm (#1) opened in 4.416776609s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000025-000000003.tsm (#2) opened in 4.663655016s
[filestore]2016/02/28 15:47:50 /usr/local/var/influxdb/data/Home/default/62/000000029-000000003.tsm (#3) opened in 4.764130897s
[filestore]2016/02/28 15:47:57 /usr/local/var/influxdb/data/Home/default/62/000000016-000000004.tsm (#0) opened in 12.2506927s
[cacheloader] 2016/02/28 15:47:57 reading file /usr/local/var/influxdb/wal/Home/default/62/_00149.wal, size 19055
[cacheloader] 2016/02/28 15:47:57 reading file /usr/local/var/influxdb/wal/Home/default/62/_00150.wal, size 10310945
[cacheloader] 2016/02/28 15:47:59 reading file /usr/local/var/influxdb/wal/Home/default/62/_00151.wal, size 0`

@cheribral
Copy link

We are having this problem as well, but to be fair we've had it for a long time. I'm not sure if it is even the same thing I reported initially in #4952, but we haven't had any huge improvements since then in the startup time. While we've put up with it, it became painful while we tripped over #5965.

I was reminded of it again today when I had to make a config change, which required a restart and caused quite lengthy downtime.

We have only about 70G worth of data on 8 CPUs and 60G of memory with about 600k series. During startup, the influxdb server uses only 1.5 CPUs, and there is no real disk IO to speak of, and it takes ages to do whatever it is doing. If there is anything I can contribute to sorting this out, please let me know.

EDIT - We are using version influxdb-0.10.3-1

@zstyblik
Copy link

@jwilder are there any plans to merge #5372?

@jwilder
Copy link
Contributor

jwilder commented Mar 22, 2016

@zstyblik Yes, but It would be very helpful if someone with a slow load time could try that PR out to see if it helps.

@jwilder
Copy link
Contributor

jwilder commented Mar 22, 2016

@cheribral How many shards do you have?

@jwilder jwilder added this to the 0.12.0 milestone Mar 22, 2016
@zstyblik
Copy link

@jwilder if somebody can provide build with that fix, then I'll bet it won't be a problem. Quote from couple comments above:

I can test it, but I'd need a Debian file to install it. If someboy can build it for me, or there is CI doing that, it would be very nice. I'm not a Go guy, sorry...

@cheribral
Copy link

@jwilder We have 410 shards (counted from the output of 'show shards') across 10 databases.

@cheribral
Copy link

@jwilder I set up a similar box to our production one, and tried the branch on a copy of the real database. The startup time was 7 minutes, and the limiting factor seems to have been disk IO this time, which is much better than it was, so the change seems to have worked.

Since the disk doesn't really ever break a sweat once the server is up and we are running normally, it would seem a bit silly to spend more on disks simply to cut down the startup time. Do all the files absolutely have to be loaded before accepting writes? Is there any chance the loading could be made lazier so that the listening socket can be available sooner?

@cheribral
Copy link

Just as a test, I swapped in a much faster disk, and while it didn't saturate CPU or the disk by any means, it started in under 2 minutes.

@jwilder
Copy link
Contributor

jwilder commented Mar 24, 2016

@cheribral That's good news. How long was it taking before? There is definitely more room for improvement in the load time. Lazy loading shards might be possible. The shard has to be loaded before it can take a write or serve queries though.

@undera
Copy link
Author

undera commented Mar 24, 2016

+1 for lazy load. That would solve my case completely. In my case shard takes writes for short time, then just stored for possible reads. The reader is ok with waiting to load "cold" shards.

Can we have an option in config file to enable/disable it? Just to not harm those who prefer to load everything upfront...

@cheribral
Copy link

@jwilder Before it was taking over 30 minutes.

@jwilder
Copy link
Contributor

jwilder commented Mar 29, 2016

Fixed via #5372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants