Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 #9534

Closed
ghost opened this issue Mar 9, 2018 · 39 comments
Closed

[inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 #9534

ghost opened this issue Mar 9, 2018 · 39 comments
Assignees
Labels
difficulty/medium Resolving this issue should take up to a week need more info
Milestone

Comments

@ghost
Copy link

ghost commented Mar 9, 2018

I reopen this issue #9486 because i've the same behaviour with final 1.5.0 version.

Bug report

System info:

Branch Build Time Commit Version
1.5 "" "6ac835404e7e64ea7299a6eebcce1ab1ef15fe3c" "1.5.0"

__ OS __ :
Red Hat Enterprise Linux Server release 6.4 (Santiago)
or
CentOS Linux release 7.2.1511 (Core)

Steps to reproduce:

  1. First Test in 1.3.5
[I] 2018-03-09T13:35:40Z InfluxDB starting, version 1.3.5, branch HEAD, commit 9d9001036d3585cf21925c13a57881bc6c8dcc7e
[I] 2018-03-09T13:35:40Z Go version go1.8.3, GOMAXPROCS set to 12
[I] 2018-03-09T13:35:40Z Using configuration at: /tech/influxdb/conf/influxdb.conf
...
[I] 2018-03-09T13:35:55Z /tech/influxdb/var/data/influxdb/data/telegraf/default/2323 opened in 8.036002793s service=store
...
[I] 2018-03-09T13:35:57Z Listening on HTTP:0.0.0.0:9001 service=httpd
  1. Second Test in 1.4.2
[I] 2018-03-09T14:00:28Z InfluxDB starting, version 1.4.2, branch 1.4, commit 6d2685d1738277a1c2672fc58df7994627769be6
[I] 2018-03-09T14:00:28Z Go version go1.9.2, GOMAXPROCS set to 12
[I] 2018-03-09T14:00:28Z Using configuration at: /tech/influxdb/conf/influxdb.conf
...
[I] 2018-03-09T14:00:44Z /tech/influxdb/var/data/influxdb/data/telegraf/default/2323 opened in 8.537640076s service=store
...
[I] 2018-03-09T14:00:47Z Listening on HTTP:0.0.0.0:9001 service=httpd

  1. Third test in 1.5.0
ts=2018-03-09T13:43:45.145501Z lvl=info msg="InfluxDB starting" log_id=06jktnzG000 version=1.5.0 branch=1.5 commit=6ac835404e7e64ea7299a6eebcce1ab1ef15fe3c
ts=2018-03-09T13:43:45.145557Z lvl=info msg="Go runtime" log_id=06jktnzG000 version=go1.9.2 maxprocs=12
...
ts=2018-03-09T13:54:22.970753Z lvl=info msg="Opened shard" log_id=06jktnzG000 service=store trace_id=06jktnzl000 op_name=tsdb_open path=/tech/influxdb/var/data/influxdb/data/telegraf/default/2323 duration=251247.336ms
...
ts=2018-03-09T13:54:33.583256Z lvl=info msg="Listening on HTTP" log_id=06jktnzG000 service=httpd addr=0.0.0.0:9001 https=false

Expected behavior: [What you expected to happen]

Same startup time in all versions

Actual behavior: [What actually happened]

Release Total startup time One shard open time
1.3.5 17s 8.036002793s
1.4.2 19s 8.537640076s
1.5.0 11min 251247.336ms

We made a test in 1.5.0 with tsi1 enabled (after influx_inspect buildtsi command) and we have the same result.

@ghost ghost changed the title [inmem] Startup in 1.5.0 is slower than in 1.3.5 and 1.4.2 [inmem] Startup in 1.5.0 is very slower than in 1.3.5 and 1.4.2 Mar 9, 2018
@benbjohnson
Copy link
Contributor

@ptitou Can you provide some details about the hardware? e.g. How many cores, how much RAM, and are you running SSDs?

Does this occur on every startup or just the first time?

@ghost
Copy link
Author

ghost commented Mar 9, 2018

Hardware Info
1 CPU Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (6 cores / 12 threads)
MemTotal: 65862900 kB (64Gb)
Disk : Hardware RAID5 of 3 HDD (total of 1Tb)

It occurs on every startup, even on other server (tested on a VM with SSD, we have same duration).

@benbjohnson
Copy link
Contributor

@ptitou How many series do you have in the shard?

@max3163
Copy link

max3163 commented Mar 9, 2018

Hi,

I'm working on this issue with @ptitou.
See below more details on series about the shard :

Summary :
File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000005.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 17126   File Size: 2147502592

File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000006.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 217919   File Size: 2147702233

File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000007.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 147602   File Size: 2147633909

File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000008.tsm
  Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
  Duration: 167h59m55s   Series: 17395   File Size: 273550172

@benbjohnson benbjohnson self-assigned this Mar 9, 2018
@benbjohnson
Copy link
Contributor

@max3163 Thank you. Can you start up influxd with some additional flags and SIGINT (CTRL-C) it after it loads the shards?

$ influxd run -cpuprofile cpu.pprof -memprofile mem.pprof

Then zip the cpu.pprof and mem.pprof files up and drop them here in this GitHub issue.

@gabrielmocan
Copy link

Also, 1.5.0 takes MUCH more RAM, even using tsi1 index. I guess I'll have to switch back to 1.4.3 =\

@max3163
Copy link

max3163 commented Mar 9, 2018

Find in attachment the prrof file as ask :

pprof.zip

We have also twice more RAM usage on the influxdb 1.5 :

image

@benbjohnson
Copy link
Contributor

@max3163 @gabrielmocan Thanks for the info. We've consistently seen a drop in memory usage so I'm curious as to why your memory has gone up. I'm taking a look at the profiles now.

@benbjohnson
Copy link
Contributor

@gabrielmocan One thing to note is that enabling TSI will only affect new shards. You can build TSI indexes for existing shards using the buildtsi utility: http://docs.influxdata.com/influxdb/v1.5/tools/influx_inspect/#influx-inspect-buildtsi

@benbjohnson
Copy link
Contributor

@max3163 It looks like it's still using the inmem index from the profile: https://rawgit.com/benbjohnson/020eff9976185bc186f5d0cf7b469fa1/raw/17fde3e2394dcf3231088c32c083a3a8972a1bbd/mem.svg

Can you show the output of tree on your shard's data directory?

@gabrielmocan
Copy link

@benbjohnson I've migrated from 1.4.3 and the shards were already tsi1 indexed.

@benbjohnson
Copy link
Contributor

@gabrielmocan Unfortunately, you'll need to rebuild the TSI indexes when migrating to 1.5. Can you try that and let me know if it improves your memory usage?

@max3163
Copy link

max3163 commented Mar 9, 2018

@benbjohnson

As indicate in the title of the issue, the problem is when we use the "inmem" engine yes.
We have made only one test to check with the tsi1 index on an other instance (with tool influxdb_inspect as indicate in the doc to migrate data) but we had same result for the loading time.

Sry I don't have access to the server for now, I will send you the output of the tree command only Monday.

@benbjohnson
Copy link
Contributor

@max3163 I apologize. I got sidetracked on the TSI side discussion. Disregard my tree request.

@gabrielmocan
Copy link

Sorry for disrupting the topic guys.

@max3163
Copy link

max3163 commented Mar 9, 2018

@benbjohnson

No problem ! and don't hesitate to ask me more information if needed.

@jwilder jwilder added the difficulty/medium Resolving this issue should take up to a week label Mar 13, 2018
@benbjohnson
Copy link
Contributor

@max3163 I'm still looking into the issue. From the cpu profile it looks like there's not much CPU usage. The profile duration was 10.63s but there were only 62.24s of samples which makes me think it's waiting on disk. Do have startup logs for your SSD drives? Also, are these drives connected directly or mounted over NFS?

https://rawgit.com/benbjohnson/020eff9976185bc186f5d0cf7b469fa1/raw/731369cc54d655f61cbf86d6ede464a1fa0f6876/cpu.svg

@ghost
Copy link
Author

ghost commented Mar 14, 2018

@benbjohnson the disks are attached to server with a RAID5 DELL controller. I confirm there is plenty IOwait during startup in 1.5.0.
Here a graph where we start v1.4.2, and then v1.5.0 :
image

I attach also a zip with the cpu.pprof and mem.pprof files for 1.4.2 startup to compare with 1.5.0 and find if there concurrent access which generate IOwait, and see where the startup process in 1.5.0 differ.
v1.4.2.zip

@ghost
Copy link
Author

ghost commented Mar 14, 2018

If we look the disk io graph and calculate the total of data reads, we remark that influxdb 1.5.0 reads about 480GB of data while files on disk is about 240GB (twice less) !

Maybe a track for memory occupation bigger in 1.5.0 than 1.4.2 ?

@szibis
Copy link

szibis commented Mar 14, 2018

I have large constant I/O 122MB/s and 955 IOPS on READ on 1000GB GP2 EBS.
This is very slow.

@szibis
Copy link

szibis commented Mar 14, 2018

1.4.3 startup vs 1.5.0 startup

image

image

@benbjohnson
Copy link
Contributor

@ptitou @szibis Do you have full startup logs that you can share? Also, how frequently are you adding new series? 1.5.0 uses a new series file which is persisted to disk that lets us scale to a higher number of series and we memory-map the data into the process.

@ghost
Copy link
Author

ghost commented Mar 16, 2018

Here the startup log files for v1.4.2 and v1.5.0
influxdb_log.zip

Also the number of series for the last 7days
image.

The tests were made on a snapshot of the production database, so there no new data between startup tests.

@benbjohnson
Copy link
Contributor

@ptitou Can you send the output of this command? It could be slow if your series data isn't being held in-memory. That doesn't seem like the case since you have 64GB of RAM but I want to double check.

$ find $DATADIR -type d | grep _series | xargs ls -l

Also, are you running this in a container or restarting the host between startups?

@ghost
Copy link
Author

ghost commented Mar 16, 2018

here the output of the find command
find.zip

The database is running on a physical server, and there is no restart of the host between startups, and no container or other VM.

@benbjohnson
Copy link
Contributor

@ptitou Sorry for the delay. We're having trouble reproducing the issue. Are you able to share any of your data files? Or can you run influx_inspect report? In particular, shard qualif/default/1773 seems to have a huge disparity between 1.4.3 & 1.5.0 (7s vs 360,000s).

@ghost
Copy link
Author

ghost commented Mar 20, 2018

I don't have the shard qualif/default/1773 but if I launch the command on another shard, first time with v1.5.0 and second time with v1.4.3 here is the result :

launch time from logs

1.4.2 : /tech/influxdb/var/data/influxdb/data/qualif/default/2175 5.114586342s
1.5.0 : /tech/influxdb/var/data/influxdb/data/qualif/default/2175 152163.862ms

v.1.5.0

-bash-4.1$ /tech/influxdb/influxdb-1.5.0/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
qualif  default 2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 4.669501721s
qualif  default 2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 4.486162495s
qualif  default 2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 3.656177153s

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s

Statistics
  Series:
     - qualif (est): 95010 (100%)
  Total (est): 95010
Completed in 12.85635712s

1.4.3

-bash-4.1$ /tech/influxdb/influxdb-1.4.2-1/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
qualif  default 2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 7.092426ms
qualif  default 2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 4.421906ms
qualif  default 2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 14.351637ms

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s

Statistics
  Series:
     - qualif (est): 95010 (100%)
  Total (est): 95010
Completed in 95.538358ms

Don't think if it can help you. If you want some shard, do you have an FTP server where I can put this one (6Gb of data) ?

@tyauvil
Copy link

tyauvil commented Mar 20, 2018

@ptitou
Please send an email to support@influxdata.com so we can share upload information.

@benbjohnson
Copy link
Contributor

@ptitou OK, this is weird. I've tried on several machines and I can't reproduce the issue. I even tried it on a limited 1 vcpu/1GB RAM machine:

v1.5.0

$ wget https://dl.influxdata.com/influxdb/releases/influxdb-1.5.1_linux_amd64.tar.gz
$ tar xzvf influxdb-1.5.1_linux_amd64.tar.gz
$ influxdb-1.5.0-1/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
        root    2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 6.088269ms
        root    2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 3.050018ms
        root    2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 16.402941ms

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s 

Statistics
  Series:
     -  (est): 95010 (100%)
  Total (est): 95010
Completed in 184.327883ms

v1.4.3

$ wget https://dl.influxdata.com/influxdb/releases/influxdb-1.4.3_linux_amd64.tar.gz
$ tar xzvf influxdb-1.4.3_linux_amd64.tar.gz
$ influxdb-1.4.3-1/usr/bin/influx_inspect report 2175
DB      RP      Shard   File                    Series  New (est) Min Time                       Max Time                       Load Time
        root    2175    000000438-000000003.tsm 26154   26243     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 7.225517ms
        root    2175    000000438-000000004.tsm 12563   12183     2017-11-27T00:00:00.56750589Z  2017-12-03T23:59:59.027678227Z 6.398312ms
        root    2175    000000438-000000005.tsm 56064   56584     2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 10.399071ms

Summary:
  Files: 3
  Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
  Duration: 167h59m58.956941034s 

Statistics
  Series:
     -  (est): 95010 (100%)
  Total (est): 95010
Completed in 186.647269ms

I accidentally fired up an Ubuntu instance instead of CentOS but I'm firing up a CentOS 7 instance right now to double check.

@benbjohnson
Copy link
Contributor

OK, it looks like it is an issue on CentOS. I'm not seeing quite the same disparity but it's still large. 1.4.3 is about 200ms and 1.5.0 is 2s.

@ghost
Copy link
Author

ghost commented Mar 21, 2018

We've got the same result on Red Hat Enterprise Linux Server release 6.4 (Santiago) and on CentOS Linux release 7.2.1511 (Core) as explained on the first comment if it can help you.

@benbjohnson
Copy link
Contributor

@ptitou I tracked it down to a change in madvise() that only seems to be affecting CentOS/RHEL. We are reverting it here: #9612

Thanks for all the help tracking this issue down.

@ghost
Copy link
Author

ghost commented Mar 21, 2018

👍 thanks for your perseverence ! I will try the patch and tell you if it's ok for us !

@xo4n
Copy link

xo4n commented Mar 21, 2018

Hi, I have tested 1.5.1 on Debian and I see also long time for restarts. Influxdb goes through all the shards and disk is 100% busy reading. Opening each file is taking from several seconds to more than 1 minute as shown in the logs

ts=2018-03-21T22:21:28.949409Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000007.tsm id=4 duration=44701.636ms
ts=2018-03-21T22:21:28.949478Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000015.tsm id=12 duration=35986.227ms
ts=2018-03-21T22:21:28.960476Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000012.tsm id=9 duration=44415.333ms
ts=2018-03-21T22:21:29.066193Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000013.tsm id=10 duration=44520.889ms
ts=2018-03-21T22:21:29.505564Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000017.tsm id=14 duration=36529.187ms
ts=2018-03-21T22:21:52.160085Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/2444 duration=132423.805ms
ts=2018-03-21T22:21:52.160776Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/3183 duration=199630.454ms
ts=2018-03-21T22:21:52.161794Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/3211 duration=68156.950ms

First attempt is a restart with index-version = "tsi1" after running influx_inspect buildtsi -datadir data -waldir wal offline. Current attempt is running with index-version = "inmem" after deleting all the index directories. Both cases is taking hours to restart

Is there a reason why it tries to open all the shards on restart? The dataset is of 2.8 Tb, with 2 million series distributed around 10 databases of different sizes

influxdb_1 5 1_long_restarts

@xo4n
Copy link

xo4n commented Mar 21, 2018

Just as an additional note in case it might be helpful for some one else. The screenshot below shows the difference between restarting v1.5.1 with tsi1 vs inmem. The latest takes less time, consumes less memory and looks more stable. the first restart with tsi1 took longer and it started swapping at the end
influxdb_1 5 1_tsi1_vs_inmem

@markusressel
Copy link

@ptitou
Sorry for the noob question here, but may I ask what query you used to get the total count of series like shown in your screenshot?

@max3163
Copy link

max3163 commented Mar 22, 2018

@markusressel

The query is :

SELECT last("numSeries") AS "numSeries" FROM "database" WHERE ("hostname" = '$host') AND $timeFilter GROUP BY time($__interval) fill(null)

From _internal datasource

@ghost
Copy link
Author

ghost commented Mar 22, 2018

Test done today with the nightly release, everything work fine !

Do you plan to release a version 1.5.2 with this commit (as explained in issue #9614 ) or do we have to wait release 1.6 ?

@e-dard e-dard added this to the 1.5.2 milestone Mar 27, 2018
@timhallinflux timhallinflux changed the title [inmem] Startup in 1.5.0 is very slower than in 1.3.5 and 1.4.2 [inmem] Startup in 1.5.0 is much slower than in 1.3.5 and 1.4.2 Mar 29, 2018
@benbjohnson
Copy link
Contributor

@ptitou Sorry for the delay. This fix is in 1.5.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty/medium Resolving this issue should take up to a week need more info
Projects
None yet
Development

No branches or pull requests

9 participants