-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q] Disabling scan-frequency for Carbonserver #400
Comments
Hi @loitho , if scan-frequency is set to 0. No index is built. It's a trade-off in the current system. Without index, your queries might become slower as it falls back to using filesystem globing, but things should continue to work. How many memory does your server have? If the memory capacity of the server is big enough, the kernel should be able cache all the file system metadata in memory and you shouldn't have too much io issues caused by scanning directories. Not sure how many people are having with issues with file system scanning. But now with concurrent and realtime indexing support in trie-index, we should be able to support indexing without scanning. |
Hi @bom-d-van thank you for your quick reply ! I understand, so basically, carbonserver is behaving like a graphite web instance and looking at the whisper file. Which is honnestly still a pretty good thing as it stops me from having to install a graphite-web (nginx + gunicorn) on each of my node.
Makes sense, I didn't see any "build index time" on the graph so I assumed so :)
Each node has 32 GB of RAM, is there a way to make sure the kernel has put the file system metadata in cache ?
That would be awesome ! |
@loitho can you also share the graph for memory and disk write metrics? Also, with collectd, I think there are merged read/write iops as well, can you also share that. Just trying to understand more of your system resource usage level. |
I haven't tweaked it myself, but you can try google it a bit and find some proper kernel tuning parameters. This one might do the job: https://unix.stackexchange.com/a/76750/22938
(@deniszh or @azhiltsov might have better suggestion/knowledge on this area.) |
Hmm, I don't think I understand this memory usage pattern. Lots of memory are freed and then used as cache. Can you also share the At the same time, you can also try enabling
If the above config helps, you can also increase Also it just occur to me, 32GB of ram is big enough to keep all the dentires and inode in memory. For 650000 metrics/files, it should just be a few hundred MBs (at most 1GB). But I'm just speculating. |
Thank you for your help, We have more around 3 750 000 metrics per node due to lot of machines being autoscaled etc ... You also have the "trigram-index" disabled, correct ? |
Looking at IOPS/LA graph I conclude you are using the spinning disk or array, not SSD. Am I right here? Normally you shouldn't see page caches to be evicted as the go-carbon performance is heavily relying on them. |
Hi @azhiltsov thank you for your answer, PS I haven't tried the suggestion and configuration above yet as my day already ended :) |
This is very common observation of mine from the past (not related to go-carbon) if disk is saturated then LA is going up because your cores are waiting for IO. I might be wrong. I think you facing two problems
Since the index is only needed to speed-up a queries its up to you to decide to use it or not. You need more speed? Get more memory. Start with 64G. If not enough bump it further. The whole performance paradigm of go-carbon is build on top of keeping as much of disk activity in page caches as possible. So you need to make sure that your caches are staying put in the memory and never evicted. This is your second solution. Extra performance points: |
Hi, it's me again, with some news :
Sorry for the dumb question but what is "LA" ? Now, here is what I tried,First I updated to go-carbon 0.15.6 (thank you for the fix and the ARM64 build, it'll serve us in the future !) vm.vfs_cache_pressure = 1 on every even node of our cluster (as opposed to the 100 by default),
Our machines are m5.2xlarge on AWS with GP3 disks with 16K IOPS / 200 MBps Queue writeout time doesn't have any spike, pretty good ! : Let's check the load : The memory interestingly shows that we indeed are storing more of the folder and file tree information in memory : Update per second also greatly improves, because there isn't an IO spike anymore, the update per seconds doesn't drop: So, everything is perfect ?Well ... not really, first of all after 24 hours of runtime, for some reason, some of the nodes started having huge load and reading a lot, for seemingly no reason ? (maybe the kernel removed from memory the index, I don't know) I think if you have a lot of memory, and disks with more sustained IOPS, and probably a better Kernel than the 3.10 from our machines, it might makes sense for you to try the setting. Then reconfigured everything to default and I tried to have just an index every 6 Hours : Looks pretty good ! And it suits better my disks, as they're meant to have a 30 Mn Burst period every 24 hours. Conclusion :First of all, thank you all again for your help. I had a final question, since realtime-index uptade the index well ... in real time, is there any point to regularly run the scan ? Kind regards, |
It's a nice and detailed report. So most of our reasoning appears to be correct.
Does that coincide with the clean-up on the clusters?
Yes, it's for deletions. Eventually we can add a delete api in All the new logics that we introduced are incremental and slowly evolving. So it might seem odd looking the implementation now. One last tip, since you prefer to have reduced disk io caused by indexing. You might also want to try this feature out.
|
Yeah, you were pretty much spot on ! (not that I doubted it, but if anyone stumble on the thread, he'll have some good information backed by graph :) )
No sadly, it didn't, that's why I found it so odd, nothing very interesting in the logs too. I think the implementation makes sense, I'm just trying to understand it fully as well as its limitations :)
Ah yes, I read about it and immediately started using it when I changed the cluster configuration for the one proposed by you, it works flawlessly it's really awesome !
It makes sense from an evolutionary stand point, as you gradually add functionalities, but I think that with a bit more precision in the documentation, like explaining the interaction between "old and new functions" like the fact that when enabling realtime-index you can actually bump up the scan frequency because then, its only purpose is to purge the deleted file from the index. Would be interesting Would you mind if I made a PR to add those informations to the documentation ? |
Yep, it a good idea. Thanks in advance! :D |
Hi there,
excuse the maybe naive question but, we're using go-carbon and are receiving 650 000 Metrics / Node, out of a 4 node cluster (so 2.5 Million metrics) per minute.
One issue we're facing is that the load iops peak tremendously when a scan is triggered to build the index for carbonserver.
There is around 3 750 000 metrics per node.
I then applied the following configuration for Carbonserver :
As you can see, I set scan-frequency to 0.
And it's working nicely, my servers don't get choked for 5 minutes (very high load during this period) trying to read all of the files.
(This behavior was happening even when only the trie index was enabled)
And so I was wondering, is this a problem to run the configuration like that ? Considering that I'm still able to have good performance from my cluster compared to before, is there a reason where I should enable this setting back ?
Kind regards,
Thomas
The text was updated successfully, but these errors were encountered: