New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] newly created metrics not fetchable even with cache-scan turned on #372
Comments
Hi @jdblack |
Thank you for the very quick response, @deniszh! I have added trie-index = true and restarted go-carbon. Unfortunately I'm still seeing a substantial delay in the visibility of created metrics. My [carbonserver] section now reads thusly:
carbon.create_delay, as mentioned in a previous comment, is a custom metric written by a inhouse tool that creates a random metric and times how long it takes for the metric to become visible via a graphite query. |
Hi, I think the the It's just my understanding from reading the commit, I think @saipraneethyss should know more. another note, you can also curl go-carbon with a force scan request to manually trigger scanning, also added by @saipraneethyss 's MR.
|
off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable. |
Looks like it, but I think @saipraneethyss should know, indeed. Also in case above scan-frequency is 5 min, but metrics delayed on 10 mins....
I would say that's not an offtopic at all, that would be great! |
@jdblack : just for clarity - did you try |
@deniszh if go-carbon would have significantly more than max-creates-per-second new metrics, they will be delayed. Also by the value (1000) it looks like underlying storage is rather slow. And carbon in general doesn't like slow storages. And as far as I remember, with trigram index cabronserver doesn't query cache for metrics that do not have data on disk. If that's the case, either carbonserver should query cached metrics for those that doesn't exist or it might be a better idea to switch to clickhouse for example |
@deniszh oh, I missed that PR. Thanks. |
@deniszh it's a good point, by reading the code, it should works for both index types: https://github.com/go-graphite/go-carbon/blob/master/carbonserver/carbonserver.go#L624-L638 Just in case, @jdblack , are you using the latest release of go-carbon published by @deniszh sometime ago? Do you have the version number or commit checksum? |
Wow guys! I really appreciate all of you taking such a close look at this! I'd also be very happy to share a zoom session if any of you would like to get a closer view of how this cluster looks and works. I switched over to trigram-index and I'm still seeing the same problem. I'm using the debian version of go-carbon 0.15.0, md5sum 48406db2a1fbb7df73fbd8e5f44b437b. I have embedded my most recent carbon.conf at the bottom of this comment A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes! As mentioned previously, we detect this delay with an inhouse tool that creates a completely new metric and times how long it takes for the metric to become renderable. Our filesystems are indeed struggling (we're using compression on zfs on raidz with criminally oversized EBS volumes to buy iops). I agree that there would be no observable problem if the filesystem had plenty of breathing room. I'd like to suggest that it's a fair expectation that go-carbon+carbonserver should be able to render as yet uncreated metrics that it has in it's caches, if even if they haven't been persisted yet. In fact, existing metrics can be pulled from the cache in the expected way with no problem. It's just metrics that have never existed before, such as when an ASG scales, causing new instances. Thanks again for all the help!
|
I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem? I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store. |
@jdblack It isn't that bad to have 7 mins of delay when using a scan frequency of 5 mins. How many metrics do you have? You can find some metrics exported by go-carbon on indexing and identify the issues on those servers:
|
That's new to me, in our case, using trie have less cpu usage, especially when dealing high number of metrics. Both trigram and trie index scans the file systems.
Our graphite team is using xfs. |
Yes, a delay of 5-7 minutes would be fine. The environment that we're testing in has a delay of ~ 7-11 minutes. The problem that we're really trying to address is a delay of 40 minutes in our big production environments! |
Have you try running a find command on your production environment? How many metrics do you have there? I think you might an issue on file system or hardware level.
|
Yeah, I suspect those lag spikes are coming from ZFS causing load spikes whenever the filesystem is fully scanned. One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk? |
That's the way indexing works currently in go-carbon, both trigram and trie. The program scans the whole |
I would also suggest running |
I'm very sorry, I must be misunderstanding. Our problem is that we're not able to find/render metric series that do not yet exist on disk at all. In other words, there is no whisper file yet at all. How does scanning the filesystem for a file that we know doesn't exist help make them renderable? |
To summary: So to solve your problems of having newly created metrics available faster, you have to either make sure the your go-carbon program flush metrics cache to disk faster or use
|
Ahh, but I have found that metrics series that do not exist yet (as in no .wsp file yet) are -not- renderable even if cache-scan and trie-index / trigram-index are turned on. |
@jdblack : v0.15.3 was released, including concurrent trie index. You can try did it work for new metrics with
If it works then we probably need to remove @bom-d-van : what's reasonable value for |
Hi, I think a few thousands should be good enough. Unless the go-carbon instance receives new metrics in a super high frequency. The reason that it's a buffer is to be able to deal with burst of new metrics and have them availble.
Oh, for this, realtime-index reuses some part of cache-scan logics, and |
Hi all! sorry for missing out on the earlier discussion. @bom-d-van thanks for pitching in with your valuable responses.
We have been running the go-carbon instance with the cache-scan code on our production machine close to 6 months now where we handle more than a 1 Million metrics. With this change we were able to bring down our query-response time for new metrics from more than 30 minutes to less than 5 minutes ( |
@saipraneethyss : I can only 🤷 here, because according to @jdblack it is not working. But it's quite hard to reproduce and test, indeed. |
I'll give 0.15.3 a try! Some quick updates:
Our largest environment has two graphite-cache clusters, a shiny new cluster of 36 instances with XFS, and an old cluster of 36 instances with XFS that is now read only while the xfs cluster fills. The new cluster is about 25 days old, while the old cluster goes back about a year.
We ended up having to disable cache_scan as walking the entire filesystem every few minutes was causing lookups against our cluster to collapse as the recurring find operation just sucked away a big chunk of our IOPs. I think our problem is that the metrics are sitting in go-carbon's input cache, totally hidden until they are processed by go-carbon When our developers scale up new instances, those new instances push out new metrics under a new namespace of servicename.${IP}.rest.of.the.namespace). These brand new metrics are not renderable until after they have been persisted to disk. This behavior differs for new data points for existing metrics, which metrics in the memory cache -immediately- became renderable, even though the metrics are still sitting in memory. To say this again; we have no problems with being able to look up fresh data for wsp files that have been around a long time, but we can't seem to look up metrics in carbonserver for metrics for which the wsp does not yet exist. Its as if go-carbon requires that the wsp be at least created before metrics are renderable from cache. I suspect that our pain comes from the metrics sitting unprocessed in go-carbon's memory cache. Is it possible that there is a bug in which go-carbon is able to render new data points for long lived metrics from cache, but data points for entirely new metrics (as in no wsp previously existed) are not available until after they have been persisted at least once? Switching from ZFS to XFS gave us a -huge- boost metric write speeds, allowing us to boost max-updates from 500 to 1500 (with IOPS left over for lookups!). This cut our delay from 30-40 minutes to find new metrics down to 9 minutes. The lower we get carbon.agents.*.cache.size, the lower the delay it takes for new metrics to be renderable, which totally make sense. |
I upgraded our dev clusters to v0.15.5 and I'm still able to see the non-visualizable effects. This problem is easy to reproduce. Just set max-updates-per-second to value that is low enough to push up the points per update. I performed some tests with one of our smaller test clusters, that only gets ~ 25K/sec metrics. Adjusting max-updates-per-second had a direct result in the amount of time for newly created metrics to become renderable. New data points for existing metrics were always immediately available. In other words, e never saw lag on our tests for total relay metrics sent and received or for metrics received per cache. We only saw lag on "Graphite Create Delays", which is an internal tool that creates a metric named carbon.create_delay.$RANDOM =timestamp every few seconds and measure how long it takes to become renderable. Timings from our test: max-updates-per-second: 450 max-updates-per-second: 50 max-updates-per-second: 1000 |
Your |
Are you asking me to set a scan-frequency of "5m" with cache-scan set to disabled? We had a scan-frequency of 5m0s when cache-scan was turned on (you can see that in my September 28 comment), but it caused our production clusters to fall over, as it took 7 minutes to perform a filesystem scan, starving the filesystem of enough IOPs to perform reads and writes. There's more details in a previous comment (starting with "I'll give 0.15.3 a try!") |
When scan-frequency is set to 0, all the indexing doesn't work because it requires scanning the data dir and this means that the new trie index features aren't enabled as well. I guess your servers are not using SSD disks? And if your configuration works for you then I guess it's good.
This makes sense because if makes it faster to flush new metrics to disk. Without |
I think you can try to set bigger scan frequency , e.g 15min or 30 min. But
please take note that you need to wait at least one scan until indexing
start working.
…On Fri, 30 Oct 2020 at 21:22, James Blackwell ***@***.***> wrote:
Are you asking me to set a scan-frequency of "5m" with cache-scan set to
disabled?
We had a scan-frequency of 5m0s when cache-scan was turned on, but it
caused our production clusters to fall over, as it took 7 minutes to
perform a filesystem scan, starving the filesystem of enough IOPs to
perform reads and writes. There's more details in a previous comment
(starting with "I'll give 0.15.3 a try!")
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#372 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJLTVXM4SREZKVV4L2VSYDSNMN6PANCNFSM4RY4NXOQ>
.
|
Oh, I see part of the confusion. I meant to set cache-scan to false as @deniszh instructed, but did not. I'll try turning cache-scan=false and test, then set cache-scan=false and scan-frequency to 5m0s and test again. I've been consistently confused about one thing: It seems that we are able to get not-yet-persisted data points from go-carbon as long as the .wsp exists. If the whisper file does not exist, then it's as if the metric does not exist at all. It's as if carbonsync (when with graphite-web) and carbonserver are totally only blind to metrics until they have been persisted at least once and the whisper file exists. To ask the question as simply as possible, is it correct to state that metrics are invisible until the whisper file exists? |
Ok, now I am confused :)
Now I have another quiestion: how |
Yes, you have to wait until the whisper file are created for the new metrics.
Yes, with cache-scan or realtime-index, you can fetch new metrics before they are even flushed to disk. |
OK. But as far as I understand code, both parameters depend on scan-frequency, so, theoreically they can help to drop new metric visibility time to be somewhere aroung scan-frequency time, right? |
@jdblack :
I said that only to isolate cases. You can try to use both. But please to be sure that you enable file scanning too. I.e.
|
@deniszh when it comes to
yes, that is correct. |
I'm happy to test that in dev, but we're not really able to use filesystem scanning in prod due to the size of our filesystems. Is there any way to decouple filesystem scanning from memory cache scanning? |
Describe the bug
We have migrated our test environments from [grafana -> graphite-web] -> [ graphite-web -> go-carbon] to [grafana-> carbonapi] -> [ carbonserver -> go-carbon] . Existing and updated metrics are fetchable in a timely manner. However, newly created metrics are not visible until persisted to disk.
We have an internal tool that creates a new random metric once a minute and time show long it takes for the metric to become fetchable. The delay in our low-usage test environment is a few seconds, while the delay on our high-usage test environment is as long as 12 minutes.
Go-carbon Configuration:
The text was updated successfully, but these errors were encountered: