Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] newly created metrics not fetchable even with cache-scan turned on #372

Open
jdblack opened this issue Sep 24, 2020 · 39 comments
Open
Labels

Comments

@jdblack
Copy link

jdblack commented Sep 24, 2020

Describe the bug
We have migrated our test environments from [grafana -> graphite-web] -> [ graphite-web -> go-carbon] to [grafana-> carbonapi] -> [ carbonserver -> go-carbon] . Existing and updated metrics are fetchable in a timely manner. However, newly created metrics are not visible until persisted to disk.

We have an internal tool that creates a new random metric once a minute and time show long it takes for the metric to become fetchable. The delay in our low-usage test environment is a few seconds, while the delay on our high-usage test environment is as long as 12 minutes.

Go-carbon Configuration:

# go-carbon's   /etc/carbon/carbon.conf
[common]
user = "_graphite"
metric-endpoint="local"
metric-interval = "1m0s"
max-cpu = 6

[whisper]
data-dir = "/graphite/whisper"
schemas-file = "/etc/carbon/storage-schemas.conf"
aggregation-file= "/etc/carbon/storage-aggregation.conf"
workers = 4
max-updates-per-second = 1000
max-creates-per-second = 1000
sparse-create = true
flock = true
enabled = true
hash-filenames = false

[cache]
max-size = 5000000
write-strategy = "max"

[tcp]
listen = ":2003"
enabled = true
buffer-size = 5000000

[logging]
logger = ""
file = "/var/log/carbon/carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

[carbonlink]
listen  = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[receiver.http]
protocol = "http"
listen = ":2006"

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trigram-index = false
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]
#carbonapi configuration
listen: "0.0.0.0:8080"
cache:
  type: "mem"
  size_mb: 2000
  defaultTimeoutSec: 60
upstreams:
    graphite09compat: false
    buckets: 10
    keepAliveInterval: "30s"
    maxIdleConnsPerHost: 100
    timeouts:
        find: "10s"
        render: "30s"
        connect: "30s"
    backendsv2:
        backends:
          -
            groupName: "go_carbon"
            protocol: "carbonapi_v3_pb"
            lbMethod: "broadcast"
            maxTries: 3
            maxBatchSize: 0
            keepAliveInterval: "10s"
            maxIdleConnsPerHost: 1000
            timeouts:
                find: "10s"
                render: "30s"
                connect: "30s"
            servers:
                - "http://10.10.6.109:8080"
                - "http://10.10.6.197:8080"
                - "http://10.10.2.77:8080"
                - "http://10.10.2.24:8080"
@jdblack jdblack added the bug label Sep 24, 2020
@deniszh
Copy link
Member

deniszh commented Sep 25, 2020

Hi @jdblack
As far as I understand, cache-scan = true will work only with indices, so you need or trigram-index = true and/or trie-index = true.

@jdblack
Copy link
Author

jdblack commented Sep 25, 2020

Thank you for the very quick response, @deniszh!

I have added trie-index = true and restarted go-carbon. Unfortunately I'm still seeing a substantial delay in the visibility of created metrics. My [carbonserver] section now reads thusly:

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trie-index = true
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

carbon.create_delay, as mentioned in a previous comment, is a custom metric written by a inhouse tool that creates a random metric and times how long it takes for the metric to become visible via a graphite query.

Screen Shot 2020-09-25 at 9 26 44 AM

@bom-d-van
Copy link
Member

bom-d-van commented Sep 26, 2020

Hi, I think the the cache-scan is still influenced by scan-frequency. You might have to lower down cache-scan if you want new metrics to be available faster. Also scanning the metrics file tree and building the index would take some time. There are some logs/metrics from go-carbon telling how long they takes.

It's just my understanding from reading the commit, I think @saipraneethyss should know more.

another note, you can also curl go-carbon with a force scan request to manually trigger scanning, also added by @saipraneethyss 's MR.

curl 127.0.0.1:8080/forcescan

@bom-d-van
Copy link
Member

off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable.

@deniszh
Copy link
Member

deniszh commented Sep 26, 2020

@bom-d-van :

I think the the cache-scan is still influenced by scan-frequency.

Looks like it, but I think @saipraneethyss should know, indeed. Also in case above scan-frequency is 5 min, but metrics delayed on 10 mins....

off topic: I think I can add a new feature to my wip concurrent trie index to make new metrics immediately available, without waiting for scanning being triggered. Should be doable.

I would say that's not an offtopic at all, that would be great!

@deniszh
Copy link
Member

deniszh commented Sep 26, 2020

@jdblack : just for clarity - did you try trigram-index=true also?

@Civil
Copy link
Member

Civil commented Sep 26, 2020

@deniszh if go-carbon would have significantly more than max-creates-per-second new metrics, they will be delayed. Also by the value (1000) it looks like underlying storage is rather slow. And carbon in general doesn't like slow storages.

And as far as I remember, with trigram index cabronserver doesn't query cache for metrics that do not have data on disk.

If that's the case, either carbonserver should query cached metrics for those that doesn't exist or it might be a better idea to switch to clickhouse for example

@deniszh
Copy link
Member

deniszh commented Sep 26, 2020

@Civil : you're describing usual go-carbon behavior, I think @jdblack is aware of it. But we're trying to understand why possible fix, which was implemented in #338 is not working.
I suspecting that it works only with trigram index enabled.

@Civil
Copy link
Member

Civil commented Sep 26, 2020

@deniszh oh, I missed that PR. Thanks.

@bom-d-van
Copy link
Member

I suspecting that it works only with trigram index enabled.

@deniszh it's a good point, by reading the code, it should works for both index types: https://github.com/go-graphite/go-carbon/blob/master/carbonserver/carbonserver.go#L624-L638
However, it's better to verify it by trying it out.

Just in case, @jdblack , are you using the latest release of go-carbon published by @deniszh sometime ago? Do you have the version number or commit checksum?

@jdblack
Copy link
Author

jdblack commented Sep 28, 2020

Wow guys! I really appreciate all of you taking such a close look at this! I'd also be very happy to share a zoom session if any of you would like to get a closer view of how this cluster looks and works.

I switched over to trigram-index and I'm still seeing the same problem. I'm using the debian version of go-carbon 0.15.0, md5sum 48406db2a1fbb7df73fbd8e5f44b437b. I have embedded my most recent carbon.conf at the bottom of this comment

A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes! As mentioned previously, we detect this delay with an inhouse tool that creates a completely new metric and times how long it takes for the metric to become renderable.

Our filesystems are indeed struggling (we're using compression on zfs on raidz with criminally oversized EBS volumes to buy iops). I agree that there would be no observable problem if the filesystem had plenty of breathing room. I'd like to suggest that it's a fair expectation that go-carbon+carbonserver should be able to render as yet uncreated metrics that it has in it's caches, if even if they haven't been persisted yet.

In fact, existing metrics can be pulled from the cache in the expected way with no problem. It's just metrics that have never existed before, such as when an ASG scales, causing new instances.

Thanks again for all the help!

[common]
user = "_graphite"
metric-endpoint="local"
metric-interval = "1m0s"
max-cpu = 2

[whisper]
data-dir = "/graphite/whisper"
schemas-file = "/etc/carbon/storage-schemas.conf"
aggregation-file= "/etc/carbon/storage-aggregation.conf"
workers = 2
max-updates-per-second = 500
max-creates-per-second = 500
sparse-create = true
flock = true
enabled = true
hash-filenames = false

[cache]
max-size = 5000000
write-strategy = "max"

[tcp]
listen = ":2003"
enabled = true
buffer-size = 5000000

[logging]
logger = ""
file = "/var/log/carbon/carbon.log"
level = "info"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

[carbonlink]
listen  = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[receiver.http]
protocol = "http"
listen = ":2006"

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = ":8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "90s"
write-timeout = "90s"
query-cache-enabled = true
query-cache-size-mb = 1024
find-cache-enabled = true
trigram-index = true
scan-frequency = "5m0s"
cache-scan = true
max-globs = 100
fail-on-max-globs = false
graphite-web-10-strict-mode = false
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

@jdblack
Copy link
Author

jdblack commented Sep 28, 2020

I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem?

I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store.

@bom-d-van
Copy link
Member

A delay of 5 minutes (as per scan frequency config) is only 5 minutes. In this environment, the fetchable-after-creates delay is only 7 minutes, but in some of our production environments, the delay is as long as 40 minutes!

@jdblack
By your description, it seems to be an issue of those servers? maybe you can check if there is anything wrong with those servers.

It isn't that bad to have 7 mins of delay when using a scan frequency of 5 mins. How many metrics do you have?

You can find some metrics exported by go-carbon on indexing and identify the issues on those servers:

carbon.agents.{host}.metrics_known
carbon.agents.{host}.index_build_time_ns
carbon.agents.{host}.file_scan_time_ns

carbon.agents.{host} is the default prefix. These metrics are also logged in /var/log/carbon/carbon.log, you should be able grep them.

@bom-d-van
Copy link
Member

I also seem to be having a problem, since switching to go-carbon 0.15 and enabling carbonserver and the trie options, where it seems load can temporarily shoot up from around 5 to over 50 for no explicable reason. Could that be the cache-scan option cruising the entire filesystem?

That's new to me, in our case, using trie have less cpu usage, especially when dealing high number of metrics. Both trigram and trie index scans the file systems.

I'm starting to wonder if a lot of our pain and misery is coming from using ZFS for our backing store.

Our graphite team is using xfs.

@jdblack
Copy link
Author

jdblack commented Sep 28, 2020

Yes, a delay of 5-7 minutes would be fine. The environment that we're testing in has a delay of ~ 7-11 minutes. The problem that we're really trying to address is a delay of 40 minutes in our big production environments!

@bom-d-van
Copy link
Member

Have you try running a find command on your production environment? How many metrics do you have there? I think you might an issue on file system or hardware level.

time find /graphite/whisper -type f > /dev/null

@jdblack
Copy link
Author

jdblack commented Sep 28, 2020

Yeah, I suspect those lag spikes are coming from ZFS causing load spikes whenever the filesystem is fully scanned.

One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk?

@bom-d-van
Copy link
Member

bom-d-van commented Sep 28, 2020

One thing I'm unclear about; why are the tries required for scanning the filesystem when it's already known that the wsp files dont yet exist on disk?

That's the way indexing works currently in go-carbon, both trigram and trie. The program scans the whole data-dir every scan-frequency to figure out what are the new metrics and cache them in memory.

@bom-d-van
Copy link
Member

I would also suggest running sudo perf top to see what the cpu is doing.

@jdblack
Copy link
Author

jdblack commented Sep 28, 2020

I'm very sorry, I must be misunderstanding. Our problem is that we're not able to find/render metric series that do not yet exist on disk at all. In other words, there is no whisper file yet at all. How does scanning the filesystem for a file that we know doesn't exist help make them renderable?

@bom-d-van
Copy link
Member

bom-d-van commented Sep 28, 2020

To summary:
For go-carbon, usually you can only query metrics that are already created on disk.
But with cache-scan, you can query metrics that haven't been created on disk, but it only works when trigram or trie index are enabled.

So to solve your problems of having newly created metrics available faster, you have to either make sure the your go-carbon program flush metrics cache to disk faster or use cache-scan with trie or trigram index, which requires scanning your file system.

How does scanning the filesystem for a file that we know doesn't exist help make them renderable?

cache-scan have a feature of returning metrics are still in cache, even though they aren't still not created on disk.

@jdblack
Copy link
Author

jdblack commented Sep 28, 2020

Ahh, but I have found that metrics series that do not exist yet (as in no .wsp file yet) are -not- renderable even if cache-scan and trie-index / trigram-index are turned on.

@deniszh
Copy link
Member

deniszh commented Oct 25, 2020

@jdblack : v0.15.3 was released, including concurrent trie index. You can try did it work for new metrics with

trie-index = true
concurrent-index = true
realtime-index = 100
cache-scan = false

If it works then we probably need to remove cache-scan code, because it's not really working.

@bom-d-van : what's reasonable value for realtime-index? 10, 100, 1000, more?

@bom-d-van
Copy link
Member

what's reasonable value for realtime-index? 10, 100, 1000, more?

Hi, I think a few thousands should be good enough. Unless the go-carbon instance receives new metrics in a super high frequency. The reason that it's a buffer is to be able to deal with burst of new metrics and have them availble.

remove cache-scan code, because it's not really working.

Oh, for this, realtime-index reuses some part of cache-scan logics, and realtime-index only supports trie-index at the moment. So we might still have to keep it.

@saipraneethyss
Copy link
Contributor

Hi all! sorry for missing out on the earlier discussion. @bom-d-van thanks for pitching in with your valuable responses.

remove cache-scan code, because it's not really working.

We have been running the go-carbon instance with the cache-scan code on our production machine close to 6 months now where we handle more than a 1 Million metrics. With this change we were able to bring down our query-response time for new metrics from more than 30 minutes to less than 5 minutes ( 5m0s is our scan-frequency ) And so far, this has been working for us fairly well.

@deniszh
Copy link
Member

deniszh commented Oct 26, 2020

And so far, this has been working for us fairly well.

@saipraneethyss : I can only 🤷 here, because according to @jdblack it is not working. But it's quite hard to reproduce and test, indeed.

@jdblack
Copy link
Author

jdblack commented Oct 27, 2020

I'll give 0.15.3 a try!

Some quick updates:

  • Switching from ZFS to XFS had -drastic- improvements on timings
  • We see a sumSeries(perSecond(carbon.relays.*.metricsReceived)) of about 2.3 million in our largest environment. Does this mean our cluster sees 2.3 million data points per sevond?

Our largest environment has two graphite-cache clusters, a shiny new cluster of 36 instances with XFS, and an old cluster of 36 instances with XFS that is now read only while the xfs cluster fills. The new cluster is about 25 days old, while the old cluster goes back about a year.

  • Our new cluster based on xfs has 1.19 million WSPs on each cache and it takes 1m54s on average to perform a find /graphite/whisper -type f.
  • The old (now read only) zfs cluster has 3.99M WSP files in each cache. A find /graphite/whisper -type f on these caches usually takes 7M2S to perform.

We ended up having to disable cache_scan as walking the entire filesystem every few minutes was causing lookups against our cluster to collapse as the recurring find operation just sucked away a big chunk of our IOPs.

I think our problem is that the metrics are sitting in go-carbon's input cache, totally hidden until they are processed by go-carbon

When our developers scale up new instances, those new instances push out new metrics under a new namespace of servicename.${IP}.rest.of.the.namespace). These brand new metrics are not renderable until after they have been persisted to disk. This behavior differs for new data points for existing metrics, which metrics in the memory cache -immediately- became renderable, even though the metrics are still sitting in memory.

To say this again; we have no problems with being able to look up fresh data for wsp files that have been around a long time, but we can't seem to look up metrics in carbonserver for metrics for which the wsp does not yet exist. Its as if go-carbon requires that the wsp be at least created before metrics are renderable from cache.

I suspect that our pain comes from the metrics sitting unprocessed in go-carbon's memory cache. Is it possible that there is a bug in which go-carbon is able to render new data points for long lived metrics from cache, but data points for entirely new metrics (as in no wsp previously existed) are not available until after they have been persisted at least once?

Switching from ZFS to XFS gave us a -huge- boost metric write speeds, allowing us to boost max-updates from 500 to 1500 (with IOPS left over for lookups!). This cut our delay from 30-40 minutes to find new metrics down to 9 minutes. The lower we get carbon.agents.*.cache.size, the lower the delay it takes for new metrics to be renderable, which totally make sense.

@jdblack
Copy link
Author

jdblack commented Oct 29, 2020

I upgraded our dev clusters to v0.15.5 and I'm still able to see the non-visualizable effects. This problem is easy to reproduce. Just set max-updates-per-second to value that is low enough to push up the points per update. I performed some tests with one of our smaller test clusters, that only gets ~ 25K/sec metrics.

Adjusting max-updates-per-second had a direct result in the amount of time for newly created metrics to become renderable. New data points for existing metrics were always immediately available. In other words, e never saw lag on our tests for total relay metrics sent and received or for metrics received per cache. We only saw lag on "Graphite Create Delays", which is an internal tool that creates a metric named carbon.create_delay.$RANDOM =timestamp every few seconds and measure how long it takes to become renderable.

Timings from our test:

max-updates-per-second: 450
points per update: 2
delay for novel metrics: 4.5 min

max-updates-per-second: 50
points per update: 17
delay for novel metrics: 30 - 50 min

max-updates-per-second: 1000
points per update: 1
delay for novel metrics : 1.25 min

carbon.conf.txt

Screen Shot 2020-10-29 at 3 36 45 PM

@bom-d-van
Copy link
Member

Your scan-frequency is "0", try going with "5m".

@jdblack
Copy link
Author

jdblack commented Oct 30, 2020

Are you asking me to set a scan-frequency of "5m" with cache-scan set to disabled?

We had a scan-frequency of 5m0s when cache-scan was turned on (you can see that in my September 28 comment), but it caused our production clusters to fall over, as it took 7 minutes to perform a filesystem scan, starving the filesystem of enough IOPs to perform reads and writes. There's more details in a previous comment (starting with "I'll give 0.15.3 a try!")

@bom-d-van
Copy link
Member

When scan-frequency is set to 0, all the indexing doesn't work because it requires scanning the data dir and this means that the new trie index features aren't enabled as well.

I guess your servers are not using SSD disks?

And if your configuration works for you then I guess it's good.

Adjusting max-updates-per-second had a direct result in the amount of time for newly created metrics to become renderable.

This makes sense because if makes it faster to flush new metrics to disk. Without cache-scan or realtime-index, new metrics are only when they are being written to disk.

@deniszh
Copy link
Member

deniszh commented Oct 30, 2020 via email

@jdblack
Copy link
Author

jdblack commented Oct 30, 2020

Oh, I see part of the confusion. I meant to set cache-scan to false as @deniszh instructed, but did not. I'll try turning cache-scan=false and test, then set cache-scan=false and scan-frequency to 5m0s and test again.

I've been consistently confused about one thing: It seems that we are able to get not-yet-persisted data points from go-carbon as long as the .wsp exists. If the whisper file does not exist, then it's as if the metric does not exist at all. It's as if carbonsync (when with graphite-web) and carbonserver are totally only blind to metrics until they have been persisted at least once and the whisper file exists.

To ask the question as simply as possible, is it correct to state that metrics are invisible until the whisper file exists?

@deniszh
Copy link
Member

deniszh commented Oct 30, 2020

Ok, now I am confused :)

To ask the question as simply as possible, is it correct to state that metrics are invisible until the whisper file exists?
As far as I know that's normal behaviour of carbon or go-carbon.

Now I have another quiestion: how cache-scan and realtime-index changes that behaviour? @bom-d-van @saipraneethyss ?

@bom-d-van
Copy link
Member

To ask the question as simply as possible, is it correct to state that metrics are invisible until the whisper file exists?
As far as I know that's normal behaviour of carbon or go-carbon.

Yes, you have to wait until the whisper file are created for the new metrics.

Now I have another quiestion: how cache-scan and realtime-index changes that behaviour?

Yes, with cache-scan or realtime-index, you can fetch new metrics before they are even flushed to disk.

@deniszh
Copy link
Member

deniszh commented Oct 31, 2020

Yes, with cache-scan or realtime-index, you can fetch new metrics before they are even flushed to disk.

OK. But as far as I understand code, both parameters depend on scan-frequency, so, theoreically they can help to drop new metric visibility time to be somewhere aroung scan-frequency time, right?

@deniszh
Copy link
Member

deniszh commented Oct 31, 2020

@jdblack :

Oh, I see part of the confusion. I meant to set cache-scan to false as @deniszh instructed, but did not. I'll try turning cache-scan=false and test, then set cache-scan=false and scan-frequency to 5m0s and test again.

I said that only to isolate cases. You can try to use both. But please to be sure that you enable file scanning too. I.e.

trigram-index = true
trie-index = true
concurrent-index = true
realtime-index = 100
cache-scan = true
scan-frequency = 15m

@saipraneethyss
Copy link
Contributor

Now I have another quiestion: how cache-scan and realtime-index changes that behaviour?

@deniszh when it comes to cache-scan, when enabled, once every scan-frequency time, new metric names from the cache are fetched and inserted to the index. And as long as the new metrics are indexed, there is no need to wait for the corresponding .wsp to be created on the disk because carbonserver can query and fetch these metrics from the cache.

But as far as I understand code, both parameters depend on scan-frequency, so, theoreically they can help to drop new metric visibility time to be somewhere aroung scan-frequency time

yes, that is correct.

@jdblack
Copy link
Author

jdblack commented Nov 2, 2020

I said that only to isolate cases. You can try to use both. But please to be sure that you enable file scanning too. I.e.

I'm happy to test that in dev, but we're not really able to use filesystem scanning in prod due to the size of our filesystems. Is there any way to decouple filesystem scanning from memory cache scanning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants