Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] carbon-cache always returns 0 data points for cached metric, but writes the metric to disk regularly #940

Open
percygrunwald opened this issue Nov 8, 2022 · 10 comments

Comments

@percygrunwald
Copy link

percygrunwald commented Nov 8, 2022

Describe the bug
We have a graphite cluster with approximately 130 million metrics spread across 3 storage nodes running graphite-web and 16 instances of carbon-cache. For a small proportion of our metrics (I'm unsure about the exact percentage, but from my experience it seems like <1%), carbon-cache always returns 0 data points. We know for a fact that carbon-cache is receiving the metrics in question though, because the whisper file is regularly updated. The way this problem manifests is that for certain metrics, the latest data point can be quite far away from "now", periodically "catching up" when carbon-cache flushes its data points to the whisper file.

I have verified with tcpdump that carbon-c-relay is sending the metrics to the same carbon-cache instance that graphite-web tries to query for the cached data points.

I feel like this might have something to do with hashing, since the same metric name exhibits the same problem in both clusters that receive the metric independently, and that metrics one character off (e.g. same metric from host124 instead of host123) do not exhibit the same problem. We have at least 3 instances of the same metric on different host exhibiting the problem (out of around 500 hosts reporting the metric), which gives me further cause to believe the issue is not related to something other than carbon.

The logs we're seeing:

76502:2022-11-08,02:00:11.291 :: CarbonLink sending request for statsd.host123-primary.some_metric.count to ('127.0.0.1', '9')
76503:2022-11-08,02:00:11.292 :: CarbonLink finished receiving statsd.host123-primary.some_metric.count from ('127.0.0.1', '9')
76504:2022-11-08,02:00:11.292 :: CarbonLink cache-query request for statsd.host123-primary.some_metric.count returned 0 datapoints

Metric exhibiting problem:
image

Metric of same name from other host (path differing by one character):
image

Note that I have changed the exact metric name to not expose our metric publicly, but I'm happy to provide the metric name privately for debugging (assuming the issue may be related to the exact metric name).

To Reproduce
We are able to replicate the behavior by querying a metric of the same name for different hosts, e.g. statsd.host123-primary.some_metric.count. The metric returns data, but is lagging far behind. The graphite-web cache logs show that the metric always returns 0 data points. Querying for the same metric where the path differs by one character e.g. statsd.host124-primary.some_metric.count, statsd.host125-primary.some_metric.count, does not exhibit the problem.

Expected behavior
carbon-cache should return data points for the metrics in question, given that it appears to have the data in memory (evidenced by the fact that it's writing data to disk at regular intervals).

Environment (please complete the following information):

  • OS flavor: Ubuntu Focal
  • Graphite version: Tested on 1.1.5 and 1.1.10 (both exhibit the same problem)
  • Setup type: pip/git
  • Python version: 2.7.17
@deniszh
Copy link
Member

deniszh commented Nov 16, 2022

Hi @percygrunwald

Looks like a bug indeed but not sure how it can be investigated externally. :(
Could you please share your configs for start please (for carbon and graphite-web)

@percygrunwald
Copy link
Author

Thanks for your reply, I will provide both configurations.

A representative carbon-cache carbon.conf:

...
[cache:9]
STORAGE_DIR = /mnt/graphite/storage/
CONF_DIR = /opt/graphite/conf/
LOG_DIR = /var/log/carbon/
LOCAL_DATA_DIR = /mnt/ssd/graphite/storage/whisper/
USER = graphite
MAX_CACHE_SIZE = 200000000
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = inf
LINE_RECEIVER_INTERFACE = 127.0.0.1
LINE_RECEIVER_PORT = 2436
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 127.0.0.1
UDP_RECEIVER_PORT = 2437
PICKLE_RECEIVER_INTERFACE = 127.0.0.1
PICKLE_RECEIVER_PORT = 2438
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 127.0.0.1
CACHE_QUERY_PORT = 2439
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
ENABLE_TAGS = False
...

graphite-web config:

SECRET_KEY = '...'
TIME_ZONE = 'America/Los_Angeles'
LOG_ROTATION = True
LOG_RENDERING_PERFORMANCE = True
LOG_CACHE_PERFORMANCE = True
import os
_pid = os.getpid()
LOG_FILE_INFO = "info.%d.log" % _pid
LOG_FILE_EXCEPTION = "exception.%d.log" % _pid
LOG_FILE_CACHE = "cache.%d.log" % _pid
LOG_FILE_RENDERING = "rendering.%d.log" % _pid
DEFAULT_XFILES_FACTOR = 0
FIND_TIMEOUT = 60.0  # Timeout for metric find requests
FETCH_TIMEOUT = 60.0  # Timeout to fetch series data
CONF_DIR = '/opt/graphite/conf'
STORAGE_DIR = '/mnt/ssd/graphite/storage'
STATIC_ROOT = '/opt/graphite/static'
LOG_DIR = '/opt/graphite/storage/log/webapp'
INDEX_FILE = '/opt/graphite/storage/index'     # Search index file
WHISPER_DIR = '/mnt/ssd/graphite/storage/whisper'
DATABASES = {
    'default': {
        'NAME': '/opt/graphite/storage/graphite.db',
        'ENGINE': 'django.db.backends.sqlite3',
        'USER': '',
        'PASSWORD': '',
        'HOST': '',
        'PORT': ''
    }
}
INTRACLUSTER_HTTPS = True
CARBONLINK_HOSTS = [
    '127.0.0.1:2403:0',
    '127.0.0.1:2407:1',
    '127.0.0.1:2411:2',
    '127.0.0.1:2415:3',
    '127.0.0.1:2419:4',
    '127.0.0.1:2423:5',
    '127.0.0.1:2427:6',
    '127.0.0.1:2431:7',
    '127.0.0.1:2435:8',
    '127.0.0.1:2439:9',
    '127.0.0.1:2443:10',
    '127.0.0.1:2447:11',
    '127.0.0.1:2451:12',
    '127.0.0.1:2455:13',
    '127.0.0.1:2459:14',
    '127.0.0.1:2463:15',
]
CARBONLINK_TIMEOUT = 60.0
FUNCTION_PLUGINS = []

I understand this would be hard to diagnose remotely. I'm happy to run any tests that would help. Thank you.

@deniszh
Copy link
Member

deniszh commented Nov 17, 2022

@percygrunwald : yes, config looks perfectly fine, indeed. Will think how better to debug that. Maybe we have some hashing issue and graphite-web checking different carbonlink instead for proper one for some metrics...
BTW, instead of multiple carbon caches you can consider to use go-carbon instead, but please take note that it has no aggregator, though.

@percygrunwald
Copy link
Author

@deniszh thank you for your feedback.

Maybe we have some hashing issue and graphite-web checking different carbonlink instead for proper one for some metrics...

In my testing, tcpdump output seems to show that graphite-web is reading from the same carbon-cache instance that carbon-c-relay is writing to.

15:23:29.925423 IP localhost.43704 > localhost.2439: Flags [P.], seq 270706700:270706830, ack 2015752604, win 64, options [nop,nop,TS val 1165724566 ecr 1165596638],
 length 130
E...x.@.@.............  .."..x%.....@.......
E{..Ey.....~..}q.(U.metricq.XN...statsd.host123-primary.some_metric.countq.U.typeq.U.cache-queryq.u.

This is only the write side, but it's the correct port for carbon-cache:9. On the read side, it appears as though graphite-web is querying carbon-cache:9 as well:

76502:2022-11-08,02:00:11.291 :: CarbonLink sending request for statsd.host123-primary.some_metric.count to ('127.0.0.1', '9')
76503:2022-11-08,02:00:11.292 :: CarbonLink finished receiving statsd.host123-primary.some_metric.count from ('127.0.0.1', '9')
76504:2022-11-08,02:00:11.292 :: CarbonLink cache-query request for statsd.host123-primary.some_metric.count returned 0 datapoints

Unless the log is wrong, it appears as though the write and read are happening to the same carbon-cache instance (9).

BTW, instead of multiple carbon caches you can consider to use go-carbon instead

Thank you for the suggestion. We make use of go-carbon in our new architecture and it has worked very well as a carbon replacement.

@stale
Copy link

stale bot commented Jan 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 17, 2023
@stale stale bot closed this as completed Mar 18, 2023
@deniszh deniszh reopened this Mar 18, 2023
@deniszh
Copy link
Member

deniszh commented Mar 18, 2023

Looks like big really exists, but I still very confused with it.

@stale stale bot removed the stale label Mar 18, 2023
@deniszh deniszh added the pinned label Mar 18, 2023
@itinneed2022
Copy link

Hello Deniszh, I think this is the same issue I'm experiencing on my setup. Is there anyway to check the carbon daemons directly to see what data they're currently holding?

@deniszh
Copy link
Member

deniszh commented Jul 19, 2023 via email

@itinneed2022
Copy link

@deniszh I'm seeing the same error @percygrunwald is seeing.

2023-07-19,14:45:16.451 :: CarbonLink finished receiving test.density-full.47_Clare.47_C.fullness from ('127.0.0.1', 'b')
2023-07-19,14:45:16.451 :: Carbonlink results: {'datapoints': []}

I created a question on graphite-web last year.
In my cache logs I've noticed some metrics do return datapoints, others do not.

Is there anyway to query the carbon cache daemon's directly and see if the metrics do exist in the caches?

@itinneed2022
Copy link

I added some lines to my carbonlink.py file to try and narrow down the issue. Here's a snippet:

2023-07-20,13:46:14.823 :: Conn is: <socket.socket fd=30, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 54484), raddr=('127.0.0.1', 7002)>
2023-07-20,13:46:14.823 :: Metric requested is: test.density-full.Broad_Flat.Broad_Flat_14.client_count
2023-07-20,13:46:14.823 :: CarbonLink sending request for test.density-full.Broad_Flat.Broad_Flat_14.client_count to ('127.0.0.1', 'a')
2023-07-20,13:46:14.823 :: len_prefix in recv_response is: b'\x00\x00\x00-'
2023-07-20,13:46:14.823 :: body size in recv_response is: 45
2023-07-20,13:46:14.824 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02Jtq\xb9dG\x00\x00\x00\x00\x00\x00\x00\x00\x86q\x03as.'
2023-07-20,13:46:14.824 :: len_prefix in recv_response is: b'\x00\x00\x00-'
2023-07-20,13:46:14.824 :: body size in recv_response is: 45
2023-07-20,13:46:14.824 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02Jtq\xb9dG\x00\x00\x00\x00\x00\x00\x00\x00\x86q\x03as.'
2023-07-20,13:46:14.824 :: Made it inside the try, result is: {'datapoints': [(1689874804, 0.0)]}
2023-07-20,13:46:14.824 :: result2 is: {'datapoints': [(1689874804, 0.0)]}
2023-07-20,13:46:14.824 :: CarbonLink finished receiving test.density-full.Broad_Flat.Broad_Flat_14.client_count from ('127.0.0.1', 'a')
2023-07-20,13:46:14.824 :: Carbonlink results: {'datapoints': [(1689874804, 0.0)]}
2023-07-20,13:46:14.824 :: CarbonLink cache-query request for test.density-full.Broad_Flat.Broad_Flat_14.client_count returned 1 datapoints
2023-07-20,13:47:43.056 :: Request-Cache miss [5df10b47fc6a60a71d25ff3c8bae38b4]
2023-07-20,13:47:43.056 :: Data-Cache miss [d15e419cbcaa3638f54205c85c887466]
2023-07-20,13:47:49.075 :: Request-Cache miss [6fa36018ab95e559b1c4deb71d1f6dae]
2023-07-20,13:47:49.076 :: Data-Cache miss [f523d7ea713e59b4d97153c6f1a6659e]
2023-07-20,13:47:49.094 :: Host is: ('127.0.0.1', 'a')
2023-07-20,13:47:49.094 :: CarbonLink creating a new socket for ('127.0.0.1', 'a')
2023-07-20,13:47:49.095 :: Carbonlink connection successful
2023-07-20,13:47:49.095 :: Conn is: <socket.socket fd=36, family=AddressFamily.AF_INET, type=2049, proto=6, laddr=('127.0.0.1', 55548), raddr=('127.0.0.1', 7002)>
2023-07-20,13:47:49.095 :: Metric requested is: CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle
2023-07-20,13:47:49.095 :: CarbonLink sending request for CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle to ('127.0.0.1', 'a')
2023-07-20,13:47:49.097 :: len_prefix in recv_response is: b'\x00\x00\x00\x1b'
2023-07-20,13:47:49.097 :: body size in recv_response is: 27
2023-07-20,13:47:49.098 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02s.'
2023-07-20,13:47:49.098 :: len_prefix in recv_response is: b'\x00\x00\x00\x1b'
2023-07-20,13:47:49.098 :: body size in recv_response is: 27
2023-07-20,13:47:49.098 :: body in recv_response is: b'\x80\x02}q\x00X\n\x00\x00\x00datapointsq\x01]q\x02s.'
2023-07-20,13:47:49.098 :: Made it inside the try, result is: {'datapoints': []}
2023-07-20,13:47:49.098 :: result2 is: {'datapoints': []}
2023-07-20,13:47:49.098 :: CarbonLink finished receiving CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle from ('127.0.0.1', 'a')
2023-07-20,13:47:49.098 :: Carbonlink results: {'datapoints': []}
2023-07-20,13:47:49.098 :: CarbonLink cache-query request for CI_unixsys.aaprddb02_company_com.cpu-0.percent-idle returned 0 datapoints
2023-07-20,13:47:49.969 :: Request-Cache miss [5df10b47fc6a60a71d25ff3c8bae38b4]
2023-07-20,13:47:49.969 :: Data-Cache miss [d15e419cbcaa3638f54205c85c887466]
2023-07-20,13:47:50.412 :: Request-Cache miss [d6f5aae245b305a127886bf5de73f3ce]
2023-07-20,13:47:50.412 :: Data-Cache miss [fe44d544439db17608cd6d177e3bbf8e]

Something I'm consistently noticing in the logs is that any datapoint that returns empty has a byte string that ends in:
datapointsq\x01]q\x02s.

Where are successfully retrieved bytes always look like:
00datapointsq\x01]q\x02Jtq\xb9dG\x00\x00\x00\x00\x00\x00\x00\x00\x86q\x03as.
I don't know how to decode the byte strings but I'm thinking they are not properly formed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants