Bulk-fetch metrics from remote hosts breaks some queries #1022

lamont · 2014-11-05T22:00:27Z

I was excited about the commit around pull request 1010 and rolled it out today with bad results. Specifically, queries using divideSeries(sum(),sum()) were failing with the error "divideSeries second argument must reference exactly 1 series"

My clustered setup involves consistent-hash with a replication factor of 2, and I wonder if the bulk-fetch was tested with replicated data.

Additionally, simple queries like the following were returning 200 but an empty body:

target=alias(scale(sumSeries(collectd.app112.ShipmentDownloadTimer.counter-TotalHits%2Ccollectd.app111.ShipmentDownloadTimer.counter-TotalHits)%2C60),%20'ShipmentsDownloaded')&from=-3600s&rawData=true

Rolling back and forward across the 13 node cluster is pretty easy, so if someone has some specific debugging they'd like me to look for, I'm happy to try things.

Thank you.

lamont · 2014-11-05T22:01:24Z

Oh sorry, I should have mentioned, that was the latest version of the 0.9.x branch.

deniszh · 2014-11-05T22:31:06Z

@lamont, pity to hear that. Unfortunately I have REPLICATION_FACTOR=1 and can't test it - but maybe @bmhatfield has one?

deniszh · 2014-11-05T22:33:38Z

@bmhatfield - or maybe we need to web bulk-fetch configurable like QUERY_BULK with default = false ?

bmhatfield · 2014-11-05T22:47:00Z

Oh, sad!

I have REPLICATION_FACTOR=2, and am not having this problem. I wrote some code that attempts to prevent duplicate series from sneaking through, but perhaps I failed to consider a use case for it. Let me check the divideSeries graphs we have, and see if I can reproduce the problem.

lamont · 2014-11-05T22:49:52Z

Thanks! I found the performance to be out of this world awesome for our large wildcard queries, so I'm really driven to help you debug this. The divideSeries error was the only one I could get a specific error message out of, the other issue I saw in the 40 minutes this was live was with rawData=true returning an empty body yet an http status code of 200 and I'm not sure how to debug it. (I saw that on the production cluster but could not replicate it on the development cluster)

bmhatfield · 2014-11-05T23:07:21Z

Hrm, I'm not seeing this kind of error using Wildcarded series with the ones we currently dashboard.

For a series that looks like this:
divideSeries(sum(stats_counts.worker.*.REDACTED.*),sum(stats_counts.worker.*.REDACTED)), I indeed get a graph with data in the form that I would expect.

I also tried a query with rawData=true that pulls in a variety of wildcarded metrics and that also worked.

While I can't reproduce this yet, I do believe you're having a problem. I'm on Python 2.7.3 - what kind of other context can you share that might help me tickle this issue?

bmhatfield · 2014-11-05T23:18:29Z

@lamont I am running a version without this commit:

fe6cda0

I don't see how it could be related yet, but I am just noting it for posterity.

lamont · 2014-11-05T23:24:32Z

Thanks for looking at this. Here's some details about my implementation:

3 frontend (graphite-web, carbon-relay CH, replication 2)
10 backend (also graphite-web, carbon-relay, carbon-cache, no replication)

The frontends only know about the backend graphite-web, nothing about their carbon caches. it's more or less the setup recommended by http://grey-boundary.com/the-architecture-of-clustering-graphite/

boxes are all redhat 6.4

Python 2.6.6,
$ pip freeze
Cheetah==2.4.1
Django==1.4.15
M2Crypto==0.20.2
Markdown==2.0.1
MySQL-python==1.2.3c1
PyYAML==3.10
Pygments==1.1.1
SSSDConfig==1.9.2
Twisted==13.1.0
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
boto==2.27.0
chardet==2.0.1
cloud-init==0.7.2
configobj==4.6.0
django-tagging==0.3.2
ethtool==0.6
iniparse==0.3.1
ordereddict==1.1
policycoreutils-default-encoding==0.1
prettytable==0.7.2
py-rrdtool==0.2.1
pyOpenSSL==0.10
pycurl==7.19.0
pygpgme==0.1
pyparsing==2.0.2
python-dateutil==1.4.1
python-dmidecode==3.10.13
python-memcached==1.53
pytz==2014.4
requests==1.1.0
rhnlib==2.5.22
rhsm==1.1.8
setools==1.0
simplejson==2.0.9
six==1.6.1
txAMQP==0.6.2
uWSGI==2.0.6
urlgrabber==3.9.1
urllib3==1.5
virtualenv==1.11.6
whisper==0.9.12
yum-metadata-parser==1.1.2
zope.interface==4.1.1

I'm using the latest 0.9.x of whisper, carbon and graphite-web, specifically:

whisper-0.9.x-4404f173c0
carbon-0.9.x-36177586ff
graphite-web-0.9.x-ce239313b

lamont · 2014-11-05T23:27:11Z

@bmhatfield I can try without that commit, thanks for pointing that out.

bmhatfield · 2014-11-05T23:34:31Z

My architecture is a little simpler: 4 Graphite nodes, all behaving the same: 2-tier carbon-relay, 1 tier carbon-cache. The webapps are all configured to talk to each other (but not themselves).

Actually, configuring a webapp to talk to itself for this might be a little interesting. Can you confirm that all of your graphite web local_settings.py's CLUSTER_SERVERS on each node only looks at the appropriate non-self hosts in the cluster?

bmhatfield · 2014-11-05T23:36:54Z

Actually - that could totally be the problem, and would be different with 1010 and not with 1010. In not-1010, the same loop covers both local and remote hosts. In 1010, if you loaded a metric using the LOCAL_STORE, and then again got it using the REMOTE_STORE, that could cause it to appear in the series list twice.

bmhatfield · 2014-11-05T23:39:59Z

Hrm, on second read, maybe not. Worth a double-check anyways, though.

lamont · 2014-11-06T00:04:14Z

I just confirmed that the frontend machines (A, B, C) only list the backend boxes in CLUSTER_SERVERS. The order of listed backend boxes varies per frontend box but there is no duplication in each list. Also, the frontend boxes have no local metrics that match the pattern of metrics on the backend. (backend has collectd.* in whisper, frontend has some nfs mounted RRD files under cacti.*)

None of the backend nodes have CLUSTER_SERVERS set as they're meant to only answer questions about their local metrics.

lamont · 2014-11-06T00:14:40Z

I'm rolling out again with 0.9.x HEAD minus the fe6cda0

bmhatfield · 2014-11-06T00:17:43Z

Good luck.. I don't really believe that commit is the issue, but since you said it was easy to try... 👍

I haven't come up with anything else interesting on my side yet :-(

bmhatfield · 2014-11-06T00:19:24Z

PS: If it does have issues, any hacked in logging or sample data or anything like that will be super valuable. You can email me sample data direct ( my-gh-username at gmail ) if you'd prefer not to have it public (though I prefer to keep the actual discussion on the issue, since this is now marked at as blocker for 0.9.13 :-( )

lamont · 2014-11-06T01:01:09Z

Ok, the problem is at least repeatable.

http://varnish-graphite/render/?target=divideSeries(sumSeries(grouper.APP.FOO1.*.bar-baz?TcSrv-TotalDuration),sumSeries(grouper.APP.FOO1.*.baz?TcSrv-TotalHits))&from=-30mins&rawData=true

That errors with the divideSeries error:

File "/opt/graphite/webapp/graphite/render/functions.py", line 501, in divideSeries
raise ValueError("divideSeries second argument must reference exactly 1 series")
ValueError: divideSeries second argument must reference exactly 1 series

bmhatfield · 2014-11-06T01:03:54Z

Can you hack in some logging as to what series are being referenced?

IE; is it 0 or 2 or ?

Also, just to confirm: the same exact query works fine on not-1010, correct?

lamont · 2014-11-06T01:34:30Z

Ok, that's helpful.
if len(divisorSeries) != 1:
print "saw series too many", len(divisorSeries)
raise ValueError("divideSeries second argument must reference exactly 1 series")

2014-11-06_01:33:16.24137 saw series too many 0

so now the question is why the pre 1010 came up with some divisor series and post 1010 has none.

bmhatfield · 2014-11-06T02:11:51Z

Ah, this is indeed very interesting information. I was guessing 2+, so 0 is useful.

For the pre-1010 code, does it error in a different way, or does it actually have a series present?

If it does have a series present, do you have any indication if it's coming from the local node, a remote node, or both?

If it's a remote node, do you know if timeouts are in play at all?

Also, do you know if you accidentally have files that would satisfy the query for the divisor metric (basically all Nones) on more hosts than you expect?

I think I can see a few ways that Line 306 here could cause us to break out of this loop at a surprising time.

lamont · 2014-11-06T02:35:57Z

pre-1010, it does not error and has a series present. Some of the data in question is spotty (ie, it is possible that there are holes in the data for a given timerange) but a spot check showed that both copies of the data are consistent. The exact query I mailed you has a numerator of 3 unique series (6 total including duplicates) and a divisor of 3 unique series (again, 6 total copies).

All series in question are coming from a remote node, the backend nodes. It's possible timeouts are in play, but the load on the backend seems very steady and under the 1010 code the "busyness" of the uwsgi is less than a quarter of normal so I'm not inclined to believe that.

I just walked the cluster to verify that there are no metrics which are out of place, ie, no local files on the frontend boxes and only 2 copies of each on the backend boxes.

I've been rolling the code out to both the frontend and backend boxes, but if it was possible to just change the code on the frontend boxes I could take one frontend node out of rotation and compare things head to head forever. Does the frontend 1010 need something matching on the backend?

bmhatfield · 2014-11-06T02:42:02Z

I haven't tested it, but 1010 on a frontend to a non-1010 backend should be fine.

lamont · 2014-11-06T02:44:42Z

Ok, let me try that and get some numbers. Just a count of various series against both servers should do.

bmhatfield · 2014-11-06T02:46:29Z

Thanks for all your help digging into this - like you, I found it to dramatically help performance, so I hope we can make it stable so everyone can get the benefit :-)

lamont · 2014-11-06T04:45:43Z

No, thank YOU for digging into it. This feature would be fantastic and I'd love to help get it working.

So, great success with narrowing down the problem. I setup app001 with the 1010 code and app002 with the "stable" 0.9.x branch from nov 1. They are both querying the same set of 10 backend hosts (running the same stable nov 1 0.9.x) I had to remember to disable memcached as well.

$ curl 'http://app001:8080/render?target=countSeries(collectd.px*.STUFF.counter-TotalDuration)&rawData=true'
1

$ curl 'http://app002:8080/render?target=countSeries(collectd.px*.STUFF.counter-TotalDuration)&rawData=true'
3

Actually walking the filesystems did show me the 3 unique values. I'll try and add some debugging statements to see what the frontend collected then possibly discarded. Python is not a language I'm any good with so if you have specific debug suggestions that would help a great deal.

bmhatfield · 2014-11-06T05:10:30Z

I made a quick branch with some debug statements, that should end up in the webapp's info.log:

bmhatfield/graphite-web@graphite-project:0.9.x...09x-debug

lamont · 2014-11-06T05:19:50Z

I added some debug statements and think I know what is happening. I added:

      for series in results:
        print "considering ", series['name']

and

            if candidate_nones >= known_nones:
              # If we already have this series in the seriesList, and
              # it is 'worse' than the other series, we don't need to
              # compare anything else. Save ourselves some work here.
              print "break cause candidate_nones >= known_nones"
              break

and

        if not series_handled:
          print "appending"
          seriesList.append(ts)

And saw the following:

2014-11-06_04:56:37.71547 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.71552 appending
2014-11-06_04:56:37.72104 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.72107 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.72662 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.72670 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.73190 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.73214 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.73887 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.73920 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.74432 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.74462 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.74963 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.74996 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.75516 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.75548 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.76087 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.76119 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.76653 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.76699 break cause candidate_nones >= known_nones

This is interesting for a few reasons. app142 came up 6 times whereas it should have come up twice (the metric and its replica). But there are 6 copies of the matching metrics, just the other 4 should be the 2 copies of apps 141 and 143.

I'll keep poking at it (my python is terrible). I see you just posted a debug branch, so let me try that and get it back to you.

lamont · 2014-11-06T05:22:00Z

Also, the app142 machine's metric is on the first node in the CLUSTER_SERVERs list.

obfuscurity · 2014-11-06T05:22:14Z

I owe you both 🍻 the next time we meet in person. 💖

bmhatfield · 2014-11-06T05:31:56Z

😄 Hopefully we can actually get this figured out...

@lamont - for the log above, is STUFF representing the same redacted string for each, or different redacted strings? If different, when you post the log for the branch I sent you, can you tokenize a bit so I can get a better sense (ie; STUFFA, STUFFB, etc)?

lamont · 2014-11-06T05:36:53Z

STUFF is a static string, unchanged across all metrics, I'm mailing you the log from the debug branch with the unredacted metrics now.

bmhatfield · 2014-11-06T06:03:25Z

Hrm,

The logs you sent me are fairly unexpected; they seem to say that all of your 10 backend servers are each returning the same 1 series for your query, which doesn't seem likely based upon our above conversation, unless they were "talking" to each other in order to return a result, which is something you mentioned they shouldn't be configured to do.

I pushed up another log message (into remoteStorage.py), can you see if bmhatfield/graphite-web@graphite-project:0.9.x...09x-debug now provides more info?

Also, it's late on the East Coast, so I am going to go to bed, but I will run this debug logging myself and see if I learn anything from it tomorrow.

lamont · 2014-11-06T06:19:21Z

I'm going to bed as well, but I just mailed you the updated debug log. It just occurred to me that while there is no CLUSTER_SERVER communication between backend nodes, they all do share a memcached which might be what's causing this. I bet they're each serving up the cached result of the "fetch local target wildcard" that went to the first node.

I'll disable memcache from the backend machines (it was of dubious value anyway) and see if that was it.

lamont · 2014-11-06T06:38:48Z

So yeah, that was totally my fault. Disabling memcache on the backend seems to have fixed the issue for me. I'll do a fuller test tomorrow and let you know.

My apologies for muddying the water, your stuff rocks.

obfuscurity · 2014-11-06T06:51:48Z

Still, it's good to recognize these use cases. There seems to be some confusion about the best way to incorporate memcached among Graphite clusters.

It appears that the "best practice" in multi-node setups is to use a shared memcached among workers on the same frontend graphite-web, but not at all on backend nodes.

deniszh · 2014-11-06T09:36:17Z

Cool, guys, it was really impressive debugging session!
Btw, @lamont, maybe it's a good idea to close issue then?

lamont · 2014-11-06T14:17:23Z

@deniszh I'm doing a full acceptance test (with humans) at noon today and I'll close the issue after that.

bmhatfield · 2014-11-06T14:34:08Z

I'm not sure we should close this quite yet, I'd like to understand why 1010's behavior is different in this case a little bit better. My outer assumption is that while it's inefficient to return the same metric from each of 10 servers, it should still work roughly the same as pre-1010.

bmhatfield · 2014-11-06T14:38:14Z

@lamont How many distinct series do you see in pre-1010 for the query that's reflected in the updated logs you sent me?

Thu Nov 06 01:12:57 2014 :: Fetched 'local=1&target=YOURWILDMETRICEXPR&format=pickle&from=1415167967&until=1415254367&now=1415254367' from ANIPADDR:8080, received 1 series
Thu Nov 06 01:12:57 2014 :: List contains: '['ASINGLEMETRICNAME']'

lamont · 2014-11-06T18:33:32Z

@bmhatfield 0 or 1 per box, for a total of 3. When I disable memcached on the backend boxes I also get the correct results.

SEJeff · 2014-11-06T18:46:31Z

Perhaps you could send us a PR to update the docs with this as a gotcha?

bmhatfield · 2014-11-06T18:49:06Z

The ideal case would be that this gets improved to not be a gotcha :-)

bmhatfield · 2014-11-06T19:07:28Z

@lamont Any chance you could hack in a log for what series are returned by what backend hosts, like what I did for remote_storage.py, but in non-1010, with memcached enabled? I have a guess as to what the problem is with 1010, and I think it's fixable in 1010, but I need a little more data :-)

bmhatfield · 2014-11-06T19:09:16Z

@SEJeff is there a particular doc that you'd recommend we edit in any case? Happy to help but the request feels a bit vague at the moment :-)

lamont · 2014-11-06T20:07:25Z

Ok, rolled the 1010 software to all frontend boxes and it is now "live" to the in-use cluster which gets hammered by queries. So far, I'm seeing all queries returning with proper counts. I am seeing each of the backend queries take 2.5-5 seconds, which when serialized across 10 backends results in a total query delay of about 30s, but that's better than before.

@bmhatfield I'll take one of the machines out and roll the old software back to it and hack in the remote_storage logging to show you the results.

deniszh · 2014-12-21T22:28:43Z

@lamont, any progress with testing? I'm still wondering why shared memcache killing your queries...
Did you try local memcache (i.e. on localhost) instead of shared one?

obfuscurity · 2014-12-22T05:36:33Z

I'm pretty sure this is resolved. I think any confusion stems from a lack of proper guidelines around using memcached with multiple nodes.

obfuscurity · 2014-12-22T15:50:26Z

@lamont @bmhatfield You guys have any time this week for any final tests so we can mark this closed? This is effectively the last issue blocking 0.9.13.

bmhatfield · 2014-12-22T16:02:02Z

I think we're good here. Per the earlier discussion, this works for @lamont without the shared memcached setup, and I have been using this in production for some time and am happy/confident in it's behavior.

We can close this issue, but I'd like to use this as an opportunity to plug #1026 as well - it's a meaningful enhancement to #1010.

deniszh · 2014-12-22T16:04:05Z

👍 for #1026 too

obfuscurity added this to the 0.9.13 milestone Nov 5, 2014

fessyfoo mentioned this issue Nov 26, 2014

upgrade from 0.9.12 to 0.9.x breaks existing cluster returning intermittent subsets of metrics. #1037

Closed

obfuscurity closed this as completed Dec 22, 2014

deniszh mentioned this issue Jan 14, 2015

Inconsistent graphs using memcached with clustered graphite #1102

Closed

jssjr mentioned this issue Apr 21, 2015

Namespace the request-cache with the local node address #1212

Merged

Bulk-fetch metrics from remote hosts breaks some queries #1022

Bulk-fetch metrics from remote hosts breaks some queries #1022

Comments

lamont commented Nov 5, 2014

lamont commented Nov 5, 2014

deniszh commented Nov 5, 2014

deniszh commented Nov 5, 2014

bmhatfield commented Nov 5, 2014

lamont commented Nov 5, 2014

bmhatfield commented Nov 5, 2014

bmhatfield commented Nov 5, 2014

lamont commented Nov 5, 2014

lamont commented Nov 5, 2014

bmhatfield commented Nov 5, 2014

bmhatfield commented Nov 5, 2014

bmhatfield commented Nov 5, 2014

lamont commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

lamont commented Nov 6, 2014

obfuscurity commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

lamont commented Nov 6, 2014

obfuscurity commented Nov 6, 2014

deniszh commented Nov 6, 2014

lamont commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

SEJeff commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

bmhatfield commented Nov 6, 2014

lamont commented Nov 6, 2014

deniszh commented Dec 21, 2014

obfuscurity commented Dec 22, 2014

obfuscurity commented Dec 22, 2014

bmhatfield commented Dec 22, 2014

deniszh commented Dec 22, 2014