-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk-fetch metrics from remote hosts breaks some queries #1022
Comments
Oh sorry, I should have mentioned, that was the latest version of the 0.9.x branch. |
@lamont, pity to hear that. Unfortunately I have REPLICATION_FACTOR=1 and can't test it - but maybe @bmhatfield has one? |
@bmhatfield - or maybe we need to web bulk-fetch configurable like QUERY_BULK with default = false ? |
Oh, sad! I have REPLICATION_FACTOR=2, and am not having this problem. I wrote some code that attempts to prevent duplicate series from sneaking through, but perhaps I failed to consider a use case for it. Let me check the divideSeries graphs we have, and see if I can reproduce the problem. |
Thanks! I found the performance to be out of this world awesome for our large wildcard queries, so I'm really driven to help you debug this. The divideSeries error was the only one I could get a specific error message out of, the other issue I saw in the 40 minutes this was live was with rawData=true returning an empty body yet an http status code of 200 and I'm not sure how to debug it. (I saw that on the production cluster but could not replicate it on the development cluster) |
Hrm, I'm not seeing this kind of error using Wildcarded series with the ones we currently dashboard. For a series that looks like this: I also tried a query with rawData=true that pulls in a variety of wildcarded metrics and that also worked. While I can't reproduce this yet, I do believe you're having a problem. I'm on |
Thanks for looking at this. Here's some details about my implementation: 3 frontend (graphite-web, carbon-relay CH, replication 2) The frontends only know about the backend graphite-web, nothing about their carbon caches. it's more or less the setup recommended by http://grey-boundary.com/the-architecture-of-clustering-graphite/ boxes are all redhat 6.4 Python 2.6.6, I'm using the latest 0.9.x of whisper, carbon and graphite-web, specifically: whisper-0.9.x-4404f173c0 |
@bmhatfield I can try without that commit, thanks for pointing that out. |
My architecture is a little simpler: 4 Graphite nodes, all behaving the same: 2-tier carbon-relay, 1 tier carbon-cache. The webapps are all configured to talk to each other (but not themselves). Actually, configuring a webapp to talk to itself for this might be a little interesting. Can you confirm that all of your graphite web local_settings.py's |
Actually - that could totally be the problem, and would be different with 1010 and not with 1010. In not-1010, the same loop covers both local and remote hosts. In 1010, if you loaded a metric using the LOCAL_STORE, and then again got it using the REMOTE_STORE, that could cause it to appear in the series list twice. |
Hrm, on second read, maybe not. Worth a double-check anyways, though. |
I just confirmed that the frontend machines (A, B, C) only list the backend boxes in CLUSTER_SERVERS. The order of listed backend boxes varies per frontend box but there is no duplication in each list. Also, the frontend boxes have no local metrics that match the pattern of metrics on the backend. (backend has collectd.* in whisper, frontend has some nfs mounted RRD files under cacti.*) None of the backend nodes have CLUSTER_SERVERS set as they're meant to only answer questions about their local metrics. |
I'm rolling out again with 0.9.x HEAD minus the fe6cda0 |
Good luck.. I don't really believe that commit is the issue, but since you said it was easy to try... 👍 I haven't come up with anything else interesting on my side yet :-( |
PS: If it does have issues, any hacked in logging or sample data or anything like that will be super valuable. You can email me sample data direct ( my-gh-username at gmail ) if you'd prefer not to have it public (though I prefer to keep the actual discussion on the issue, since this is now marked at as blocker for 0.9.13 :-( ) |
Ok, the problem is at least repeatable.
That errors with the divideSeries error: File "/opt/graphite/webapp/graphite/render/functions.py", line 501, in divideSeries |
Can you hack in some logging as to what series are being referenced? IE; is it 0 or 2 or ? Also, just to confirm: the same exact query works fine on not-1010, correct? |
Ok, that's helpful. 2014-11-06_01:33:16.24137 saw series too many 0 so now the question is why the pre 1010 came up with some divisor series and post 1010 has none. |
Ah, this is indeed very interesting information. I was guessing 2+, so 0 is useful. For the pre-1010 code, does it error in a different way, or does it actually have a series present? If it does have a series present, do you have any indication if it's coming from the local node, a remote node, or both? If it's a remote node, do you know if timeouts are in play at all? Also, do you know if you accidentally have files that would satisfy the query for the divisor metric (basically all Nones) on more hosts than you expect? I think I can see a few ways that Line 306 here could cause us to break out of this loop at a surprising time. |
pre-1010, it does not error and has a series present. Some of the data in question is spotty (ie, it is possible that there are holes in the data for a given timerange) but a spot check showed that both copies of the data are consistent. The exact query I mailed you has a numerator of 3 unique series (6 total including duplicates) and a divisor of 3 unique series (again, 6 total copies). All series in question are coming from a remote node, the backend nodes. It's possible timeouts are in play, but the load on the backend seems very steady and under the 1010 code the "busyness" of the uwsgi is less than a quarter of normal so I'm not inclined to believe that. I just walked the cluster to verify that there are no metrics which are out of place, ie, no local files on the frontend boxes and only 2 copies of each on the backend boxes. I've been rolling the code out to both the frontend and backend boxes, but if it was possible to just change the code on the frontend boxes I could take one frontend node out of rotation and compare things head to head forever. Does the frontend 1010 need something matching on the backend? |
I haven't tested it, but 1010 on a frontend to a non-1010 backend should be fine. |
Ok, let me try that and get some numbers. Just a count of various series against both servers should do. |
Thanks for all your help digging into this - like you, I found it to dramatically help performance, so I hope we can make it stable so everyone can get the benefit :-) |
No, thank YOU for digging into it. This feature would be fantastic and I'd love to help get it working. So, great success with narrowing down the problem. I setup app001 with the 1010 code and app002 with the "stable" 0.9.x branch from nov 1. They are both querying the same set of 10 backend hosts (running the same stable nov 1 0.9.x) I had to remember to disable memcached as well. $ curl 'http://app001:8080/render?target=countSeries(collectd.px*.STUFF.counter-TotalDuration)&rawData=true' $ curl 'http://app002:8080/render?target=countSeries(collectd.px*.STUFF.counter-TotalDuration)&rawData=true' Actually walking the filesystems did show me the 3 unique values. I'll try and add some debugging statements to see what the frontend collected then possibly discarded. Python is not a language I'm any good with so if you have specific debug suggestions that would help a great deal. |
I made a quick branch with some debug statements, that should end up in the webapp's info.log: |
I added some debug statements and think I know what is happening. I added:
and
and
And saw the following:
This is interesting for a few reasons. app142 came up 6 times whereas it should have come up twice (the metric and its replica). But there are 6 copies of the matching metrics, just the other 4 should be the 2 copies of apps 141 and 143. I'll keep poking at it (my python is terrible). I see you just posted a debug branch, so let me try that and get it back to you. |
Also, the app142 machine's metric is on the first node in the CLUSTER_SERVERs list. |
I owe you both 🍻 the next time we meet in person. 💖 |
😄 Hopefully we can actually get this figured out... @lamont - for the log above, is |
STUFF is a static string, unchanged across all metrics, I'm mailing you the log from the debug branch with the unredacted metrics now. |
Hrm, The logs you sent me are fairly unexpected; they seem to say that all of your 10 backend servers are each returning the same 1 series for your query, which doesn't seem likely based upon our above conversation, unless they were "talking" to each other in order to return a result, which is something you mentioned they shouldn't be configured to do. I pushed up another log message (into remoteStorage.py), can you see if bmhatfield/graphite-web@graphite-project:0.9.x...09x-debug now provides more info? Also, it's late on the East Coast, so I am going to go to bed, but I will run this debug logging myself and see if I learn anything from it tomorrow. |
I'm going to bed as well, but I just mailed you the updated debug log. It just occurred to me that while there is no CLUSTER_SERVER communication between backend nodes, they all do share a memcached which might be what's causing this. I bet they're each serving up the cached result of the "fetch local target wildcard" that went to the first node. I'll disable memcache from the backend machines (it was of dubious value anyway) and see if that was it. |
So yeah, that was totally my fault. Disabling memcache on the backend seems to have fixed the issue for me. I'll do a fuller test tomorrow and let you know. My apologies for muddying the water, your stuff rocks. |
Still, it's good to recognize these use cases. There seems to be some confusion about the best way to incorporate memcached among Graphite clusters. It appears that the "best practice" in multi-node setups is to use a shared memcached among workers on the same frontend graphite-web, but not at all on backend nodes. |
Cool, guys, it was really impressive debugging session! |
@deniszh I'm doing a full acceptance test (with humans) at noon today and I'll close the issue after that. |
I'm not sure we should close this quite yet, I'd like to understand why 1010's behavior is different in this case a little bit better. My outer assumption is that while it's inefficient to return the same metric from each of 10 servers, it should still work roughly the same as pre-1010. |
@lamont How many distinct series do you see in pre-1010 for the query that's reflected in the updated logs you sent me?
|
@bmhatfield 0 or 1 per box, for a total of 3. When I disable memcached on the backend boxes I also get the correct results. |
Perhaps you could send us a PR to update the docs with this as a gotcha? |
The ideal case would be that this gets improved to not be a gotcha :-) |
@lamont Any chance you could hack in a log for what series are returned by what backend hosts, like what I did for remote_storage.py, but in non-1010, with memcached enabled? I have a guess as to what the problem is with 1010, and I think it's fixable in 1010, but I need a little more data :-) |
@SEJeff is there a particular doc that you'd recommend we edit in any case? Happy to help but the request feels a bit vague at the moment :-) |
Ok, rolled the 1010 software to all frontend boxes and it is now "live" to the in-use cluster which gets hammered by queries. So far, I'm seeing all queries returning with proper counts. I am seeing each of the backend queries take 2.5-5 seconds, which when serialized across 10 backends results in a total query delay of about 30s, but that's better than before. @bmhatfield I'll take one of the machines out and roll the old software back to it and hack in the remote_storage logging to show you the results. |
@lamont, any progress with testing? I'm still wondering why shared memcache killing your queries... |
I'm pretty sure this is resolved. I think any confusion stems from a lack of proper guidelines around using memcached with multiple nodes. |
@lamont @bmhatfield You guys have any time this week for any final tests so we can mark this closed? This is effectively the last issue blocking 0.9.13. |
I think we're good here. Per the earlier discussion, this works for @lamont without the shared memcached setup, and I have been using this in production for some time and am happy/confident in it's behavior. We can close this issue, but I'd like to use this as an opportunity to plug #1026 as well - it's a meaningful enhancement to #1010. |
👍 for #1026 too |
I was excited about the commit around pull request 1010 and rolled it out today with bad results. Specifically, queries using divideSeries(sum(),sum()) were failing with the error "divideSeries second argument must reference exactly 1 series"
My clustered setup involves consistent-hash with a replication factor of 2, and I wonder if the bulk-fetch was tested with replicated data.
Additionally, simple queries like the following were returning 200 but an empty body:
target=alias(scale(sumSeries(collectd.app112.ShipmentDownloadTimer.counter-TotalHits%2Ccollectd.app111.ShipmentDownloadTimer.counter-TotalHits)%2C60),%20'ShipmentsDownloaded')&from=-3600s&rawData=true
Rolling back and forward across the 13 node cluster is pretty easy, so if someone has some specific debugging they'd like me to look for, I'm happy to try things.
Thank you.
The text was updated successfully, but these errors were encountered: