Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk-fetch metrics from remote hosts breaks some queries #1022

Closed
lamont opened this issue Nov 5, 2014 · 50 comments
Closed

Bulk-fetch metrics from remote hosts breaks some queries #1022

lamont opened this issue Nov 5, 2014 · 50 comments
Milestone

Comments

@lamont
Copy link

lamont commented Nov 5, 2014

I was excited about the commit around pull request 1010 and rolled it out today with bad results. Specifically, queries using divideSeries(sum(),sum()) were failing with the error "divideSeries second argument must reference exactly 1 series"

My clustered setup involves consistent-hash with a replication factor of 2, and I wonder if the bulk-fetch was tested with replicated data.

Additionally, simple queries like the following were returning 200 but an empty body:

target=alias(scale(sumSeries(collectd.app112.ShipmentDownloadTimer.counter-TotalHits%2Ccollectd.app111.ShipmentDownloadTimer.counter-TotalHits)%2C60),%20'ShipmentsDownloaded')&from=-3600s&rawData=true

Rolling back and forward across the 13 node cluster is pretty easy, so if someone has some specific debugging they'd like me to look for, I'm happy to try things.

Thank you.

@lamont
Copy link
Author

lamont commented Nov 5, 2014

Oh sorry, I should have mentioned, that was the latest version of the 0.9.x branch.

@deniszh
Copy link
Member

deniszh commented Nov 5, 2014

@lamont, pity to hear that. Unfortunately I have REPLICATION_FACTOR=1 and can't test it - but maybe @bmhatfield has one?

@deniszh
Copy link
Member

deniszh commented Nov 5, 2014

@bmhatfield - or maybe we need to web bulk-fetch configurable like QUERY_BULK with default = false ?

@bmhatfield
Copy link
Member

Oh, sad!

I have REPLICATION_FACTOR=2, and am not having this problem. I wrote some code that attempts to prevent duplicate series from sneaking through, but perhaps I failed to consider a use case for it. Let me check the divideSeries graphs we have, and see if I can reproduce the problem.

@obfuscurity obfuscurity added this to the 0.9.13 milestone Nov 5, 2014
@lamont
Copy link
Author

lamont commented Nov 5, 2014

Thanks! I found the performance to be out of this world awesome for our large wildcard queries, so I'm really driven to help you debug this. The divideSeries error was the only one I could get a specific error message out of, the other issue I saw in the 40 minutes this was live was with rawData=true returning an empty body yet an http status code of 200 and I'm not sure how to debug it. (I saw that on the production cluster but could not replicate it on the development cluster)

@bmhatfield
Copy link
Member

Hrm, I'm not seeing this kind of error using Wildcarded series with the ones we currently dashboard.

For a series that looks like this:
divideSeries(sum(stats_counts.worker.*.REDACTED.*),sum(stats_counts.worker.*.REDACTED)), I indeed get a graph with data in the form that I would expect.

I also tried a query with rawData=true that pulls in a variety of wildcarded metrics and that also worked.

While I can't reproduce this yet, I do believe you're having a problem. I'm on Python 2.7.3 - what kind of other context can you share that might help me tickle this issue?

@bmhatfield
Copy link
Member

@lamont I am running a version without this commit:

fe6cda0

I don't see how it could be related yet, but I am just noting it for posterity.

@lamont
Copy link
Author

lamont commented Nov 5, 2014

Thanks for looking at this. Here's some details about my implementation:

3 frontend (graphite-web, carbon-relay CH, replication 2)
10 backend (also graphite-web, carbon-relay, carbon-cache, no replication)

The frontends only know about the backend graphite-web, nothing about their carbon caches. it's more or less the setup recommended by http://grey-boundary.com/the-architecture-of-clustering-graphite/

boxes are all redhat 6.4

Python 2.6.6,
$ pip freeze
Cheetah==2.4.1
Django==1.4.15
M2Crypto==0.20.2
Markdown==2.0.1
MySQL-python==1.2.3c1
PyYAML==3.10
Pygments==1.1.1
SSSDConfig==1.9.2
Twisted==13.1.0
argparse==1.2.1
backports.ssl-match-hostname==3.4.0.2
boto==2.27.0
chardet==2.0.1
cloud-init==0.7.2
configobj==4.6.0
django-tagging==0.3.2
ethtool==0.6
iniparse==0.3.1
ordereddict==1.1
policycoreutils-default-encoding==0.1
prettytable==0.7.2
py-rrdtool==0.2.1
pyOpenSSL==0.10
pycurl==7.19.0
pygpgme==0.1
pyparsing==2.0.2
python-dateutil==1.4.1
python-dmidecode==3.10.13
python-memcached==1.53
pytz==2014.4
requests==1.1.0
rhnlib==2.5.22
rhsm==1.1.8
setools==1.0
simplejson==2.0.9
six==1.6.1
txAMQP==0.6.2
uWSGI==2.0.6
urlgrabber==3.9.1
urllib3==1.5
virtualenv==1.11.6
whisper==0.9.12
yum-metadata-parser==1.1.2
zope.interface==4.1.1

I'm using the latest 0.9.x of whisper, carbon and graphite-web, specifically:

whisper-0.9.x-4404f173c0
carbon-0.9.x-36177586ff
graphite-web-0.9.x-ce239313b

@lamont
Copy link
Author

lamont commented Nov 5, 2014

@bmhatfield I can try without that commit, thanks for pointing that out.

@bmhatfield
Copy link
Member

My architecture is a little simpler: 4 Graphite nodes, all behaving the same: 2-tier carbon-relay, 1 tier carbon-cache. The webapps are all configured to talk to each other (but not themselves).

Actually, configuring a webapp to talk to itself for this might be a little interesting. Can you confirm that all of your graphite web local_settings.py's CLUSTER_SERVERS on each node only looks at the appropriate non-self hosts in the cluster?

@bmhatfield
Copy link
Member

Actually - that could totally be the problem, and would be different with 1010 and not with 1010. In not-1010, the same loop covers both local and remote hosts. In 1010, if you loaded a metric using the LOCAL_STORE, and then again got it using the REMOTE_STORE, that could cause it to appear in the series list twice.

@bmhatfield
Copy link
Member

Hrm, on second read, maybe not. Worth a double-check anyways, though.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

I just confirmed that the frontend machines (A, B, C) only list the backend boxes in CLUSTER_SERVERS. The order of listed backend boxes varies per frontend box but there is no duplication in each list. Also, the frontend boxes have no local metrics that match the pattern of metrics on the backend. (backend has collectd.* in whisper, frontend has some nfs mounted RRD files under cacti.*)

None of the backend nodes have CLUSTER_SERVERS set as they're meant to only answer questions about their local metrics.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

I'm rolling out again with 0.9.x HEAD minus the fe6cda0

@bmhatfield
Copy link
Member

Good luck.. I don't really believe that commit is the issue, but since you said it was easy to try... 👍

I haven't come up with anything else interesting on my side yet :-(

@bmhatfield
Copy link
Member

PS: If it does have issues, any hacked in logging or sample data or anything like that will be super valuable. You can email me sample data direct ( my-gh-username at gmail ) if you'd prefer not to have it public (though I prefer to keep the actual discussion on the issue, since this is now marked at as blocker for 0.9.13 :-( )

@lamont
Copy link
Author

lamont commented Nov 6, 2014

Ok, the problem is at least repeatable.

http://varnish-graphite/render/?target=divideSeries(sumSeries(grouper.APP.FOO1.*.bar-baz?TcSrv-TotalDuration),sumSeries(grouper.APP.FOO1.*.baz?TcSrv-TotalHits))&from=-30mins&rawData=true

That errors with the divideSeries error:

File "/opt/graphite/webapp/graphite/render/functions.py", line 501, in divideSeries
raise ValueError("divideSeries second argument must reference exactly 1 series")
ValueError: divideSeries second argument must reference exactly 1 series

@bmhatfield
Copy link
Member

Can you hack in some logging as to what series are being referenced?

IE; is it 0 or 2 or ?

Also, just to confirm: the same exact query works fine on not-1010, correct?

@lamont
Copy link
Author

lamont commented Nov 6, 2014

Ok, that's helpful.
if len(divisorSeries) != 1:
print "saw series too many", len(divisorSeries)
raise ValueError("divideSeries second argument must reference exactly 1 series")

2014-11-06_01:33:16.24137 saw series too many 0

so now the question is why the pre 1010 came up with some divisor series and post 1010 has none.

@bmhatfield
Copy link
Member

Ah, this is indeed very interesting information. I was guessing 2+, so 0 is useful.

For the pre-1010 code, does it error in a different way, or does it actually have a series present?

If it does have a series present, do you have any indication if it's coming from the local node, a remote node, or both?

If it's a remote node, do you know if timeouts are in play at all?

Also, do you know if you accidentally have files that would satisfy the query for the divisor metric (basically all Nones) on more hosts than you expect?

I think I can see a few ways that Line 306 here could cause us to break out of this loop at a surprising time.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

pre-1010, it does not error and has a series present. Some of the data in question is spotty (ie, it is possible that there are holes in the data for a given timerange) but a spot check showed that both copies of the data are consistent. The exact query I mailed you has a numerator of 3 unique series (6 total including duplicates) and a divisor of 3 unique series (again, 6 total copies).

All series in question are coming from a remote node, the backend nodes. It's possible timeouts are in play, but the load on the backend seems very steady and under the 1010 code the "busyness" of the uwsgi is less than a quarter of normal so I'm not inclined to believe that.

I just walked the cluster to verify that there are no metrics which are out of place, ie, no local files on the frontend boxes and only 2 copies of each on the backend boxes.

I've been rolling the code out to both the frontend and backend boxes, but if it was possible to just change the code on the frontend boxes I could take one frontend node out of rotation and compare things head to head forever. Does the frontend 1010 need something matching on the backend?

@bmhatfield
Copy link
Member

I haven't tested it, but 1010 on a frontend to a non-1010 backend should be fine.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

Ok, let me try that and get some numbers. Just a count of various series against both servers should do.

@bmhatfield
Copy link
Member

Thanks for all your help digging into this - like you, I found it to dramatically help performance, so I hope we can make it stable so everyone can get the benefit :-)

@lamont
Copy link
Author

lamont commented Nov 6, 2014

No, thank YOU for digging into it. This feature would be fantastic and I'd love to help get it working.

So, great success with narrowing down the problem. I setup app001 with the 1010 code and app002 with the "stable" 0.9.x branch from nov 1. They are both querying the same set of 10 backend hosts (running the same stable nov 1 0.9.x) I had to remember to disable memcached as well.

$ curl 'http://app001:8080/render?target=countSeries(collectd.px*.STUFF.counter-TotalDuration)&rawData=true'
1

$ curl 'http://app002:8080/render?target=countSeries(collectd.px*.STUFF.counter-TotalDuration)&rawData=true'
3

Actually walking the filesystems did show me the 3 unique values. I'll try and add some debugging statements to see what the frontend collected then possibly discarded. Python is not a language I'm any good with so if you have specific debug suggestions that would help a great deal.

@bmhatfield
Copy link
Member

I made a quick branch with some debug statements, that should end up in the webapp's info.log:

bmhatfield/graphite-web@graphite-project:0.9.x...09x-debug

@lamont
Copy link
Author

lamont commented Nov 6, 2014

I added some debug statements and think I know what is happening. I added:

      for series in results:
        print "considering ", series['name']

and

            if candidate_nones >= known_nones:
              # If we already have this series in the seriesList, and
              # it is 'worse' than the other series, we don't need to
              # compare anything else. Save ourselves some work here.
              print "break cause candidate_nones >= known_nones"
              break

and

        if not series_handled:
          print "appending"
          seriesList.append(ts)

And saw the following:

2014-11-06_04:56:37.71547 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.71552 appending
2014-11-06_04:56:37.72104 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.72107 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.72662 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.72670 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.73190 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.73214 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.73887 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.73920 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.74432 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.74462 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.74963 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.74996 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.75516 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.75548 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.76087 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.76119 break cause candidate_nones >= known_nones
2014-11-06_04:56:37.76653 considering  collectd.app142.STUFF.counter-TotalDuration
2014-11-06_04:56:37.76699 break cause candidate_nones >= known_nones

This is interesting for a few reasons. app142 came up 6 times whereas it should have come up twice (the metric and its replica). But there are 6 copies of the matching metrics, just the other 4 should be the 2 copies of apps 141 and 143.

I'll keep poking at it (my python is terrible). I see you just posted a debug branch, so let me try that and get it back to you.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

Also, the app142 machine's metric is on the first node in the CLUSTER_SERVERs list.

@obfuscurity
Copy link
Member

I owe you both 🍻 the next time we meet in person. 💖

@bmhatfield
Copy link
Member

😄 Hopefully we can actually get this figured out...

@lamont - for the log above, is STUFF representing the same redacted string for each, or different redacted strings? If different, when you post the log for the branch I sent you, can you tokenize a bit so I can get a better sense (ie; STUFFA, STUFFB, etc)?

@lamont
Copy link
Author

lamont commented Nov 6, 2014

STUFF is a static string, unchanged across all metrics, I'm mailing you the log from the debug branch with the unredacted metrics now.

@bmhatfield
Copy link
Member

Hrm,

The logs you sent me are fairly unexpected; they seem to say that all of your 10 backend servers are each returning the same 1 series for your query, which doesn't seem likely based upon our above conversation, unless they were "talking" to each other in order to return a result, which is something you mentioned they shouldn't be configured to do.

I pushed up another log message (into remoteStorage.py), can you see if bmhatfield/graphite-web@graphite-project:0.9.x...09x-debug now provides more info?

Also, it's late on the East Coast, so I am going to go to bed, but I will run this debug logging myself and see if I learn anything from it tomorrow.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

I'm going to bed as well, but I just mailed you the updated debug log. It just occurred to me that while there is no CLUSTER_SERVER communication between backend nodes, they all do share a memcached which might be what's causing this. I bet they're each serving up the cached result of the "fetch local target wildcard" that went to the first node.

I'll disable memcache from the backend machines (it was of dubious value anyway) and see if that was it.

@lamont
Copy link
Author

lamont commented Nov 6, 2014

So yeah, that was totally my fault. Disabling memcache on the backend seems to have fixed the issue for me. I'll do a fuller test tomorrow and let you know.

My apologies for muddying the water, your stuff rocks.

@obfuscurity
Copy link
Member

Still, it's good to recognize these use cases. There seems to be some confusion about the best way to incorporate memcached among Graphite clusters.

It appears that the "best practice" in multi-node setups is to use a shared memcached among workers on the same frontend graphite-web, but not at all on backend nodes.

@deniszh
Copy link
Member

deniszh commented Nov 6, 2014

Cool, guys, it was really impressive debugging session!
Btw, @lamont, maybe it's a good idea to close issue then?

@lamont
Copy link
Author

lamont commented Nov 6, 2014

@deniszh I'm doing a full acceptance test (with humans) at noon today and I'll close the issue after that.

@bmhatfield
Copy link
Member

I'm not sure we should close this quite yet, I'd like to understand why 1010's behavior is different in this case a little bit better. My outer assumption is that while it's inefficient to return the same metric from each of 10 servers, it should still work roughly the same as pre-1010.

@bmhatfield
Copy link
Member

@lamont How many distinct series do you see in pre-1010 for the query that's reflected in the updated logs you sent me?

Thu Nov 06 01:12:57 2014 :: Fetched 'local=1&target=YOURWILDMETRICEXPR&format=pickle&from=1415167967&until=1415254367&now=1415254367' from ANIPADDR:8080, received 1 series
Thu Nov 06 01:12:57 2014 :: List contains: '['ASINGLEMETRICNAME']'

@lamont
Copy link
Author

lamont commented Nov 6, 2014

@bmhatfield 0 or 1 per box, for a total of 3. When I disable memcached on the backend boxes I also get the correct results.

@SEJeff
Copy link
Member

SEJeff commented Nov 6, 2014

Perhaps you could send us a PR to update the docs with this as a gotcha?

@bmhatfield
Copy link
Member

The ideal case would be that this gets improved to not be a gotcha :-)

@bmhatfield
Copy link
Member

@lamont Any chance you could hack in a log for what series are returned by what backend hosts, like what I did for remote_storage.py, but in non-1010, with memcached enabled? I have a guess as to what the problem is with 1010, and I think it's fixable in 1010, but I need a little more data :-)

@bmhatfield
Copy link
Member

@SEJeff is there a particular doc that you'd recommend we edit in any case? Happy to help but the request feels a bit vague at the moment :-)

@lamont
Copy link
Author

lamont commented Nov 6, 2014

Ok, rolled the 1010 software to all frontend boxes and it is now "live" to the in-use cluster which gets hammered by queries. So far, I'm seeing all queries returning with proper counts. I am seeing each of the backend queries take 2.5-5 seconds, which when serialized across 10 backends results in a total query delay of about 30s, but that's better than before.

@bmhatfield I'll take one of the machines out and roll the old software back to it and hack in the remote_storage logging to show you the results.

@deniszh
Copy link
Member

deniszh commented Dec 21, 2014

@lamont, any progress with testing? I'm still wondering why shared memcache killing your queries...
Did you try local memcache (i.e. on localhost) instead of shared one?

@obfuscurity
Copy link
Member

I'm pretty sure this is resolved. I think any confusion stems from a lack of proper guidelines around using memcached with multiple nodes.

@obfuscurity
Copy link
Member

@lamont @bmhatfield You guys have any time this week for any final tests so we can mark this closed? This is effectively the last issue blocking 0.9.13.

@bmhatfield
Copy link
Member

I think we're good here. Per the earlier discussion, this works for @lamont without the shared memcached setup, and I have been using this in production for some time and am happy/confident in it's behavior.

We can close this issue, but I'd like to use this as an opportunity to plug #1026 as well - it's a meaningful enhancement to #1010.

@deniszh
Copy link
Member

deniszh commented Dec 22, 2014

👍 for #1026 too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants