-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multiple YARN masters #595
Comments
as you said we can take the list of yarn endpoints.when ever we have to pick the yarn end point we can start connecting in the round robin fashion ,and pick the active from the list and assign it to the variable which we can use while getting the status.
am i correct? |
Yes. However, we may want to do this in the yarn-api-client layer since it already uses the That said, I like the idea with dealing with the list in EG because, ultimately it would be nice not to have to rely on local hadoop config files (although spark might need those regardless). I'm not that familiar the spark configuration relative to YARN. Adding @lresende and @akchinSTC in case they have input here. |
The code should not |
I believe @saipradeepkumar is saying the similar thing. The idea is that we'd discriminate against the active RMs via the The link you provided is good. It would require additional knowledge other than the RM node(s). Not sure which is better. I like the idea of just adding the I do want to, for this exercise, change the |
@lresende |
I'd prefer we not introduce another dependency when we can get the same information using the At any rate, I think the constructor for I'd be happy to put something together, but could only test it in a contrived env (using a fake "down" server, etc.) with one working master. Let me know. |
@saipradeepkumar, @lresende - I had some time and went ahead and implemented my thoughts on this. Let me know what you think. Once we settle on something, I'll update the docs to include the new option (and remove the old). |
@kevin-bates it sounds good for me ,we can go further and implement this feature |
The yarn-api-client package already has functionality to pull yarn resource manager id(s) from you can find the reference functionality here |
Thanks for the comment. That "cluster check" only occurs if no address is provided and doesn't check the haState, only that the response is received. I tend to agree that this kind of thing should be in yarn-api-client, but felt an external approach may be more flexible. The other thing I want to prevent is pinging the standby master each time a new kernel is started. That is, we should have a means of remembering the last valid master across kernel instances. |
@hansohn - you got me thinking (and looking into RM HA a bit) - thank you.
I think you're correct in that we can get a long way here by NOT setting an address - although I feel the code in Regarding the spark-submit point, this implies we would need the ability to pass in another (alternate) address. Rather than overload the address parameter on the ResourceManager constructor to handle it being a list, I'm inclined to add another address/port pair... def __init__(self, address=None, port=8088, alt_address=None, alt_port=8088, timeout=30, kerberos_enabled=False): Then refactor the constructor such that if two addresses are provided the active state is checked - either by promoting This way, the 90% case is handled by simple removal of the address (and EG should NOT provide defaults for yarn endpoint config values), while also allowing the case where the caller is targeting multiple yarn clusters. I would also advocate for a new method on ResourceManager Comments? |
Regarding this particular design, I think it's not extensible. In the YARN APi Client project today, there is a PR to add http/https support, which following this pattern, will add address_use_https, alt_address_use_https, etc... Having a single parameter where you can pass one or a list of service endpoints like "http(s)://host:port" might be a good consideration and provide a little more flexibility and extensibility. |
@kevin-bates I manage a few yarn HA clusters and was originally writing a patch to account for HA because I didn't want to define a new value for In regards to the points you mentioned above:
The code in spark-submit handles yarn ha internally by translating the values of the rm-ids it finds in the yarn-site.xml when you define the The biggest problem I have found with the above architecture is that the enterprise_gateway service will need to be bounced in order to trigger the call to yarn-api-client to get the active resource manager. It would be nice to adapt to a change like that at notebook runtime but I am not familiar enough with the codebase to know whether those sort of hooks are available or not. Thank you for working on this feature! We use Yarn and Jupyter at my company and figuring out the HA implementation is a hurdle for us in order to be able to implement Enterprise Gateway into our workflow. Looking forward to the new features! Cheers! |
@lresende and @hansohn - thank you for the responses - they were helpful. @lresende - I agree there's an extensibility issue here in terms of https coupled with the alternate addresses - although I think you'd only need a single I choose the additional parameters approach because it's completely backward compatible and doesn't require updates to the other classes (for consistency purposes, since ResourceManager is the only class that should entertain multiple addresses). I think a switch to a fully qualified URI (sans the endpoint, although that should probably be part of it as well), will be disruptive and, not being a person close to yarn-api-client, don't really feel its in my place to make such a change. That said, I'd be happy to crank that stuff out, but I don't feel qualified to make that kind of decision. @hansohn - Yes, the bounce of EG is required because we currently provide a default value for the yarn endpoint parameter. Once we remove the defaulting behaviors, no restart will be required because we get a new instance of RM for each kernel. We will do that at a minimum! Thanks for raising that. Heck, if we feel we do not need to target multiple clusters at all, then we can get support for HA by subtraction! We simply REMOVE the endpoint configuration item - since we're required to have a CONF_DIR on the EG server in SPARK_HOME anyway. (@lresende - is that a true statement?) We'd just document that HADOOP_CONF_DIR be specified and never pass an address/port when creating an RM. This would mean no changes to the ResourceManager constructor would be required as well.
Hmm - I still think the explicit check is required when |
Looks like this answers the response from the standby relative to Refresh in the header: http://mail-archives.apache.org/mod_mbox/ambari-user/201508.mbox/%3C09D941A9-4A4E-4153-A2D2-CCB57ACC332F@hortonworks.com%3E |
So does Enterprise Gateway support Yarn HA out of the gate? or is it not supported an you have to hardcore the active RM? |
That's the plan (and soon). For current releases, however, you have to hardcode the active RM. With PR #623 (merged into master) you can achieve YARN HA support if EG is running on an edge node (i.e, Once yarn-api-client PR #29 is merged and part of a release, we can then update the EG dependencies via PR #607 and can then set the addresses for the two RMs and yarn-api-client will determine the active RM. In this case, For both of these cases, we plan on updating BOTH the 1.x and 2.x releases. |
Kevin is Toree still needed for the Scala support within Jupyter? |
EG provides Scala kernelspecs that utilize Toree. You could probably use other Scala kernels for remote interaction via Toree, but you'd need to tweak our toree launcher or model a different launcher after that one. If you're talking about regular Jupyter support, where kernels run local to the server, then there wouldn't be a requirement that Toree be the Scala kernel. |
There should not be a difference if you are running locally or remotely, you will need a Scala kernel to support Scala / Spark and Toree is one of those kernels. |
@lresende - when running remote, you'll need a kernel launcher to create the (local) connection file, send it back to EG and listen for interrupts. At this moment, the kernel launcher for Scala kernels is dedicated to Apache Toree. Is there some other way to run remote kernels outside of EG (i.e., regular Jupyter)? |
@kevin-bates I think I got a little confused from your previous paragraph, which now I seem to get it. Toree is a requirement in the context of |
Correct - although technically speaking we (EG) support local kernels, so in that case, you could run EG with a different scala kernel provided the kernel only ran locally. As soon as you want it to run remotely, then you'll have a kernel launcher issue because the launcher we provide is dedicated to Toree. Sorry for the confusion. |
When YARN is configured for HA, it requires that multiple masters be specified. As a result, admins of EG should be able to specify multiple masters or none at all (in which case, the yarn-api-client library uses local Hadoop config files).
Currently EG requires a yarn_endpoint be configured
--EnterpriseGatewayApp.yarn_endpoint=http://hostname:port/ws/v1/cluster
, yet, now that the underlying library has been enhanced, EG only uses the hostname from that value. Instead, we should add a configuration option of--EnterpriseGatewayApp.yarn_masters
which is a list-valued property, whose default value is the empty list.If empty, the underlying library will use local configuration files. If single entry, EG will use that host. If multiples, we could either be optimistic and use the first one (with some recovery to the others) or we could ensure the one used is an active master via the
ws/v1/cluster/info
REST API. We should also retain the selected (or single) master in a static variable relative to the scope (remember that yarn hosts could appear in kernelspec configs) so that we start with that verified value for the next applicable kernel. (Unconditionally verifying the master when multiple exist, is probably the easiest and most stable approach.)We should also try to switch from downed masters for already running kernels, otherwise things like termination, might fail - assuming the scope of that effort is not too great.
The text was updated successfully, but these errors were encountered: