Move to hadoop-2.7, improvements to FGS #116

smarella · 2015-07-11T01:55:46Z

The PR includes 3 separate commits:

Move to hadoop-2.7
2b76199

Apply FGS to only NM's registered with (0G,0CPU)
8e355b5

Allow Myriad to launch a few NMs during RM's startup
3e47ed1

The following problems surfaced during this change: 1. The API that Myriad was using to update NM's capacity is removed in 2.7. It's replaced with a NODE_RESOURCE_UPDATE event (YARN-1506). 2. RM rejects app submissions if the size of the AM container requested is greater than the capacity of the largest NM (YARN-3079, YARN-2604). Since FGS sets the NM's capacity to zero during NM's registration with RM, the app submission fails.

Summary of major changes made: - Introduced a "zero" profile in .yml file. A NM launched with this profile advertizes (0G,0CPU) to RM. - When an offer is received: - Myriad uses the offer to launch a new NM, if there are pending "flexup" requests and an NM is not already running on the slave the resources are offered on. - (if above is not true,) if the slave node is running a NM originally launched with (0G,0CPU) capacity, then Myriad uses the offer to dynamically increase the capacity of the NM. - (if above is not true,) Myriad rejects the offer. - Moved the FGS classes to a separate package (.fgs). - Built a "filter" mechanism using which FGS can receive interceptor callbacks only for NMs launched with (0G,0CPU) capacity.

- Added a section "nmInstances" in .yml file - "nmInstances" allows to specify the number of NM instances of a specific profile - Default is "medium: 1" (can be changed) - Multiple profiles can be specified. for e.g. medium: 1 small: 3 zero: 2 - Added validation for nmInstances. RM doesn't start if - profile name is incorrect - number of NMs to launch is zero - largest profile has zero cpus or zero memory

Updated documentation for fine grained scaling (FGS).

smarella · 2015-07-16T21:22:01Z

Updated documentation for FGS (smarella@5611a8c)

DarinJ · 2015-07-29T15:15:03Z

Tried this out, but have so far been unsuccessful. The RM launches one large NM on startup, I can flexup 2 zero NMs, but they don't flex up. I'm currently debugging this, it might be on my end.

I'm +1ing @yufeldman's comments on on Map.Entry Iterator. Also, there's inconsistent indentation which is minor but annoying when editing. Saw a few more minor issues and will do a more thorough review later.

DarinJ · 2015-08-04T03:18:33Z

myriad-scheduler/src/main/java/com/ebay/myriad/scheduler/fgs/OfferUtils.java

+  public static Resource getYarnResourcesFromMesosOffers(Collection<Offer> offers) {
+    double cpus = 0.0;
+    double mem = 0.0;
+


Observed that when running Map Reduce tasks, I was unable to get full memory utilization from Mesos Offers. It looks like this was because I had a high memory to cpu ratio and containers required only 1 GB Ram. This caused many 1GB/1CPU tasks to be created until no more CPU's were available. Suggestion: have a configurable mesosToYARNCPURatio, return floor(cpus/mesosToYarnCPURation).

Isn't that the desired behavior? A map container usually takes 1CPU/1G. If mesos offered 10G/4CPU, then 4 1CPU/1G mesos tasks will be launched. The remaining 6G of memory is now available for mesos to offer to other frameworks.

I'd argue there's a few use cases, especially for machines with faster but fewer cores and lots of RAM. I could also see the ratio being greater than one as in Mesos you can increase cpu resource past the physical cores of the machine (I did this so I could test FGS).

I now see your point. Any idea how other mesos frameworks handle this? Sounds like every framework needs this ratio for CPUs.

@adam-mesos Are there any guidelines for framework developers around how to interpret Mesos offered CPU resources?

@smarella @DarinJ Sorry for my slow response.. Glad this got merged in anyway.
Mesos translates using "1" cpu resource of a total of "8" cpu resources into relative cpushares, so "1 cpu" does not mean you will get pinned to a specific core, and you are even likely get more than your requested cpu percentage, if there are idle cpus. Only during a cpu-intensive period will your task get throttled to its requested share. As such, "0.1 cpus" could be a reasonable cpu request, especially for best effort jobs.

DarinJ · 2015-08-04T04:03:42Z

Tried out on Hadoop 2.6.X and 2.7.X clusters and it worked as expected, noted in line comments I couldn't get full Memory Utilization. Comments from #91 still apply. Seems like a good start.

- Defined a new 'myriadSchedulerConf' to exclude hadoop dependencies from being copied into myriad-scheduler/build/libs. Tests are now unaffected. - Fixed a few findBugs issues. - In ResourceOffersEventHandler, fixed code to ensure an offer used to launch a zero profile NM is not used again for FGS. - Fixed a test in TestMyriadScheduler. The test for NodeStore needs to be coded differently due to the recent FGS changes. Will be done in a separate changeset.

smarella · 2015-08-06T02:34:41Z

@DarinJ @yufeldman - Thanks for reviewing the changes. I've addressed the review feedback as part of smarella@ef328d1.

Move to hadoop-2.7, improvements to FGS

smarella added 4 commits July 10, 2015 17:07

Update myriad-fine-grained-scaling.md

5611a8c

Updated documentation for fine grained scaling (FGS).

DarinJ reviewed Aug 4, 2015
View reviewed changes

smarella added a commit that referenced this pull request Aug 11, 2015

Merge pull request #116 from smarella/issue_14

a4ceb36

Move to hadoop-2.7, improvements to FGS

smarella merged commit a4ceb36 into mesos:issue_14 Aug 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to hadoop-2.7, improvements to FGS #116

Move to hadoop-2.7, improvements to FGS #116

smarella commented Jul 11, 2015

smarella commented Jul 16, 2015

DarinJ commented Jul 29, 2015

DarinJ Aug 4, 2015

smarella Aug 4, 2015

DarinJ Aug 6, 2015

smarella Aug 6, 2015

adam-mesos Sep 14, 2015

DarinJ commented Aug 4, 2015

smarella commented Aug 6, 2015

Move to hadoop-2.7, improvements to FGS #116

Move to hadoop-2.7, improvements to FGS #116

Conversation

smarella commented Jul 11, 2015

smarella commented Jul 16, 2015

DarinJ commented Jul 29, 2015

DarinJ Aug 4, 2015

Choose a reason for hiding this comment

smarella Aug 4, 2015

Choose a reason for hiding this comment

DarinJ Aug 6, 2015

Choose a reason for hiding this comment

smarella Aug 6, 2015

Choose a reason for hiding this comment

adam-mesos Sep 14, 2015

Choose a reason for hiding this comment

DarinJ commented Aug 4, 2015

smarella commented Aug 6, 2015