[SPARK-4325] Use pssh. #80

nchammas · 2014-11-10T22:29:54Z

Replace some bash-isms with pssh to neatly parallelize cluster operations.

Also, decrease questionably high sleep times.

nchammas · 2014-11-10T22:31:58Z

setup.sh

-      break;
-  fi
-done
+echo "SSH-ing to all cluster nodes to approve keys..."


This is less a performance optimization and more just code trimming. Now that we check for SSH availability across the cluster in spark_ec2.py, there is no need for the retry logic here.

Just curious: Does spark_ec2.py now fail if one slave doesn't come up ? Sometimes when I launch a large number of machines (> 100) I run into cases where 1 or 2 machines just don't come up.

Good question. I know @danosipov reported a similar issue.

With the SSH wait logic in spark_ec2.py, we will just wait indefinitely if a slave never comes up. That's because we wait until SSH becomes available across all the nodes in the cluster before proceeding with setup.sh.

Perhaps in the future we want to do something more intelligent, but for now I presume this is OK.

nchammas · 2014-11-10T22:39:33Z

cc @shivaram @danosipov

nchammas · 2014-11-10T22:44:19Z

setup.sh

-  sleep 0.3
-done
-ssh $SSH_OPTS localhost echo -n &
-ssh $SSH_OPTS `hostname` echo -n &


I'm not sure what this line is for. Presumably, if we've SSHed to $MASTERS and localhost, we don't need hostname, no?

On EC2 there are two hostnames, one internal (of the form ip-127-1-1-1.ec2.internal) and one external (of the form ec2-54-227-51-123.compute-1.amazonaws.com) -- We typically pass in the latter in $MASTERS and hostname usually returns the former.

Even though we try to only use the external hostname in all our configs, it is better practice to approve keys for both hostnames

OK, I'll add it in to the pssh version of the call.

shivaram · 2014-11-11T06:54:31Z

This looks good to me. But I still need to launch a cluster using this change just to verify things. Will merge this after doing that.

shivaram · 2014-11-12T20:31:46Z

I did notice one issue. We run rm -f /root/.ssh/known_hosts in setup-slave.sh, so approving ssh keys before that might not be useful ? Previously setup-slave.sh was run at the beginning on the master and that has changed now. Could we restore this behavior ?

nchammas · 2014-11-12T23:06:21Z

Oh, derp. That will cost us 10 seconds.

But wait, why do we delete known_hosts in the first place?

shivaram · 2014-11-13T00:24:13Z

There is a comment above it that [1] says we do it to get rid of old hosts which accumulate on start/stops. One solution is to just move that out from setup-slave.sh to setup.sh at the top (but then we'll need to do it on the slaves ?)
[1] https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L99

nchammas · 2014-11-13T00:50:53Z

Ah, I'm currently on a mobile device; didn't check the source.

OK I'll revisit this later this week and see what to do. Maybe the old way will do fine for now.

nchammas · 2014-11-17T04:43:39Z

@shivaram How about simply approving SSH keys twice? Once at the beginning and once after setup-slave.sh is run. It's a cheap operation. That's what I've done with these latest commits.

shivaram · 2014-11-17T18:09:01Z

Hmm this should be fine. Do you know how much latency this adds ? (I'm wondering if its a bad idea for large clusters like >100 machines etc.)

nchammas · 2014-11-17T20:41:11Z

I'll test it out with a 50 node cluster and report back.
On 2014년 11월 17일 (월) at 오후 1:09 Shivaram Venkataraman <
notifications@github.com> wrote:

Hmm this should be fine. Do you know how much latency this adds ? (I'm
wondering if its a bad idea for large clusters like >100 machines etc.)

—
Reply to this email directly or view it on GitHub
#80 (comment).

nchammas · 2014-11-18T17:45:17Z

I'm using a new AWS account and have lower instance limits than expected, so I can't spin up a 50- or 100-node cluster. Just submitted a request for a higher limit.

Note that I intend to revert d9333af as soon as I get to this testing.

nchammas · 2014-11-27T21:59:42Z

@shivaram, the added latency is as follows:

50 slaves: 1.7s
100 slaves: 3.4s

I haven't tested with larger clusters, but it seems reasonable to extrapolate that for a cluster with 500 slaves, it would add at least 17 seconds of latency.

Is this acceptable, or shall we go back to running setup-slave.sh separately on the master? Doing that adds a fixed 10 second latency, compared to this approach which adds a variable latency dependent on the number of slaves.

shivaram · 2014-12-01T02:44:01Z

Thanks for the measurements. One thing I am wondering is if we can remove the rm -rf ~/.ssh/known_hosts from setup-slave.sh and move it to setup.sh as the first line. We mostly only ssh from the master to the slaves for almost all our operations, so not removing known_hosts on the slaves should be be fine.

nchammas · 2014-12-23T03:49:35Z

Hey @shivaram, I actually tried removing both instances where we pre-approve SSH keys and launching worked fine thanks to these options deactivating strict host checking.

Is there any need for the pre-approval step? I think we can just take it out entirely, no?

shivaram · 2014-12-23T18:33:03Z

Hmm - you are right. The SSH_OPTS should take care of approving the keys the first time we ssh (will the first time be using rsync ?).

I think the original reason for adding this was that if you tried to use say start-dfs.sh or start-all.sh in Hadoop/Spark, they expect SSH to work without any prompts. If it works without the need for pre-approval then it should be fine to remove it

Replace bash-isms with pssh to neatly parallelize cluster operations. Also, decrease questionably high sleep times.

This reverts commit d9333af.

nchammas · 2014-12-24T04:07:30Z

Yes, the first time is with rsync here. I was able to start-all.sh and call hadoop fs -ls / successfully in this manner.

I messed up the rebase on this PR, so it is now superseded by #85.

nchammas reviewed Nov 10, 2014
View reviewed changes

nchammas force-pushed the setup-sh-use-pssh branch 3 times, most recently from 0a9a082 to 3a7f4ba Compare December 24, 2014 03:20

nchammas and others added 9 commits December 23, 2014 22:23

update broken hadoop link

658af93

update Maven link

29051d5

Support Spark 1.1.1

9a07db0

Adding Spark 1.2.0

dc2a08a

Add realpath to fix rJava install issues

8ee9e62

Add realpath to create_image.sh

e308e55

Use pssh.

8d0a903

Replace bash-isms with pssh to neatly parallelize cluster operations. Also, decrease questionably high sleep times.

Approve key to local hostname.

ee5c085

approve keys twice

a6c6b85

nchammas added 3 commits December 23, 2014 22:23

time the pssh calls

c0f60f6

Revert "time the pssh calls"

8913fe1

This reverts commit d9333af.

test removing ssh pre-approval

658d88c

nchammas force-pushed the setup-sh-use-pssh branch from 3a7f4ba to 658d88c Compare December 24, 2014 03:23

nchammas closed this Dec 24, 2014

nchammas deleted the setup-sh-use-pssh branch December 24, 2014 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4325] Use pssh. #80

[SPARK-4325] Use pssh. #80

nchammas commented Nov 10, 2014

nchammas Nov 10, 2014

shivaram Nov 11, 2014

nchammas Nov 11, 2014

nchammas commented Nov 10, 2014

nchammas Nov 10, 2014

shivaram Nov 11, 2014

nchammas Nov 11, 2014

nchammas Nov 11, 2014

shivaram commented Nov 11, 2014

shivaram commented Nov 12, 2014

nchammas commented Nov 12, 2014

shivaram commented Nov 13, 2014

nchammas commented Nov 13, 2014

nchammas commented Nov 17, 2014

shivaram commented Nov 17, 2014

nchammas commented Nov 17, 2014

nchammas commented Nov 18, 2014

nchammas commented Nov 27, 2014

shivaram commented Dec 1, 2014

nchammas commented Dec 23, 2014

shivaram commented Dec 23, 2014

nchammas commented Dec 24, 2014

[SPARK-4325] Use pssh. #80

[SPARK-4325] Use pssh. #80

Conversation

nchammas commented Nov 10, 2014

nchammas Nov 10, 2014

Choose a reason for hiding this comment

shivaram Nov 11, 2014

Choose a reason for hiding this comment

nchammas Nov 11, 2014

Choose a reason for hiding this comment

nchammas commented Nov 10, 2014

nchammas Nov 10, 2014

Choose a reason for hiding this comment

shivaram Nov 11, 2014

Choose a reason for hiding this comment

nchammas Nov 11, 2014

Choose a reason for hiding this comment

nchammas Nov 11, 2014

Choose a reason for hiding this comment

shivaram commented Nov 11, 2014

shivaram commented Nov 12, 2014

nchammas commented Nov 12, 2014

shivaram commented Nov 13, 2014

nchammas commented Nov 13, 2014

nchammas commented Nov 17, 2014

shivaram commented Nov 17, 2014

nchammas commented Nov 17, 2014

nchammas commented Nov 18, 2014

nchammas commented Nov 27, 2014

shivaram commented Dec 1, 2014

nchammas commented Dec 23, 2014

shivaram commented Dec 23, 2014

nchammas commented Dec 24, 2014