Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4325] Use pssh. #80

Closed
wants to merge 12 commits into from
Closed

Conversation

nchammas
Copy link

Replace some bash-isms with pssh to neatly parallelize cluster operations.

Also, decrease questionably high sleep times.

break;
fi
done
echo "SSH-ing to all cluster nodes to approve keys..."
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less a performance optimization and more just code trimming. Now that we check for SSH availability across the cluster in spark_ec2.py, there is no need for the retry logic here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Does spark_ec2.py now fail if one slave doesn't come up ? Sometimes when I launch a large number of machines (> 100) I run into cases where 1 or 2 machines just don't come up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I know @danosipov reported a similar issue.

With the SSH wait logic in spark_ec2.py, we will just wait indefinitely if a slave never comes up. That's because we wait until SSH becomes available across all the nodes in the cluster before proceeding with setup.sh.

Perhaps in the future we want to do something more intelligent, but for now I presume this is OK.

@nchammas
Copy link
Author

cc @shivaram @danosipov

sleep 0.3
done
ssh $SSH_OPTS localhost echo -n &
ssh $SSH_OPTS `hostname` echo -n &
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this line is for. Presumably, if we've SSHed to $MASTERS and localhost, we don't need hostname, no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On EC2 there are two hostnames, one internal (of the form ip-127-1-1-1.ec2.internal) and one external (of the form ec2-54-227-51-123.compute-1.amazonaws.com) -- We typically pass in the latter in $MASTERS and hostname usually returns the former.

Even though we try to only use the external hostname in all our configs, it is better practice to approve keys for both hostnames

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add it in to the pssh version of the call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@shivaram
Copy link

This looks good to me. But I still need to launch a cluster using this change just to verify things. Will merge this after doing that.

@shivaram
Copy link

I did notice one issue. We run rm -f /root/.ssh/known_hosts in setup-slave.sh, so approving ssh keys before that might not be useful ? Previously setup-slave.sh was run at the beginning on the master and that has changed now. Could we restore this behavior ?

@nchammas
Copy link
Author

Oh, derp. That will cost us 10 seconds.

But wait, why do we delete known_hosts in the first place?

@shivaram
Copy link

There is a comment above it that [1] says we do it to get rid of old hosts which accumulate on start/stops. One solution is to just move that out from setup-slave.sh to setup.sh at the top (but then we'll need to do it on the slaves ?)
[1] https://github.com/mesos/spark-ec2/blob/v4/setup-slave.sh#L99

@nchammas
Copy link
Author

Ah, I'm currently on a mobile device; didn't check the source.

OK I'll revisit this later this week and see what to do. Maybe the old way will do fine for now.

@nchammas
Copy link
Author

@shivaram How about simply approving SSH keys twice? Once at the beginning and once after setup-slave.sh is run. It's a cheap operation. That's what I've done with these latest commits.

@shivaram
Copy link

Hmm this should be fine. Do you know how much latency this adds ? (I'm wondering if its a bad idea for large clusters like >100 machines etc.)

@nchammas
Copy link
Author

I'll test it out with a 50 node cluster and report back.
On 2014년 11월 17일 (월) at 오후 1:09 Shivaram Venkataraman <
notifications@github.com> wrote:

Hmm this should be fine. Do you know how much latency this adds ? (I'm
wondering if its a bad idea for large clusters like >100 machines etc.)


Reply to this email directly or view it on GitHub
#80 (comment).

@nchammas
Copy link
Author

I'm using a new AWS account and have lower instance limits than expected, so I can't spin up a 50- or 100-node cluster. Just submitted a request for a higher limit.

Note that I intend to revert d9333af as soon as I get to this testing.

@nchammas
Copy link
Author

@shivaram, the added latency is as follows:

  • 50 slaves: 1.7s
  • 100 slaves: 3.4s

I haven't tested with larger clusters, but it seems reasonable to extrapolate that for a cluster with 500 slaves, it would add at least 17 seconds of latency.

Is this acceptable, or shall we go back to running setup-slave.sh separately on the master? Doing that adds a fixed 10 second latency, compared to this approach which adds a variable latency dependent on the number of slaves.

@shivaram
Copy link

shivaram commented Dec 1, 2014

Thanks for the measurements. One thing I am wondering is if we can remove the rm -rf ~/.ssh/known_hosts from setup-slave.sh and move it to setup.sh as the first line. We mostly only ssh from the master to the slaves for almost all our operations, so not removing known_hosts on the slaves should be be fine.

@nchammas
Copy link
Author

Hey @shivaram, I actually tried removing both instances where we pre-approve SSH keys and launching worked fine thanks to these options deactivating strict host checking.

Is there any need for the pre-approval step? I think we can just take it out entirely, no?

@shivaram
Copy link

Hmm - you are right. The SSH_OPTS should take care of approving the keys the first time we ssh (will the first time be using rsync ?).

I think the original reason for adding this was that if you tried to use say start-dfs.sh or start-all.sh in Hadoop/Spark, they expect SSH to work without any prompts. If it works without the need for pre-approval then it should be fine to remove it

@nchammas nchammas force-pushed the setup-sh-use-pssh branch 3 times, most recently from 0a9a082 to 3a7f4ba Compare December 24, 2014 03:20
@nchammas
Copy link
Author

Yes, the first time is with rsync here. I was able to start-all.sh and call hadoop fs -ls / successfully in this manner.

I messed up the rebase on this PR, so it is now superseded by #85.

@nchammas nchammas closed this Dec 24, 2014
@nchammas nchammas deleted the setup-sh-use-pssh branch December 24, 2014 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants