Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4325] Use pssh. #80

Closed
wants to merge 12 commits into from
13 changes: 9 additions & 4 deletions create_image.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,7 @@ done

# Install Maven (for Hadoop)
cd /tmp
wget "http://apache.osuosl.org/maven/maven-3/3.2.3/binaries/"\
"apache-maven-3.2.3-bin.tar.gz"
wget "http://archive.apache.org/dist/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz"
tar xvzf apache-maven-3.2.3-bin.tar.gz
mv apache-maven-3.2.3 /opt/

Expand All @@ -65,8 +64,7 @@ source ~/.bash_profile
sudo mkdir /root/hadoop-native
cd /tmp
sudo yum install -y protobuf-compiler cmake openssl-devel
wget "http://apache.mirrors.tds.net/hadoop/common/"\
"hadoop-2.4.1/hadoop-2.4.1-src.tar.gz"
wget "http://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz"
tar xvzf hadoop-2.4.1-src.tar.gz
cd hadoop-2.4.1-src
mvn package -Pdist,native -DskipTests -Dtar
Expand All @@ -75,3 +73,10 @@ sudo mv hadoop-dist/target/hadoop-2.4.1/lib/native/* /root/hadoop-native
# Install Snappy lib (for Hadoop)
yum install -y snappy
ln -sf /usr/lib64/libsnappy.so.1 /root/hadoop-native/.

# Create /usr/bin/realpath which is used by R to find Java installations
# NOTE: /usr/bin/realpath is missing in CentOS AMIs. See
# http://superuser.com/questions/771104/usr-bin-realpath-not-found-in-centos-6-5
echo '#!/bin/bash' > /usr/bin/realpath
echo 'readlink -e "$@"' >> /usr/bin/realpath
chmod a+x /usr/bin/realpath
7 changes: 7 additions & 0 deletions setup-slave.sh
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,10 @@ echo 1 > /proc/sys/vm/overcommit_memory
# Add github to known hosts to get git@github.com clone to work
# TODO(shivaram): Avoid duplicate entries ?
cat /root/spark-ec2/github.hostkey >> /root/.ssh/known_hosts

# Create /usr/bin/realpath which is used by R to find Java installations
# NOTE: /usr/bin/realpath is missing in CentOS AMIs. See
# http://superuser.com/questions/771104/usr-bin-realpath-not-found-in-centos-6-5
echo '#!/bin/bash' > /usr/bin/realpath
echo 'readlink -e "$@"' >> /usr/bin/realpath
chmod a+x /usr/bin/realpath
70 changes: 24 additions & 46 deletions setup.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/bash

yum install -y pssh

# Make sure we are in the spark-ec2 directory
cd /root/spark-ec2

Expand All @@ -9,6 +11,14 @@ source /root/.bash_profile
# Load the cluster variables set by the deploy script
source ec2-variables.sh

function approve_ssh_keys () {
pssh --inline \
--host "localhost $(hostname) $MASTERS $SLAVES" \
--user root \
--extra-args "$SSH_OPTS" \
":"
}

# Set hostname based on EC2 private DNS name, so that it is set correctly
# even if the instance is restarted with a different private DNS name
PRIVATE_DNS=`wget -q -O - http://instance-data.ec2.internal/latest/meta-data/local-hostname`
Expand Down Expand Up @@ -42,60 +52,28 @@ fi
echo "Setting executable permissions on scripts..."
find . -regex "^.+.\(sh\|py\)" | xargs chmod a+x

echo "Running setup-slave on master to mount filesystems, etc..."
source ./setup-slave.sh

echo "SSH'ing to master machine(s) to approve key(s)..."
for master in $MASTERS; do
echo $master
ssh $SSH_OPTS $master echo -n &
sleep 0.3
done
ssh $SSH_OPTS localhost echo -n &
ssh $SSH_OPTS `hostname` echo -n &
wait

# Try to SSH to each cluster node to approve their key. Since some nodes may
# be slow in starting, we retry failed slaves up to 3 times.
TODO="$SLAVES $OTHER_MASTERS" # List of nodes to try (initially all)
TRIES="0" # Number of times we've tried so far
echo "SSH'ing to other cluster nodes to approve keys..."
while [ "e$TODO" != "e" ] && [ $TRIES -lt 4 ] ; do
NEW_TODO=
for slave in $TODO; do
echo $slave
ssh $SSH_OPTS $slave echo -n
if [ $? != 0 ] ; then
NEW_TODO="$NEW_TODO $slave"
fi
done
TRIES=$[$TRIES + 1]
if [ "e$NEW_TODO" != "e" ] && [ $TRIES -lt 4 ] ; then
sleep 15
TODO="$NEW_TODO"
echo "Re-attempting SSH to cluster nodes to approve keys..."
else
break;
fi
done
# echo "SSH-ing to all cluster nodes to approve keys..."
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this line is for. Presumably, if we've SSHed to $MASTERS and localhost, we don't need hostname, no?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On EC2 there are two hostnames, one internal (of the form ip-127-1-1-1.ec2.internal) and one external (of the form ec2-54-227-51-123.compute-1.amazonaws.com) -- We typically pass in the latter in $MASTERS and hostname usually returns the former.

Even though we try to only use the external hostname in all our configs, it is better practice to approve keys for both hostnames

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add it in to the pssh version of the call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# approve_ssh_keys

echo "RSYNC'ing /root/spark-ec2 to other cluster nodes..."
for node in $SLAVES $OTHER_MASTERS; do
echo $node
rsync -e "ssh $SSH_OPTS" -az /root/spark-ec2 $node:/root &
scp $SSH_OPTS ~/.ssh/id_rsa $node:.ssh &
sleep 0.3
sleep 0.1
done
wait

# NOTE: We need to rsync spark-ec2 before we can run setup-slave.sh
# on other cluster nodes
echo "Running slave setup script on other cluster nodes..."
for node in $SLAVES $OTHER_MASTERS; do
echo $node
ssh -t -t $SSH_OPTS root@$node "spark-ec2/setup-slave.sh" & sleep 0.3
done
wait
echo "Running setup-slave on all cluster nodes to mount filesystems, etc..."
pssh --inline \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the --inline flag, pssh will echo all of stdout and stderr back to the master. We should remove this if we think it will be an issue -- specifically when launching large clusters -- at the cost of visibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram Do you think echoing stdout and stderr from all the slaves back to the master could be problematic?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this was the old behavior too ? If so it should be fine.

--host "$MASTERS $SLAVES" \
--user root \
--extra-args "-t -t $SSH_OPTS" \
"spark-ec2/setup-slave.sh"

# echo "SSH-ing to all cluster nodes to re-approve keys..."
# We do this again because setup-slave.sh clears out .ssh/known_hosts.
# approve_ssh_keys

# Always include 'scala' module if it's not defined as a work around
# for older versions of the scripts.
Expand Down Expand Up @@ -126,6 +104,6 @@ chmod u+x /root/spark/conf/spark-env.sh
for module in $MODULES; do
echo "Setting up $module"
source ./$module/setup.sh
sleep 1
sleep 0.1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we are sleeping here, but I reduced the sleep time. I was able to launch and do some basic operations on a cluster without issue after making this change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sleep call was just defensive. Sometimes services take time to come up and this just provides some buffer between starting them up. It should be fine to reduce this.

cd /root/spark-ec2 # guard against setup.sh changing the cwd
done
14 changes: 14 additions & 0 deletions spark/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,20 @@ else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.0-bin-cdh4.tgz
fi
;;
1.1.1)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.1-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.1.1-bin-cdh4.tgz
fi
;;
1.2.0)
if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-hadoop1.tgz
else
wget http://s3.amazonaws.com/spark-related-packages/spark-1.2.0-bin-cdh4.tgz
fi
;;
*)
echo "ERROR: Unknown Spark version"
return
Expand Down