Skip to content

Commit

Permalink
Better handle EC2 spot interruptions in AWS Batch WorkerManager
Browse files Browse the repository at this point in the history
  • Loading branch information
nelson-liu committed Oct 17, 2020
1 parent 32cd704 commit 265f875
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 1 deletion.
4 changes: 3 additions & 1 deletion codalab/worker_manager/aws_batch_worker_manager.py
Expand Up @@ -96,7 +96,9 @@ def start_worker_job(self):
)
# This needs to be a unique directory since Batch jobs may share a host
work_dir = os.path.join(work_dir_prefix, 'cl_worker_{}_work_dir'.format(worker_id))
command = self.build_command(worker_id, work_dir)
command = "/opt/scripts/detect-ec2-spot-preemption.sh & " + self.build_command(
worker_id, work_dir
)

# https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-batch-jobdefinition.html
# Need to mount:
Expand Down
2 changes: 2 additions & 0 deletions docker/dockerfiles/Dockerfile.worker
Expand Up @@ -34,6 +34,7 @@ RUN echo "{\"credsStore\": \"ecr-login\"}" >> ~/.docker/config.json

WORKDIR /opt
RUN mkdir ${WORKDIR}/codalab
RUN mkdir ${WORKDIR}/scripts

# Install dependencies
COPY requirements.txt requirements.txt
Expand All @@ -44,6 +45,7 @@ RUN python3.6 -m pip install --user --no-cache-dir --upgrade pip; \
COPY codalab/lib codalab/lib
COPY codalab/worker codalab/worker
COPY codalab/common.py codalab/common.py
COPY scripts/detect-ec2-spot-preemption.sh scripts/detect-ec2-spot-preemption.sh
COPY setup.py setup.py

RUN python3 -m pip install --no-cache-dir -e .
Expand Down
21 changes: 21 additions & 0 deletions scripts/detect-ec2-spot-preemption.sh
@@ -0,0 +1,21 @@
#!/usr/bin/env bash

while true
do
# This IP address comes from:
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices
# It's a special endpoint set up by AWS whereby AWS instances can view metadata about themselves.
# One such piece of metadata is the termination time, which is only set when the spot instance is to be
# pre-empted (you get a 404 otherwise).
# This script was partially taken from https://stackoverflow.com/q/32613600/14089059 .
if [ -z $(curl -Is http://169.254.169.254/latest/meta-data/spot/termination-time | head -1 | grep 404 | cut -d \ -f 2) ]
then
echo "EC2 spot instance scheduled for shutdown."
echo "Sending SIGTERM to CodaLab workers"
# Kill all cl-workers in the EC2 instance.
pgrep -f "cl-worker" | xargs kill
else
# Instance not yet marked for termination, so sleep and check again in 5 seconds.
sleep 5
fi
done

0 comments on commit 265f875

Please sign in to comment.