Skip to content

Commit

Permalink
Merge remote-tracking branch 'aws/master'
Browse files Browse the repository at this point in the history
* aws/master:
  [doc] Minor corrections on available_images (aws#1061)
  [huggingface_pytorch] upgrade PyTorch to 1.7.1 (aws#1025)
  Fix for hf canary tests (aws#1057)
  [test][huggingface_tensorflow, huggingface_pytorch] SM local and remote tests (aws#1021)
  hf transformer version update (aws#1060)
  [build] update EIA buildspec and upgrade ruamel_yaml package (aws#1051)
  bump transformer version (aws#1056)
  Including hf images into canary tests (aws#1050)
  fix tf1 neuron buildspec (aws#1048)
  [build,test] Disable dedicated telemetry tests and tags (aws#1045)
  [test][sagemaker] Execute SM local tests in parallel (aws#1027)
  Temporaru disabling sagemaker tests for HF containers. (aws#1042)
  Add automatic yes to prompts for apt (aws#1043)
  • Loading branch information
jeet4320 committed Apr 23, 2021
2 parents cd551be + 44b3738 commit 37123ad
Show file tree
Hide file tree
Showing 49 changed files with 1,908 additions and 41 deletions.
21 changes: 11 additions & 10 deletions available_images.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Once you've selected your desired Deep Learning Containers image, continue with


Deep Learning Containers Docker Images are available in the following regions:

| Region |Code |General Container |Elastic Inference Container|Neuron Container |Example URL |
|---------------------------|-------------------|-------------------|---------------------------|-------------------|-------------------------------------------------------------------------------------------|
|US East (N. Virginia) |us-east-1 |Available |Available |Available |763104351884.dkr.ecr.us-east-1.amazonaws.com/<repository-name>:<image-tag> |
Expand Down Expand Up @@ -101,29 +102,29 @@ You can pin your version by adding the version tag to your URL as follows:
|MXNet 1.8.0 |training |Yes |CPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.8.0-cpu-py37-ubuntu16.04 |
|MXNet 1.8.0 |inference |No |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-gpu-py37-cu110-ubuntu16.04 |
|MXNet 1.8.0 |inference |No |CPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-cpu-py37-ubuntu16.04 |
|PyTorch 1.8.0 |training |Yes |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.0-gpu-py36-cu111-ubuntu18.04 |
|PyTorch 1.8.0 |training |No |CPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.0-cpu-py36-ubuntu18.04 |
|PyTorch 1.8.0 |inference |No |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.0-gpu-py36-cu111-ubuntu18.04 |
|PyTorch 1.8.0 |inference |No |CPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.0-cpu-py36-ubuntu18.04 |
|PyTorch 1.8.1 |training |Yes |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04 |
|PyTorch 1.8.1 |training |No |CPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-cpu-py36-ubuntu18.04 |
|PyTorch 1.8.1 |inference |No |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.1-gpu-py36-cu111-ubuntu18.04 |
|PyTorch 1.8.1 |inference |No |CPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.1-cpu-py36-ubuntu18.04 |


HuggingFace training containers
===============================

| Framework |Job Type |CPU/GPU |Python Version Options |Example URL |
|-------------------|-----------|-----------|-----------------------|---------------------------------------------------------------------------------------------------|
|PyTorch 1.6.0 with HuggingFace transformers |training |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.6.0-transformers4.4.2-gpu-py36-cu110-ubuntu18.04 |
|TensorFlow 2.4.1 with HuggingFace transformers |training |GPU | 3.7 (py37) |763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-tensorflow-training:2.4.1-transformers4.4.2-gpu-py37-cu110-ubuntu18.04 |
| Framework |Job Type |CPU/GPU |Python Version Options |Example URL |
|-----------------------------------------------|-----------|-----------|-----------------------|---------------------------------------------------------------------------------------------------|
|PyTorch 1.6.0 with HuggingFace transformers |training |GPU | 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.6.0-transformers4.5.0-gpu-py36-cu110-ubuntu18.04 |
|TensorFlow 2.4.1 with HuggingFace transformers |training |GPU | 3.7 (py37) |763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-tensorflow-training:2.4.1-transformers4.5.0-gpu-py37-cu110-ubuntu18.04 |


Elastic Inference Containers
============================

| Framework |Job Type |Horovod Options |CPU/GPU |Python Version Options |Example URL |
|---------------------------------------------|--------------|--------------------|------------|---------------------------|---------------------------------------------------------------------------------------------------|
|TensorFlow 2.3.0 with Elastic Inference |inference |No |CPU |3.7 (py37) |763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference-eia:2.3.0-cpu-py37-ubuntu18.04 |
|TensorFlow 2.3.0 with Elastic Inference |inference |No |CPU |3.7 (py37) |763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-eia:2.3.0-cpu-py37-ubuntu18.04 |
|TensorFlow 1.15.0 with Elastic Inference |inference |No |CPU |2.7 (py27), 3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu-py36-ubuntu18.04 |
|MXNet 1.7.0 with Elastic Inference |inference |No |CPU |3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference-eia:1.7.0-cpu-py36-ubuntu16.04 |
|MXNet 1.7.0 with Elastic Inference |inference |No |CPU |3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference-eia:1.7.0-cpu-py36-ubuntu16.04 |
|PyTorch 1.5.1 with Elastic Inference |inference |No |CPU |3.6 (py36) |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-eia:1.5.1-cpu-py36-ubuntu16.04 |


Expand Down
31 changes: 31 additions & 0 deletions huggingface/pytorch/buildspec-1-6.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
region: &REGION <set-$REGION-in-environment>
base_framework: &BASE_FRAMEWORK pytorch
framework: &FRAMEWORK !join [ "huggingface_", *BASE_FRAMEWORK]
version: &VERSION 1.6.0
short_version: &SHORT_VERSION 1.6

repository_info:
training_repository: &TRAINING_REPOSITORY
image_type: &TRAINING_IMAGE_TYPE training
root: !join [ "huggingface/", *BASE_FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
repository_name: &REPOSITORY_NAME !join ["pr", "-", "huggingface", "-", *BASE_FRAMEWORK, "-", *TRAINING_IMAGE_TYPE]
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/,
*REPOSITORY_NAME ]

images:
BuildHuggingFacePytorchGpuPy37Cu110TrainingDockerImage:
<<: *TRAINING_REPOSITORY
build: &HUGGINGFACE_PYTORCH_GPU_TRAINING_PY3 false
image_size_baseline: &IMAGE_SIZE_BASELINE 15000
device_type: &DEVICE_TYPE gpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py36
cuda_version: &CUDA_VERSION cu110
os_version: &OS_VERSION ubuntu18.04
transformers_version: &TRANSFORMERS_VERSION 4.5.0
datasets_version: &DATASETS_VERSION 1.5.0
tag: !join [ *VERSION, '-', 'transformers', *TRANSFORMERS_VERSION, '-', *DEVICE_TYPE, '-', *TAG_PYTHON_VERSION, '-',
*CUDA_VERSION, '-', *OS_VERSION ]
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /,
*CUDA_VERSION, /Dockerfile., *DEVICE_TYPE ]
4 changes: 2 additions & 2 deletions huggingface/pytorch/buildspec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
region: &REGION <set-$REGION-in-environment>
base_framework: &BASE_FRAMEWORK pytorch
framework: &FRAMEWORK !join [ "huggingface_", *BASE_FRAMEWORK]
version: &VERSION 1.6.0
short_version: &SHORT_VERSION 1.6
version: &VERSION 1.7.1
short_version: &SHORT_VERSION 1.7

repository_info:
training_repository: &TRAINING_REPOSITORY
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ RUN pip install --no-cache-dir \
transformers[sklearn,sentencepiece]==${TRANSFORMERS_VERSION} \
datasets==${DATASETS_VERSION}
RUN apt-get update \
&& apt install git-lfs \
unzip \
&& apt install -y git-lfs \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ RUN pip install --no-cache-dir \
transformers[sklearn,sentencepiece]==${TRANSFORMERS_VERSION} \
datasets==${DATASETS_VERSION}
RUN apt-get update \
&& apt install git-lfs \
&& apt install -y git-lfs \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ RUN pip install --no-cache-dir \
tensorflow-addons==0.12.0 \
psutil
RUN apt-get update \
&& apt install git-lfs \
&& apt install -y git-lfs \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ RUN pip install --no-cache-dir \
tensorflow-addons==0.12.0 \
psutil
RUN apt-get update \
&& apt install git-lfs \
&& apt install -y git-lfs \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

Expand Down
4 changes: 2 additions & 2 deletions pytorch/buildspec-eia.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
region: &REGION <set-$REGION-in-environment>
framework: &FRAMEWORK pytorch
version: &VERSION 1.3.1
version: &VERSION 1.5.1

repository_info:
inference_repository: &INFERENCE_REPOSITORY
Expand All @@ -23,7 +23,7 @@
BuildEIAPTInferencePy3DockerImage:
<<: *INFERENCE_REPOSITORY
build: &PYTORCH_CPU_INFERENCE_PY3 false
image_size_baseline: 4899
image_size_baseline: 6318
device_type: &DEVICE_TYPE cpu
python_version: &DOCKER_PYTHON_VERSION py3
tag_python_version: &TAG_PYTHON_VERSION py36
Expand Down
3 changes: 2 additions & 1 deletion pytorch/inference/docker/1.5.1/py3/Dockerfile.eia
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
LABEL dlc_major_version="1"

# Add arguments to achieve the version, python and url
ARG PYTHON=python3
ARG PYTHON_VERSION=3.6.13
ARG PYTORCH_VERSION=1.5.1
ARG TORCHVISION_VERSION=0.6.1
Expand Down Expand Up @@ -59,7 +60,7 @@ RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-lat
python=$PYTHON_VERSION \
&& /opt/conda/bin/conda install -y \
# conda 4.9.2 requires ruamel_yaml to be installed. Currently pinned at latest.
ruamel_yaml==0.15.87 \
ruamel_yaml==0.15.100 \
cython==0.29.12 \
ipython==7.7.0 \
numpy==1.19.1 \
Expand Down
3 changes: 2 additions & 1 deletion pytorch/inference/docker/1.5.1/py3/Dockerfile.neuron
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ LABEL dlc_major_version="1"
LABEL maintainer="Amazon AI"
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

ARG PYTHON=python3
ARG PYTHON_VERSION=3.6.13

# See http://bugs.python.org/issue19846
Expand Down Expand Up @@ -59,7 +60,7 @@ RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-lat
python=$PYTHON_VERSION \
&& /opt/conda/bin/conda install -y \
# conda 4.9.2 requires ruamel_yaml to be installed. Currently pinned at latest.
ruamel_yaml==0.15.87 \
ruamel_yaml==0.15.100 \
cython==0.29.12 \
ipython==7.7.0 \
mkl-include==2019.4 \
Expand Down
4 changes: 2 additions & 2 deletions release_images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ release_images:
20:
framework: "huggingface_tensorflow"
version: "2.4.1"
hf_transformers: "4.4.2"
hf_transformers: "4.5.0"
training:
device_types: [ "gpu" ]
python_versions: [ "py37" ]
Expand All @@ -310,7 +310,7 @@ release_images:
21:
framework: "huggingface_pytorch"
version: "1.6.0"
hf_transformers: "4.4.2"
hf_transformers: "4.5.0"
training:
device_types: [ "gpu" ]
python_versions: [ "py36" ]
Expand Down
10 changes: 6 additions & 4 deletions src/deep_learning_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,10 +209,12 @@ def tag_instance():
request_status = None
if instance_id and region:
try:
session = botocore.session.get_session()
ec2_client = session.create_client("ec2", region_name=region)
response = ec2_client.create_tags(Resources=[instance_id], Tags=[tag_struct])
request_status = response.get("ResponseMetadata").get("HTTPStatusCode")
# # The section below has been commented out because the feature has been disabled until it is
# # ready to be enabled.
# session = botocore.session.get_session()
# ec2_client = session.create_client("ec2", region_name=region)
# response = ec2_client.create_tags(Resources=[instance_id], Tags=[tag_struct])
# request_status = response.get("ResponseMetadata").get("HTTPStatusCode")
if os.environ.get("TEST_MODE") == str(1):
with open(os.path.join(os.sep, "tmp", "test_tag_request.txt"), "w+") as rf:
rf.write(json.dumps(tag_struct, indent=4))
Expand Down
6 changes: 1 addition & 5 deletions src/start_testbuilds.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,7 @@ def main():
pr_test_job = f"dlc-pr-{test_type}-test"
images_str = " ".join(images)
if is_test_job_enabled(test_type):
# TODO: remove "sagemaker" once sagemaker tests are ready
if "huggingface" in images_str and test_type in [constants.EC2_TESTS,
constants.ECS_TESTS,
constants.EKS_TESTS,
constants.SAGEMAKER_TESTS]:
if "huggingface" in images_str and test_type in [constants.EC2_TESTS, constants.ECS_TESTS, constants.EKS_TESTS]:
LOGGER.debug(f"Skipping huggingface {test_type} test")
continue
run_test_job(commit, pr_test_job, images_str)
Expand Down
3 changes: 3 additions & 0 deletions tensorflow/buildspec-tf1-neuron.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@
dockerd-entrypoint:
source: docker/build_artifacts/entrypoint.sh
target: entrypoint.sh
deep_learning_container:
source: ../../src/deep_learning_container.py
target: deep_learning_container.py

images:
BuildNeuronTFInferencePy3DockerImage:
Expand Down
6 changes: 6 additions & 0 deletions test/dlc_tests/sanity/test_telemetry.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
@pytest.mark.processor("gpu")
@pytest.mark.integration("telemetry")
@pytest.mark.parametrize("ec2_instance_type", ["p3.2xlarge"], indirect=True)
@pytest.mark.skip(reason="Skip test until feature is enabled.")
def test_telemetry_instance_role_disabled_gpu(gpu, ec2_client, ec2_instance, ec2_connection):
_run_instance_role_disabled(gpu, ec2_client, ec2_instance, ec2_connection)

Expand All @@ -19,6 +20,7 @@ def test_telemetry_instance_role_disabled_gpu(gpu, ec2_client, ec2_instance, ec2
@pytest.mark.processor("cpu")
@pytest.mark.integration("telemetry")
@pytest.mark.parametrize("ec2_instance_type", ["c4.4xlarge"], indirect=True)
@pytest.mark.skip(reason="Skip test until feature is enabled.")
def test_telemetry_bad_instance_role_disabled_cpu(cpu, ec2_client, ec2_instance, ec2_connection, cpu_only):
_run_instance_role_disabled(cpu, ec2_client, ec2_instance, ec2_connection)

Expand All @@ -27,6 +29,7 @@ def test_telemetry_bad_instance_role_disabled_cpu(cpu, ec2_client, ec2_instance,
@pytest.mark.processor("neuron")
@pytest.mark.integration("telemetry")
@pytest.mark.parametrize("ec2_instance_type", ["inf1.xlarge"], indirect=True)
@pytest.mark.skip(reason="Skip test until feature is enabled.")
def test_telemetry_bad_instance_role_disabled_neuron(neuron, ec2_client, ec2_instance, ec2_connection):
_run_instance_role_disabled(neuron, ec2_client, ec2_instance, ec2_connection)

Expand All @@ -35,6 +38,7 @@ def test_telemetry_bad_instance_role_disabled_neuron(neuron, ec2_client, ec2_ins
@pytest.mark.processor("gpu")
@pytest.mark.integration("telemetry")
@pytest.mark.parametrize("ec2_instance_type", ["p3.2xlarge"], indirect=True)
@pytest.mark.skip(reason="Skip test until feature is enabled.")
def test_telemetry_instance_tag_success_gpu(gpu, ec2_client, ec2_instance, ec2_connection, non_huggingface_only):
_run_tag_success(gpu, ec2_client, ec2_instance, ec2_connection)

Expand All @@ -43,6 +47,7 @@ def test_telemetry_instance_tag_success_gpu(gpu, ec2_client, ec2_instance, ec2_c
@pytest.mark.processor("cpu")
@pytest.mark.integration("telemetry")
@pytest.mark.parametrize("ec2_instance_type", ["c4.4xlarge"], indirect=True)
@pytest.mark.skip(reason="Skip test until feature is enabled.")
def test_telemetry_instance_tag_success_cpu(cpu, ec2_client, ec2_instance, ec2_connection, cpu_only, non_huggingface_only):
_run_tag_success(cpu, ec2_client, ec2_instance, ec2_connection)

Expand All @@ -51,6 +56,7 @@ def test_telemetry_instance_tag_success_cpu(cpu, ec2_client, ec2_instance, ec2_c
@pytest.mark.processor("neuron")
@pytest.mark.integration("telemetry")
@pytest.mark.parametrize("ec2_instance_type", ["inf1.xlarge"], indirect=True)
@pytest.mark.skip(reason="Skip test until feature is enabled.")
def test_telemetry_instance_tag_success_neuron(neuron, ec2_client, ec2_instance, ec2_connection, non_huggingface_only):
_run_tag_success(neuron, ec2_client, ec2_instance, ec2_connection)

Expand Down
Empty file.
13 changes: 13 additions & 0 deletions test/sagemaker_tests/huggingface_pytorch/training/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.
from __future__ import absolute_import

0 comments on commit 37123ad

Please sign in to comment.