Merge remote-tracking branch 'aws/master'

* aws/master: [doc] Minor corrections on available_images (aws#1061) [huggingface_pytorch] upgrade PyTorch to 1.7.1 (aws#1025) Fix for hf canary tests (aws#1057) [test][huggingface_tensorflow, huggingface_pytorch] SM local and remote tests (aws#1021) hf transformer version update (aws#1060) [build] update EIA buildspec and upgrade ruamel_yaml package (aws#1051) bump transformer version (aws#1056) Including hf images into canary tests (aws#1050) fix tf1 neuron buildspec (aws#1048) [build,test] Disable dedicated telemetry tests and tags (aws#1045) [test][sagemaker] Execute SM local tests in parallel (aws#1027) Temporaru disabling sagemaker tests for HF containers. (aws#1042) Add automatic yes to prompts for apt (aws#1043)
jeet4320 · Apr 23, 2021 · 37123ad · 37123ad
2 parents cd551be + 44b3738
commit 37123ad
Show file tree

Hide file tree

Showing 49 changed files with 1,908 additions and 41 deletions.
diff --git a/available_images.md b/available_images.md
@@ -16,6 +16,7 @@ Once you've selected your desired Deep Learning Containers image, continue with
 
 
 Deep Learning Containers Docker Images are available in the following regions:
+
 | Region 					|Code 				|General Container	|Elastic Inference Container|Neuron Container	|Example URL																				|
 |---------------------------|-------------------|-------------------|---------------------------|-------------------|-------------------------------------------------------------------------------------------|
 |US East (N. Virginia)		|us-east-1			|Available 			|Available 			        |Available			|763104351884.dkr.ecr.us-east-1.amazonaws.com/&lt;repository-name>:&lt;image-tag>			|
@@ -101,29 +102,29 @@ You can pin your version by adding the version tag to your URL as follows:
 |MXNet 1.8.0        |training	|Yes			|CPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.8.0-cpu-py37-ubuntu16.04				|
 |MXNet 1.8.0        |inference	|No				|GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-gpu-py37-cu110-ubuntu16.04		|
 |MXNet 1.8.0        |inference	|No				|CPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.8.0-cpu-py37-ubuntu16.04			|
-|PyTorch 1.8.0      |training	|Yes			|GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.0-gpu-py36-cu111-ubuntu18.04		|
-|PyTorch 1.8.0      |training	|No				|CPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.0-cpu-py36-ubuntu18.04			|
-|PyTorch 1.8.0      |inference	|No			|GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.0-gpu-py36-cu111-ubuntu18.04		|
-|PyTorch 1.8.0      |inference	|No				|CPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.0-cpu-py36-ubuntu18.04			|
+|PyTorch 1.8.1      |training	|Yes			|GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04		|
+|PyTorch 1.8.1      |training	|No				|CPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.8.1-cpu-py36-ubuntu18.04			|
+|PyTorch 1.8.1      |inference	|No			    |GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.1-gpu-py36-cu111-ubuntu18.04	|
+|PyTorch 1.8.1      |inference	|No				|CPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.1-cpu-py36-ubuntu18.04			|
 
 
 HuggingFace training containers
 ===============================
 
-| Framework         |Job Type	|CPU/GPU 	|Python Version Options	|Example URL																						|
-|-------------------|-----------|-----------|-----------------------|---------------------------------------------------------------------------------------------------|
-|PyTorch 1.6.0   with HuggingFace transformers   |training	|GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:1.6.0-transformers4.4.2-gpu-py36-cu110-ubuntu18.04		|
-|TensorFlow 2.4.1 with HuggingFace transformers    |training	|GPU 		| 3.7 (py37)			|763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-tensorflow-training:2.4.1-transformers4.4.2-gpu-py37-cu110-ubuntu18.04 	|
+| Framework                                     |Job Type	|CPU/GPU 	|Python Version Options	|Example URL																						|
+|-----------------------------------------------|-----------|-----------|-----------------------|---------------------------------------------------------------------------------------------------|
+|PyTorch 1.6.0 with HuggingFace transformers    |training	|GPU 		| 3.6 (py36)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:1.6.0-transformers4.5.0-gpu-py36-cu110-ubuntu18.04		|
+|TensorFlow 2.4.1 with HuggingFace transformers |training	|GPU 		| 3.7 (py37)			|763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-tensorflow-training:2.4.1-transformers4.5.0-gpu-py37-cu110-ubuntu18.04 	|
 
 
 Elastic Inference Containers
 ============================
 
 | Framework 			                      |Job Type 	 |Horovod Options 	  |CPU/GPU 	   |Python Version Options     |Example URL 			                                                                           |
 |---------------------------------------------|--------------|--------------------|------------|---------------------------|---------------------------------------------------------------------------------------------------|
-|TensorFlow 2.3.0 with Elastic Inference 	  |inference     |No 			      |CPU 		   |3.7 (py37) 	   |763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-inference-eia:2.3.0-cpu-py37-ubuntu18.04   |
+|TensorFlow 2.3.0 with Elastic Inference 	  |inference     |No 			      |CPU 		   |3.7 (py37) 	               |763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-eia:2.3.0-cpu-py37-ubuntu18.04   |
 |TensorFlow 1.15.0 with Elastic Inference     |inference 	 |No                  |CPU         |2.7 (py27), 3.6 (py36) 	   |763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-inference-eia:1.15.0-cpu-py36-ubuntu18.04  |
-|MXNet 1.7.0 with Elastic Inference           |inference     |No 			      |CPU 		   |3.6 (py36) 	   |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference-eia:1.7.0-cpu-py36-ubuntu16.04        |
+|MXNet 1.7.0 with Elastic Inference           |inference     |No 			      |CPU 		   |3.6 (py36) 	               |763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference-eia:1.7.0-cpu-py36-ubuntu16.04        |
 |PyTorch 1.5.1 with Elastic Inference 		  |inference 	 |No 			      |CPU 		   |3.6 (py36) 			       |763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-eia:1.5.1-cpu-py36-ubuntu16.04      |
 
 

diff --git a/huggingface/pytorch/buildspec-1-6.yml b/huggingface/pytorch/buildspec-1-6.yml
@@ -0,0 +1,31 @@
+account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
+region: &REGION <set-$REGION-in-environment>
+base_framework: &BASE_FRAMEWORK pytorch
+framework: &FRAMEWORK !join [ "huggingface_", *BASE_FRAMEWORK]
+version: &VERSION 1.6.0
+short_version: &SHORT_VERSION 1.6
+
+repository_info:
+  training_repository: &TRAINING_REPOSITORY
+    image_type: &TRAINING_IMAGE_TYPE training
+    root: !join [ "huggingface/", *BASE_FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
+    repository_name: &REPOSITORY_NAME !join ["pr", "-", "huggingface", "-", *BASE_FRAMEWORK, "-", *TRAINING_IMAGE_TYPE]
+    repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/,
+      *REPOSITORY_NAME ]
+
+images:
+  BuildHuggingFacePytorchGpuPy37Cu110TrainingDockerImage:
+    <<: *TRAINING_REPOSITORY
+    build: &HUGGINGFACE_PYTORCH_GPU_TRAINING_PY3 false
+    image_size_baseline: &IMAGE_SIZE_BASELINE 15000
+    device_type: &DEVICE_TYPE gpu
+    python_version: &DOCKER_PYTHON_VERSION py3
+    tag_python_version: &TAG_PYTHON_VERSION py36
+    cuda_version: &CUDA_VERSION cu110
+    os_version: &OS_VERSION ubuntu18.04
+    transformers_version: &TRANSFORMERS_VERSION 4.5.0
+    datasets_version: &DATASETS_VERSION 1.5.0
+    tag: !join [ *VERSION, '-', 'transformers', *TRANSFORMERS_VERSION, '-', *DEVICE_TYPE, '-', *TAG_PYTHON_VERSION, '-',
+      *CUDA_VERSION, '-', *OS_VERSION ]
+    docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, 
+      *CUDA_VERSION, /Dockerfile., *DEVICE_TYPE ]
diff --git a/huggingface/pytorch/buildspec.yml b/huggingface/pytorch/buildspec.yml
@@ -2,8 +2,8 @@ account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
 region: &REGION <set-$REGION-in-environment>
 base_framework: &BASE_FRAMEWORK pytorch
 framework: &FRAMEWORK !join [ "huggingface_", *BASE_FRAMEWORK]
-version: &VERSION 1.6.0
-short_version: &SHORT_VERSION 1.6
+version: &VERSION 1.7.1
+short_version: &SHORT_VERSION 1.7
 
 repository_info:
   training_repository: &TRAINING_REPOSITORY

diff --git a/huggingface/pytorch/training/docker/1.6/py3/cu110/Dockerfile.gpu b/huggingface/pytorch/training/docker/1.6/py3/cu110/Dockerfile.gpu
@@ -13,8 +13,7 @@ RUN pip install --no-cache-dir \
 	transformers[sklearn,sentencepiece]==${TRANSFORMERS_VERSION} \ 
 	datasets==${DATASETS_VERSION}
 RUN apt-get update \
- && apt install git-lfs \
-   	unzip \
+ && apt install -y git-lfs \
  && apt-get clean \  
  && rm -rf /var/lib/apt/lists/*
 

diff --git a/huggingface/pytorch/training/docker/1.7/py3/cu110/Dockerfile.gpu b/huggingface/pytorch/training/docker/1.7/py3/cu110/Dockerfile.gpu
@@ -17,7 +17,7 @@ RUN pip install --no-cache-dir \
 	transformers[sklearn,sentencepiece]==${TRANSFORMERS_VERSION} \ 
 	datasets==${DATASETS_VERSION}
 RUN apt-get update \
- && apt install git-lfs \
+ && apt install -y git-lfs \
  && apt-get clean \  
  && rm -rf /var/lib/apt/lists/*
 

diff --git a/huggingface/tensorflow/training/docker/2.3/py3/cu110/Dockerfile.gpu b/huggingface/tensorflow/training/docker/2.3/py3/cu110/Dockerfile.gpu
@@ -15,7 +15,7 @@ RUN pip install --no-cache-dir \
     tensorflow-addons==0.12.0 \
     psutil
 RUN apt-get update \
- && apt install git-lfs \
+ && apt install -y git-lfs \
  && apt-get clean \  
  && rm -rf /var/lib/apt/lists/*
 

diff --git a/huggingface/tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu b/huggingface/tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu
@@ -19,7 +19,7 @@ RUN pip install --no-cache-dir \
     tensorflow-addons==0.12.0 \
     psutil
 RUN apt-get update \
- && apt install git-lfs \
+ && apt install -y git-lfs \
  && apt-get clean \  
  && rm -rf /var/lib/apt/lists/*
 

diff --git a/pytorch/buildspec-eia.yml b/pytorch/buildspec-eia.yml
@@ -1,7 +1,7 @@
   account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
   region: &REGION <set-$REGION-in-environment>
   framework: &FRAMEWORK pytorch
-  version: &VERSION 1.3.1
+  version: &VERSION 1.5.1
 
   repository_info:
     inference_repository: &INFERENCE_REPOSITORY
@@ -23,7 +23,7 @@
     BuildEIAPTInferencePy3DockerImage:
       <<: *INFERENCE_REPOSITORY
       build: &PYTORCH_CPU_INFERENCE_PY3 false
-      image_size_baseline: 4899
+      image_size_baseline: 6318
       device_type: &DEVICE_TYPE cpu
       python_version: &DOCKER_PYTHON_VERSION py3
       tag_python_version: &TAG_PYTHON_VERSION py36

diff --git a/pytorch/inference/docker/1.5.1/py3/Dockerfile.eia b/pytorch/inference/docker/1.5.1/py3/Dockerfile.eia
@@ -4,6 +4,7 @@ LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 LABEL dlc_major_version="1"
 
 # Add arguments to achieve the version, python and url
+ARG PYTHON=python3
 ARG PYTHON_VERSION=3.6.13
 ARG PYTORCH_VERSION=1.5.1
 ARG TORCHVISION_VERSION=0.6.1
@@ -59,7 +60,7 @@ RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-lat
     python=$PYTHON_VERSION \
  && /opt/conda/bin/conda install -y \
     # conda 4.9.2 requires ruamel_yaml to be installed. Currently pinned at latest.
-    ruamel_yaml==0.15.87 \
+    ruamel_yaml==0.15.100 \
     cython==0.29.12 \
     ipython==7.7.0 \
     numpy==1.19.1 \

diff --git a/pytorch/inference/docker/1.5.1/py3/Dockerfile.neuron b/pytorch/inference/docker/1.5.1/py3/Dockerfile.neuron
@@ -4,6 +4,7 @@ LABEL dlc_major_version="1"
 LABEL maintainer="Amazon AI"
 LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 
+ARG PYTHON=python3
 ARG PYTHON_VERSION=3.6.13
 
 # See http://bugs.python.org/issue19846
@@ -59,7 +60,7 @@ RUN curl -L -o ~/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-lat
     python=$PYTHON_VERSION \
  && /opt/conda/bin/conda install -y \
     # conda 4.9.2 requires ruamel_yaml to be installed. Currently pinned at latest.
-    ruamel_yaml==0.15.87 \
+    ruamel_yaml==0.15.100 \
     cython==0.29.12 \
     ipython==7.7.0 \
     mkl-include==2019.4 \

diff --git a/release_images.yml b/release_images.yml
@@ -298,7 +298,7 @@ release_images:
   20:
     framework: "huggingface_tensorflow"
     version: "2.4.1"
-    hf_transformers: "4.4.2"
+    hf_transformers: "4.5.0"
     training:
       device_types: [ "gpu" ]
       python_versions: [ "py37" ]
@@ -310,7 +310,7 @@ release_images:
   21:
     framework: "huggingface_pytorch"
     version: "1.6.0"
-    hf_transformers: "4.4.2"
+    hf_transformers: "4.5.0"
     training:
       device_types: [ "gpu" ]
       python_versions: [ "py36" ]

diff --git a/src/deep_learning_container.py b/src/deep_learning_container.py
@@ -209,10 +209,12 @@ def tag_instance():
     request_status = None
     if instance_id and region:
         try:
-            session = botocore.session.get_session()
-            ec2_client = session.create_client("ec2", region_name=region)
-            response = ec2_client.create_tags(Resources=[instance_id], Tags=[tag_struct])
-            request_status = response.get("ResponseMetadata").get("HTTPStatusCode")
+            # # The section below has been commented out because the feature has been disabled until it is
+            # # ready to be enabled.
+            # session = botocore.session.get_session()
+            # ec2_client = session.create_client("ec2", region_name=region)
+            # response = ec2_client.create_tags(Resources=[instance_id], Tags=[tag_struct])
+            # request_status = response.get("ResponseMetadata").get("HTTPStatusCode")
             if os.environ.get("TEST_MODE") == str(1):
                 with open(os.path.join(os.sep, "tmp", "test_tag_request.txt"), "w+") as rf:
                     rf.write(json.dumps(tag_struct, indent=4))

diff --git a/src/start_testbuilds.py b/src/start_testbuilds.py
@@ -94,11 +94,7 @@ def main():
             pr_test_job = f"dlc-pr-{test_type}-test"
             images_str = " ".join(images)
             if is_test_job_enabled(test_type):
-                # TODO: remove "sagemaker" once sagemaker tests are ready
-                if "huggingface" in images_str and test_type in [constants.EC2_TESTS,
-                                                                 constants.ECS_TESTS,
-                                                                 constants.EKS_TESTS,
-                                                                 constants.SAGEMAKER_TESTS]:
+                if "huggingface" in images_str and test_type in [constants.EC2_TESTS, constants.ECS_TESTS, constants.EKS_TESTS]:
                     LOGGER.debug(f"Skipping huggingface {test_type} test")
                     continue
                 run_test_job(commit, pr_test_job, images_str)

diff --git a/tensorflow/buildspec-tf1-neuron.yml b/tensorflow/buildspec-tf1-neuron.yml
@@ -23,6 +23,9 @@
       dockerd-entrypoint:
         source: docker/build_artifacts/entrypoint.sh
         target: entrypoint.sh
+      deep_learning_container:
+        source: ../../src/deep_learning_container.py
+        target: deep_learning_container.py
 
   images:
     BuildNeuronTFInferencePy3DockerImage:

diff --git a/test/dlc_tests/sanity/test_telemetry.py b/test/dlc_tests/sanity/test_telemetry.py
@@ -11,6 +11,7 @@
 @pytest.mark.processor("gpu")
 @pytest.mark.integration("telemetry")
 @pytest.mark.parametrize("ec2_instance_type", ["p3.2xlarge"], indirect=True)
+@pytest.mark.skip(reason="Skip test until feature is enabled.")
 def test_telemetry_instance_role_disabled_gpu(gpu, ec2_client, ec2_instance, ec2_connection):
     _run_instance_role_disabled(gpu, ec2_client, ec2_instance, ec2_connection)
 
@@ -19,6 +20,7 @@ def test_telemetry_instance_role_disabled_gpu(gpu, ec2_client, ec2_instance, ec2
 @pytest.mark.processor("cpu")
 @pytest.mark.integration("telemetry")
 @pytest.mark.parametrize("ec2_instance_type", ["c4.4xlarge"], indirect=True)
+@pytest.mark.skip(reason="Skip test until feature is enabled.")
 def test_telemetry_bad_instance_role_disabled_cpu(cpu, ec2_client, ec2_instance, ec2_connection, cpu_only):
     _run_instance_role_disabled(cpu, ec2_client, ec2_instance, ec2_connection)
 
@@ -27,6 +29,7 @@ def test_telemetry_bad_instance_role_disabled_cpu(cpu, ec2_client, ec2_instance,
 @pytest.mark.processor("neuron")
 @pytest.mark.integration("telemetry")
 @pytest.mark.parametrize("ec2_instance_type", ["inf1.xlarge"], indirect=True)
+@pytest.mark.skip(reason="Skip test until feature is enabled.")
 def test_telemetry_bad_instance_role_disabled_neuron(neuron, ec2_client, ec2_instance, ec2_connection):
     _run_instance_role_disabled(neuron, ec2_client, ec2_instance, ec2_connection)
 
@@ -35,6 +38,7 @@ def test_telemetry_bad_instance_role_disabled_neuron(neuron, ec2_client, ec2_ins
 @pytest.mark.processor("gpu")
 @pytest.mark.integration("telemetry")
 @pytest.mark.parametrize("ec2_instance_type", ["p3.2xlarge"], indirect=True)
+@pytest.mark.skip(reason="Skip test until feature is enabled.")
 def test_telemetry_instance_tag_success_gpu(gpu, ec2_client, ec2_instance, ec2_connection, non_huggingface_only):
     _run_tag_success(gpu, ec2_client, ec2_instance, ec2_connection)
 
@@ -43,6 +47,7 @@ def test_telemetry_instance_tag_success_gpu(gpu, ec2_client, ec2_instance, ec2_c
 @pytest.mark.processor("cpu")
 @pytest.mark.integration("telemetry")
 @pytest.mark.parametrize("ec2_instance_type", ["c4.4xlarge"], indirect=True)
+@pytest.mark.skip(reason="Skip test until feature is enabled.")
 def test_telemetry_instance_tag_success_cpu(cpu, ec2_client, ec2_instance, ec2_connection, cpu_only, non_huggingface_only):
     _run_tag_success(cpu, ec2_client, ec2_instance, ec2_connection)
 
@@ -51,6 +56,7 @@ def test_telemetry_instance_tag_success_cpu(cpu, ec2_client, ec2_instance, ec2_c
 @pytest.mark.processor("neuron")
 @pytest.mark.integration("telemetry")
 @pytest.mark.parametrize("ec2_instance_type", ["inf1.xlarge"], indirect=True)
+@pytest.mark.skip(reason="Skip test until feature is enabled.")
 def test_telemetry_instance_tag_success_neuron(neuron, ec2_client, ec2_instance, ec2_connection, non_huggingface_only):
     _run_tag_success(neuron, ec2_client, ec2_instance, ec2_connection)
 

diff --git a/test/sagemaker_tests/huggingface_pytorch/__init__.py b/test/sagemaker_tests/huggingface_pytorch/__init__.py
diff --git a/test/sagemaker_tests/huggingface_pytorch/training/__init__.py b/test/sagemaker_tests/huggingface_pytorch/training/__init__.py
@@ -0,0 +1,13 @@
+# Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"). You
+# may not use this file except in compliance with the License. A copy of
+# the License is located at
+#
+#     http://aws.amazon.com/apache2.0/
+#
+# or in the "license" file accompanying this file. This file is
+# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
+# ANY KIND, either express or implied. See the License for the specific
+# language governing permissions and limitations under the License.
+from __future__ import absolute_import