Using GPUs for Training Models in the Cloud

목차
* [Requesting GPU-enabled machines](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#requesting_gpu-enabled_machines)
* [Submitting the training job](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#submit-job)
* [Assigning ops to GPUs](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#assigning_ops_to_gpus)
  * [GPU device strings](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#gpu_device_strings)
* [Python packages on GPU-enabled machines](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#python_packages_on_gpu-enabled_machines)
* [Maintenance events](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#maintenance_events)
* [What's next](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#whats_next)

그래픽 처리 장치 (GPU)는 많은 심층 학습 모델의 교육 프로세스를 크게 가속화 할 수 있습니다. 예를 들어, GPU는 이미지 분류, 비디오 분석 및 자연 언어 처리를 위해 고안된 심층 학습 모델의 교육 과정을 가속화 할 수 있습니다. 이러한 모델의 교육 과정에는 연산 집약적 인 작업 및 GPU의 대규모 병렬 아키텍처. 이 아키텍처는 혼란스럽게 병렬 작업 부하를 처리하도록 설계된 알고리즘에 매우 적합합니다.

대용량 데이터 세트에서 집중적 인 계산 작업을 수행하는 심층 학습 모델을 교육하면 단일 프로세서에서 실행하는 데 며칠이 걸릴 수 있습니다. 그러나 이러한 작업을 하나 이상의 GPU로 오프로드하도록 프로그램을 설계하면 교육 시간을 며칠이 아닌 몇 시간으로 줄일 수 있습니다.

GPU를 사용한 가속화 된 컴퓨팅에 대한 일반 정보는 가속 컴퓨팅에 대한 NVIDIA의 페이지를 참조하십시오. TensorFlow와 함께 GPU를 사용하는 방법에 대한 자세한 내용은 TensorFlow 설명서의 GPU 사용을 참조하십시오.

## Requesting GPU-enabled machines

To use GPUs in the cloud, configure your training job to access GPU-enabled machines:

* Set the scale tier to CUSTOM.
* Configure each task (master, worker, or parameter server) to use one of the GPU-enabled machine types below, based on the number of GPUs and the type of accelerator required for your task:
  * standard_gpu: A single NVIDIA Tesla K80 GPU
  * complex_model_m_gpu: Four NVIDIA Tesla K80 GPUs
  * complex_model_l_gpu: Eight NVIDIA Tesla K80 GPUs
  * standard_p100: A single NVIDIA Tesla P100 GPU (Beta)
  * complex_model_m_p100: Four NVIDIA Tesla P100 GPUs (Beta)


[Below](https://cloud.google.com/ml-engine/docs/tensorflow/using-gpus?authuser=0&hl=ko#submit-job) is an example of submitting the job using the gcloud command.

또는 클라우드 ML 엔진을 사용하는 방법이나 GPU 사용 가능 머신을 실험하는 방법을 배우는 경우 스케일 계층을 BASIC_GPU로 설정하여 단일 NVIDIA Tesla K80 GPU로 단일 작업 인스턴스를 얻을 수 있습니다.

See more information about [comparing machine types.](https://cloud.google.com/ml-engine/docs/tensorflow/training-overview?authuser=0&hl=ko#machine_type_table)

In addition, you need to run your job in a region that supports GPUs. The following regions currently provide access to GPUs:

* us-east1
* us-central1
* asia-east1
* europe-west1
To fully understand the available regions for Cloud ML Engine services, including model training and online/batch prediction, read the guide to regions.

## Submitting the training job

You can submit your training job using the [gcloud ml-engine jobs submit training](https://cloud.google.com/ml-engine/reference/commandline/jobs/submit/training) command.

1.Define a config.yaml file that describes the GPU options you want. The structure of the YAML file represents the [Job resource](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs). For example:

In [0]:
trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m_gpu
  workerType: complex_model_m_gpu
  parameterServerType: large_model
  workerCount: 9
  parameterServerCount: 3

2.Use the gcloud command to submit the job, including a --config argument pointing to your config.yaml file. The following example assumes you've set up environment variables, indicated by a $ sign followed by capital letters, for the values of some arguments:

In [0]:
gcloud ml-engine jobs submit training $JOB_NAME \
        --package-path $APP_PACKAGE_PATH \
        --module-name $MAIN_APP_MODULE \
        --job-dir $JOB_DIR \
        --region us-central1 \
        --config config.yaml \
        -- \
        --user_arg_1 value_1 \
         ...
        --user_arg_n value_n

Notes:

* The empty -- argument marks the end of the gcloud specific arguments and the start of the USER_ARGS that you want to pass to your application.
* Arguments specific to Cloud ML Engine, such as --module-name, --runtime-version, and --job-dir, must come before the empty -- argument. The Cloud ML Engine service interprets these arguments.
* The --job-dir argument, if specified, must come before the empty -- argument, because Cloud ML Engine uses the --job-dir argument to validate the path.
* Your application must handle the --job-dir argument too, if specified. Even though the argument comes before the empty --, the --job-dir is also passed to your application as a command-line argument.

For more details of the job submission options, see the guide to starting a training job.

## Assigning ops to GPUs

To make use of the GPUs on a machine, make the appropriate changes to your TensorFlow trainer application:

* High-level Estimator API: No code changes are necessary as long as your [ClusterSpec](https://www.tensorflow.org/api_docs/python/tf/train/ClusterSpec) is configured properly. If a cluster is a mixture of CPUs and GPUs, map the ps job name to the CPUs and the worker job name to the GPUs.
* Core Tensorflow API: You must assign ops to run on GPU-enabled machines. This process is the same as using [GPUs with TensorFlow locally](https://www.tensorflow.org/tutorials/using_gpu). You can use [tf.train.replica_device_setter](https://www.tensorflow.org/api_docs/python/tf/train/replica_device_setter) to assign ops to devices.

When you assign a GPU-enabled machine to a Cloud ML Engine process, that process has exclusive access to that machine's GPUs; you can't share the GPUs of a single machine in your cluster among multiple processes. The process corresponds to the distributed TensorFlow task in your cluster specification. The [distributed TensorFlow documentation](https://www.tensorflow.org/how_tos/distributed/) describes cluster specifications and tasks.



### GPU device strings

A standard_gpu machine's single GPU is identified as "/gpu:0". Machines with multiple GPUs use identifiers starting with "/gpu:0", then "/gpu:1", and so on. For example, complex_model_m_gpu machines have four GPUs identified as "/gpu:0" through "/gpu:3".

## Python packages on GPU-enabled machines

GPU-enabled machines come pre-installed with tensorflow-gpu, the Tensorflow Python package with GPU support. See the Cloud ML Runtime Version List for a list of all pre-installed packages.

## Maintenance events

If you use GPU machines in your training jobs, it is good to be aware that the underlying virtual machines will occasionally be subject to [Compute Engine host maintenance](https://cloud.google.com/compute/docs/gpus/add-gpus#host-maintenance). The GPU-enabled virtual machines used in your training jobs are configured to automatically restart after such maintenance events, but you may have to do some extra work to ensure that your trainer is resilient to these shutdowns by ensuring that you regularly save model checkpoints (usually along the Cloud Storage path you specify through the --job-dir argument to gcloud ml-engine jobs submit training) and that your trainer is configured to restore the most recent checkpoint in the case that a checkpoint already exists.

The [TensorFlow Estimator API](https://www.tensorflow.org/programmers_guide/estimators) implements this functionality for you, so if your model is already wrapped in an Estimator, you do not have to worry about maintenance events on your GPU workers.

If it is not feasible for you to wrap your model in a TensorFlow Estimator and you want your GPU-enabled training jobs to be resilient to maintenance events, you must write the checkpoint saving and restoration functionality into your model manually. TensorFlow does provide some useful resources for such an implementation in the [tf.train module](https://www.tensorflow.org/api_docs/python/tf/train) - specifically, [tf.train.checkpoint_exists](https://www.tensorflow.org/api_docs/python/tf/train/checkpoint_exists) and [tf.train.latest_checkpoint](https://www.tensorflow.org/api_docs/python/tf/train/latest_checkpoint).

What's next

* [Learn more about training models in the cloud.](https://cloud.google.com/ml-engine/docs/tensorflow/training-overview)
* [Train your model in the cloud.](https://cloud.google.com/ml-engine/docs/tensorflow/training-steps)
* [Visit the Cloud ML Engine documentation main page](https://cloud.google.com/ml-engine/docs/tensorflow/).
* [Limits on concurrent GPU usage](https://cloud.google.com/ml-engine/quotas#gpu-quota).