### 참고 url
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/pytorch/pytorch-text-sentiment-classification-custom-train-deploy.ipynb

In [1]:
# Install the packages required for executing this notebook.

import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME") and not os.getenv("VIRTUAL_ENV")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip install --upgrade google-cloud-aiplatform {USER_FLAG} -q

In [2]:
# Restart the kernel

import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

In [7]:
# UUID
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()
UUID

'boki7996'

In [8]:
# GCP project id & region 설정

PROJECT_ID = 'airy-runway-344101'
REGION = "asia-east1"

In [31]:
# GCS 버킷 세팅
BUCKET_NAME = "catch-ai"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

# 버킷 접속 확인
! gsutil ls -al $BUCKET_URI

                                 gs://catch-ai/data/
                                 gs://catch-ai/model/
                                 gs://catch-ai/pytorch-on-gcp/


In [10]:
# Import the required libraries for this notebook.

import base64
import json

from google.cloud import aiplatform
from google.protobuf.json_format import MessageToDict

In [32]:
# Define the constants needed for this tutorial.

# Name for the package application / model / repository
APP_NAME = "ai-catchform-kobert"

# URI for the pre-built container for custom training
PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI = (
    "asia-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-11:latest"
)

# Name of the folder where the python package needs to be stored
PYTHON_PACKAGE_APPLICATION_DIR = "ai_catchform"

# Path to the source distribution tar of the python package
source_package_file_name = f"{PYTHON_PACKAGE_APPLICATION_DIR}/dist/trainer-0.1.tar.gz"

# GCS path where the python package is stored
python_package_gcs_uri = (
    f"{BUCKET_URI}/pytorch-on-gcp/{APP_NAME}/train/python_package/trainer-0.1.tar.gz"
)

# Module name for training application
python_module_name = "src.main"

# Training job's display name
JOB_NAME = f"{APP_NAME}-pytorch-pkg-train-{UUID}"

# Set training job's machine-type
TRAIN_MACHINE_TYPE = "n1-standard-8"
# Set training job's accelerator type
TRAIN_ACCELERATOR_TYPE = "NVIDIA_TESLA_V100"
# Set no. of h/w accelerators needed for the training job
TRAIN_ACCELERATOR_COUNT = 1

# Set the name of the container image for prediction
CUSTOM_PREDICTOR_IMAGE_URI = (
    f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{APP_NAME}/pytorch_predict_{APP_NAME}:latest"
)

# Set the version for model-deployment
VERSION = 1
# Set the model display name
model_display_name = f"{APP_NAME}-v{VERSION}"
# Set the model description
model_description = "PyTorch based KoBERT NER classification model for AI-catchform"

# Set the health route for prediction container
health_route = "/ping"
# Set the predict route for prediction container
predict_route = f"/predictions/{APP_NAME}"
# Set the serving container ports for prediction
serving_container_ports = [7080]

# Set the display name for endpoint
endpoint_display_name = f"{APP_NAME}-endpoint"
# Set the machine-type for deployment
DEPLOY_MACHINE_TYPE = "n1-standard-4"

In [33]:
# Initialize the Vertex AI SDK for Python

aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Custom Training on Vertex AI

### Recommended Training Application Structure

You can structure your training application in any way you like. However, the following structure is commonly used in Vertex AI samples, and having your project organized similarly can make it easier for you to follow the samples.

The following python_package directory structure shows a sample packaging approach.

```
├── ai_catchform
│   ├── setup.py
│   └── trainer
│       ├── __init__.py
│       ├── experiment.py
│       ├── metadata.py
│       ├── model.py
│       ├── task.py
│       └── utils.py
└── pytorch-text-sentiment-classification-custom-train-deploy.ipynb    --> This notebook
```

- Main project directory contains your setup.py file with the dependencies.
- Inside trainer directory:
    - task.py - Main application module initializes and parse task arguments (hyperparameters). It also serves as an entry point to the trainer.
    - model.py - Includes a function to create a model with a sequence classification head from a pre-trained model.
    - experiment.py - Runs the model training and evaluation experiment, and exports the final model.
    - metadata.py - Defines the metadata for classification tasks such as predefined model, dataset name and target labels.

- utils.py - Includes utility functions such as those used for reading data, saving models to Cloud Storage buckets.

In [34]:
# Run the following command to create a source distribution.
!cd {PYTHON_PACKAGE_APPLICATION_DIR} && python setup.py sdist --formats=gztar

running sdist
running egg_info
writing trainer.egg-info/PKG-INFO
writing dependency_links to trainer.egg-info/dependency_links.txt
writing requirements to trainer.egg-info/requires.txt
writing top-level names to trainer.egg-info/top_level.txt
reading manifest file 'trainer.egg-info/SOURCES.txt'
writing manifest file 'trainer.egg-info/SOURCES.txt'

running check
creating trainer-0.1
creating trainer-0.1/trainer.egg-info
copying files to trainer-0.1...
copying setup.py -> trainer-0.1
copying trainer.egg-info/PKG-INFO -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/SOURCES.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/dependency_links.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/requires.txt -> trainer-0.1/trainer.egg-info
copying trainer.egg-info/top_level.txt -> trainer-0.1/trainer.egg-info
Writing trainer-0.1/setup.cfg
Creating tar archive
removing 'trainer-0.1' (and everything under it)


In [35]:
# Now upload the source distribution with the training application to Cloud Storage bucket.
!gsutil cp {source_package_file_name} {python_package_gcs_uri}

Copying file://ai_catchform/dist/trainer-0.1.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  962.0 B/  962.0 B]                                                
Operation completed over 1 objects/962.0 B.                                      


In [36]:
# Validate that the source distribution exists in the Cloud Storage bucket.
!gsutil ls -l {python_package_gcs_uri}

       962  2022-12-23T08:09:05Z  gs://catch-ai/pytorch-on-gcp/ai-catchform-kobert/train/python_package/trainer-0.1.tar.gz
TOTAL: 1 objects, 962 bytes (962 B)


### Run a custom job in Vertex AI using a pre-built container

In this notebook, you are using Hugging Face Datasets and fine-tuning a transformer model from the Hugging Face Transformers library for sentiment analysis tasks using PyTorch. You don't need to build a PyTorch environment from scratch for running the training application because Vertex AI provides pre-built containers.

Vertex AI pre-built containers are Docker container images that you can use for custom training. They include some common dependencies used in training code based on the machine learning framework and framework version.

You use a pre-built container for PyTorch and the packaged training application to run the training job on Vertex AI.

Configure a Custom Job with the pre-built container image for PyTorch and training code packaged as Python source distribution.

In [37]:
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=JOB_NAME,
    python_package_gcs_uri=python_package_gcs_uri,
    python_module_name=python_module_name,
    container_uri=PRE_BUILT_TRAINING_CONTAINER_IMAGE_URI,
)

### Run the Custom training job with the following parameters:

- `machine_type`: Mahcine type on which the job needs to run.
- `accelerator_type`: Hardware accelerator type for running the job. One of *ACCELERATOR_TYPE_UNSPECIFIED, NVIDIA_TESLA_K80, NVIDIA_TESLA_P100, NVIDIA_TESLA_V100, NVIDIA_TESLA_P4, NVIDIA_TESLA_T4.*
- `accelerator_count`: The number of accelerators to attach to a worker replica.
- `replica_count`: The number of worker replicas.
- `args`: Command line arguments to be passed to the Python script.

In [38]:
model = job.run(
    replica_count=1,
    machine_type=TRAIN_MACHINE_TYPE,
    accelerator_type=TRAIN_ACCELERATOR_TYPE,
    accelerator_count=TRAIN_ACCELERATOR_COUNT,
)

Training Output directory:
gs://catch-ai/aiplatform-custom-training-2022-12-23-17:09:21.576 
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1524510392745721856?project=1001227203935
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7495461063941423104?project=1001227203935
CustomPythonPackageTrainingJob projects/1001227203935/locations/us-central1/trainingPipelines/1524510392745721856 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob projects/1001227203935/locations/us-central1/trainingPipelines/1524510392745721856 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob projects/1001227203935/locations/us-central1/trainingPipelines/1524510392745721856 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomPythonPackageTrainingJob projects/1001227203935/locations/us-central1/trainingPipelines/1524510392745721856 current state:
Pipe

RuntimeError: Training failed with:
code: 3
message: "The replica workerpool0-0 exited with a non-zero status of 1. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1001227203935&resource=ml_job%2Fjob_id%2F7495461063941423104&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%227495461063941423104%22"


In [11]:
# Create the repository in Artifact registry
! gcloud artifacts repositories create {APP_NAME} --repository-format=docker --location={REGION} --description="Docker repository"

# List all repositories and check your repository
! gcloud artifacts repositories list

Listing items under project airy-runway-344101, across all locations.

                                                                                 ARTIFACT_REGISTRY
REPOSITORY               FORMAT  MODE                 DESCRIPTION                   LOCATION         LABELS  ENCRYPTION          CREATE_TIME          UPDATE_TIME          SIZE (MB)
cloud-run-source-deploy  DOCKER  STANDARD_REPOSITORY  Cloud Run Source Deployments  asia-northeast3          Google-managed key  2022-12-08T12:13:57  2022-12-08T12:26:05  1948.728
