# NAS (Neural Architecture Search) for Classification on Vertex AI with TF-vision

Make sure that you have read the [required documentations](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/overview#reading_order)
before executing this notebook.
NOTE: This notebook is meant to run pre-built trainer code with pre-built search-spaces. If you want to run your own trainer
code or create your own NAS search-space from scratch, then do not use this notebook.

This notebook shows example of [MnasNet](https://arxiv.org/abs/1807.11626) paper result on Imagenet data.
According to the paper, MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, 
which is 1.8x faster than MobileNetV2 with 0.5% higher accuracy and 2.3x faster than NASNet with 1.2% higher accuracy.
However, this notebook uses GPUs instead of TPUs for training and uses cloud-CPU (n1-highmem-8) to evaluate latency.
With this notebook, the expected Stage2 top-1 accuracy on MNasnet is 75.2% with 50ms latency on cloud-CPU (n1-highmem-8).
The detailed settings for this notebook are:
- Stage-1 search
    - Number of trials: 2000
    - Number of GPUs per trial: 2
    - GPU type: TESLA_T4
    - Avg single trial training time: 3 hours
    - Number of parallel trials: 10
    - GPU quota used: 10*2 = 20 T4 GPUs
    - Time to run: 25 days
    If you have higher GPU quota, then the runtime will decrease proportionately.
    - GPU hours: 12000 T4 GPU hours
- Stage-2 full-training with top 10 models
    - Number of trials: 10
    - Number of GPUs per trial: 4
    - GPU type: TESLA_T4
    - Avg single trial training time: ~9 days
    - Number of parallel trials: 10
    - GPU quota used: 10*4 = 40 T4 GPUs
    You can also run this with just 20 T4 GPUs by running the job twice with just 5 models at
    a time instead of all 10 in parallel.
    - Time to run: ~9 days
    - GPU hours: 8960 T4 GPU hours

Stage1 search cost: ~$15,000
Stage2 full-training cost: ~$8,000
Total cost: ~$23,000


You can also test drive MnasNet with just few trials with much lower cost.
See [here for the test drive settings and cost](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/overview#mnasnet_test_drive).
For this use case, run this notebook only till section 'Test drive only: Launch NAS stage 1 job with latency constraint'.


Here are the **pre-requisites** before you can start using this notebook: 
1. Your GCP project should have been (a) allow-listed and (b) [a GPU quota should have been allocated](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/environment-setup#device-quota) for the NAS jobs.
2. You have selected a python3 kernel to run this notebook.

# Install required libraries

In [None]:
%%sh

pip install tensorflow==2.7.0 --user
pip install tf-models-official==2.7.1
pip install pyglove==0.1.0

**NOTE: Please restart the notebook after installing above libraries successfully.**

# Download source code

This needs to be done just once.


In [None]:
%%sh

# NOTE: It is ok for this step to fail if the directory exists.
mkdir -p ~/nas_experiment

In [None]:
%%sh

rm -r -f ~/nas_experiment/nas_codes
git clone git@github.com:google/vertex-ai-nas.git ~/nas_experiment/nas_codes

# Set code path

In [None]:
import os
os.chdir(os.path.join('/home/jupyter/nas_experiment/nas_codes'))

# Set up environment variables

Here we set up the environment variables.

NOTE: These have to be set-up every time you run a new session because the later code-blocks use them.


In [None]:
# Set a unique USER name. This is used for creating a unique job-name identifier.
%env USER=<fill>
# Set a region to launch jobs into.
# If you only want to test-drive and do not have enough GPU quota, then you can use 'us-central1' region
# which should have a default quota of 12 Nvidia T4 GPUs.
%env REGION=<fill>
# Set any unique docker-id for this run. When the next section builds a docker, then this id will be used to tag it.
%env TRAINER_DOCKER_ID=<fill>
%env LATENCY_CALCULATOR_DOCKER_ID=<fill>
# The GCP project-id must be the one that has been clear-listed for the NAS jobs. 
%env PROJECT_ID=<fill>
# Set an output working directory for the NAS jobs. The GCP project should have write access to 
# this GCS output directory. A simple way to ensure this is to use a bucket inside the same GCP project.
# NOTE: The region of the bucket must be the same as job's.
%env GCS_ROOT_DIR=<fill>
# Set the accelerator device type.
%env DEVICE=NVIDIA_TESLA_T4

# Set the GCS paths to the training and validation datasets. The GCP project should have read access to the data-location.
# Please read the "Data-Preparation" section 
# (https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/pre-built-trainer#data-preparation)
# in the documentation to ensure that the data is in an appropriate format
# suitable for the NAS pipeline. The documentation also mentions how you can download and prepare the ImageNet dataset.
# You can run the "Validate and Visualize data format" section in this notebook 
# to verify that the data can be loaded properly.
# Update the path to ImageNet data below.
%env STAGE1_TRAINING_DATA_PATH=<gs://path-to>/imagenet/train-00[0-8]??-of-01024
%env STAGE1_VALIDATION_DATA_PATH=<gs://path-to>/imagenet/train-009??-of-01024,gs://cloud-nas-public-eu/classification/imagenet/train-01???-of-01024
%env STAGE2_TRAINING_DATA_PATH=<gs://path-to>/imagenet/train*
%env STAGE2_VALIDATION_DATA_PATH=<gs://path-to>/imagenet/validation*

**NOTE:** The following set up steps need to be done just once.

In [None]:
%%sh

# Authenticate docker for your artifact registry.
gcloud auth configure-docker ${REGION}-docker.pkg.dev

In [None]:
%%sh

# NOTE: This needs to be just once for the first time. It is ok for this to FAIL if the GCS bucket already exists.

# Create the output directory. 
gsutil mkdir $GCS_ROOT_DIR

# Validate and Visualize data format

The following code verifies that the data can be loaded properly before you run the experiments.

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
from official.vision.beta.dataloaders import classification_input

dataset = tf.data.Dataset.list_files(os.environ.get('STAGE1_TRAINING_DATA_PATH'), shuffle=False).apply(tf.data.TFRecordDataset)
dataset = dataset.map(classification_input.Decoder().decode).batch(1)

num_examples = 4
_, ax = plt.subplots(num_examples, 1, figsize=(100, 64))
for (i, example) in enumerate(dataset.take(num_examples)):
    image = tf.io.decode_image(example['image/encoded'][0], channels=3)
    image.set_shape([None, None, 3])
    image = image.numpy()
    
    ax[i].imshow(image)
    ax[i].grid(False)
    ax[i].set_title('label: {}'.format(example['image/class/label'][0].numpy()))

# Build container
The container must be built the first time and then every time the source-code is modified. Otherwise, there is no need to run this step. This step internally builds the *Dockerfile* in the source-code directory and then pushes the docker to the cloud. 

In [None]:
%%sh

# NOTE: This step can take several minutes when run for the first time.

python3 vertex_nas_cli.py build \
--project_id=${PROJECT_ID} \
--trainer_docker_id=${TRAINER_DOCKER_ID} \
--region=${REGION} \
--trainer_docker_file="tf_vision/nas_multi_trial.Dockerfile" \
--latency_calculator_docker_id=${LATENCY_CALCULATOR_DOCKER_ID} \
--latency_calculator_docker_file="tf_vision/latency_computation_using_saved_model.Dockerfile"

# Test drive only: Launch NAS stage 1 job with latency constraint
If you do not want to run a full MNasNet run and only want to test drive with [just few trials as described here,](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/overview#mnasnet_test_drive)
then only run the following command and skip the rest of the notebook.

In [None]:
%%sh

DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${USER}_nas_tfvision_icn_latency_${DATE}"

CMD="
python3 vertex_nas_cli.py search \
--project_id=${PROJECT_ID} \
--region=${REGION} \
--trainer_docker_id=${TRAINER_DOCKER_ID} \
--job_name=${JOB_ID} \
--max_nas_trial=25 \
--max_parallel_nas_trial=6 \
--max_failed_nas_trial=10 \
--use_prebuilt_trainer=True \
--prebuilt_search_space="mnasnet" \
--accelerator_type=${DEVICE} \
--num_gpus=2 \
--root_output_dir=${GCS_ROOT_DIR} \
--latency_calculator_docker_id=${LATENCY_CALCULATOR_DOCKER_ID} \
--target_device_type=CPU \
--use_prebuilt_latency_calculator=True \
--search_docker_flags \
params_override="tf_vision/configs/experiments/mnasnet_search_gpu.yaml" \
training_data_path=${STAGE1_TRAINING_DATA_PATH} \
validation_data_path=${STAGE1_VALIDATION_DATA_PATH} \
model="classification" \
target_device_latency_ms=50
"

echo Executing command: ${CMD}
    
${CMD}

# Launch NAS stage 1 job with latency constraint
If you want to customize this notebook for your own dataset other than ImageNet, then you must
read the [required documentations](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/overview#reading_order)
to ensure that you set up the proxy-task and other settings properly.

In [None]:
%%sh

DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${USER}_nas_tfvision_icn_latency_${DATE}"

CMD="
python3 vertex_nas_cli.py search \
--project_id=${PROJECT_ID} \
--region=${REGION} \
--trainer_docker_id=${TRAINER_DOCKER_ID} \
--job_name=${JOB_ID} \
--max_nas_trial=2000 \
--max_parallel_nas_trial=10 \
--max_failed_nas_trial=400 \
--use_prebuilt_trainer=True \
--prebuilt_search_space="mnasnet" \
--accelerator_type=${DEVICE} \
--num_gpus=2 \
--root_output_dir=${GCS_ROOT_DIR} \
--latency_calculator_docker_id=${LATENCY_CALCULATOR_DOCKER_ID} \
--target_device_type=CPU \
--use_prebuilt_latency_calculator=True \
--search_docker_flags \
params_override="tf_vision/configs/experiments/mnasnet_search_gpu.yaml" \
training_data_path=${STAGE1_TRAINING_DATA_PATH} \
validation_data_path=${STAGE1_VALIDATION_DATA_PATH} \
model="classification" \
target_device_latency_ms=50
"

echo Executing command: ${CMD}
    
${CMD}

# Inspect NAS search progress
A periodic evaluation while the search is going on can help decide if the search job has converged. This code-block shows how to generate summary of top N trials so far.

In [None]:
# Set the stage1 search-job id. It's a numeric value returned by the Vertex service.
%env JOB_ID=<fill>

In [None]:
%%sh

mkdir -p /home/jupyter/nas_experiment/jobs
python3 vertex_nas_cli.py list_trials \
--project_id=${PROJECT_ID} \
--job_id=${JOB_ID} \
--region=${REGION} \
--trials_output_file=/home/jupyter/nas_experiment/jobs/${JOB_ID}.yaml

cat /home/jupyter/nas_experiment/jobs/${JOB_ID}.yaml

# Launch NAS stage 2 job

In [None]:
%%sh

DATE="$(date '+%Y%m%d_%H%M%S')"

# Please modify the "JOB_ID", "TRIAL_ID", and the finetuning config file before running.
# JOB_ID is numeric value you can find from the job UI in Pantheon.
JOB_ID=<fill>
# TRIAL_ID is one of the best performing trials which has to be finetuned.
TRIAL_IDS=<fill> # The top trials chosen for training to converge.

CMD="

python3 vertex_nas_cli.py train \
--project_id=${PROJECT_ID} \
--region=${REGION} \
--trainer_docker_id=${TRAINER_DOCKER_ID} \
--use_prebuilt_trainer=True \
--prebuilt_search_space="mnasnet" \
--train_accelerator_type=${DEVICE} \
--train_num_gpus=4 \
--root_output_dir=${GCS_ROOT_DIR} \
--search_job_id=${JOB_ID} \
--search_job_region=${REGION} \
--train_nas_trial_numbers=${TRIAL_IDS} \
--train_job_suffix="stage2_${DATE}" \
--train_docker_flags \
params_override="tf_vision/configs/experiments/mnasnet_search_finetune_gpu.yaml" \
training_data_path=${STAGE2_TRAINING_DATA_PATH} \
validation_data_path=${STAGE2_VALIDATION_DATA_PATH} \
model="classification"
"

echo Executing command: ${CMD}
    
${CMD}