# NAS (Neural Architecture Search) Pytorch Based Example on Vertex AI Platform  

Make sure that you have read the [required documentations](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/overview#reading_order)
before executing this notebook.
NOTE: This notebook is meant to run pre-built trainer code with pre-built search-spaces. If you want to run your own trainer
code or create your own NAS search-space from scratch, then do not use this notebook.

This notebook is only meant to demonstrate a PyTorch trainer example using multiple GPUs with GCS data on Vertex AI platform.
It is not meant to be used as a benchmark for [MNasnet paper](https://arxiv.org/abs/1807.11626) performance. Refer to this [blog](https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai) for more details.
It runs only stage-1 search with just 2 trials and without latency constraints. It does not perform stage2 evaluation.
Please see [Tensorflow MNasnet notebook](https://github.com/google/vertex-ai-nas/blob/main/notebooks/vertex_nas_classification_tfvision.ipynb) for full workflow example.

Experiment setup:
- Stage-1 search
    - Number of trials: 2
    - Number of GPUs per trial: 2
    - GPU type: TESLA_V100
    - Avg single trial training time: 3 hours
    - Number of parallel trials: 2
    - GPU quota used: 2*2 = 4 V100 GPUs
    - Time to run: 0.125 days
    - GPU hours: 12 V100 GPU hours
Stage1 search cost: ~$53


Here are the **pre-requisites** before you can start using this notebook: 
1. Your GCP project should have been (a) clear-listed and (b) a GPU quota should have been allocated for the NAS jobs, refert to [this](https://cloud.google.com/vertex-ai/docs/training/neural-architecture-search/environment-setup#device-quota) for requesting GPU quota. This notebook requires 16 GPUs to run experiments.
2. You have selected a python3 kernel to run this notebook.

# Install required libraries

In [None]:
%%sh

pip install pyglove==0.1.0

**NOTE: Please restart the notebook after installing above libraries successfully.**

# Download the code from Cloud Repository
**NOTE:** The following set up steps need to be done just once.



In [None]:
%%sh

# NOTE: It is ok for this step to fail if the directory exists.
mkdir -p ~/nas_experiment

In [None]:
%%sh

rm -r -f ~/nas_experiment/nas_codes
git clone git@github.com:google/vertex-ai-nas.git ~/nas_experiment/nas_codes

# Set code path

In [None]:
import os
os.chdir(os.path.join('/home/jupyter/nas_experiment/nas_codes'))

# Set up environment variables
Here we set up the environment variables.
NOTE: These have to be set-up every time you run a new session because the later code-blocks use them.


In [None]:
# Set a unique USER name. This is used for creating a unique job-name identifier.
%env USER=<fill>
# Set any unique docker-id for this run. When the next section builds a docker, then this id will be used to tag it.
%env TRAINER_DOCKER_ID=<fill>
# The GCP project-id must be the one that has been clear-listed for the NAS jobs. 
%env PROJECT_ID=<fill>
# Set an output working directory for the NAS jobs. The GCP project should have write access to 
# this GCS output directory. A simple way to ensure this is to use a bucket inside the same GCP project.
%env GCS_ROOT_DIR=<gs://fill>
# Set the region to be same as for your bucket. For example, `us-central1`.
%env REGION=<fill>
# Set the accelerator device type, for example NVIDIA_TESLA_V100.
%env DEVICE=NVIDIA_TESLA_V100

**NOTE:** The following set up steps need to be done just once.

In [None]:
%%sh

# Enable the container registry API for your project.
gcloud services enable containerregistry.googleapis.com --project=${PROJECT_ID}

In [None]:
%%sh

# NOTE: This needs to be just once for the first time.

# Create the output directory. 
# NOTE: It is ok for this step to fail if the directory already exists.
gsutil mkdir $GCS_ROOT_DIR

# Prepare dataset

You need to download the [ImageNet dataset](https://image-net.org/download.php) first. Then run the following commands to shard the dataset.

In [None]:
%%sh

# Shard training dataset
python3 pytorch/classification/shard_imagenet.py \
--image_list_file=<path-to-imagenet>/train_list.txt \
--output_pattern=<gcs-path>/imagenet/train-%06d.tar

# Shard validation dataset
python3 pytorch/classification/shard_imagenet.py \
--image_list_file=<path-to-imagenet>/validation_list.txt \
--output_pattern=<gcs-path>/imagenet/validation-%06d.tar

In [None]:
# Path to sharded training data
%env TRAIN_DATA_PATH=<gcs-path>/imagenet/train-{000000..000466}.tar
# Path to sharded validation data
%env EVAL_DATA_PATH=<gcs-path>/imagenet/validation-{000000..000021}.tar

# Build container
The container must be built the first time and then every time the source-code is modified. Otherwise, there is no need to run this step. This step internally builds the *Dockerfile* in the source-code directory and then pushes the docker to the cloud. 

In [None]:
%%sh

# NOTE: This step may take few minutes the first time.
python3 vertex_nas_cli.py build \
--project_id=${PROJECT_ID} \
--region=${REGION} \
--trainer_docker_id=${TRAINER_DOCKER_ID} \
--trainer_docker_file=pytorch/classification/classification.Dockerfile

### Launch NAS - Stage 1 job

In [None]:
%%sh

# For proxy task set-up.
MAX_NAS_TRIAL=2
MAX_PARALLEL_NAS_TRIAL=2
MAX_FAILED_NAS_TRIAL=2

job_name="${USER}_pytorch_example_$(date '+%Y%m%d_%H%M%S')"

# Run in Cloud
CMD="

python3 vertex_nas_cli.py search \
--project_id=${PROJECT_ID} \
--region=${REGION} \
--trainer_docker_id=${TRAINER_DOCKER_ID} \
--search_space_module=pytorch.classification.search_space.mnasnet_search_space \
--job_name=${job_name} \
--root_output_dir=${GCS_ROOT_DIR} \
--accelerator_type=${DEVICE} \
--num_gpus=2 \
--max_nas_trial=${MAX_NAS_TRIAL} \
--max_parallel_nas_trial=${MAX_PARALLEL_NAS_TRIAL} \
--max_failed_nas_trial=${MAX_FAILED_NAS_TRIAL} \
--nas_target_reward_metric=top_1_accuracy \
--search_docker_flags \
config_file=pytorch/classification/mnasnet_search_gpu.yaml \
train_data_path=${TRAIN_DATA_PATH} \
eval_data_path=${EVAL_DATA_PATH}

"

echo Executing command: ${CMD}
    
${CMD}