Copyright (C) 2022 Intel Corporation
 
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
 
http://www.apache.org/licenses/LICENSE-2.0
 
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
and limitations under the License.
 

SPDX-License-Identifier: Apache-2.0

# General Description

Version: 1.0
Date: Sep 15, 2022

This notebook outlines the general usage of using Intel's CPU, Intel optimized PyTorch, Intel Extension for PyTorch on Amazon EKS platform. 

A BERT model is fine-tuned using HuggingFace framework and with IPEX optimization. As a result, a FP32 BERT model is generated and stored into a Amazon Elastic File System.

Users are free to based on any part of the codes and customize those to suit their purposes.

# Prerequisite

- It is expected that there is already a EKS cluster hosted and setup properly. 
   The EKS cluster should have:
    1. Kubeflow installed
    2. Associated related AWS credentials (i.e. you are able to use 'kubectl' to create/delete training jobs)
    3. A Elastic File System (EFS) that is accesible(read and write) by the cluster nodes
    4.  2 nodes or above (optional but recommended)
   
   Please contact your EKS administrator to obtain the necessary information.
   
   You may also refer to the following webpage to create a new EKS cluster and install the kubeflow.
   - https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-eks-setup.html
   - https://www.eksworkshop.com/advanced/420_kubeflow/install/
   
- Setup the Amazon credential credential (e.g.: aws configure) 
   1. in the container
   2. the docker host 

- Set the notebook kernel to use 'Python 3 (ipykernel)'

# Step 1: Build a custom docker image for training

    1. Copy the content of the "../src/eks_training_container" and paste those outside the docker container. 
    2. Modify the AWS credential of the build_and_push.sh 
       Pay attention to the region, account number, algorithm_name and the firewall issue
    3. Run build_and_push.sh to build the custom docker image for training.

Note: 
- Users may change the content of the "train.py" to adjust the nature of the training task/use different BERT models/change the behavior of Intel Neural Compressor
- For this reference design, it is assumed that the EFS mounted path is /data and accessible by the cluster nodes. The trained model will be stored under the /data

# Step 2a: Set kubectl to the target cluster

We will first use the following command to set `kubectl`.

Please modify and run the follwoing command to set the proper cluster target for 'kubectl'
<region> is the Amaozn EKS Region
<cluster-name> is the cluster name shown in the Amazon Elastic Kubernetes Service.

In [None]:
%%sh
aws eks --region us-west-2 update-kubeconfig --name eks-clustereks120

# Step 2b: Get the details of the nodes

The following command will list the nodes we are able to use.

In [None]:
%%sh 
kubectl get nodes -o wide

# Step 2c: Label the nodes

In order to achieve efficient distributed training. It is necessary to label the kubernate nodes properly. When we perform the training, we will provide a config file (in .yaml format) and distribute the jobs to the nodes by using the label. Please based on the above output and run the following command to label the nodes.

You may run the following command several times to label the node manually. Please modify the content in the '<>'

In [None]:
%%sh
kubectl label nodes <name-of-the-node> nodename=<node1> --overwrite 

# Step 3: Assign the training job by modifying the config file (.yaml)

The training job is created by calling 'kubectl' on the config file. By modifying the config file, we can assign the computational resources (i.e. the nodes and EFS) for the training job.

Please modify the distributed_training.yaml to specify the resources, especially the following items:
1. name        -- change to a name that suits the task. This will be used when checking the logs/pods execution.
2. replicas    -- it is recommended to set 1 for the Master and n-1 (n is the total number of nodes) for worker 
3. image       -- it should be the one you built in step 1
4. mountPath   -- the path where the fine-tuned model is stored
5. claimName   -- the name of the EFS storage. May need to ask EKS administrator regarding to the details
6. cpu         -- it is recommended to set the value equals to (n/2) + 1, where n is the number of vCPU of the instance. This 
                  will be a hint to inform Kubernate to distribute the training job evenly to the computational nodes. 
                  Otherwise the job may stay in 1 node and result in poor performance (i.e. not the expected distributed 
                  training)
7. memory      -- you may check the type of the instance and request depends on the task nature
8. values      -- use the labels set in Step 2c. This tells K8S to use specific nodes for training and assign correct affinity.

Note: If there is unexpected distributed training performance, users may use the following command to check if the training job is distributed into different nodes.

`kubectl get pods -o wide`

# Step 4: Start the training

After having the pod specification defined in the previous step, users may start the training job on the EKS cluster by calling the command in the next block cell.

Users may use the following command to check the execution status/logs of the training job
1. kubectl logs -f <name-of-the-job/pod>
2. kubectl describe pods

In [None]:
%%sh
kubectl create -f ../src/eks_training_container/distributed_training.yaml

# Step 5: Clean up

After the job is completed, the fine-tuned FP32 model should be stored in the EFS mountPath (in this case, /data). Users may wish to release the resources by calling the following two commands.

In [None]:
%%sh
kubectl delete <pytorchjobs.kubeflow.org/the-job-name> #e.g.: pytorchjobs.kubeflow.org/xxxxxxxxxxxxx
kubectl delete pod --all