Copyright (C) 2022 Intel Corporation
 
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
 
http://www.apache.org/licenses/LICENSE-2.0
 
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions
and limitations under the License.
 

SPDX-License-Identifier: Apache-2.0

# General Description

Version: 1.0
Date: Sep 16, 2022

This notebook outlines the general usage of quantizing a FP32 HuggingFace model to INT8 and inferencing using Intel Neural Compressor and Intel's latest CPU (with VNNI instruction support).

A quantiezed model will be stored into a Amazon Elastic File System.

Users are free to based on any part of the codes and customize those to suit their purposes.

# Prerequisite

1. It is expected that there is already a EKS cluster hosted and setup properly. 
   The EKS cluster should have:
   i.   Kubeflow installed
   ii.  Associated related AWS credentials (i.e. you are able to use 'kubectl' to create/delete training jobs)
   iii. A Elastic File System (EFS) that is accesible(read and write) by the cluster nodes
   iv.  2 nodes or above (optional but recommended)
   
   Please contact your EKS administrator to obtain the necessary information.
   
   You may also refer to the following webpage to create a new EKS cluster and install the kubeflow.
   - https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-eks-setup.html
   - https://www.eksworkshop.com/advanced/420_kubeflow/install/
   
2. Setup the Amazon credential credential (e.g.: aws configure) 
   i.  in the container
   ii. the docker host 

3. Set the notebook kernel to use 'Python 3 (ipykernel)'

# Step 1: Build a custom docker image for inference

    1. Copy the content of the "../src/eks_inference_container" and paste those outside the docker container. 
    2. Modify the AWS credential of the build_and_push.sh 
       Pay attention to the region, account number, algorithm_name and the firewall issue
    3. Run build_and_push.sh to build the custom docker image for inference.

Note: 
- Users may change the content of the "inc_quantization.py" to adjust the output of the quantized model (into EFS) and the inc_config.yaml to change the quantization type/distillation.
- For this reference design, it is assumed that the EFS mounted path is /data and accessible by the cluster nodes. The previously trained model is stored under /data. Intel Neural Compressor will read the FP32 model and quantize it

# Step 2a: Set kubectl to the target cluster

We will first use the following command to set `kubectl`.

Please modify and run the follwoing command to set the proper cluster target for 'kubectl'
<region> is the Amaozn EKS Region
<cluster-name> is the cluster name shown in the Amazon Elastic Kubernetes Service.

In [None]:
%%sh
aws eks --region <region> update-kubeconfig --name <cluster-name>

# Step 2b: Get the details of the nodes

The following command will list the nodes we are able to use.

In [None]:
%%sh 
kubectl get nodes -o wide

# Step 2c: Label the nodes

For the quantization, you may just need to specify a node for the quantization. Please label the one that has highest computational power as quantization may take some time.
You may run the following command several times to label the node manually. Please modify the content in the '<>'

In [None]:
%%sh
kubectl label nodes <name-of-the-node> nodename=<node1> --overwrite 

# Step 3: Assign the training job by modifying the config file (.yaml)

The quantization job is created by calling 'kubectl' on the config file (distributed_1node_inc.yaml). By modifying the config file, we can assign the computational resources (i.e. the nodes and EFS) for the training job.

Please modify the distributed_training.yaml to specify the resources, especially the following items:
1. name        -- change to a name that suits the task. This will be used when checking the logs/pods execution.
2. replicas    -- it is recommended to set 1 for the Master and n-1 (n is the total number of nodes) for worker 
3. image       -- it should be the one you built in step 1
4. mountPath   -- the path where the fine-tuned model is stored
5. claimName   -- the name of the EFS storage. May need to ask EKS administrator regarding to the details
6. cpu         -- it is recommended to set the value equals to (n/2) + 1, where n is the number of vCPU of the instance. This 
                  will be a hint to inform Kubernate to distribute the training job evenly to the computational nodes. 
                  Otherwise the job may stay in 1 node and result in poor performance (i.e. not the expected distributed 
                  training)
7. memory      -- you may check the type of the instance and request depends on the task nature
8. values      -- use the labels set in Step 2c. This tells K8S to use specific nodes for training and assign correct affinity.

# Step 4: Start the quantiztion

After having the pod specification defined in the previous step, users may start the training job on the EKS cluster by calling the command in the next block cell.

Users may use the following command to check the execution status/logs of the training job
1. kubectl logs -f <name-of-the-job/pod>
2. kubectl describe pods

In [None]:
%%sh
kubectl create -f ../src/eks_inference_container/distributed_1node_inc.yaml

# Step 5: Performing inference using the quantized model

Once the model has been quantized, users may deploy the quantized model.
The following code block is a simple example to demonstrate how to perform data preprocessing and retrieve the result.

Please duplicate the code and put into a EKS node (with access to the EFS) to experiment the results.
For the package dependency, please refer to the Dockerfile (in inference_container/).

In [None]:
# Copyright (c) 2020-2022 Intel Corporation.
# SPDX-License-Identifier: BSD-3-Clause

import argparse, os
from urllib.parse import uses_fragment
import pandas as pd
import sys
import numpy as np
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, BertTokenizerFast
from transformers import AutoModelForSequenceClassification
from datasets import load_metric
from transformers import Trainer
from transformers import TrainingArguments
from transformers import AdamW
from torch.nn import functional as F
from tqdm import tqdm
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import time

if __name__ == '__main__':
    #P assing in environment variables and hyperparameters 
    parser = argparse.ArgumentParser()
    parser.add_argument("--quantized_model_path", type=str, default='/data/quantized_model')
    parser.add_argument("--max_seq_length", type=int, default='128')
    parser.add_argument("--model_name", type=str, default='bert-base-uncased')
    args, _ = parser.parse_known_args()

    from neural_compressor.utils.load_huggingface import OptimizedModel
    model = OptimizedModel.from_pretrained(
            args.quantized_model_path
        )
    
    # Load the MRPC dataset
    # It can be a single input using dictionary to contain
    train_dataset, test_dataset = load_dataset('glue', 'mrpc', split=["train", "test"])
    
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, truncation=True, padding="max_length", max_length=args.max_seq_length)
    
    correct_pred = 0
    # Inference on every single input
    for i in range(len(train_dataset['sentence1'])):
        # Preprocess the input using the tokenizer
        sentence1 = train_dataset['sentence1'][i] #string type
        sentence2 = train_dataset['sentence2'][i] #string type
        example_inputs = tokenizer(sentence1, sentence2, return_tensors="pt")
        
        # Perform inference
        results = model(**example_inputs)
        probability = F.softmax(results['logits'][0], dim=0)
        pred = torch.argmax(probability).item()

        gt = train_dataset['label'][i]
        if pred == gt:
            correct_pred = correct_pred + 1
   
    print('Accuracy: ' + str(correct_pred/len(train_dataset['sentence1'])))

# Step 6: Clean up

After the quantization is completed, the INT8 model should be stored in the EFS mountPath (in this case, /data). Users may wish to release the resources by calling the following two commands.

In [None]:
%%sh
kubectl delete <pytorchjobs.kubeflow.org/the-job-name> #e.g.: pytorchjobs.kubeflow.org/xxxxxxxxxxxxx
kubectl delete pod --all