diff --git a/_blog.yml b/_blog.yml index 4be0a367d7..5acee7f41d 100644 --- a/_blog.yml +++ b/_blog.yml @@ -1095,4 +1095,13 @@ date: August 10, 2022 tags: - guide - - nlp \ No newline at end of file + - nlp + +- local: deploy-tfserving-kubernetes + title: "Deploying 🤗 ViT on Kubernetes with TF Serving" + author: chansung + thumbnail: /blog/assets/94_tf_serving_kubernetes/thumb.png + date: August 15, 2022 + tags: + - guide + - cv diff --git a/assets/94_tf_serving_kubernetes/manifest_propagation.png b/assets/94_tf_serving_kubernetes/manifest_propagation.png new file mode 100644 index 0000000000..cae011ec5c Binary files /dev/null and b/assets/94_tf_serving_kubernetes/manifest_propagation.png differ diff --git a/assets/94_tf_serving_kubernetes/thumb.png b/assets/94_tf_serving_kubernetes/thumb.png new file mode 100644 index 0000000000..3eb916a5a0 Binary files /dev/null and b/assets/94_tf_serving_kubernetes/thumb.png differ diff --git a/deploy-tfserving-kubernetes.md b/deploy-tfserving-kubernetes.md new file mode 100644 index 0000000000..33db3c20b3 --- /dev/null +++ b/deploy-tfserving-kubernetes.md @@ -0,0 +1,580 @@ +--- +title: Deploying 🤗 ViT on Kubernetes with TF Serving +thumbnail: /blog/assets/94_tf_serving_kubernetes/thumb.png +--- + +

+ Deploying 🤗 ViT on Kubernetes with TF Serving +

+ +
+ Published August 15, 2022. + + Update on GitHub + +
+ +
+ + +
+ chansung + Chansung Park* + guest +
+
+ + +
+ sayakpaul + Sayak Paul* + guest +
+
+ +
+ +# Introduction + +In the [previous post](https://huggingface.co/blog/tf-serving-vision), we showed how +to deploy a [Vision Transformer (ViT)](https://huggingface.co/docs/transformers/main/en/model_doc/vit) +model from 🤗 Transformers locally with TensorFlow Serving. We covered +topics like embedding preprocessing and postprocessing operations within +the Vision Transformer model, handling gRPC requests, and more! + +While local deployments are an excellent head start to building +something useful, you’d need to perform deployments that can serve many +users in real-life projects. In this post, you’ll learn how to scale the +local deployment from the previous post with Docker and Kubernetes. +Therefore, we assume some familiarity with Docker and Kubernetes. + +This post builds on top of the [previous post](https://huggingface.co/blog/tf-serving-vision), so, we highly +recommend reading it first. You can find all the code +discussed throughout this post in [this repository](https://github.com/sayakpaul/deploy-hf-tf-vision-models/tree/main/hf_vision_model_onnx_gke). + +# Why go with Docker and Kubernetes? + +The basic workflow of scaling up a deployment like ours includes the +following steps: + +- **Containerizing the application logic**: The application logic + involves a served model that can handle requests and return + predictions. For containerization, Docker is the industry-standard + go-to. + +- **Deploying the Docker container**: You have various options here. The most + widely used option is deploying the Docker container on a Kubernetes + cluster. Kubernetes provides numerous deployment-friendly features + (e.g. autoscaling and security). You can use a solution like + [Minikube](https://minikube.sigs.k8s.io/docs/start/) to + manage Kubernetes clusters locally or a serverless solution like + [Elastic Kubernetes Service (EKS)](https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html). + +You might be wondering why use an explicit setup like this in the age +of [Sagemaker,](https://aws.amazon.com/sagemaker/) [Vertex AI](https://cloud.google.com/vertex-ai) +that provides ML deployment-specific features right off the bat. It is fair to think +about it. + +The above workflow is widely adopted in the industry, and many +organizations benefit from it. It has already been battle-tested for +many years. It also lets you have more granular control of your +deployments while abstracting away the non-trivial bits. + +This post uses [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +to provision and manage a Kubernetes cluster. We assume you already have a +billing-enabled GCP project if you’re using GKE. Also, note that you’d need to +configure the [`gcloud`](https://cloud.google.com/sdk/gcloud) utility for +performing the deployment on GKE. But the concepts discussed in this post +equally apply should you decide to use Minikube. + +**Note**: The code snippets shown in this post can be executed on a Unix terminal +as long as you have configured the `gcloud` utility along with Docker and `kubectl`. +More instructions are available in the [accompanying repository](https://github.com/sayakpaul/deploy-hf-tf-vision-models/tree/main/hf_vision_model_onnx_gke). + +# Containerization with Docker + +The serving model can handle raw image inputs as bytes and is capable of preprocessing and +postprocessing. + +In this section, you’ll see how to containerize that model using the +[base TensorFlow Serving Image](http://hub.docker.com/r/tensorflow/serving/tags/). TensorFlow Serving consumes models +in the [`SavedModel`](https://www.tensorflow.org/guide/saved_model) format. Recall how you +obtained such a `SavedModel` in the [previous post](https://huggingface.co/blog/tf-serving-vision). We assume that +you have the `SavedModel` compressed in `tar.gz` format. You can fetch +it from [here](https://huggingface.co/deploy-hf-tf-vit/vit-base16-extended/resolve/main/saved_model.tar.gz) +just in case. Then `SavedModel` should be placed in the special directory +structure of `//`. This is how TensorFlow Serving simultaneously manages multiple deployments of different versioned models. + +## Preparing the Docker image + +The shell script below places the `SavedModel` in `hf-vit/1` under the +parent directory models. You'll copy everything inside it when preparing +the Docker image. There is only one model in this example, but this +is a more generalizable approach. + +```bash +$ MODEL_TAR=model.tar.gz +$ MODEL_NAME=hf-vit +$ MODEL_VERSION=1 +$ MODEL_PATH=models/$MODEL_NAME/$MODEL_VERSION + +$ mkdir -p $MODEL_PATH +$ tar -xvf $MODEL_TAR --directory $MODEL_PATH +``` + +Below, we show how the `models` directory is structured in our case: + +```bash +$ find /models +/models +/models/hf-vit +/models/hf-vit/1 +/models/hf-vit/1/keras_metadata.pb +/models/hf-vit/1/variables +/models/hf-vit/1/variables/variables.index +/models/hf-vit/1/variables/variables.data-00000-of-00001 +/models/hf-vit/1/assets +/models/hf-vit/1/saved_model.pb +``` + +The custom TensorFlow Serving image should be built on top of the [base one](http://hub.docker.com/r/tensorflow/serving/tags/). +There are various approaches for this, but you’ll do this by running a Docker container as illustrated in the +[official document](https://www.tensorflow.org/tfx/serving/serving_kubernetes#commit_image_for_deployment). We start by running `tensorflow/serving` image in background mode, then the entire `models` directory is copied to the running container +as below. + +```bash +$ docker run -d --name serving_base tensorflow/serving +$ docker cp models/ serving_base:/models/ +``` + +We used the official Docker image of TensorFlow Serving as the base, but +you can use ones that you have [built from source](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/setup.md#building-from-source) +as well. + +**Note**: TensorFlow Serving benefits from hardware optimizations that leverage instruction sets such as +[AVX512](https://en.wikipedia.org/wiki/AVX-512). These +instruction sets can [speed up deep learning model inference](https://huggingface.co/blog/bert-cpu-scaling-part-1). So, +if you know the hardware on which the model will be deployed, it’s often +beneficial to obtain an optimized build of the TensorFlow Serving image +and use it throughout. + +Now that the running container has all the required files in the +appropriate directory structure, we need to create a new Docker image +that includes these changes. This can be done with the [`docker commit`](https://docs.docker.com/engine/reference/commandline/commit/) command below, and you'll have a new Docker image named `$NEW_IMAGE`. +One important thing to note is that you need to set the `MODEL_NAME` +environment variable to the model name, which is `hf-vit` in this +case. This tells TensorFlow Serving what model to deploy. + +```bash +$ NEW_IMAGE=tfserving:$MODEL_NAME + +$ docker commit \ + --change "ENV MODEL_NAME $MODEL_NAME" \ + serving_base $NEW_IMAGE +``` + +## Running the Docker image locally + +Lastly, you can run the newly built Docker image locally to see if it +works fine. Below you see the output of the `docker run` command. Since +the output is verbose, we trimmed it down to focus on the important +bits. Also, it is worth noting that it opens up `8500` and `8501` +ports for gRPC and HTTP/REST endpoints, respectively. + +```shell +$ docker run -p 8500:8500 -p 8501:8501 -t $NEW_IMAGE & + + +---------OUTPUT--------- +(Re-)adding model: hf-vit +Successfully reserved resources to load servable {name: hf-vit version: 1} +Approving load for servable version {name: hf-vit version: 1} +Loading servable version {name: hf-vit version: 1} +Reading SavedModel from: /models/hf-vit/1 +Reading SavedModel debug info (if present) from: /models/hf-vit/1 +Successfully loaded servable version {name: hf-vit version: 1} +Running gRPC ModelServer at 0.0.0.0:8500 ... +Exporting HTTP/REST API at:localhost:8501 ... +``` + +## Pushing the Docker image + +The final step here is to push the Docker image to an image repository. +You'll use [Google Container Registry (GCR)](https://cloud.google.com/container-registry) for this +purpose. The following lines of code can do this for you: + +```bash +$ GCP_PROJECT_ID= +$ GCP_IMAGE=gcr.io/$GCP_PROJECT_ID/$NEW_IMAGE + +$ gcloud auth configure-docker +$ docker tag $NEW_IMAGE $GCP_IMAGE +$ docker push $GCP_IMAGE +``` + +Since we’re using GCR, you need to prefix the +Docker image tag ([note](https://cloud.google.com/container-registry/docs/pushing-and-pulling) the other formats too) with `gcr.io/` . With the Docker image prepared and pushed to GCR, you can now proceed to deploy it on a +Kubernetes cluster. + +# Deploying on a Kubernetes cluster + +Deployment on a Kubernetes cluster requires the following: + +- Provisioning a Kubernetes cluster, done with [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine) (GKE) in + this post. However, you’re welcome to use other platforms and tools + like EKS or Minikube. + +- Connecting to the Kubernetes cluster to perform a deployment. + +- Writing YAML manifests. + +- Performing deployment with the manifests with a utility tool, + [`kubectl`](https://kubernetes.io/docs/reference/kubectl/). + +Let’s go over each of these steps. + +## Provisioning a Kubernetes cluster on GKE + +You can use a shell script like so for this (available +[here](https://github.com/sayakpaul/deploy-hf-tf-vision-models/blob/main/hf_vision_model_tfserving_gke/provision_gke_cluster.sh)): + +```bash +$ GKE_CLUSTER_NAME=tfs-cluster +$ GKE_CLUSTER_ZONE=us-central1-a +$ NUM_NODES=2 +$ MACHINE_TYPE=n1-standard-8 + +$ gcloud container clusters create $GKE_CLUSTER_NAME \ + --zone=$GKE_CLUSTER_ZONE \ + --machine-type=$MACHINE_TYPE \ + --num-nodes=$NUM_NODES +``` + +GCP offers a variety of machine types to configure the deployment in a +way you want. We encourage you to refer to the +[documentation](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create) +to learn more about it. + +Once the cluster is provisioned, you need to connect to it to perform +the deployment. Since GKE is used here, you also need to authenticate +yourself. You can use a shell script like so to do both of these: + +```bash +$ GCP_PROJECT_ID= + +$ export USE_GKE_GCLOUD_AUTH_PLUGIN=True + +$ gcloud container clusters get-credentials $GKE_CLUSTER_NAME \ + --zone $GKE_CLUSTER_ZONE \ + --project $GCP_PROJECT_ID +``` + +The `gcloud container clusters get-credentials` command takes care of +both connecting to the cluster and authentication. Once this is done, +you’re ready to write the manifests. + +## Writing Kubernetes manifests + +Kubernetes manifests are written in [YAML](https://yaml.org/) +files. While it’s possible to use a single manifest file to perform the +deployment, creating separate manifest files is often beneficial for +delegating the separation of concerns. It’s common to use three manifest +files for achieving this: + +- `deployment.yaml` defines the desired state of the Deployment by + providing the name of the Docker image, additional arguments when + running the Docker image, the ports to open for external accesses, + and the limits of resources. + +- `service.yaml` defines connections between external clients and + inside Pods in the Kubernetes cluster. + +- `hpa.yaml` defines rules to scale up and down the number of Pods + consisting of the Deployment, such as the percentage of CPU + utilization. + +You can find the relevant manifests for this post +[here](https://github.com/sayakpaul/deploy-hf-tf-vision-models/tree/main/hf_vision_model_tfserving_gke/.kube/base). +Below, we present a pictorial overview of how these manifests are +consumed. + +![](./assets/94_tf_serving_kubernetes/manifest_propagation.png) + +Next, we go through the important parts of each of these manifests. + +**`deployment.yaml`**: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: tfs-server + name: tfs-server +... +spec: + containers: + - image: gcr.io/$GCP_PROJECT_ID/tfserving-hf-vit:latest + name: tfs-k8s + imagePullPolicy: Always + args: ["--tensorflow_inter_op_parallelism=2", + "--tensorflow_intra_op_parallelism=8"] + ports: + - containerPort: 8500 + name: grpc + - containerPort: 8501 + name: restapi + resources: + limits: + cpu: 800m + requests: + cpu: 800m +... + +``` + +You can configure the names like `tfs-server`, `tfs-k8s` any way you +want. Under `containers`, you specify the Docker image URI the +deployment will use. The current resource utilization gets monitored by +setting the allowed bounds of the `resources` for the container. It +can let Horizontal Pod Autoscaler (discussed later) decide to scale up or down the number of +containers. `requests.cpu` is the minimal amount of CPU resources to +make the container work correctly set by operators. Here 800m means 80% +of the whole CPU resource. So, HPA monitors the average CPU utilization +out of the sum of `requests.cpu` across all Pods to make scaling +decisions. + +Besides Kubernetes specific configuration, you can specify TensorFlow +Serving specific options in `args`.In this case, you have two: + +- `tensorflow_inter_op_parallelism`, which sets the number of threads + to run in parallel to execute independent operations. The + recommended value for this is 2. + +- `tensorflow_intra_op_parallelism`, which sets the number of threads + to run in parallel to execute individual operations. The recommended + value is the number of physical cores the deployment CPU has. + +You can learn more about these options (and others) and tips on tuning +them for deployment from +[here](https://www.tensorflow.org/tfx/serving/performance) and +[here](https://github.com/IntelAI/models/blob/master/docs/general/tensorflow_serving/GeneralBestPractices.md). + +**`service.yaml`**: + +```yaml +apiVersion: v1 +kind: Service +metadata: + labels: + app: tfs-server + name: tfs-server +spec: + ports: + - port: 8500 + protocol: TCP + targetPort: 8500 + name: tf-serving-grpc + - port: 8501 + protocol: TCP + targetPort: 8501 + name: tf-serving-restapi + selector: + app: tfs-server + type: LoadBalancer +``` + +We made the service type ‘LoadBalancer’ so the endpoints are +exposed externally to the Kubernetes cluster. It selects the +‘tfs-server’ Deployment to make connections with external clients via +the specified ports. We open two ports of ‘8500’ and ‘8501’ for gRPC and +HTTP/REST connections respectively. + +**`hpa.yaml`**: + +```yaml +apiVersion: autoscaling/v1 +kind: HorizontalPodAutoscaler +metadata: + name: tfs-server + +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: tfs-server + minReplicas: 1 + maxReplicas: 3 + targetCPUUtilizationPercentage: 80 +``` + +HPA stands for **H**orizontal **P**od **A**utoscaler. It sets criteria +to decide when to scale the number of Pods in the target Deployment. You +can learn more about the autoscaling algorithm internally used by +Kubernetes [here](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale). + +Here you specify how Kubernetes should handle autoscaling. In +particular, you define the replica bound within which it should perform +autoscaling – `minReplicas\` and `maxReplicas` and the target CPU +utilization. `targetCPUUtilizationPercentage` is an important metric +for autoscaling. The following thread aptly summarizes what it means +(taken from [here](https://stackoverflow.com/a/42530520/7636462)): + +> The CPU utilization is the average CPU usage of all Pods in a +deployment across the last minute divided by the requested CPU of this +deployment. If the mean of the Pods' CPU utilization is higher than the +target you defined, your replicas will be adjusted. + +Recall specifying `resources` in the deployment manifest. By +specifying the `resources`, the Kubernetes control plane starts +monitoring the metrics, so the `targetCPUUtilization` works. +Otherwise, HPA doesn't know the current status of the Deployment. + +You can experiment and set these to the required numbers based on your +requirements. Note, however, that autoscaling will be contingent on the +quota you have available on GCP since GKE internally uses [Google Compute Engine](https://cloud.google.com/compute) +to manage these resources. + +## Performing the deployment + +Once the manifests are ready, you can apply them to the currently +connected Kubernetes cluster with the +[`kubectl apply`](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply) +command. + +```bash +$ kubectl apply -f deployment.yaml +$ kubectl apply -f service.yaml +$ kubectl apply -f hpa.yaml +``` + +While using `kubectl` is fine for applying each of the manifests to +perform the deployment, it can quickly become harder if you have many +different manifests. This is where a utility like +[Kustomize](https://kustomize.io/) can be helpful. You simply +define another specification named `kustomization.yaml` like so: + +```yaml +commonLabels: + app: tfs-server +resources: +- deployment.yaml +- hpa.yaml +- service.yaml +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +``` + +Then it’s just a one-liner to perform the actual deployment: + +```bash +$ kustomize build . | kubectl apply -f - +``` + +Complete instructions are available +[here](https://github.com/sayakpaul/deploy-hf-tf-vision-models/tree/main/hf_vision_model_tfserving_gke). +Once the deployment has been performed, we can retrieve the endpoint IP +like so: + +```bash +$ kubectl rollout status deployment/tfs-server +$ kubectl get svc tfs-server --watch + +---------OUTPUT--------- +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +tfs-server LoadBalancer xxxxxxxxxx xxxxxxxxxx 8500:30869/TCP,8501:31469/TCP xxx + +``` + +Note down the external IP when it becomes available. + +And that sums up all the steps you need to deploy your model on +Kubernetes! Kubernetes elegantly provides abstractions for complex bits +like autoscaling and cluster management while letting you focus on +the crucial aspects you should care about while deploying a model. These +include resource utilization, security (we didn’t cover that here), +performance north stars like latency, etc. + +# Testing the endpoint + +Given that you got an external IP for the endpoint, you can use the +following listing to test it: + +```py +import tensorflow as tf +import json +import base64 + +image_path = tf.keras.utils.get_file( + "image.jpg", "http://images.cocodataset.org/val2017/000000039769.jpg" +) +bytes_inputs = tf.io.read_file(image_path) +b64str = base64.urlsafe_b64encode(bytes_inputs.numpy()).decode("utf-8") +data = json.dumps( + {"signature_name": "serving_default", "instances": [b64str]} +) + +json_response = requests.post( + "http://:8501/v1/models/hf-vit:predict", + headers={"content-type": "application/json"}, + data=data +) +print(json.loads(json_response.text)) + +---------OUTPUT--------- +{'predictions': [{'label': 'Egyptian cat', 'confidence': 0.896659195}]} + +``` + +If you’re interested to know how this deployment would perform if it +meets more traffic then we recommend you to check [this article](https://blog.tensorflow.org/2022/07/load-testing-TensorFlow-Servings-REST-interface.html). +Refer to the corresponding [repository](https://github.com/sayakpaul/deploy-hf-tf-vision-models/tree/main/locust) +to know more about running load tests with Locust and visualize the results. + +# Notes on different TF Serving configurations + +TensorFlow Serving +[provides](https://www.tensorflow.org/tfx/serving/serving_config) +various options to tailor the deployment based on your application use +case. Below, we briefly discuss some of them. + +**`enable_batching`** enables the batch inference capability that +collects incoming requests with a certain amount of timing window, +collates them as a batch, performs a batch inference, and returns the +results of each request to the appropriate clients. TensorFlow Serving +provides a rich set of configurable options (such as `max_batch_size`, +`num_batch_threads`) to tailor your deployment needs. You can learn +more about them +[here](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md). Batching is +particularly beneficial for applications where you don't need predictions from a model +instantly. In those cases, you'd typically gather together multiple samples for prediction in batches and +then send those batches for prediction. Lucky for us, TensorFlow Serving can configure all of these +automatically when we enable its batching capabilities. + +**`enable_model_warmup`** warms up some of the TensorFlow components +that are lazily instantiated with dummy input data. This way, you can +ensure everything is appropriately loaded up and that there will be no +lags during the actual service time. + +# Conclusion + +In this post and the associated [repository](https://github.com/sayakpaul/deploy-hf-tf-vision-models), +you learned about deploying the Vision Transformer model +from 🤗 Transformers on a Kubernetes cluster. If you’re doing this for +the first time, the steps may appear to be a little daunting, but once +you get the grasp, they’ll soon become an essential component of your +toolbox. If you were already familiar with this workflow, we hope this post was still beneficial +for you. + +We applied the same deployment workflow for an ONNX-optimized version of the same +Vision Transformer model. For more details, check out [this link](https://github.com/sayakpaul/deploy-hf-tf-vision-models/tree/main/hf_vision_model_onnx_gke). ONNX-optimized models are especially beneficial if you're using x86 CPUs for deployment. + +In the next post, we’ll show you how to perform these deployments with +significantly less code with [Vertex AI](https://cloud.google.com/vertex-ai) – more like +`model.deploy(autoscaling_config=...)` and boom! We hope you’re just as +excited as we are. + +# Acknowledgement + +Thanks to the ML Developer Relations Program team at Google, which +provided us with GCP credits for conducting the experiments.