# 10.1 Kubernetes and TensorFlow Serving

In this chapter we'll use TensorFlow serving for serving our clothing classification model. TF Serving is a special tool from the TF family which is specifically created for serving TF models. TF serving is a library that is written in C++, so it's very efficient but it also focuses on inference. You cannot do anything else with that library.

How does it work?
TF Serving gets a request with the X matrix which is the already prepared image. The result is a Numpy array with 10 predictions (in our case because of having 10 different classes). The user will not do the preprocessing, so we need something between the user and the TF Serving which is called gateway. A gateway gets an url, downloads the image, resizes it and turning into Numpy array, and pre-process it, and outputs predictions in a consumable format (f.e. json format). That means the gateway is also post-processing the output. The only thing the user needs to do is uploading the image to the website that uses the gateway. For implementing the gateway we'll use flask. Then we'll take the gateway and TF Serving and deploy it to Kubernetes.There is one benefit in using TF Serving. We can use GPU for applying the model. That means a lot of matrix multiplications.

How this chapter is organized?
- We'll take the model we trained already with Keras and convert it to a format that TF Serving expects which is called "saved_model" format.
- We'll deploy this model locally with Docker and see how to interact with that.
- After that we'll create this pre-processing service which we called gateway. We'll create two servers each of them will run in its own Docker container. We need to ensure that both can talk to each other
- Then we'll talk about Docker-compose as a way of running two services that communicate with each other on one machine.
- Then we'll look at the main concepts from Kubernetes.
- After that we'll deploy a simple application to Kubernetes and set it up. Actually we'll run Kubernetes locally using a thing called Kind, which is a lightweight Kubernetes that you can run on your local machine.
- Then we'll take the services that we created and deploy them to Kubernetes.
- Finally we'll move these things from our local Kubernetes cluster to a cluster in the cloud. We'll use EKS which is a managed Kubernetes from AWS, but it should work for any cloud provider.


# 10.2 TensorFlow Serving
## The saved_model format
Here we'll use again the model which was trained for the book (xception_v4_large_08_0.894.h5). We can use wget again to download the model (and save it as clothing-model-v4.h5). Now we can convert the model from h5 format to the saved_model format. For the converting we only need a few lines of code. You can do this by using ipython.

In [1]:
import tensorflow as tf
from tensorflow import keras

model = keras.models.load_model('./clothing-model-v4.h5')

tf.saved_model.save(model, 'clothing-model')

ModuleNotFoundError: No module named 'tensorflow'

In [2]:
!ls -lhR

.:
total 83M
-rw-rw-r-- 1 peter peter    0 Nov  26 18:18 10-KubTFServing.ipynb
drwxr-xr-x 4 peter peter 4,0K Nov  26 19:05 clothing-model
-rw-rw-r-- 1 peter peter  83M Nov  26 19:03 clothing-model-v4.h5

./clothing-model:
total 2,3M
drwxr-xr-x 2 peter peter 4,0K Nov  26 19:05 assets
-rw-rw-r-- 1 peter peter   57 Nov  26 19:05 fingerprint.pb
-rw-rw-r-- 1 peter peter 2,3M Nov  26 19:05 saved_model.pb
drwxr-xr-x 2 peter peter 4,0K Nov  26 19:05 variables

./clothing-model/assets:
total 0

./clothing-model/variables:
total 83M
-rw-rw-r-- 1 peter peter 83M Nov  26 19:05 variables.data-00000-of-00001
-rw-rw-r-- 1 peter peter 15K Nov  26 19:05 variables.index


Now we can look what's inside the model using the utility saved_model_cli.

In [3]:
!saved_model_cli show --dir clothing-model --all

2023-11-26 19:09:33.048917: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-26 19:09:33.246135: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-26 19:09:33.246206: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-26 19:09:33.250297: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-26 19:09:33.272191: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-26 19:09:33.272700: I tensorflow/core/platform/cpu_feature_guard.cc:1

"serving_default" is the name of the signature definition. This is something technical but we need to know this value when we invoke our model.
Then we have an input and an output. The input is called 'input_28'. The shape of input is 299x299x3 and -1 means that we have a batch of arbitrarily many images.
The output is called 'dense_22'. The shape is 10 and -1 means again that we can have a lot of outputs.
So what we need from this definition are the following information:
- serving_default
- input_28 - input
- dense_22 - output

## Running TF-Serving locally with Docker


Now we can use this information to run TF-Serving locally with Docker. Just for repitition 8500:8500 means the local port 8500 is mapped to the port 8500 inside the container. For mounting a folder there is the same rule. The first part is the local folder on the host machine, and the second part is the folder inside the container.


In [3]:
!docker run -it --rm -p 8500:8500 -v "$(pwd)/clothing-model:/models/clothing-model/1" -e MODEL_NAME="clothing-model" tensorflow/serving:2.7.0

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.


## Invoking the model from Jupyter

--> tf-serving-connect.ipynb

# 10.3 Creating a Pre-Processing Service

In this lesson we'll take the Jupyter notebook we've prepared last time and turn it into a Flask application. As a repitition what the notebook did. We used it to communicate to our model which was deployed with TensorFlow Serving. Then we used an image of pants and pre-process it and turn it into Protobuf and then we sent it to TensorFlow Serving to get responses. This response needed to be post-processed and then we've turnt it into something human readable.

## Converting Jupyter notebook to Python script
Before turning it into a Flask application we turn it into a Python script using "nbconvert".

!jupyter nbconvert --to script tf-serving-connect.ipynb

After converting, the file is renamed to "gateway.py".

After code cleaning we add another function "prepare_request(X)" which turns the "X" into a pb request. Then we add another function "predict" which takes an url and turns it into X. We add a third function "prepare_response" which gets the pb_response and extracts the float and combines it with the classes. 
To test the script we need to add a few more lines.

In [None]:
if __name__ == '__main__':
     url = 'http://bit.ly/mlbookcamp-pants'
     response = predict(url)
     print(response)

## Turning Python script into Flask application

Now we see that the script is working, so let's turn it into a Flask application. Therefor we need to add some more libraries and lines of code.

In [None]:
from flask import Flask
from flask import request
from flask import jsonify

...

app = Flask('gateway')


@app.route('/predict', methods=['POST'])
def predict_endpoint():
    data = request.get_json()
    url = data['url']
    result = predict(url)
    return jsonify(result)

if __name__ == '__main__':
    # url = 'http://bit.ly/mlbookcamp-pants'
    # response = predict(url)
    # print(response)
    app.run(debug=True, host='0.0.0.0', port=9696)

Now we can start this app. To test that Flask application we need again some test.py. We can copy the code from another session and adapt the url.

In [None]:
import requests

url = 'http://localhost:9696/predict'

data = {'url': 'http://bit.ly/mlbookcamp-pants'}

result = requests.post(url, json=data).json()
print(result)

When starting this test script you should see the well know output as seen before.

## Putting everything into Pipenv
For using Pipenv we need to install a bunch of things.

In [1]:
!pip install pipenv
!pipenv install grpcio==1.42.0 flask gunicorn keras-image-helper

Defaulting to user installation because normal site-packages is not writeable
We recommend setting this in [32m~/.profile[0m (or equivalent) for proper expected behavior.
[31mNeither 'pyenv' nor 'asdf' could be found to install Python.[0m
You can specify specific versions of Python with:
[33m$ pipenv --python path/to/python[0m


We're not installing TensorFlow and TensorFlow Serving here. That's why we don't want to import TensorFlow in our code. We just need the function "tf.make_tensor_proto(data, shape=data.shape)" and we don't want to drag the whole TensorFlow library (approx. 1.7GB) just for this one function. There is a lighter weight version called "tensorflow-cpu" but it is still about 400MB. Alexey provide a solution for that.
You can install it with:

In [2]:
!pip install tensorflow-protobuf==2.7.0

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow-protobuf==2.7.0
  Downloading tensorflow_protobuf-2.7.0-py3-none-any.whl (228 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.0/229.0 kB[0m [31m370.2 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: tensorflow-protobuf
Successfully installed tensorflow-protobuf-2.7.0


This package constains everything that is needed. No need to install TensorFlow or TensorFlow Serving.

# 10.4 Running everything locally with Docker-compose
In this session we'll run the TensorFlow serving model and the gateway service locally together and use Docker-compose for that.

## Preparing the docker images
The image we use is the official image from TensorFlow serving but it doesn't contain the model.

docker run -it --rm \
    -p 8500:8500 \
    -v $(pwd)/clothing-model:/models/clothing-model/1 \
    -e MODEL_NAME="clothing-model" \
    tensorflow/serving:2.7.0

Next time we run we just specify the docker image here without doing the mounting and setting the variable, because if we want to deploy it later we want this image to be self-contained. So let's create a docker file (image-model.dockerfile) for that and build it with:

docker build -t zoomcamp-10-model:xception-v4-001 -f image-model.dockerfile .

To run this image

docker run -it --rm \
    -p 8500:8500 \
    zoomcamp-10-model:xception-v4-001

To test this so far we can use the gateway.py file with:

if __name__ == '__main__':
     url = 'http://bit.ly/mlbookcamp-pants'
     response = predict(url)
     print(response)
    #app.run(debug=True, host='0.0.0.0', port=9696)

pipenv run python gateway.py

Now we have an Docker image for TensorFlow Serving model. Now we need the same for our gateway service.
The image has the name image-gateway.dockerfile.

docker build -t zoomcamp-10-gateway:001 -f image-gateway.dockerfile .

To run this image

docker run -it --rm \
    -p 9696:9696 \
    zoomcamp-10-gateway:001

Now both images are running and we can test them by using the test.py file.

This will fail and return the message:

"status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses""

The reason for that is the gateway was not able to reach our TensorFlow Serving. The reason for that is that the gateway is trying to reach the TensorFlow Serving on local port 8500, but this localhost is the container running the flask application. There is nothing on port 8500. We need to find a way to connect to the port 8500 of the TF-Serving container. What we need to do is putting both containers together in the same network. That means the gateway container can access the ports of TensorFlow Serving and the other way around.
One way of doing this is using plain docker, but there is another way of linking multiple related services together which is docker compose. Docker-compose allows us to run multiple Docker containers and then link related ones to each other. All of them will run in a single network and will be able to talk to each other if needed.

## Installing docker-compose
For using docker-compose we first need to install it. If running it on Windows or MacOS and you have Docker Desktop probably for running Docker you already have it.
To get more information about installing you can find them on the official page of Docker (docs.docker.com/compose/install).

To install it under Linux you can do it like Alexey who creates a "bin" folder in his home directory.

mkdir bin
cd bin
wget https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m) -O docker-compose
chmod +x docker-compose

For making it available in any place we need to add it to the PATH

echo $PATH
nano .bashrc 

Add the following line at the end of the ".bashrc" file:

export PATH="${HOME}/bin:${PATH}"

To execute this script:

source .bashrc

## Configuring Docker-compose
Now we need to create a special file which is called "docker-compose.yaml". This file describes what are the containers we want to run.

version: "3.9"
services:
  clothing-model:
    image: zoomcamp-10-model:xception-v4-001
  gateway:
    image: zoomcamp-10-gateway:002
    environment:
      - TF_SERVING_HOST=clothing-model:8500
    ports:
      - "9696:9696"

Inside Docker-compose each container will be accessible within this network that Docker compose creates by its name. So if we want to access this gateway we need to write gateway:9696. For TensorFlow Serving you need to write tf-serving:8500.
There is one thing to know. We need to map the port from the gateway service to the port on our host machine, because the test.py lives outside of the network Docker-compose creates but needs to access the gateway.

## Running the service
To run everything you just need one command:

docker-compose up

This looks for docker-compose.yaml file in the current directory and runs all the images that are specified there. You will see the output from both services. That means you don't need several different terminal tabs, it just works with one tab. 

## Testing the service
Now we can test again with the test.py file.

python test.py

## Running the service in detached mode

You can also run the services in detached mode. That means it runs all the services and then you'll get back the terminal. This is the command.

docker-compose up -d

To see what is running you can type:

docker ps

After testing everything you can shutdown everything with the command:

docker-compose down

Then it stops all the services. This shows how to run multiple services on the same machine. This is quite useful for testing. The next step is to take this and deploy it to Kubernetes.

# 10.5 Introduction to Kubernetes (image)
Kubernetes "is an open-source system for automating deployment, scaling, and management of containerized applications." (https://kubernetes.io). That means we can use Kubernetes to deploy Docker images. It will manage the images to scale up. That means it will add more instances of our application when there is an increase in load and remove these instances when the load decreases.

Let's imagine a big "box" as our Kubernetes cluster. Inside the cluster we have nodes. These nodes are like machines/servers/EC2 instances where things are running. On the nodes there are pods. In Kubernetes a pod is a (Docker) container that runs a specific image with specific parameters. Each node can have multiple such containers. These containers can need different amount of resources. Usually the pods are grouped in deployments. All pods in one deployment (can run on different nodes) have the same Docker image and configuration. That means you can have two nodes with one pod and both are running the same image with the same configuration (f.e. environment variables, etc.). This deployment could be for example our gateway service. Another deployment could be our TensorFlow Serving model. That deployment needs more resources to score images.
There is another thing to know called services. In our case there is a gateway service and a TensorFlow model service. A service is kind of entry point to a deployment. That means the user who wants to upload an image sends the request to the gateway service. The gateway service will be the main point of contact for the web application. The service is responsible to route the request to one of available pods (of the deployment gateway). This means the service is able to spread the load. This pod is doing something with the image and convert it to protobuf and send it to the model service. Here the model service is doing similar things like the first service. It routes the request to one available pod (of the deployment TF-Serving). This pod gets the protobuf request and replies back with predictions. The model service gets this predictions and routes it back to the pod from the deployment gateway that sends the request which sends it back to gateway service and the gateway service sends the reply back to the user.
In our case there are two main kind of services. The gateway service that the user contacts is an external service which means it has to be visible outside of the Kubernetes cluster. The model service on the other hand doesn't need to be visible outside of the Kubernetes cluster which means that this is an internal service. Internal services can only be used by pods in the Kubernetes cluster (and not by clients outside the cluster). External services in Kubernetes are called "Load Balancer" and internal services are called "Cluster IP".
And actually the client doesn't contact directly the gateway service but it contacts first the so called "ingress" and this draws the request to one of the external services. Ingress is so to say the entry point to the cluster.

There is one more thing. Imagine there are more than one client contacting the ingress. To be able to deal with this load Kubernetes can start more pods. You can configure what is the maximum/minimum of pods you accept to have. That means Kubernetes will automatically scale up/down the deployment depending on increasing/decreasing load. The thing that is taking care of that is called "HPA" which means horizontal pod autoscaler. This can also leads to request a new node if the existing ones are too busy/less/occupied.

We'll not use all of that things but we'll need pods, deployments, and services.

# 10.6 Deploying a simple service to Kubernetes

## Create a simple ping application in Flask
We'll use the code from a previous session (ping.py). This is the file we want to deploy. We can copy it to a new folder "ping". For this application we need to create a virtual environment.

- cd ping
- touch Pipfile
- pipenv install flask gunicorn

What we need to do now is to create a Dockerfile for this application. Here we can just copy the Dockerfile from the deployment chapter and adapt it to our needs.

Now we can build the image with "docker build -t ping ."
This will label the image with the tag "latest", but we want to give a label.

docker build -t ping:v001 .

The reason for that is we will use a local Kubernetes cluster called "Kind" which we'll setup later and Kind doesn't like the tag "latest". So it needs a specific tag. 

Let's run this image:

docker run -it --rm -p 9696:9696 ping:v001

When you now use another terminal you can test the application with a curl request:

curl localhost:9696/ping

It should return "PONG".

## Installing kubectl
Now we want to deploy this application to Kubernetes, but before we need to install some things. We need "kubectl" which is a tool for interacting with any Kubernetes cluster. 
If you need some information about installing it (on Linux), you can go to the documentation on https://kubernetes.io/docs/. If you're running things on Windows or macOS you probably use Docker Desktop and this comes with kubectl already. For Linux you can find all information on https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/. This is one way of installing it. But because we'll use AWS later so we'll use the install documentation from there. You can find it here https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html

There need to find the right version, the command looks like:

curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.28.3/2023-11-14/bin/linux/amd64/kubectl

We will use again the "bin" directory in our home folder which we used already to install docker-compose. So when we download it to this folder we can use it, becauser this folder is available on the path variable.
After downloading we need to make it executable with

chmod +x kubectl

## Installing Kind
There is another tool we need to install which is Kind. Kind is a tool for setting up a local Kubernetes cluster.
You can find more information on that here: https://kind.sigs.k8s.io/docs/user/quick-start/

For Linux you can use this to get Kind:

wget https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64 -O kind

chmod +x kind

## Setting up a local Kubernetes cluster with Kind
To create a Kubernetes cluster we just need one command:

kind create cluster

With that command a cluster named "kind" will be created. The first run could take a moment (around 10min), because images needs to be downloaded first. The next step is configuring kubectl to know that it needs to access this "kind-kind" cluster.

kubectl cluster-info --context kind-kind

To check that things work we can do

kubectl get service

That command lists all services that are running in our Kubernetes cluster. Because we haven't deployed any services ourselves we only have a service called kubernetes.

kubectl get pod
kubectl get deployment

With this commands we can check about pods and deployments. Both commands shouldn't return anything because nothing is running already. But that signals that everything is ready for the next steps.

docker ps 

shows that there is something running called "kind-control-plane". That means Kind uses Docker to create a Kubernetes cluster. That is good for testing things locally because it feels like we're interacting with a real Kubernetes cluster.

## Creating a deployment
For the configuration we need to write some yaml files and there is a Visual Studio Code extension which is quite useful. The name is "Kubernetes (Develop, deploy and debug Kubernetes applications)" from Microsoft.

Now we can start writing the deployment (deployment.yaml) code. In deployments we specify how the pods will look like. After that we can create a service.

There are multiple things here. 
The deployment has a name. This name is defined in the (first) metadata section.

The template section is a template for each pod. As we know a deployment contains multiple pods which have the same image and the same configuration. In this template we provide a specification for each pod. Each pod has a container with a name which is here "ping-pod". Each container also has an image which is here "ping:v001". And there are resources for each pod aswell. Here we provide maximum 128 MB RAM and maximum 50% of one CPU. You can also use "0.5" instead of "500m". (You can check that using "htop" on Linux).
The last thing in this template is a port which means the port of the container that we're exposing. This is the port that we want others to access.

But there is another metadata section in the template section. This means that each pod gets a label and the label is ping.

The selector section means that all pods that have a label app that equals to ping belong to this deployment.

One last thing is the replicas section where you can specify how many pods we want to create. For this deployment we set that value to 1. That means it will create or destroy pods until this value is reached. 

Now we can apply this config file to our Kubernetes cluster:

kubectl apply -f deployment.yaml

This should return "deployment.apps/ping-deployment created"

kubectl get deployment

now should return our created deployment with name ping-deployment.

kubectl get pod

should also return something more than last time.

To get more information about one pod you can use

kubectl describe pod ping-deployment-... | less

At the end of this output you can find the error messages. In our case it returns the message "Failed to pull image "ping:v001"" In our case we didn't let Kind know that this image should be registered. Kubernetes should know about this image. Therefor we need to load this image into the cluster. We already have a bunch of images locally, now we want to load them to the cluster so they become visible.

kind load docker-image ping:v001

Now "kubectl get pod" should show the status READY 1/1.

## Creating a service


# 10.7 Deploying TensorFlow models to Kubernetes

## Deploying the TF-Serving model
## Deploying the Gateway
## Testing the service

# 10.8 Deploying to EKS

# 10.9 Summary

- TF-Serving is a system for deploying TensorFlow models
- When using TF-Serving, we need a component for pre-processing
- Kubernetes is a container orchestration platform
- To deploy something on Kubernetes, we need to specify a deployment and a service
- You can use Docker compose and Kind for local experiments

# 10.10 Explore more

- Other local Kuberneteses: minikube, k3d, k3s, microk8s, EKS Anywhere
- Experiment also with [Rancher Desktop](https://rancherdesktop.io) which is similar to Docker Desktop but for Kubernetes
- Experiment also with Docker Desktop which should also have Kubernetes cluster capabilities
- Check out [Lens | The Kubernetes IDE](https://k8slens.dev) which is an integrated environment for Kubernetes for monitoring Kubernetes things. This should be something like cube ctl but with graphic interface.
- Many cloud providers have Kubernetes: GCP, Azure, Digital Ocean, and others. Look for "Managed Kubernetes" in your favourite search engine
- Deploy the model from previous modules and from your project with Kubernetes
- Learn about Kubernetes namespaces. Here we used the default namespace. Namespaces are quite useful for organizing your applications into related areas f.e. one namespace for all the deployments/pods for one project, ...