## Batching

:::{.callout-warning}

I was not able to simulate a situation where dynamic batching is better than not batching.  Apparently it can take time and lots of experiments to get right.  Follow [this guide](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md#batch-scheduling-parameters-and-tuning) for more information.  This is a topic I may revisit in the future.

:::

According to the docs:

> Model Server has the ability to batch requests in a variety of settings in order to realize better throughput. The scheduling for this batching is done globally for all models and model versions on the server to ensure the best possible utilization of the underlying resources no matter how many models or model versions are currently being served by the server.  You can enable this by using the `--enable_batching` flag and control it with the `--batching_parameters_file`.

This is an example batching parameters file:

In [117]:
%%writefile batch-config.cfg
max_batch_size { value: 10000 }
batch_timeout_micros { value: 1000 }
max_enqueued_batches { value: 18 }
num_batch_threads { value: 16 }

Overwriting batch-config.cfg


Guidance for these config files is [here](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md) there is no "right answer".  For GPUs, the guidance is this:

GPU: One Approach

If your model uses a GPU device for part or all of your its inference work, consider the following approach:

1. Set `num_batch_threads` to the number of CPU cores.

2. Temporarily set `batch_timeout_micros` to a really high value while you tune `max_batch_size` to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.

3. For online serving, tune `batch_timeout_micros` to rein in tail latency. The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch. The best value for `batch_timeout_micros` is typically a few milliseconds, and depends on your context and goals. Zero is a value to consider; it works well for some workloads. (For bulk processing jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.)


## Test it out

The model we are going to serve is generated [in this note](./tf-serving-basics.ipynb)

I'm going to start two TF Serving instances, one thats regular CPU and one that does batching on GPU.  I'm running both commands from the `/home/hamel/tf-serving/` directory.


### CPU Version

```bash
docker run \
--mount type=bind,source=/home/hamel/hamel/notes/serving/tfserving/model/,target=/models/model \
--net=host -t tensorflow/serving
```

:::{.callout-note}

`--net=host` binds all ports to the host, which is convenient for testing

:::

Test the CPU version:

In [4]:
! curl http://localhost:8501/v1/models/model

{
 "model_version_status": [
  {
   "version": "1",
   "state": "AVAILABLE",
   "status": {
    "error_code": "OK",
    "error_message": ""
   }
  }
 ]
}


### GPU Version

You can pass additional arguments like `--enable_batching` in [the same way](https://www.tensorflow.org/tfx/serving/docker#passing_additional_arguments):

Use the `latest-gpu` tag to enable GPUs as well as the `--port` and `--rest_api_port` so that it doesn't conflict with the other tf serving instance I have running:

### Pre-requisites

You must [install nvidia-docker](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) first

### Docker Command

Note that we need the `--gpus all` flag to enable GPUs with nvidia-Docker:

```bash
docker run --gpus all \
--mount type=bind,source=/home/hamel/hamel/notes/serving/tfserving,target=/models \
--net=host -t tensorflow/serving:latest-gpu --enable_batching \
--batching_parameters_file=/models/batch-config.cfg --port=8505 --rest_api_port=8506
```

:::{.callout-note}

### Understanding the volume mount

On the host, the config file is located at `/home/hamel/hamel/notes/serving/tfserving/batch-config.cfg` and the model is located at `/home/hamel/hamel/notes/serving/tfserving/model/`

The [Docker file](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/tools/docker/Dockerfile.gpu) will try to import the model like this:

```dockerfile
# Set where models should be stored in the container
ENV MODEL_BASE_PATH=/models
RUN mkdir -p ${MODEL_BASE_PATH}

# The only required piece is the model name in order to differentiate endpoints
ENV MODEL_NAME=model

# Create a script that runs the model server so we can use environment variables
# while also passing in arguments from the docker command line
RUN echo '#!/bin/bash \n\n\
tensorflow_model_server --port=8500 --rest_api_port=8501 \
--model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \
"$@"' > /usr/bin/tf_serving_entrypoint.sh \
&& chmod +x /usr/bin/tf_serving_entrypoint.sh
```

By default it will try to get models from `${MODEL_BASE_PATH}/${MODEL_NAME}` which is `/models/model`. So when we mount `/home/hamel/hamel/notes/serving/tfserving` from the host to `/models` in the container.

In the container:

- The model files will be available at `models/model` as expected
- The config file will be available at `models/batch-config.cfg`

:::


Test the TF-Serving GPU api:

In [5]:
! curl http://localhost:8506/v1/models/model

{
 "model_version_status": [
  {
   "version": "1",
   "state": "AVAILABLE",
   "status": {
    "error_code": "OK",
    "error_message": ""
   }
  }
 ]
}


## Benchmark

"All benchmarks are wrong, some are useful"

### Prepare the data

In [14]:
from tensorflow import keras

vocab_size = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review

_, (x_val, _) = keras.datasets.imdb.load_data(num_words=vocab_size)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

### Prepare the prediction code

In [106]:
import json, requests
import numpy as np


def predict_rest(data, port):
    json_data = json.dumps(
    {"signature_name": "serving_default", "instances": data.tolist()}
    )
    url = f"http://localhost:{port}/v1/models/model:predict"

    json_response = requests.post(url, data=json_data)
    response = json.loads(json_response.text)
    rest_outputs = np.array(response["predictions"])
    return rest_outputs

In [107]:
sample_data = x_val[:2, :]
rest_outputs = predict_rest(sample_data, '8501')
rest_outputs

array([[0.89650154, 0.1034985 ],
       [0.00330466, 0.9966954 ]])

### Test the CPU Server

The CPU server is running on port `8501`, we are going to send 5 examples to score 10,000 times in parallel with threads and measure the total time:

In [108]:
from fastcore.parallel import parallel
from functools import partial

In [136]:
cpu_pred = partial(predict_rest, port = '8501')
parallel_pred = partial(parallel, threadpool=True, n_workers=500)
sample_data = x_val[:5, :]
data = [sample_data] * 10000

In [137]:
%%time
results = parallel_pred(cpu_pred, data)

CPU times: user 27.6 s, sys: 4.8 s, total: 32.4 s
Wall time: 26 s


### Test the GPU Server with batching

The GPU server is running on port `8506` (we already started it above).  For this

In [140]:
gpu_pred = partial(predict_rest, port = '8506')

In [141]:
%%time
results = parallel_pred(gpu_pred, data)

CPU times: user 25.5 s, sys: 3.18 s, total: 28.7 s
Wall time: 25.4 s


### Test the GPU server without batching

```bash
docker run --gpus all --mount type=bind,source=/home/hamel/hamel/notes/serving/tfserving,target=/models --net=host -t tensorflow/serving:latest-gpu --port=8507 --rest_api_port=8508
```

In [138]:
gpu_pred_no_batch = partial(predict_rest, port = '8508')

In [139]:
%%time
results = parallel_pred(gpu_pred_no_batch, data)

CPU times: user 22.9 s, sys: 3.45 s, total: 26.4 s
Wall time: 21.9 s
