# Tensorflow in Production

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 16/03/2025   | Martin | Updated   | Created notebook for serving portion of productionising TensorFlow models | 

# Content

* [Saving and Restoring TensorFlow models](#saving-and-restoring-tensorflow-models)
* [Parallelising TensorFlow](#parallelising-tensorflow)

# Saving and Restoring TensorFlow Models

To use ML model in production or reuse trained model for transfer learning tasks

`SavedModel` is the recommended format to save the entire model to disk

In [None]:
# Assuming a model already exists
# To save a model:
model.save('SavedModel')

In [None]:
# To load a model:
model2 = tf.keras.models.load_model('SavedModel')

## Changing to Keras H5 format

Pass it with the extension `.h5` or add the argument `save_format="h5"`

In [None]:
model.save("h5SavedModel.h5")

In [None]:
model.save("AnotherModel", save_format='h5')

## Saving and restoring from checkpoints

Use `ModelCheckpoint` callback to save an entire model or just the weights into a checkpoint structure. Callback is added to the `fit` method which will store the model weights over each epoch

Docs: https://keras.io/api/callbacks/model_checkpoint/

In [None]:
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
  filepath='./checkpoint',
  save_weights_only=True,
  save_freq='epoch', # integer value means save the model after x number of batches
  save_best_only=False, # only save the best latest model
  monitor='val_loss',
)

In [None]:
model.fit(
  x=x_train,
  y=y_train,
  epochs=5,
  validation_data=(x_test, y_test),
  callbacks=[checkpoint_callback]
)

In [None]:
# To load model from a checkpoint
model.load("./checkpoint")

---

# TensorFlow Serving

Learn to serve machine learning models in production.

TensorFlow Serving from __TensorFlow Extended (TFX)__ is an MLOps tool that builds complete ML pipelines. A TFX pipeline is composed of a sequence of components for data validation, transformation, model analysis and model serving.

In [1]:
import tensorflow as tf
import numpy as np
import requests
import matplotlib.pyplot as plt
import json

2025-03-16 21:43:53.570571: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-16 21:43:53.763072: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742132633.829864   45276 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742132633.850948   45276 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-16 21:43:54.015250: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

## Built the MNIST model

In [2]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize
x_train = x_train / 255
x_test = x_test/ 255

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(name="FLATTEN"))
model.add(tf.keras.layers.Dense(units=128 , activation="relu", name="D1"))
model.add(tf.keras.layers.Dense(units=64 , activation="relu", name="D2"))
model.add(tf.keras.layers.Dense(units=10, activation="softmax", name="OUTPUT"))
    
model.compile(
  optimizer="sgd", 
  loss="sparse_categorical_crossentropy",
  metrics=["accuracy"]
)

model.fit(
  x=x_train, 
  y=y_train, 
  epochs=5,
  validation_data=(x_test, y_test)
) 

I0000 00:00:1742132670.224192   45276 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9558 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4070, pci bus id: 0000:01:00.0, compute capability: 8.9


Epoch 1/5


I0000 00:00:1742132687.143330   45543 service.cc:148] XLA service 0x7f44b4006550 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1742132687.152212   45543 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2025-03-16 21:44:44.968888: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1742132685.894675   45543 cuda_dnn.cc:529] Loaded cuDNN version 90300



[1m  47/1875[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m6s[0m 3ms/step - accuracy: 0.1532 - loss: 2.2604

I0000 00:00:1742132688.635283   45543 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 15ms/step - accuracy: 0.7058 - loss: 1.0561 - val_accuracy: 0.9125 - val_loss: 0.3141
Epoch 2/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.9134 - loss: 0.3027 - val_accuracy: 0.9287 - val_loss: 0.2511
Epoch 3/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9277 - loss: 0.2508 - val_accuracy: 0.9365 - val_loss: 0.2174
Epoch 4/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.9380 - loss: 0.2167 - val_accuracy: 0.9439 - val_loss: 0.1906
Epoch 5/5
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.9467 - loss: 0.1846 - val_accuracy: 0.9492 - val_loss: 0.1752


<keras.src.callbacks.history.History at 0x7f4614b973d0>

## Save the model in SavedModel format

Each folder will have a different model version

In [None]:
model.export('mnist_model/1')

INFO:tensorflow:Assets written to: mnist_model/v1/assets


INFO:tensorflow:Assets written to: mnist_model/v1/assets


Saved artifact at 'mnist_model/v1'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 28, 28), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 10), dtype=tf.float32, name=None)
Captures:
  139938957709856: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139938957717072: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139938957199488: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139938952830944: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139938952827776: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139938952833760: TensorSpec(shape=(), dtype=tf.resource, name=None)


In [None]:
# New high level format for keras models
model.save('mnist_model.keras')

## Serve using the TensorFlow Serving docker image

1. Pull the docker image `tensorflow/serving`
2. Publish the REST API port 8501 to host port 8501 (mainly for docker)
3. Take the saved model and bind it to the model base path `/models/my_mnist_model`
4. Fill environment vairables MODEL_NAME with `my_mnist_model`

Command
```
docker run -p 8501:8501 --mount type=bind,source="$(pwd)/my_mnist_model/",target=/models/my_mnist_model -e MODEL_NAME=my_mnist_model -t tensorflow/serving
```

## Image to predict

In [None]:
num_rows = 4
num_cols = 3
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for row in range(num_rows):
  for col in range(num_cols):
    index = num_cols * row + col
    image = x_test[index]
    true_label = y_test[index]
    plt.subplot(num_rows, 2*num_cols, 2*index+1)
    plt.imshow(image.reshape(28,28), cmap="binary")
    plt.axis('off')
    plt.title('\n\n It is a {}'.format(y_test[index]), fontdict={'size': 16})
plt.tight_layout()
plt.show()

## Send a post request to test the endpoint

Send a POST request contain 5 images and return the results. The server will return for each image ten probabilities corresponding to the probability for each digit between 0 and 9

In [None]:
json_request = '{{ "instances" : {} }}'.format(x_test[0:12].tolist())
resp = requests.post('http://localhost:8501/v1/models/my_mnist_model:predict', data=json_request, headers = {"content-type": "application/json"})
print('response.status_code: {}'.format(resp.status_code))     
print('response.content: {}'.format(resp.content))
predictions = json.loads(resp.text)['predictions']

In [None]:
# Display the result
num_rows = 4
num_cols = 3
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for row in range(num_rows):
  for col in range(num_cols):
    index = num_cols * row + col
    image = x_test[index]
    predicted_label = np.argmax(predictions[index])
    true_label = y_test[index]
    plt.subplot(num_rows, 2*num_cols, 2*index+1)
    plt.imshow(image.reshape(28,28), cmap="binary")
    plt.axis('off')
    if predicted_label == true_label:
      color = 'blue'
    else:
      color = 'red'
    plt.title('\n\n The model predicts a {} \n and it is a {}'.format(predicted_label, true_label), fontdict={'size': 16}, color=color)
plt.tight_layout()
plt.show()

## Additional notes

* TensorFlow Serving requires a specific tree structure and models to be in the `SavedModel` format => Each model version should be expoerted to a different subdirectory in the path
* TFX will automatically look for and grab the highest integer (model version) in the folder specified 
* TFX contains many other components like data pipelines, data validation, feature engineering and model analysis to create more comprehensive model serving methods