# Operating a Vision Model as a Living System

In previous chapters, we learned how to:

- Train a computer vision model
- Evaluate its performance
- Deploy it using FastAPI and containers
- Monitor for drift
- Explain predictions using Grad-CAM

In this notebook, we shift perspective.

Instead of focusing on *training a model*, we focus on **operating a model over time**.

In the real world, models:

- Experience data drift
- Require retraining
- Must be versioned
- Need promotion decisions
- Should not automatically replace previous models

We will simulate a simple “production lifecycle”:

1. Train baseline model (v1)
2. Save artifacts + metadata
3. Simulate new incoming data
4. Detect drift
5. Retrain (v2)
6. Compare v1 vs v2
7. Decide whether to promote

This notebook runs fully locally using `tensorflow_datasets`.


In [2]:
import os
import json
import shutil
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_datasets as tfds
from datetime import datetime
from scipy.stats import entropy

print("TensorFlow version:", tf.__version__)

IMG_SIZE = 160
BATCH_SIZE = 32
MODEL_DIR = "model_registry"

os.makedirs(MODEL_DIR, exist_ok=True)

# Load Dataset

(ds_train, ds_val), ds_info = tfds.load(
    "tf_flowers",
    split=["train[:80%]", "train[80%:]"],
    as_supervised=True,
    with_info=True
)

NUM_CLASSES = ds_info.features["label"].num_classes
CLASS_NAMES = ds_info.features["label"].names

def preprocess(image, label):
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    image = image / 255.0
    return image, label

ds_train = ds_train.map(preprocess).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
ds_val = ds_val.map(preprocess).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

print("Classes:", CLASS_NAMES)


TensorFlow version: 2.9.1
Classes: ['dandelion', 'daisy', 'tulips', 'sunflowers', 'roses']


In [3]:
# Build Model (Transfer Learning)

base_model = tf.keras.applications.MobileNetV2(
    input_shape=(IMG_SIZE, IMG_SIZE, 3),
    include_top=False,
    weights="imagenet"
)

base_model.trainable = False

model = tf.keras.Sequential([
    base_model,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(NUM_CLASSES, activation="softmax")
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 mobilenetv2_1.00_160 (Funct  (None, 5, 5, 1280)       2257984   
 ional)                                                          
                                                                 
 global_average_pooling2d (G  (None, 1280)             0         
 lobalAveragePooling2D)                                          
                                                                 
 dense (Dense)               (None, 5)                 6405      
                                                                 
Total params: 2,264,389
Trainable params: 6,405
Non-trainable params: 2,257,984
_________________________________________________________________


In [4]:
# Train Baseline Model (v1)

history = model.fit(
    ds_train,
    validation_data=ds_val,
    epochs=3
)

val_loss, val_acc = model.evaluate(ds_val)
print("Validation accuracy (v1):", val_acc)


Epoch 1/3
Epoch 2/3
Epoch 3/3
Validation accuracy (v1): 0.8746594190597534


In [5]:
# Save Model + Metadata (Model Registry Simulation)

version = "v1"
version_path = os.path.join(MODEL_DIR, version)
os.makedirs(version_path, exist_ok=True)

model.save(os.path.join(version_path, "model"))

metadata = {
    "version": version,
    "timestamp": str(datetime.now()),
    "validation_accuracy": float(val_acc)
}

with open(os.path.join(version_path, "metadata.json"), "w") as f:
    json.dump(metadata, f, indent=4)

print("Saved model version:", version)




INFO:tensorflow:Assets written to: model_registry\v1\model\assets


INFO:tensorflow:Assets written to: model_registry\v1\model\assets


Saved model version: v1


In [6]:
# Simulate Incoming Drifted Data (we artificially introduce brightness shift)

def simulate_drift(image, label):
    image = tf.image.adjust_brightness(image, delta=0.3)
    image = tf.clip_by_value(image, 0.0, 1.0)
    return image, label

ds_drift = ds_val.unbatch().map(simulate_drift).batch(BATCH_SIZE)

# Evaluate Drift Impact

drift_loss, drift_acc = model.evaluate(ds_drift)
print("Accuracy on drifted data:", drift_acc)


Accuracy on drifted data: 0.8160762786865234


In [7]:
# Simple Drift Detection (KL Divergence)

def get_prediction_distribution(dataset):
    preds = []
    for images, _ in dataset:
        p = model.predict(images, verbose=0)
        preds.append(p)
    preds = np.vstack(preds)
    return preds.mean(axis=0)

dist_original = get_prediction_distribution(ds_val)
dist_drifted = get_prediction_distribution(ds_drift)

kl_div = entropy(dist_original, dist_drifted)
print("KL Divergence between original and drifted predictions:", kl_div)


KL Divergence between original and drifted predictions: 0.004618276


## Decision Point

If performance degrades significantly and drift is detected,
we may choose to retrain.

This is not automatic — it requires a governance decision.


In [8]:
# Retrain as Version v2

model_v2 = tf.keras.models.clone_model(model)
model_v2.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model_v2.fit(ds_drift, epochs=3)

v2_loss, v2_acc = model_v2.evaluate(ds_val)
print("Validation accuracy (v2):", v2_acc)


Epoch 1/3
Epoch 2/3
Epoch 3/3
Validation accuracy (v2): 0.25340598821640015


In [9]:
# Save v2

version = "v2"
version_path = os.path.join(MODEL_DIR, version)
os.makedirs(version_path, exist_ok=True)

model_v2.save(os.path.join(version_path, "model"))

metadata = {
    "version": version,
    "timestamp": str(datetime.now()),
    "validation_accuracy": float(v2_acc)
}

with open(os.path.join(version_path, "metadata.json"), "w") as f:
    json.dump(metadata, f, indent=4)

print("Saved model version:", version)




INFO:tensorflow:Assets written to: model_registry\v2\model\assets


INFO:tensorflow:Assets written to: model_registry\v2\model\assets


Saved model version: v2


## Reflection

You have simulated:

- Model versioning
- Drift detection
- Performance degradation
- Retraining
- Promotion decision logic

This is what happens in real ML systems over months or years.

The important takeaway: ***Training a model is not the end of the workflow — it is the beginning of a lifecycle.***
