# Model accuracy metric

In this notebook, we train a deep neural network with `keras`, and examine the training and validation accuracies reported by:
- `fit().history`
- `model.evaluate`

It is known that the training accuracy metric produced by these two methods differ slightly.  The same problem applies to loss metric.

A quick internet search brings up a few suspected reasons: difference between training/test mode in keras, learning phase, batch normalization, dropout etc. However, I haven't been able to find a definitive answer.

It seems that the validation metrics always agree.  My main takeaways are then:

- Always have a validation set (validation fit and accuracy seem reliable)
- If it's necessary to calculate training fit, use `model.evaluate` rather than relying on the `fit().history` logs.

For this investigation, we will use the CIFAR10 image dataset.

Some references on this issue:

https://github.com/keras-team/keras/issues/6977

https://stackoverflow.com/questions/51123198/strange-behaviour-of-the-loss-function-in-keras-model-with-pretrained-convoluti/51124511#51124511

https://blog.datumbox.com/the-batch-normalization-layer-of-keras-is-broken/

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

2025-05-06 11:11:29.097812: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-06 11:11:29.105536: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746544289.114080   24878 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746544289.116593   24878 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746544289.123658   24878 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [2]:
print("****Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

****Num GPUs Available: 1


## Load data

In [3]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [4]:
X_train = X_train.astype("float32") / 255.0
X_valid = X_valid.astype("float32") / 255.0
X_test = X_test.astype("float32") / 255.0

In [5]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

## Build models
Let's define a few functions to help us build models easier.

In [6]:
def build_model(batch_norm: bool):
    if batch_norm:
        layer_list = [keras.layers.Flatten(input_shape=(32,32,3))] + \
            [x for _ in range(20) for x in [keras.layers.Dense(100, activation="elu", kernel_initializer=keras.initializers.HeNormal()), keras.layers.BatchNormalization()]] + \
            [keras.layers.Dense(10, activation="softmax")]
    else:
        layer_list = [keras.layers.Flatten(input_shape=(32,32,3))] + \
            [keras.layers.Dense(100, activation="elu", kernel_initializer=keras.initializers.HeNormal()) for _ in range(20)] + \
            [keras.layers.Dense(10, activation="softmax")]

    model = keras.models.Sequential(layer_list)
    return model

In [7]:
def compile_and_fit(model, batch_size: int, total_epochs: int):
   opt = keras.optimizers.Nadam(learning_rate=1e-5)
   
   model.compile(loss="sparse_categorical_crossentropy",
         optimizer=opt,
         metrics=["accuracy"])
   
   history = model.fit(X_train, y_train, epochs=total_epochs,
      batch_size=batch_size,
      validation_data=(X_valid,y_valid))
   return history

In [8]:
def evaluate_model(model, batch_size):
    _, train_accuracy = model.evaluate(X_train, y_train, batch_size=batch_size, verbose=0)
    _, valid_accuracy = model.evaluate(X_valid, y_valid, batch_size=batch_size, verbose=0)
    return train_accuracy, valid_accuracy

## Train and evaluate

We test a few models with different batch size and traiing epochs, with or without batch normalization layers.

In [9]:
# batch=32, epochs=10
nn0_3210 = build_model(batch_norm = False)
nn1_3210 = build_model(batch_norm = True)
hist0_3210 = compile_and_fit(model=nn0_3210, batch_size=32, total_epochs=10)
hist1_3210 = compile_and_fit(model=nn1_3210, batch_size=32, total_epochs=10)

#batch=1024, epochs=25
nn0_102425 = build_model(batch_norm = False)
nn1_102425 = build_model(batch_norm = True)
hist0_102425 = compile_and_fit(model=nn0_102425, batch_size=1024, total_epochs=25)
hist1_102425 = compile_and_fit(model=nn1_102425, batch_size=1024, total_epochs=25)

  super().__init__(**kwargs)
I0000 00:00:1746544292.759007   24878 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9177 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4070, pci bus id: 0000:01:00.0, compute capability: 8.9


Epoch 1/10


I0000 00:00:1746544295.809432   24995 service.cc:152] XLA service 0x7e7204029e00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1746544295.809443   24995 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2025-05-06 11:11:35.920384: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1746544296.102401   24995 cuda_dnn.cc:529] Loaded cuDNN version 90800


[1m  56/1407[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m3s[0m 3ms/step - accuracy: 0.0917 - loss: 4.5651

I0000 00:00:1746544297.022001   24995 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 4ms/step - accuracy: 0.1916 - loss: 2.4718 - val_accuracy: 0.3134 - val_loss: 1.8922
Epoch 2/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.3240 - loss: 1.8750 - val_accuracy: 0.3468 - val_loss: 1.7893
Epoch 3/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.3604 - loss: 1.7814 - val_accuracy: 0.3728 - val_loss: 1.7334
Epoch 4/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.3842 - loss: 1.7218 - val_accuracy: 0.3910 - val_loss: 1.6952
Epoch 5/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.3987 - loss: 1.6772 - val_accuracy: 0.4016 - val_loss: 1.6661
Epoch 6/10
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.4132 - loss: 1.6410 - val_accuracy: 0.4086 - val_loss: 1.6436
Epoch 7/10
[1m1407/1407[






[1m38/44[0m [32m━━━━━━━━━━━━━━━━━[0m[37m━━━[0m [1m0s[0m 4ms/step - accuracy: 0.0985 - loss: 3.0412













[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 67ms/step - accuracy: 0.1006 - loss: 3.0033 - val_accuracy: 0.1454 - val_loss: 2.4287
Epoch 2/25
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.1568 - loss: 2.3567 - val_accuracy: 0.2036 - val_loss: 2.1822
Epoch 3/25
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2043 - loss: 2.1724 - val_accuracy: 0.2382 - val_loss: 2.0896
Epoch 4/25
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.2298 - loss: 2.0886 - val_accuracy: 0.2548 - val_loss: 2.0306
Epoch 5/25
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.2480 - loss: 2.0341 - val_accuracy: 0.2740 - val_loss: 1.9875
Epoch 6/25
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.2620 - loss: 1.9942 - val_accuracy: 0.2886 - val_loss: 1.9533
Epoch 7/25
[1m44/44[0m [32m━━━━━━━━━━━━━━━━━━━━

In [10]:
eval0_3210 = evaluate_model(nn0_3210, batch_size=32)
eval1_3210 = evaluate_model(nn1_3210, batch_size=32)

eval0_102425 = evaluate_model(nn0_102425, batch_size=1024)
eval1_102425 = evaluate_model(nn1_102425, batch_size=1024)

In [None]:
# Without batch norm, batch=32, epochs=10
print("Training accuracy in the last epoch:", hist0_3210.history['accuracy'][-1])
print("Tranning accuracy using model evaluate:", eval0_3210[0])
print("Validation accuracy in the last epoch:", hist0_3210.history['val_accuracy'][-1])
print("Validation accuracy using model evaluate:", eval0_3210[1])

Training accuracy in the last epoch: 0.446911096572876
Tranning accuracy using model evaluate: 0.45251110196113586
Validation accuracy in the last epoch: 0.43299999833106995
Validation accuracy using model evaluate: 0.43299999833106995


In [None]:
# With batch norm, batch=32, epochs=10
print("Training accuracy in the last epoch:", hist1_3210.history['accuracy'][-1])
print("Tranning accuracy using model evaluate:", eval1_3210[0])
print("Validation accuracy in the last epoch:", hist1_3210.history['val_accuracy'][-1])
print("Validation accuracy using model evaluate:", eval1_3210[1])

Training accuracy in the last epoch: 0.38588887453079224
Tranning accuracy using model evaluate: 0.32988888025283813
Validation accuracy in the last epoch: 0.31619998812675476
Validation accuracy using model evaluate: 0.31619998812675476


In [None]:
# Without batch norm, batch=1024, epochs=25
print("Training accuracy in the last epoch:", hist0_102425.history['accuracy'][-1])
print("Tranning accuracy using model evaluate:", eval0_102425[0])
print("Validation accuracy in the last epoch:", hist0_102425.history['val_accuracy'][-1])
print("Validation accuracy using model evaluate:", eval0_102425[1])

Training accuracy in the last epoch: 0.38457778096199036
Tranning accuracy using model evaluate: 0.38642221689224243
Validation accuracy in the last epoch: 0.37720000743865967
Validation accuracy using model evaluate: 0.37720000743865967


In [None]:
# With batch norm, batch=1024, epochs=25
print("Training accuracy in the last epoch:", hist1_102425.history['accuracy'][-1])
print("Tranning accuracy using model evaluate:", eval1_102425[0])
print("Validation accuracy in the last epoch:", hist1_102425.history['val_accuracy'][-1])
print("Validation accuracy using model evaluate:", eval1_102425[1])

Training accuracy in the last epoch: 0.3581777811050415
Tranning accuracy using model evaluate: 0.3585111200809479
Validation accuracy in the last epoch: 0.30979999899864197
Validation accuracy using model evaluate: 0.30979999899864197
