# Training a neural network

In this notebook, we will train a deep neural network and experiment with:
- Optimization algorithm
- Batch normalization
- Activation function
- Regularization, dropout

We will use the CIFAR10 image dataset.

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

2025-05-05 15:59:18.532270: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-05 15:59:18.539478: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746475158.548265   36637 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746475158.550878   36637 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746475158.557626   36637 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [2]:
print("****Num GPUs Available:", len(tf.config.list_physical_devices('GPU')))

****Num GPUs Available: 1


## Load data

In [3]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [4]:
X_train = X_train.astype("float32") / 255.0
X_valid = X_valid.astype("float32") / 255.0
X_test = X_test.astype("float32") / 255.0

In [5]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

## Create neural network model

Having too many layers would exarcebate the problem of vanishing / exploding gradients.  So let's start with a model of 20 hidden layers.  We will try to use a small learning rate and apply batch normalization in order to obtain stable solution during training.

In [6]:
def build_model():
    model = keras.models.Sequential(
    [keras.layers.Flatten(input_shape=(32,32,3))] +
    [keras.layers.Dense(100, activation="elu", kernel_initializer=keras.initializers.HeNormal()) for _ in range(20)] +
    [keras.layers.Dense(10, activation="softmax")]
    )
    return model

In [7]:
nn1 = build_model()

  super().__init__(**kwargs)
I0000 00:00:1746475162.084163   36637 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7785 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4070, pci bus id: 0000:01:00.0, compute capability: 8.9


## Optimize with a "too large" learning rate

First, let's try optimizing with a "large" learning rate.  The model does not converge and predictions end up being just guess of one class.

In [8]:
opt1 = keras.optimizers.Nadam(learning_rate=1.0)

nn1.compile(loss="sparse_categorical_crossentropy",
           optimizer=opt1,
           metrics=["accuracy"])
nn1.fit(X_train, y_train, epochs=3,
       validation_data=(X_valid,y_valid))

Epoch 1/3


I0000 00:00:1746475164.977050   36752 service.cc:152] XLA service 0x79f728014a90 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1746475164.977061   36752 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2025-05-05 15:59:25.089498: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1746475165.271402   36752 cuda_dnn.cc:529] Loaded cuDNN version 90800


[1m  89/1407[0m [32m━[0m[37m━━━━━━━━━━━━━━━━━━━[0m [1m2s[0m 2ms/step - accuracy: 0.1134 - loss: nan

I0000 00:00:1746475166.150076   36752 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 4ms/step - accuracy: 0.1011 - loss: nan - val_accuracy: 0.0996 - val_loss: nan
Epoch 2/3
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.1007 - loss: nan - val_accuracy: 0.0996 - val_loss: nan
Epoch 3/3
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.1007 - loss: nan - val_accuracy: 0.0996 - val_loss: nan


<keras.src.callbacks.history.History at 0x79f8c890d9d0>

In [30]:
train_loss, train_accuracy = nn1.evaluate(X_train, y_train, verbose=0)
print(f"Training accuracy from evaluate(): {train_accuracy:.4f}")

Training accuracy from evaluate(): 0.1000


From the classification report, we see that the model ends up just classifying everything into class 0.

## Better learning rate
Now, let's try a smaller learning rate.  We see that the model converges, and predictions are improved.

In [10]:
nn2 = build_model()

opt2 = keras.optimizers.Nadam(learning_rate=1e-5)

nn2.compile(loss="sparse_categorical_crossentropy",
           optimizer=opt2,
           metrics=["accuracy"])
nn2.fit(X_train, y_train, epochs=5,
       validation_data=(X_valid,y_valid))

  super().__init__(**kwargs)


Epoch 1/5
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 4ms/step - accuracy: 0.1970 - loss: 2.2451 - val_accuracy: 0.3084 - val_loss: 1.8833
Epoch 2/5
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.3137 - loss: 1.8863 - val_accuracy: 0.3446 - val_loss: 1.7909
Epoch 3/5
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.3431 - loss: 1.8080 - val_accuracy: 0.3640 - val_loss: 1.7441
Epoch 4/5
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.3651 - loss: 1.7578 - val_accuracy: 0.3754 - val_loss: 1.7112
Epoch 5/5
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.3792 - loss: 1.7186 - val_accuracy: 0.3886 - val_loss: 1.6830


<keras.src.callbacks.history.History at 0x79f849e5fc20>

In [29]:
train_loss, train_accuracy = nn2.evaluate(X_train, y_train, verbose=0)
print(f"Training accuracy from evaluate(): {train_accuracy:.4f}")

Training accuracy from evaluate(): 0.3962


## Batch Normalization
Let's add batch normalization and train for more epochs.

In [31]:
keras.backend.clear_session()
nn3 = keras.models.Sequential(
    [keras.layers.Flatten(input_shape=(32,32,3))] +
    [x for _ in range(20) for x in [keras.layers.Dense(100, activation="elu", kernel_initializer=keras.initializers.HeNormal()), keras.layers.BatchNormalization()]] +
    [keras.layers.Dense(10, activation="softmax")]
    )

  super().__init__(**kwargs)


In [32]:
nn3.summary()

Let's train for more epochs, but also add an early stopping rule.

In [33]:
def get_run_logdir():
    import time, os
    run_id = time.strftime("run_%Y%m%d-%H%M%S")
    return os.path.join(os.pardir, "logs", "cifar10", run_id)

run_logdir = get_run_logdir()

In [34]:
%load_ext tensorboard
%tensorboard --logdir=./logs/cifar10/ --port=6006


The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Launching TensorBoard...

In [35]:
opt3 = keras.optimizers.Nadam(learning_rate=1e-5)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=5)
model_checkpoint_cb = keras.callbacks.ModelCheckpoint("../models/my_cifar10_model.keras")
tensorboard_cb = keras.callbacks.TensorBoard(run_logdir)

nn3.compile(loss="sparse_categorical_crossentropy",
           optimizer=opt3,
           metrics=["accuracy"])
nn3.fit(X_train, y_train, epochs=100,
       validation_data=(X_valid,y_valid),
       callbacks=[early_stopping_cb, model_checkpoint_cb, tensorboard_cb])

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 8ms/step - accuracy: 0.1305 - loss: 2.7251 - val_accuracy: 0.1978 - val_loss: 2.3003
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.2002 - loss: 2.2626 - val_accuracy: 0.2374 - val_loss: 2.1456
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - accuracy: 0.2484 - loss: 2.0969 - val_accuracy: 0.2572 - val_loss: 2.0711
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 6ms/step - accuracy: 0.2809 - loss: 1.9994 - val_accuracy: 0.2654 - val_loss: 2.0312
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 6ms/step - accuracy: 0.3044 - loss: 1.9335 - val_accuracy: 0.2770 - val_loss: 2.0037
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 6ms/step - accuracy: 0.3225 - loss: 1.8825 - val_accuracy: 0.2848 - val_loss: 1.9893
Epoch 7/1

<keras.src.callbacks.history.History at 0x79f5145edd90>

In [None]:
train_loss, train_accuracy = nn3.evaluate(X_train, y_train, verbose=0)
print(f"Training accuracy from evaluate(): {train_accuracy:.4f}")

Training accuracy from evaluate(): 0.3376


In [38]:
valid_loss, valid_accuracy = nn3.evaluate(X_valid, y_valid, verbose=0)
print(f"Validation accuracy from evaluate(): {valid_accuracy:.4f}")
print(f"Validation loss from evaluate(): {valid_loss:.4f}")

Validation accuracy from evaluate(): 0.3088
Validation loss from evaluate(): 1.9648


With batch normalization, it seems prediction performance (measured on the validation set) actually dropped.