# Model.fit() input as ```tf.data.Dataset```

* [tf.keras.Model.fit()](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit)

> A tf.data dataset. Should return a tuple of either **(inputs, targets)** or (inputs, targets, sample_weights).

* [How do I train my keras model using Tf Datasets #720](https://github.com/tensorflow/datasets/issues/720)

> You can use ```tfds.load(as_supervised=True)``` kwargs to return an **```(image, label)``` tuple expected by keras**. 
For images, you would have in addition to cast/normalize the image to tf.float32, for this, you can use ```tf.data.Dataset.map```.
> 
> ```
> def _normalize_img(img, label):
>   img = tf.cast(img, tf.float32) / 255.
>   return (img, label)
> 
> ds = tfds.load('mnist', split='train', as_supervised=True)
> ds = ds.batch(32)
> ds = ds.map(_normalize_img)
>
> model.fit(ds_train, epochs=5)
> ```
 
* [How does tf.keras.Model tell between features and label(s) in tf.data.Dataset and in TFRecords?](https://stackoverflow.com/a/59838140/4281353) 

> As such, the dataset that is given to model.fit is actually **a dataset of tuples**, and to the best of my knowledge, this is exactly what the model will assume if you provide a tf.data.Dataset as input to the fit function -- **a dataset of tuples (inputs, labels)**. So the first will be taken as input to the model, the second as target for the loss function.

* [Support model.fit using targets in a dictionary](https://github.com/tensorflow/tensorflow/issues/24962#issuecomment-475709720)

> ```
> def make_dataset(images, labels, batch_size=64, buffer_size=1024, shuffle=True):
>     inputs = dict(images=images)
>     outputs = dict(labels=labels)
>     dataset = tf.data.Dataset.from_tensor_slices((inputs, outputs))
>     if shuffle:
>         dataset = dataset.shuffle(buffer_size=buffer_size)
>     dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
>     dataset = dataset.map(map_func=normalize_fn, num_parallel_calls=8)
>     dataset = dataset.batch(batch_size)
>     return dataset
> 
> model.add(tf.keras.Input(shape=(28, 28, 1), name='images'))
> model.add(tf.keras.layers.Dense(10, activation='softmax', name='labels'))
> ```

# Baics

1. Generate a ```tf.data.Dataset``` that returns ```(data, label)```.
2. Use tf.data.Dataset.from_tensor_slices((inputs, outputs)) where inputs and outputs are separate sequences.

## Batch Shape

```
model.fit(dataset)
---
ValueError: Input 0 of layer "model_6" is incompatible with the layer:
expected shape=(None, 448, 448, 3), found shape=(448, 448, 3)
```

```tf.keras.Model.fit()``` expects batches. DO not forget ```tf.data.Dataset.batch(batch_size)``` to have the batched shape.
```
model.fit(dataset.batch(batch_size))
```

## Don'ts

Do not attempt to manipulate the shape. The ```tf.data.Dataset``` should be already in the state that the ```tf.keras.model.fit()``` can accept. 

### ValueError: Creating variables on a non-first call to a function decorated with tf.function.

If try to manipulate, it can cause issues. For instance, the code creates new Tensors in the ```tf.data.Dataset.map()``` function which is invoked during ```tf.keras.model.fit()``` which runs in Graph mode.  DO NOT use ```tf.config.run_functions_eagerly(True)``` to get around.

```
# tf.config.run_functions_eagerly(False)
def mapper(image, label):
    return (
        tf.expand_dims(image, axis=0), # <--- creating a new Tensor
        tf.expand_dims(label, axis=0   # <--- creating a new Tensor
    )

model.fit(train_dataset.map(mapper))
```



See [Running the Tensorflow 2.0 code gives 'ValueError: tf.function-decorated function tried to create variables on non-first call'. What am I doing wrong?](https://stackoverflow.com/a/59209937/4281353) for other errors.



---
# Using generator

## steps_per_epoch

```tf.keras.Model.fit()``` does not know the number of records that the generator can provide. Need to tell ```fit()``` that it can consume ```num_batches_per_epoch * batch_size``` records per epoch. This ```num_batches_per_epoch``` is passed via ```steps_per_epoch``` argument.

```tf.keras.Model.fit()``` keeps consuming records from the generator during the training. In total, ```fit``` consumes ```batch_size * num_batches_per_epoch * num_epochs``` records. The generator needs to be able to provide the amount of records.

### Calculation

```steps_per_epoch = total_availble_records / batch_size / num_epochs```


## Prevent exhausting generator

1. Set ```steps_per_epoch``` and ```validation_steps``` arguments, or
2. Implement loop inside the generator to keep producing records. 


<img src="./image/tf_keras_model_fit_steps_per_epoch.png" align="left" width=700/>


# Example

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

2023-11-21 18:32:16.667353: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-21 18:32:16.693939: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-21 18:32:16.693960: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-21 18:32:16.693979: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-21 18:32:16.699778: I tensorflow/core/platform/cpu_feature_g

In [2]:
(train, test), info = tfds.load(
    'mnist',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

2023-11-21 18:32:18.011344: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-21 18:32:18.019980: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-21 18:32:18.020203: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

In [4]:
def f(image, label):
    return 1

In [5]:
num_total_train_records = len(list(
    train.map(f)
))
num_total_test_records = len(list(
    test.map(f)
))
print(num_total_train_records, num_total_test_records)

60000 10000


## Model

In [6]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

## 

In [7]:
num_epochs = 2
batch_size = 8

num_x_batches_per_epoch = int(np.floor(num_total_train_records / batch_size / num_epochs))
num_v_batches_per_epoch = int(np.floor(num_total_test_records / batch_size / num_epochs)) -1  # Cuase ran out of data without -1

print(num_x_batches_per_epoch, num_v_batches_per_epoch)

3750 624


## Without steps_per_epoch

```model.fit``` will exhaust the genreator and cause the error:

> Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 1250 batches). You may need to use the repeat() function when building your dataset.

In [8]:
x_generator = train.batch(batch_size).as_numpy_iterator()
v_generator = test.batch(batch_size).as_numpy_iterator()

model.fit(
    x=x_generator ,
    epochs=num_epochs,
    batch_size=batch_size,
    #steps_per_epoch=num_x_batches_per_epoch,
    #validation_data=v_generator,
    #validation_steps=num_v_batches_per_epoch,
    #validation_batch_size=batch_size
)

Epoch 1/2


2023-11-21 18:32:20.475119: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3890641fd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-21 18:32:20.475135: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4050 Laptop GPU, Compute Capability 8.9
2023-11-21 18:32:20.478134: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-11-21 18:32:20.487187: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700
2023-11-21 18:32:20.551998: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/2


2023-11-21 18:32:43.649131: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2933709128551705068
2023-11-21 18:32:43.649195: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 897582143205307554
2023-11-21 18:32:43.675769: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2933709128551705068
2023-11-21 18:32:43.675837: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 897582143205307554




<keras.src.callbacks.History at 0x7f39b015ab30>

## With steps_per_epoch

In [9]:
x_generator = train.batch(batch_size).as_numpy_iterator()
v_generator = test.batch(batch_size).as_numpy_iterator()

model.fit(
    x=x_generator ,
    epochs=num_epochs,
    batch_size=batch_size,
    steps_per_epoch=num_x_batches_per_epoch,
    #validation_data=v_generator,
    #validation_steps=num_v_batches_per_epoch,
    #validation_batch_size=batch_size
)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7f39b00b7e20>

---
# Validation without validation_steps 

Although the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) says it is for **tf.data.Dataset**, it is required for **generator as well**.

> Only relevant if validation_data is provided and **is a tf.data dataset**. Total ```number of steps (batches of samples)``` to draw before stopping when performing validation at the end of every epoch.

In [10]:
x_generator = train.batch(batch_size).as_numpy_iterator()
v_generator = test.batch(batch_size).as_numpy_iterator()

model.fit(
    x=x_generator ,
    epochs=num_epochs,
    batch_size=batch_size,
    steps_per_epoch=num_x_batches_per_epoch,
    validation_data=v_generator,
    #validation_steps=num_v_batches_per_epoch,
    #validation_batch_size=batch_size
)

Epoch 1/2
Epoch 2/2
  57/3750 [..............................] - ETA: 10s - loss: 0.4637 - sparse_categorical_accuracy: 0.8947

2023-11-21 18:33:16.922099: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2933709128551705068
2023-11-21 18:33:16.922163: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 897582143205307554




2023-11-21 18:33:27.850811: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2933709128551705068
2023-11-21 18:33:27.850832: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 897582143205307554




<keras.src.callbacks.History at 0x7f39a040dd80>

---
# Validation with validation_steps 

In [11]:
x_generator = train.batch(batch_size).as_numpy_iterator()
v_generator = test.batch(batch_size).as_numpy_iterator()

model.fit(
    x=x_generator ,
    epochs=num_epochs,
    batch_size=batch_size,
    steps_per_epoch=num_x_batches_per_epoch,
    validation_data=v_generator,
    validation_steps=num_v_batches_per_epoch,
    validation_batch_size=batch_size
)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7f39a040f400>

---
# validation_steps confusion

validation_steps seems ```-1``` required.

* [tensorflow - tf.keras.Model.fit causes run out of data for validation data with validation_steps being set](https://stackoverflow.com/questions/77520936/tensorflow-tf-keras-model-fit-causes-run-out-of-data-for-validation-data-with)

* [tensorflow - tf.keras.Model.fit causes run out of data for validation data with validation_steps being set#62444](https://github.com/tensorflow/tensorflow/issues/62444)

In [12]:
num_epochs = 2
batch_size = 8

num_x_batches_per_epoch = int(np.floor(num_total_train_records / batch_size / num_epochs))
# without -1
num_v_batches_per_epoch = int(np.floor(num_total_test_records / batch_size / num_epochs))

print(num_x_batches_per_epoch, num_v_batches_per_epoch)

3750 625


In [13]:
x_generator = train.batch(batch_size).as_numpy_iterator()
v_generator = test.batch(batch_size).as_numpy_iterator()

model.fit(
    x=x_generator ,
    epochs=num_epochs,
    batch_size=batch_size,
    steps_per_epoch=num_x_batches_per_epoch,
    validation_data=v_generator,
    validation_steps=num_v_batches_per_epoch,
    validation_batch_size=batch_size
)

Epoch 1/2
Epoch 2/2


2023-11-21 18:34:13.022749: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2933709128551705068
2023-11-21 18:34:13.022866: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 897582143205307554




<keras.src.callbacks.History at 0x7f39a04b2c20>