# Tensorflow Dataset API
* Mohammad Hassan Heydari

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import tensorflow as tf
print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.16.2


In [2]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])

for element in dataset:
  print(element)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)


In [4]:
dataset = dataset.map(lambda x: x*2)
print(list(dataset.as_numpy_iterator()))
print(dataset)

[4, 8, 12]
<_MapDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>


**apply() method**
* Applies a transformation function to this dataset.

apply(
    transformation_func
) -> 'DatasetV2'

In [6]:
dataset = tf.data.Dataset.range(100)

def dataset_fn(ds):
  return ds.filter(lambda x: x < 10)

dataset = dataset.apply(dataset_fn)

list(dataset.as_numpy_iterator())

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

**as_numpy_iterator() method**
* Returns an iterator which converts all elements of the dataset to numpy.

Use as_numpy_iterator to inspect the content of your dataset. To see element shapes and types, print dataset elements directly instead of using as_numpy_iterator

In [7]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
  print(element)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)


In [8]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset.as_numpy_iterator():
  print(element)

1
2
3


In [11]:
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])

print(dataset.as_numpy_iterator(), end = '\n---\n')
print(list(dataset.as_numpy_iterator()))

NumpyIterator(iterator=<tensorflow.python.data.ops.iterator_ops.OwnedIterator object at 0x7f4efacab550>)
---
[1, 2, 3]


**batch() method**
* Combines consecutive elements of this dataset into batches.

batch (

batch_size,
    
drop_remainder=False,
    
num_parallel_calls=None,
    
deterministic=None,
    
name=None
) -> 'DatasetV2'


In [12]:
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3)
list(dataset.as_numpy_iterator())

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7])]

**num_parallel_cells** and **AUTOTUNE** : A tf.int64 scalar tf.Tensor, representing the number of batches to compute asynchronously in parallel. If not specified, batches will be computed sequentially. If the value tf.data.AUTOTUNE is used, then the number of parallel calls is set dynamically based on available resources.

**Note** : If your program requires data to have a statically known shape (e.g., when using XLA), you should use drop_remainder=True. Without drop_remainder=True the shape of the output dataset will have an unknown leading dimension due to the possibility of a smaller final batch

In [13]:
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)
list(dataset.as_numpy_iterator())

[array([0, 1, 2]), array([3, 4, 5])]