<a href="https://colab.research.google.com/github/poojasaxena/tensorflow-developer-zertificate-coursera/blob/main/course2_convolutional-neural-networks-tensorflow/07_exploring_tf_datasets/Course_2_Part_1_Lesson_1_Notebook_tfds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

It handles downloading and preparing the data deterministically and constructing a tf.data.Dataset (or np.array).



#Installation
TFDS exists in two packages:

1. pip install tensorflow-datasets: The stable version, released every few months.
2. pip install tfds-nightly: Released every day, contains the last versions of the datasets.

This colab uses tfds-nightly:



In [1]:
pip install -q tfds-nightly tensorflow matplotlib

[K     |████████████████████████████████| 3.9MB 27.1MB/s 
[?25h

In [4]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds    

# Find available datasets

In [4]:
## To find all the availalbe datesets
#tfds.list_builders() 

In [6]:
[data for data in tfds.list_builders() if "cat" in data]

['cats_vs_dogs',
 'visual_domain_decathlon',
 'huggingface:acronym_identification',
 'huggingface:catalonia_independence',
 'huggingface:interpress_news_category_tr']

# Load a Dataset
The easiest way of loading a dataset is *tfds.load*. It will:

1. Download the data and save it as *tfrecord* files.
2. Load the tfrecord and create the *tf.data.Dataset*.

## 4.1 tfds.load
tfds.load is a thin wrapper around tfds.core.DatasetBuilder.

In [19]:
ds, info = tfds.load(name='mnist', split='train', as_supervised=True, shuffle_files=True, with_info=True)

In [17]:
assert isinstance(ds, tf.data.Dataset)

In [20]:
print(ds)

<_OptionsDataset shapes: ((28, 28, 1), ()), types: (tf.uint8, tf.int64)>


In [22]:
print(info)

tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='/root/tensorflow_datasets/mnist/3.0.1',
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)


In [28]:
type(ds)

tensorflow.python.data.ops.dataset_ops._OptionsDataset

## 4.2 tfds.builder

In [29]:
builder = tfds.builder('mnist')

In [30]:
# 1. Create the tfrecord files (no-op if already exists)
builder.download_and_prepare()


In [38]:
# 2. Load the `tf.data.Dataset`
ds = builder.as_dataset(split='train', shuffle_files=True, as_supervised=True)
print(ds)

<_OptionsDataset shapes: ((28, 28, 1), ()), types: (tf.uint8, tf.int64)>


In [39]:
assert isinstance(ds, tf.data.Dataset)

In [40]:
type(ds)

tensorflow.python.data.ops.dataset_ops._OptionsDataset

In [41]:
print(ds)

<_OptionsDataset shapes: ((28, 28, 1), ()), types: (tf.uint8, tf.int64)>


## 4.3 tfds.build CLI
If you want to generate a specific dataset, you can use the tfds command line. For example:



In [45]:
tfds build mnist

SyntaxError: ignored

## 5. Iterate over a dataset

## 5.1 As dict
By default, the tf.data.Dataset object contains a dict of tf.Tensors:



In [5]:
ds = tfds.load('mnist', split='train')

In [8]:
for example in ds.take(2):
  print(list(example.keys()))
  image = example['image']
  label = example['label']
  print(image.shape, label)

['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
['image', 'label']
(28, 28, 1) tf.Tensor(1, shape=(), dtype=int64)


## 5.2 As tuple

In [10]:
ds = tfds.load('mnist', split='train', as_supervised=True)


In [11]:
for image, label in ds.take(2):
  print(image.shape, label)

(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
(28, 28, 1) tf.Tensor(1, shape=(), dtype=int64)


## 5.3 As Numpy

Uses tfds.as_numpy to convert:

1. tf.Tensor -> np.array
2. tf.data.Dataset -> Iterator[Tree[np.array]] (Tree can be arbitrary nested Dict, Tuple)




In [30]:
ds = tfds.load('mnist', as_supervised=True, split='train')
for image, label in tfds.as_numpy(ds.take(4)):
  print(type(image), type(label), label)

<class 'numpy.ndarray'> <class 'numpy.int64'> 4
<class 'numpy.ndarray'> <class 'numpy.int64'> 1
<class 'numpy.ndarray'> <class 'numpy.int64'> 0
<class 'numpy.ndarray'> <class 'numpy.int64'> 7


## 5.4 As batched tf.Tensor (batch_size=-1)
By using batch_size=-1, you can load the full dataset in a single batch.

This can be combined with as_supervised=True and tfds.as_numpy to get the the data as (np.array, np.array):



In [31]:
image, label = tfds.as_numpy(tfds.load(
    'mnist',
    split='test',
    batch_size=-1,
    as_supervised=True,
))

print(type(image), image.shape)

<class 'numpy.ndarray'> (10000, 28, 28, 1)
