# TensorFlow Datasets

* [TensorFlow Datasets Project Page](https://www.tensorflow.org/datasets)

* [TensorFlow Datasets Guide](https://www.tensorflow.org/datasets/overview)

> TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks. It handles downloading and preparing the data deterministically and constructing a tf.data.Dataset (or np.array).
> ## Installation
> ```
> pip install tensorflow-datasets
>```
> ## Available Dataset
> All dataset builders are subclass of tfds.core.DatasetBuilder. To get the list of available builders, use tfds.list_builders() or look at our catalog.
> ```
> tfds.list_builders()
> ```
> ## Load a dataset
> tfds.load. It will:
> 1. Download the data and save it as tfrecord files.
> 2. Load the tfrecord and create the tf.data.Dataset.
> 
> ```
> ds = tfds.load(
>     'mnist', 
>     split='train',        # (e.g. 'train', ['train', 'test'], 'train[80%:]',...).
>     shuffle_files=True,
>     with_info=True,       # Returns the tfds.core.DatasetInfo containing dataset metadata
>     data_dir=path_to_dir, # Location where the dataset is saved (default ~/tensorflow_datasets/)
> )
> ```
> ## Visualization
> * [tfds.as_dataframe](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe)  
> Convert TFDS dataset to Pandas DataFrame.
> ```
> ds, info = tfds.load('mnist', split='train', with_info=True)
> tfds.as_dataframe(ds.take(4), info)
> ```
> * [tfds.show_examples](https://www.tensorflow.org/datasets/overview#tfdsshow_examples)  
> tfds.show_examples returns a matplotlib.figure.Figure (only image datasets supported now):
> ```
> ds, info = tfds.load('mnist', split='train', with_info=True)
> fig = tfds.show_examples(ds, info)
> ```

* [Tensorflow Document - TensorFlow Datasets](https://www.tensorflow.org/datasets/overview)
* [TensorFlow Datasets Github](https://github.com/tensorflow/datasets)

<img src="./image/tfds_datasets.png" align="left">

In [1]:
#!conda install -y tensorflow-datasets

# Data location

* [tfds.load](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)

```
tfds.load(
    name: str,
    *,
    split: Optional[Tree[splits_lib.SplitArg]] = None,
    data_dir: Union[None, str, os.PathLike] = None,
    batch_size: Optional[int] = None,
    shuffle_files: bool = False,
    download: bool = True,
    as_supervised: bool = False,
    decoders: Optional[TreeDict[decode.partial_decode.DecoderArg]] = None,
    read_config: Optional[read_config_lib.ReadConfig] = None,
    with_info: bool = False,
    builder_kwargs: Optional[Dict[str, Any]] = None,
    download_and_prepare_kwargs: Optional[Dict[str, Any]] = None,
    as_dataset_kwargs: Optional[Dict[str, Any]] = None,
    try_gcs: bool = False
)
```


> * data_dir:  
> 
> directory to read/write data. Defaults to the value of the environment variable **TFDS_DATA_DIR** if set, otherwise defaults to ```~/tensorflow_datasets/```.

In [14]:
import numpy as np
import tensorflow_datasets as tfds

# Loading Dataset

In [1]:
# Construct a tf.data.Dataset
mnist, info = tfds.load(
    'mnist',              # Name of the dataset
    split='train', 
    with_info=True,       # Information of the dataset
    shuffle_files=True, 
)

## Dataset Information

In [3]:
info

tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='/home/oonisim/tensorflow_datasets/mnist/3.0.1',
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)

In [4]:
# Number of training data
info.splits['train'].num_examples

60000

# All available datasets

In [5]:
for ds in tfds.list_builders():
    print(ds)

abstract_reasoning
accentdb
aeslc
aflw2k3d
ag_news_subset
ai2_arc
ai2_arc_with_ir
amazon_us_reviews
anli
arc
bair_robot_pushing_small
bccd
beans
big_patent
bigearthnet
billsum
binarized_mnist
binary_alpha_digits
blimp
bool_q
c4
caltech101
caltech_birds2010
caltech_birds2011
cars196
cassava
cats_vs_dogs
celeb_a
celeb_a_hq
cfq
cherry_blossoms
chexpert
cifar10
cifar100
cifar10_1
cifar10_corrupted
citrus_leaves
cityscapes
civil_comments
clevr
clic
clinc_oos
cmaterdb
cnn_dailymail
coco
coco_captions
coil100
colorectal_histology
colorectal_histology_large
common_voice
coqa
cos_e
cosmos_qa
covid19
covid19sum
crema_d
curated_breast_imaging_ddsm
cycle_gan
d4rl_adroit_door
d4rl_adroit_hammer
d4rl_adroit_pen
d4rl_adroit_relocate
d4rl_mujoco_ant
d4rl_mujoco_halfcheetah
d4rl_mujoco_hopper
d4rl_mujoco_walker2d
dart
davis
deep_weeds
definite_pronoun_resolution
dementiabank
diabetic_retinopathy_detection
div2k
dmlab
doc_nli
dolphin_number_word
downsampled_imagenet
drop
dsprites
dtd
duke_ultrasound
e2e

---
# (input, label) format for superpised learning data

* [tfds.load](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)

> * as_supervised	bool  
> if **True**, the returned tf.data.Dataset will have a 2-tuple structure ```(input, label)``` according to builder.info.supervised_keys. If False, the default, the returned tf.data.Dataset will have a dictionary with all the features.

In [3]:
# Construct a tf.data.Dataset
mnist, info = tfds.load(
    'mnist',
    split='train', 
    with_info=True, 
    shuffle_files=True,
    as_supervised=True
)

In [7]:
for x, label in mnist.take(2).as_numpy_iterator():
    print(label, x.shape)

4 (28, 28, 1)
1 (28, 28, 1)


2023-03-04 11:49:43.085837: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


# Convert dataset into numpy array (when size is small)

In [15]:
def dataset_to_numpy(ds):
    """
    Convert tensorflow dataset to numpy arrays
    """
    images = []
    labels = []

    # Iterate over a dataset
    for i, (image, label) in enumerate(tfds.as_numpy(ds)):
        images.append(image)
        labels.append(label)

    for i, img in enumerate(images):
        if i < 3:
            print(img.shape, labels[i])

    return np.array(images), np.array(labels)

In [16]:
images, labels = dataset_to_numpy(mnist.take(16))

(28, 28, 1) 4
(28, 28, 1) 1
(28, 28, 1) 0


2023-03-04 11:54:04.426037: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [18]:
images.shape

(16, 28, 28, 1)

In [19]:
labels.shape

(16,)