<a href="https://colab.research.google.com/github/martin-fabbri/colab-notebooks/blob/master/deeplearning.ai/tf/b4_public_datasets_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using public datasets with TF Datasets

In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds

from tensorflow.keras import layers

tfds.__version__

'4.0.1'

In [2]:
mnist_data = tfds.load('fashion_mnist')
type(mnist_data), mnist_data

[1mDownloading and preparing dataset fashion_mnist/3.0.1 (download: 29.45 MiB, generated: 36.42 MiB, total: 65.87 MiB) to /root/tensorflow_datasets/fashion_mnist/3.0.1...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/fashion_mnist/3.0.1.incompleteNVCS1X/fashion_mnist-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/fashion_mnist/3.0.1.incompleteNVCS1X/fashion_mnist-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=10000.0), HTML(value='')))

[1mDataset fashion_mnist downloaded and prepared to /root/tensorflow_datasets/fashion_mnist/3.0.1. Subsequent calls will reuse this data.[0m


(dict,
 {'test': <PrefetchDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>,
  'train': <PrefetchDataset shapes: {image: (28, 28, 1), label: ()}, types: {image: tf.uint8, label: tf.int64}>})

In [3]:
for item in mnist_data:
  print(type(item), item)

<class 'str'> test
<class 'str'> train


If you want to load these splits into a dataset containing the actual data, you can simply specify the split you want in the tfds.load command, like this:

In [4]:
mnist_train = tfds.load(name='fashion_mnist', split='train')
assert isinstance(mnist_train, tf.data.Dataset)
type(mnist_train)

tensorflow.python.data.ops.dataset_ops.PrefetchDataset

In this instance, we we a `PrefetchDataset` object, which we can iterate through to inspect the data. One nice feature is that we can apply `take(1)` and get the first record.

In [5]:
item = next(iter(mnist_train.take(1)))
print(type(item))
print(item.keys())

<class 'dict'>
dict_keys(['image', 'label'])


In [6]:
image = item['image']
print(type(image))
print(image.shape)
print(image[0:0])

<class 'tensorflow.python.framework.ops.EagerTensor'>
(28, 28, 1)
tf.Tensor([], shape=(0, 28, 1), dtype=uint8)


In [7]:
label = item['label']
print(type(label))
print(label)

<class 'tensorflow.python.framework.ops.EagerTensor'>
tf.Tensor(2, shape=(), dtype=int64)


In [8]:
mnist_test, info = tfds.load(name='fashion_mnist', with_info='true')
info

tfds.core.DatasetInfo(
    name='fashion_mnist',
    version=3.0.1,
    description='Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.',
    homepage='https://github.com/zalandoresearch/fashion-mnist',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{DBLP:journals/corr/abs-1708-07747,
      author    = {Han Xiao and
                   Kashif Rasul and
                   Roland Vollgraf},
      title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning
                   Algorithms},
      journal   = {CoRR},
      volume

## Using TFDS with Keras Model

In [9]:
mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
print(type(train_images))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
<class 'numpy.ndarray'>


When using TFDS the code is very similar, but with some minor changes. The Keras datasets gave us `ndarray` that worked natively in `model.fit`. However, with TFDS we will need to do a little conversion work.

In [10]:
(train_images, train_labels), (test_images, test_labels) = \
  tfds.as_numpy(
      tfds.load('fashion_mnist',
                split=['train', 'test'],
                batch_size=-1,
                as_supervised=True))
print(type(train_images))

<class 'numpy.ndarray'>


In [11]:
# we need to rescale our images before feeding them into the network
# train_images = train_images * 1.0/255.0
# test_images = test_images * 1.0/255.0
# skipping this rescaling step in favor of adding rescaling directly
# into the model pipeline(see layers...Rescaling) 

model = tf.keras.models.Sequential([
  layers.experimental.preprocessing.Rescaling(1.0/255.0),
  layers.Flatten(input_shape=(28, 28, 1)),
  layers.Dense(128, activation='relu'),
  layers.Dropout(0.2),
  layers.Dense(10, activation='softmax')
])

model.compile(
    loss='sparse_categorical_crossentropy', 
    optimizer='adam', 
    metrics=['accuracy']
)

model.fit(
    train_images,
    train_labels,
    epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7efd600857b8>

The data is batched and shuffled to make training more effective?

## Human-or-Horses Model

In [14]:
data = tfds.load('horses_or_humans', split='train', as_supervised=True)
val_data = tfds.load('horses_or_humans', split='test', as_supervised=True)
 
train_batches = data.shuffle(100).batch(10)
validation_batches = val_data.batch(32)

model = tf.keras.models.Sequential([
    layers.experimental.preprocessing.Rescaling(1.0/255.0),
    layers.Conv2D(16, (3,3), activation='relu', 
                           input_shape=(300, 300, 3)),
    layers.MaxPooling2D(2, 2),
    layers.Conv2D(32, (3,3), activation='relu'),
    layers.MaxPooling2D(2,2),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D(2,2),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D(2,2),
    layers.Conv2D(64, (3,3), activation='relu'),
    layers.MaxPooling2D(2,2),
    layers.Flatten(),
    layers.Dense(512, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='Adam', loss='binary_crossentropy',
metrics=['accuracy'])

history = model.fit(
    train_batches, 
    epochs=10,
    validation_data=validation_batches
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
