In [1]:
import os
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras import datasets, models, layers, optimizers, losses, callbacks, metrics
from tensorflow.train import BytesList, FloatList, Int64List, Feature, Features, Example

from sklearn.model_selection import train_test_split

# Chapter 13: Loading and Preprocessing Data with Tensorflow

This notebook contains the solution for the exercises 9 and 10 of the chapter 13: *Loading and Preprocessing Data with TensorFlow* of the book *Hands On Machine Learning with Scikit-Learn, Keras & TensorFlow* of Aurélien Géron.

## Exercise 9

**Load the Fashion MNIST dataset (introduced in Chapter 10); split it into a training set, a validation set, and a test set; shuffle the training set; and save each dataset to multiple TFRecord files. Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image), and the label. Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.**

Let's first download the Fashion MNIST dataset. This is a collection of 60,000 images of fashion items, with size 28 x 28. The classes are the following:

|Label|Description|
|:---:|:---:|
|0|T-shirt/top|
|1|Trouser|
|2|Pullover|
|3|Dress|
|4|Coat|
|5|Sandal|
|6|Shirt|
|7|Sneaker|
|8|Bag|
|9|Ankle Boot|

If you want to have more details on the dataset, please visit the [documentation of Keras](https://keras.io/api/datasets/fashion_mnist/#fashion-mnist-dataset-an-alternative-to-mnist).

In [2]:
fashion_mnist = datasets.fashion_mnist
(x_train_full, y_train_full), (x_test, y_test) = fashion_mnist.load_data()

x_train, x_valid = train_test_split(x_train_full, test_size=0.2)
y_train, y_valid = train_test_split(y_train_full, test_size=0.2)

After downloading the data, we will create a dataset using the ```tf.data.Dataset.from_tensor_slices()``` method. We will only shuffle the training data since this is the data the model will use to be trained, therefore is the data we need to be independent and identically distributed.

In [3]:
train_set = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(len(x_train))
valid_set = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))
test_set = tf.data.Dataset.from_tensor_slices((x_test, y_test))

Metal device set to: Apple M1 Pro

systemMemory: 32.00 GB
maxCacheSize: 10.67 GB



2022-01-30 12:31:39.522629: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-01-30 12:31:39.522743: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


The next step is to save the datasets in different TFRecord files. We will save each record using 2 features: The serialized image and the label. To save them, we will use the ```Example``` protobuf implemented in TensorFlow. A protobuf (serialized protocol buffer) is a binary file developed by Google that is efficient, portable and extensible. TFRecord allows to save data in an efficient way, especially for binaries records with large amount of data. 

You can read more about the ```Example``` protobuf implemented by TensorFlow [here](https://www.tensorflow.org/api_docs/python/tf/train/Example).

In [4]:
def save_tfrecord(image, label, filepath):
    
    image = tf.io.serialize_tensor(image).numpy()
    label = label.numpy()
    
    image_example = Example(
        features = Features(
            feature = {
                'image' : Feature(bytes_list = BytesList(value = [image])),
                'label' : Feature(int64_list = Int64List(value = [label]))
    }))
    
    with tf.io.TFRecordWriter(filepath) as f:
        f.write(image_example.SerializeToString())


def save_dataset(dataset, data_type='train', n_files=15):
    
    folder_path = f'datasets/mnist_fashion/{data_type}_data'
    os.makedirs(folder_path, exist_ok=True)
    filepaths = [f'{folder_path}/{data_type}_{number_file}.tfrecord' for number_file in range(n_files)]
    
    for index, (image, label) in dataset.enumerate():
        file_number = index % n_files
        
        filepath = f'{folder_path}/{data_type}_{file_number}.tfrecord'
        
        save_tfrecord(image, label, filepath)
    
    return filepaths

The first function will take an image, a label and the filepath for the current record. It will create a serialize tensor of the image. the ```numpy()``` method transforms the tensor to a single value that can be passed to the ```Example``` protobuf.

The second function takes the dataset and enumerates the records in the dataset using the ```enumerate()``` method. This will return an index to determine the number of the file this record will be saved to. Using this index and the data type, the function creates the folder, the filepath list, and saves each record into a different TFRecord file. Finally returns the pattern of the filepath for the dataset. 

In [5]:
train_filepath = save_dataset(train_set, 'train')
valid_filepath = save_dataset(valid_set, 'valid')
test_filepath = save_dataset(test_set, 'test')

The next step is to create an efficient dataset by loading the images and label from the TFRecord files. To do so, we will use the filepaths created previously.

In [6]:
def preprocess(serialized_image):
    
    feature_descriptions = {
        'image' : tf.io.FixedLenFeature([], tf.string, default_value=b''),
        'label' : tf.io.FixedLenFeature([], tf.int64, default_value=0)
    }
    
    example = tf.io.parse_single_example(serialized_image,
                                       feature_descriptions)
    
    image = tf.io.parse_tensor(example['image'], out_type=tf.uint8)
    image = tf.reshape(image, shape=(28, 28))
    
    return image, example['label']

def tfrecord_reader(filepaths, shuffle_buffer_size=None, batch_size=32):
    
    AUTOTUNE = tf.data.AUTOTUNE
    
    dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=AUTOTUNE)
    dataset = dataset.cache()
    dataset = dataset.map(preprocess, num_parallel_calls=AUTOTUNE)
    
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    
    return dataset.batch(batch_size).prefetch(AUTOTUNE)

The ```tfrecord_reader``` takes the filepaths of each set, and loads them using the ```tf.data.TFRecordDataset()``` method. We allow multithreading by setting the ```num_parallel_reads``` parameter to ```tf.data.AUTOTUNE```. This will tune the parameter at runtime to be as efficient as possible.

Then, we cache the content of the dataset to memory. The ```preprocessing()``` function is used with the ```map()``` method to load the records. The records are loaded in the ```preprocessing()``` function. It is important to use the ```tf.io.parse_tensor()``` to parse the image tensor to *uint8*, and then reshape the image. Again we allow the multithreading for the ```map()``` by setting the ```num_parallel_calls``` parameter to ```tf.data.AUTOTUNE```.

Finally, we shuffle the dataset, batch it and prefetch to improve the performance. Once again, we use the ```tf.data.AUTOTUNE``` value for ```prefetch()``` to allow the value being automatically tuned by TensorFlow at runtime. 

In [7]:
train_set = tfrecord_reader(train_filepath)
valid_set = tfrecord_reader(valid_filepath)
test_set = tfrecord_reader(test_filepath)

In order to use the dataset, let's create a simple model. Let's use the ```keras.layers.Normalization()``` layer in order to standardize the input. In order to calculate the mean and standard deviation of this layer, we will need to call its ```adapt()``` method. To do so, we will need a sample, that will be parsed into a numpy array and passed to the normalization layer.

Then we create a simple model, with a callback to save the logs for the Tensorboard visualization:

In [15]:
tf.keras.backend.clear_session()
tf.random.set_seed(15)
np.random.seed(15)

# Create the model using the custom layer created previously.

sample_data = train_set.take(500).map(lambda image, label: image)
sample_data = np.concatenate(list(sample_data.as_numpy_iterator()), axis=0).astype(np.float32)
normalizer = layers.Normalization(input_shape=[28, 28])
normalizer.adapt(sample_data)

model = models.Sequential([
    normalizer,
    layers.Flatten(),
    layers.Dense(250, activation='elu', kernel_initializer='he_normal'),
    layers.Dropout(0.5),
    layers.Dense(100, activation='elu', kernel_initializer='he_normal'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

# Compile the model 

optimizer = optimizers.Nadam()

model.compile(loss = losses.sparse_categorical_crossentropy,
              optimizer = optimizer,
              metrics = ['accuracy'])

# Create the callback for tensorboard

logs_path = os.path.join(os.curdir, 'logs', 'run_' + datetime.now().strftime('%Y%m%d-%H%M%S'))

tensorboard_cb = callbacks.TensorBoard(
    log_dir = logs_path,
    histogram_freq=1,
    profile_batch=10
)

list_cb = [tensorboard_cb]

# Train the model
history_model = model.fit(train_set,
                          validation_data=valid_set,
                          epochs=5,
                          callbacks=list_cb,
                          verbose=2)

2022-01-30 12:35:01.749590: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-01-30 12:35:01.767240: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
2022-01-30 12:35:01.826163: I tensorflow/core/profiler/lib/profiler_session.cc:110] Profiler session initializing.
2022-01-30 12:35:01.826181: I tensorflow/core/profiler/lib/profiler_session.cc:125] Profiler session started.
2022-01-30 12:35:01.826195: I tensorflow/core/profiler/lib/profiler_session.cc:143] Profiler session tear down.


Epoch 1/5


2022-01-30 12:35:02.093649: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


1/1 - 1s - loss: 6.4226 - accuracy: 0.0667 - val_loss: 3.5848 - val_accuracy: 0.0667 - 913ms/epoch - 913ms/step


2022-01-30 12:35:02.646071: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 2/5
1/1 - 0s - loss: 1.2993 - accuracy: 0.6000 - val_loss: 3.5553 - val_accuracy: 0.0667 - 140ms/epoch - 140ms/step
Epoch 3/5
1/1 - 0s - loss: 0.1674 - accuracy: 1.0000 - val_loss: 3.5862 - val_accuracy: 0.0667 - 85ms/epoch - 85ms/step
Epoch 4/5
1/1 - 0s - loss: 0.0463 - accuracy: 1.0000 - val_loss: 3.6069 - val_accuracy: 0.0667 - 78ms/epoch - 78ms/step
Epoch 5/5
1/1 - 0s - loss: 0.0244 - accuracy: 1.0000 - val_loss: 3.6265 - val_accuracy: 0.0667 - 80ms/epoch - 80ms/step


This model is clearly overfitting, and more regularization is required. 

Then, we call the tensorboard to visualiza the performance of the training:

In [9]:
%load_ext tensorboard
%tensorboard --logdir=./logs --port=6006

ERROR: Failed to launch TensorBoard (exited with 255).
Contents of stderr:
E0130 12:32:04.440227 4305732992 program.py:298] TensorBoard could not bind to port 6006, it was already in use
ERROR: TensorBoard could not bind to port 6006, it was already in use

## Exercise 10

**In this exercise you will download a dataset, split it, create a tf.data.Dataset to load it and preprocess it efficiently, then build and train a binary classification model containing an Embedding layer**

**a) Download the Large Movie Review Dataset, which contains 50,000 movies reviews from the Internet Movie Database. The data is organized in two directories, train and test, each containing a pos subdirectory with 12,500 positive reviews and a neg subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise.**

In [10]:
def preprocess():
    
    return None

def read_text_files(filepath):
    
    AUTOTUNE = tf.data.AUTOTUNE
    dataset = tf.keras.utils.text_dataset_from_directory(filepath)
    for neg, pos in dataset:
        print(pos)

In [11]:
classes = ['neg', 'pos']
train_filepath = 'datasets/aclImdb/train'
test_filepath = 'datasets/aclImdb/test'

In [12]:
read_text_files(train_filepath)

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/aclImdb/train'