<a href="https://colab.research.google.com/github/manabil/Tensorflow-Advanced-Techniques-Specialization/blob/main/Custom%20and%20Distributed%20Training%20with%20Tensorflow/Week%204/C2W4_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 Assignment: Custom training with tf.distribute.Strategy

Welcome to the final assignment of this course! For this week, you will implement a distribution strategy to train on the [Oxford Flowers 102](https://www.tensorflow.org/datasets/catalog/oxford_flowers102) dataset. As the name suggests, distribution strategies allow you to setup training across multiple devices. We are just using a single device in this lab but the syntax you'll apply should also work when you have a multi-device setup. Let's begin!

## Imports

In [2]:
# Uncomment the following lines if you're running this notebook on Colab. This
# is for compatibility with the autograder. No need to run these on Coursera.

# !pip install -q tensorflow==2.8.0
# !pip install -q keras==2.8.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m497.6/497.6 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.6/42.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m104.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.5/462.5 kB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m84.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m111.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
from __future__ import absolute_import, division, print_function
from __future__ import unicode_literals

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub

# Helper libraries
import os
import zipfile
from tqdm import tqdm
from typing import Callable, Iterable

## Download the dataset

In [4]:
tfds.disable_progress_bar()

In [5]:
splits: list[str] = ['train[:80%]', 'train[80%:90%]', 'train[90%:]']

train_examples: tf.data.Dataset
validation_examples: tf.data.Dataset
test_examples: tf.data.Dataset
info: object
(train_examples, validation_examples, test_examples), info = tfds.load(
    'oxford_flowers102',
    with_info=True,
    as_supervised=True,
    split = splits,
    data_dir='data/'
)

num_examples: int = info.splits['train'].num_examples
num_classes: int = info.features['label'].num_classes

Downloading and preparing dataset 328.90 MiB (download: 328.90 MiB, generated: 331.34 MiB, total: 660.25 MiB) to data/oxford_flowers102/2.1.1...
Dataset oxford_flowers102 downloaded and prepared to data/oxford_flowers102/2.1.1. Subsequent calls will reuse this data.


## Create a strategy to distribute the variables and the graph

How does `tf.distribute.MirroredStrategy` strategy work?

*   All the variables and the model graph are replicated on the replicas.
*   Input is evenly distributed across the replicas.
*   Each replica calculates the loss and gradients for the input it received.
*   The gradients are synced across all the replicas by summing them.
*   After the sync, the same update is made to the copies of the variables on each replica.

In [6]:
# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
strategy: tf.distribute.Strategy = tf.distribute.MirroredStrategy()

In [7]:
print(f'Number of devices: {strategy.num_replicas_in_sync}')

Number of devices: 1


## Setup input pipeline

Set some constants, including the buffer size, number of epochs, and the image size.

In [8]:
BUFFER_SIZE: int = num_examples
EPOCHS: int = 10
pixels: int = 224

# Path to the model features. Only use this when running the notebook on
# Coursera
MODULE_HANDLE: str = 'data/resnet_50_feature_vector'

# Note: Uncomment the line below if you are running the notebook on Colab
# MODULE_HANDLE='https://tfhub.dev/tensorflow/resnet_50/feature_vector/1'

IMAGE_SIZE: tuple[int, int] = (pixels, pixels)
print(f"Using {MODULE_HANDLE} with input size {IMAGE_SIZE}")

Using data/resnet_50_feature_vector with input size (224, 224)


Define a function to format the image (resizes the image and scales the pixel values to range from [0,1].

In [9]:
def format_image(image: tf.Tensor, label: tf.Tensor) -> tuple[tf.Tensor, ...]:
    image = tf.image.resize(image, IMAGE_SIZE) / 255.0
    return  image, label

## Set the global batch size (please complete this section)

Given the batch size per replica and the strategy, set the global batch size.
- The global batch size is the batch size per replica times the number of replicas in the strategy.

Hint: You'll want to use the `num_replicas_in_sync` stored in the [strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy).

In [10]:
# GRADED FUNCTION
def set_global_batch_size(
    batch_size_per_replica: int, strategy: tf.distribute.Strategy
) -> int:
    '''
    Args:
        batch_size_per_replica (int) - batch size per replica
        strategy (tf.distribute.Strategy) - distribution strategy
    '''

    # set the global batch size
    ### START CODE HERE ###
    num_replica: int = strategy.num_replicas_in_sync
    global_batch_size: int = batch_size_per_replica * num_replica
    ### END CODD HERE ###

    return global_batch_size

Set the GLOBAL_BATCH_SIZE with the function that you just defined

In [11]:
BATCH_SIZE_PER_REPLICA: int = 64
GLOBAL_BATCH_SIZE: int = set_global_batch_size(BATCH_SIZE_PER_REPLICA, strategy)

print(GLOBAL_BATCH_SIZE)

64


**Expected Output:**
```
64
```

Create the datasets using the global batch size and distribute the batches for training, validation and test batches

In [12]:
train_batches: tf.data.Dataset = train_examples.shuffle(num_examples // 4)
train_batches = train_batches.map(format_image).batch(BATCH_SIZE_PER_REPLICA)
train_batches = train_batches.prefetch(1)

validation_batches: tf.data.Dataset = validation_examples.map(format_image)
validation_batches = validation_batches.batch(BATCH_SIZE_PER_REPLICA)
validation_batches = validation_batches.prefetch(1)

test_batches: tf.data.Dataset = test_examples.map(format_image).batch(1)

## Define the distributed datasets (please complete this section)

Create the distributed datasets using `experimental_distribute_dataset()` of the [Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy) class and pass in the training batches.
- Do the same for the validation batches and test batches.

In [13]:
# GRADED FUNCTION
def distribute_datasets(
    strategy: tf.distribute.Strategy,
    train_batches: tf.data.Dataset,
    validation_batches: tf.data.Dataset,
    test_batches: tf.data.Dataset
) -> tuple[tf.distribute.DistributedDataset, ...]:

    ### START CODE HERE ###
    DistDataset = tf.distribute.DistributedDataset
    train_dist_dataset: DistDataset = strategy.experimental_distribute_dataset(
        train_batches
    )
    val_dist_dataset: DistDataset = strategy.experimental_distribute_dataset(
        validation_batches
    )
    test_dist_dataset: DistDataset = strategy.experimental_distribute_dataset(
        test_batches
    )
    ### END CODE HERE ###

    return train_dist_dataset, val_dist_dataset, test_dist_dataset

Call the function that you just defined to get the distributed datasets.

In [14]:
train_dist_dataset: tf.distribute.DistributedDataset
val_dist_dataset: tf.distribute.DistributedDataset
test_dist_dataset: tf.distribute.DistributedDataset
train_dist_dataset, val_dist_dataset, test_dist_dataset = distribute_datasets(
    strategy, train_batches, validation_batches, test_batches
)

Take a look at the types of the distributed datasets:

In [15]:
print(type(train_dist_dataset))
print(type(val_dist_dataset))
print(type(test_dist_dataset))

<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>


**Expected Output:**
```
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
<class 'tensorflow.python.distribute.input_lib.DistributedDataset'>
```

Also get familiar with a single batch from the train_dist_dataset:
- Each batch has 64 features and labels

In [16]:
# Take a look at a single batch from the train_dist_dataset
x: tuple[tf.Tensor, ...] = iter(train_dist_dataset).get_next()

print(f"x is a tuple that contains {len(x)} values ")
print(f"x[0] contains the features, and has shape {x[0].shape}")
print(f"  so it has {x[0].shape[0]} examples in the batch, each is an", end="")
print(f" image that is {x[0].shape[1:]}")
print(f"x[1] contains the labels, and has shape {x[1].shape}")

x is a tuple that contains 2 values 
x[0] contains the features, and has shape (64, 224, 224, 3)
  so it has 64 examples in the batch, each is an image that is (224, 224, 3)
x[1] contains the labels, and has shape (64,)


## Create the model

Use the Model Subclassing API to create model `ResNetModel` as a subclass of `tf.keras.Model`.

In [17]:
class ResNetModel(tf.keras.Model):
    def __init__(self, classes) -> None:
        super(ResNetModel, self).__init__()
        self._feature_extractor: hub.KerasLayer = hub.KerasLayer(
            MODULE_HANDLE, trainable=False
        )
        self._classifier: tf.keras.layers.Layer = tf.keras.layers.Dense(
            classes, activation='softmax'
        )

    def call(self, inputs: tf.Tensor) -> tf.Tensor:
        x: tf.Tensor = self._feature_extractor(inputs)
        x = self._classifier(x)
        return x

Create a checkpoint directory to store the checkpoints (the model's weights during training).

In [18]:
# Create a checkpoint directory to store the checkpoints.
checkpoint_dir: str = './training_checkpoints'
checkpoint_prefix: str = os.path.join(checkpoint_dir, "ckpt")

## Define the loss function

You'll define the `loss_object` and `compute_loss` within the `strategy.scope()`.
- `loss_object` will be used later to calculate the loss on the test set.
- `compute_loss` will be used later to calculate the average loss on the training data.

You will be using these two loss calculations later.

In [19]:
with strategy.scope():
    # Set reduction to `NONE` so we can do the reduction afterwards and divide
    # by the global batch size.
    Losses = tf.keras.losses.Loss
    loss_object: Losses = tf.keras.losses.SparseCategoricalCrossentropy(
        reduction=tf.keras.losses.Reduction.NONE
    )
    # or loss_fn = tf.keras.losses.sparse_categorical_crossentropy
    def compute_loss(labels: tf.Tensor, predictions: tf.Tensor) -> tf.Tensor:
        per_example_loss: tf.Tensor = loss_object(labels, predictions)
        return tf.nn.compute_average_loss(
            per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE
        )

    Metrics = tf.keras.metrics.Metric
    test_loss: Metrics = tf.keras.metrics.Mean(name='test_loss')

## Define the metrics to track loss and accuracy

These metrics track the test loss and training and test accuracy.
- You can use `.result()` to get the accumulated statistics at any time, for example, `train_accuracy.result()`.

In [20]:
with strategy.scope():
    train_accuracy: Metrics = tf.keras.metrics.SparseCategoricalAccuracy(
        name='train_accuracy'
    )
    test_accuracy: Metrics = tf.keras.metrics.SparseCategoricalAccuracy(
        name='test_accuracy'
    )

## Instantiate the model, optimizer, and checkpoints

This code is given to you.  Just remember that they are created within the `strategy.scope()`.
- Instantiate the ResNetModel, passing in the number of classes
- Create an instance of the Adam optimizer.
- Create a checkpoint for this model and its optimizer.

*Note: If you are running this on Colab and get the error message: `OSError: data/resnet_50_feature_vector does not exist`, please scroll up to the `Setup Input Pipeline` section and uncomment the `MODULE_HANDLE` line for Colab. Then restart the runtime and run all cells.*

In [21]:
# model and optimizer must be created under `strategy.scope`.
with strategy.scope():
    model: ResNetModel = ResNetModel(classes=num_classes)
    optimizer: tf.keras.optimizers.Optimizer = tf.keras.optimizers.Adam()
    checkpoint: tf.train.Checkpoint = tf.train.Checkpoint(
        optimizer=optimizer, model=model
    )

## Training loop (please complete this section)

You will define a regular training step and test step, which could work without a distributed strategy.  You can then use `strategy.run` to apply these functions in a distributed manner.
- Notice that you'll define `train_step` and `test_step` inside another function `train_testp_step_fns`, which will then return these two functions.

### Define train_step
Within the strategy's scope, define `train_step(inputs)`
- `inputs` will be a tuple containing `(images, labels)`.
- Create a gradient tape block.
- Within the gradient tape block:
  - Call the model, passing in the images and setting training to be `True` (complete this part).
  - Call the `compute_loss` function (defined earlier) to compute the training loss (complete this part).
  - Use the gradient tape to calculate the gradients.
  - Use the optimizer to update the weights using the gradients.
  
### Define test_step
Also within the strategy's scope, define `test_step(inputs)`
- `inputs` is a tuple containing `(images, labels)`.
  - Call the model, passing in the images and set training to `False`, because the model is not going to train on the test data. (complete this part).
  - Use the `loss_object`, which will compute the test loss.  Check `compute_loss`, defined earlier, to see what parameters to pass into `loss_object`. (complete this part).
  - Next, update `test_loss` (the running test loss) with the `t_loss` (the loss for the current batch).
  - Also update the `test_accuracy`.

In [22]:
# GRADED FUNCTION
def train_test_step_fns(
    strategy: tf.distribute.Strategy,
    model: ResNetModel,
    compute_loss: Callable[[tf.Tensor, tf.Tensor], tf.Tensor],
    optimizer: tf.keras.optimizers.Optimizer,
    train_accuracy: Metrics,
    loss_object: Losses,
    test_loss: Metrics,
    test_accuracy: Metrics
) -> tuple[
    Callable[[tf.distribute.DistributedDataset], tf.Tensor],
    Callable[[tf.distribute.DistributedDataset], None],
]:
    with strategy.scope():
        def train_step(inputs: tf.distribute.DistributedDataset) -> tf.Tensor:
            images: tf.Tensor
            labels: tf.Tensor
            images, labels = inputs

            with tf.GradientTape() as tape:
                ### START CODE HERE ###
                predictions: tf.Tensor = model(images, training=True)
                loss: tf.Tensor = compute_loss(labels, predictions)
                ### END CODE HERE ###

            gradients: list = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(
                gradients, model.trainable_variables
            ))

            train_accuracy.update_state(labels, predictions)
            return loss

        def test_step(inputs: tf.distribute.DistributedDataset) -> None:
            images: tf.Tensor
            labels: tf.Tensor
            images, labels = inputs

            ### START CODE HERE ###
            predictions: tf.Tensor = model(images, training=False)
            t_loss: tf.Tensor = loss_object(labels, predictions)
            ### END CODE HERE ###

            test_loss.update_state(t_loss)
            test_accuracy.update_state(labels, predictions)

        return train_step, test_step

Use the `train_test_step_fns` function to produce the `train_step` and `test_step` functions.

In [23]:
train_step, test_step = train_test_step_fns(
    strategy,
    model,
    compute_loss,
    optimizer,
    train_accuracy,
    loss_object,
    test_loss,
    test_accuracy
)

## Distributed training and testing (please complete this section)

The `train_step` and `test_step` could be used in a non-distributed, regular model training.  To apply them in a distributed way, you'll use [strategy.run](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy#run).

`distributed_train_step`
- Call the `run` function of the `strategy`, passing in the train step function (which you defined earlier), as well as the arguments that go in the train step function.
- The run function is defined like this `run(fn, args=() )`.  
  - `args` will take in the dataset inputs

`distributed_test_step`
- Similar to training, the distributed test step will use the `run` function of your strategy, taking in the test step function as well as the dataset inputs that go into the test step function.

#### Hint:
- You saw earlier that each batch in `train_dist_dataset` is tuple with two values:
  - a batch of features
  - a batch of labels.

Let's think about how you'll want to pass in the dataset inputs into `args` by running this next cell of code:

In [24]:
#See various ways of passing in the inputs

def fun1(args: Iterable=()) -> None:
    print(f"number of arguments passed is {len(args)}")


list_of_inputs: list = [1,2]
print("When passing in args=list_of_inputs:")
fun1(args=list_of_inputs)
print()
print("When passing in args=(list_of_inputs)")
fun1(args=(list_of_inputs))
print()
print("When passing in args=(list_of_inputs,)")
fun1(args=(list_of_inputs,))

When passing in args=list_of_inputs:
number of arguments passed is 2

When passing in args=(list_of_inputs)
number of arguments passed is 2

When passing in args=(list_of_inputs,)
number of arguments passed is 1


Notice that depending on how `list_of_inputs` is passed to `args` affects whether `fun1` sees one or two positional arguments.  
- If you see an error message about positional arguments when running the training code later, please come back to check how you're passing in the inputs to `run`.

Please complete the following function.

In [31]:
def distributed_train_test_step_fns(
    strategy: tf.distribute.Strategy,
    train_step: Callable[[tf.distribute.DistributedDataset], tf.Tensor],
    test_step: Callable[[tf.distribute.DistributedDataset], None],
    model: ResNetModel,
    compute_loss: Callable[[tf.Tensor, tf.Tensor], tf.Tensor],
    optimizer: tf.keras.optimizers.Optimizer,
    train_accuracy: Metrics,
    loss_object: Losses,
    test_loss: Metrics,
    test_accuracy: Metrics
) -> tuple[Callable[[tf.distribute.DistributedDataset], tf.Tensor], ...]:
    with strategy.scope():
        @tf.function
        def distributed_train_step(
            dataset_inputs: tf.distribute.DistributedDataset
        ) -> tf.Tensor:
            ### START CODE HERE ###
            per_replica_losses: tf.Tensor = strategy.run(
                train_step, args=(dataset_inputs,)
            )
            ### END CODE HERE ###
            return strategy.reduce(
                tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None
            )

        @tf.function
        def distributed_test_step(
            dataset_inputs: tf.distribute.DistributedDataset
        ) -> tf.Tensor:
            ### START CODE HERE ###
            return strategy.run(test_step, args=(dataset_inputs,))
            ### END CODE HERE ###

        return distributed_train_step, distributed_test_step

Call the function that you just defined to get the distributed train step function and distributed test step function.

In [26]:
distributed_train_step, distributed_test_step = distributed_train_test_step_fns(
    strategy,
    train_step,
    test_step,
    model,
    compute_loss,
    optimizer,
    train_accuracy,
    loss_object,
    test_loss,
    test_accuracy
)

**An important note before you continue:**

The following sections will guide you through how to train your model and save it to a .zip file. These sections are **not** required for you to pass this assignment but you are encouraged to continue anyway. If you consider no more work is needed in previous sections, please submit now and carry on.

After training your model, you can download it as a .zip file and upload it back to the platform to know how well it performed.  However, training your model takes around 20 minutes within the Coursera environment. Because of this, there are two methods to train your model:

**Method 1**

If 20 mins is too long for you, we recommend to download this notebook (after submitting it for grading) and upload it to [Colab](https://colab.research.google.com/) to finish the training in a GPU-enabled runtime. If you decide to do this, these are the steps to follow:

- Save this notebok.
- Click the `jupyter` logo on the upper left corner of the window. This will take you to the Jupyter workspace.
- Select this notebook (C2W4_Assignment.ipynb) and click `Shutdown`.
- Once the notebook is shutdown, you can go ahead and download it.
- Head over to [Colab](https://colab.research.google.com/) and select the `upload` tab and upload your notebook.
- Before running any cell go into `Runtime` --> `Change Runtime Type` and make sure that `GPU` is enabled.
- Uncomment the first line in this notebook that installs autograder-compatible package versions.
- Uncomment the `MODULE_HANDLE` in the `Setup input pipeline` section that contains the URL to the feature vector.
- Run all of the cells in the notebook. After training, follow the rest of the instructions of the notebook to download your model.

**Method 2**

If you prefer to wait the 20 minutes and not leave Coursera, keep going through this notebook. Once you are done, follow these steps:
- Click the `jupyter` logo on the upper left corner of the window. This will take you to the jupyter filesystem.
- In the filesystem you should see a file named `mymodel.zip`. Go ahead and download it.

Independent of the method you choose, you should end up with a `mymodel.zip` file which can be uploaded for evaluation after this assignment. Once again, this is optional but we strongly encourage you to do it as it is a lot of fun.

With this out of the way, let's continue.



## Run the distributed training in a loop

You'll now use a for-loop to go through the desired number of epochs and train the model in a distributed manner.
In each epoch:
- Loop through each distributed training set
  - For each training batch, call `distributed_train_step` and get the loss.
- After going through all training batches, calculate the training loss as the average of the batch losses.
- Loop through each batch of the distributed test set.
  - For each test batch, run the distributed test step. The test loss and test accuracy are updated within the test step function.
- Print the epoch number, training loss, training accuracy, test loss and test accuracy.
- Reset the losses and accuracies before continuing to another epoch.

In [27]:
# Running this cell in Coursera takes around 20 mins
with strategy.scope():
    for epoch in range(EPOCHS):
        # TRAIN LOOP
        total_loss: float = 0.0
        num_batches: int = 0
        for x in tqdm(train_dist_dataset):
            total_loss += distributed_train_step(x)
            num_batches += 1
        train_loss: float = total_loss / num_batches

        # TEST LOOP
        for x in test_dist_dataset:
            distributed_test_step(x)

        print(f"Epoch {epoch+1}, Loss: {train_loss}, ", end="")
        print(f"Accuracy: {train_accuracy.result()*100}, ", end="")
        print(f"Test Loss: {test_loss.result()}, ", end="")
        print(f"Test Accuracy: {test_accuracy.result()*100}")

        test_loss.reset_states()
        train_accuracy.reset_states()
        test_accuracy.reset_states()

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid value for "node": expected "ast.AST", got "<class 'NoneType'>"; to visit lists of nodes, use "visit_block" instead


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid value for "node": expected "ast.AST", got "<class 'NoneType'>"; to visit lists of nodes, use "visit_block" instead


13it [00:29,  2.27s/it]
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid value for "node": expected "ast.AST", got "<class 'NoneType'>"; to visit lists of nodes, use "visit_block" instead


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: invalid value for "node": expected "ast.AST", got "<class 'NoneType'>"; to visit lists of nodes, use "visit_block" instead
Epoch 1, Loss: 4.648879528045654, Accuracy: 4.289216041564941, Test Loss: 3.887516498565674, Test Accuracy: 11.764705657958984


13it [00:05,  2.51it/s]


Epoch 2, Loss: 2.578608274459839, Accuracy: 48.52941131591797, Test Loss: 2.760967254638672, Test Accuracy: 42.156864166259766


13it [00:02,  4.64it/s]


Epoch 3, Loss: 1.4276982545852661, Accuracy: 82.59803771972656, Test Loss: 2.2005815505981445, Test Accuracy: 52.94117736816406


13it [00:02,  4.69it/s]


Epoch 4, Loss: 0.8435925841331482, Accuracy: 94.36274719238281, Test Loss: 1.8571081161499023, Test Accuracy: 62.74510192871094


13it [00:05,  2.51it/s]


Epoch 5, Loss: 0.5501012802124023, Accuracy: 97.05882263183594, Test Loss: 1.66487717628479, Test Accuracy: 60.78431701660156


13it [00:05,  2.40it/s]


Epoch 6, Loss: 0.38454991579055786, Accuracy: 98.52941131591797, Test Loss: 1.5392638444900513, Test Accuracy: 66.66667175292969


13it [00:02,  4.38it/s]


Epoch 7, Loss: 0.28526145219802856, Accuracy: 99.14215850830078, Test Loss: 1.4733747243881226, Test Accuracy: 66.66667175292969


13it [00:05,  2.50it/s]


Epoch 8, Loss: 0.22366470098495483, Accuracy: 99.26470184326172, Test Loss: 1.4187721014022827, Test Accuracy: 64.70588684082031


13it [00:02,  4.66it/s]


Epoch 9, Loss: 0.1779027283191681, Accuracy: 99.50980377197266, Test Loss: 1.3832956552505493, Test Accuracy: 65.68627166748047


13it [00:05,  2.51it/s]


Epoch 10, Loss: 0.147242933511734, Accuracy: 99.87745666503906, Test Loss: 1.3596138954162598, Test Accuracy: 64.70588684082031


Things to note in the example above:

* We are iterating over the `train_dist_dataset` and `test_dist_dataset` using  a `for x in ...` construct.
* The scaled loss is the return value of the `distributed_train_step`. This value is aggregated across replicas using the `tf.distribute.Strategy.reduce` call and then across batches by summing the return value of the `tf.distribute.Strategy.reduce` calls.
* `tf.keras.Metrics` should be updated inside `train_step` and `test_step` that gets executed by `tf.distribute.Strategy.experimental_run_v2`.
*`tf.distribute.Strategy.experimental_run_v2` returns results from each local replica in the strategy, and there are multiple ways to consume this result. You can do `tf.distribute.Strategy.reduce` to get an aggregated value. You can also do `tf.distribute.Strategy.experimental_local_results` to get the list of values contained in the result, one per local replica.


# Save the Model for submission (Optional)

You'll get a saved model of this trained model. You'll then need to zip that to upload it to the testing infrastructure. We provide the code to help you with that here:

## Step 1: Save the model as a SavedModel
This code will save your model as a SavedModel

In [28]:
model_save_path: str = "./tmp/mymodel/1/"
tf.saved_model.save(model, model_save_path)

## Step 2: Zip the SavedModel Directory into /mymodel.zip

This code will zip your saved model directory contents into a single file.

If you are on colab, you can use the file browser pane to the left of colab to find `mymodel.zip`. Right click on it and select 'Download'.

If the download fails because you aren't allowed to download multiple files from colab, check out the guidance here: https://ccm.net/faq/32938-google-chrome-allow-websites-to-perform-simultaneous-downloads

If you are in Coursera, follow the instructions previously provided.

It's a large file, so it might take some time to download.

In [29]:
def zipdir(path: str, ziph: zipfile.ZipFile) -> None:
    # ziph is zipfile handle
    for root, dirs, files in os.walk(path):
        for file in files:
            ziph.write(os.path.join(root, file))

with zipfile.ZipFile('./mymodel.zip', 'w', zipfile.ZIP_DEFLATED) as zip:
    zipdir('./tmp/mymodel/1/', zip)