### Quick start to Tensorflow 2 - Basic MNIST classifier using CNN
Original [doc](https://www.tensorflow.org/tutorials/quickstart/advanced)

In [1]:
# Use CPU as there is some cudnn issue with TF2.0 conda installation
!export CUDA_VISIBLE_DEVICES=""

In [2]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Conv2D
from tensorflow.keras import Model
import tensorflow.keras.metrics as metrics
import numpy as np
tf.__version__

'2.2.0'

___
#### Checking available datasets with keras

In [3]:
# Checking available datasets with keras
for dataset in dir(tf.keras.datasets):
    if "_" not in dataset:
        print(dataset)

cifar10
cifar100
imdb
mnist
reuters


---
#### load and prepare MNIST dataset

In [4]:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train, x_test = x_train/255.0, x_test/255.0
# Use only 10k instead of 60k for training to prevent memory issues
x_train = x_train[..., tf.newaxis].astype("float32")[:10000]
y_train = y_train[:10000]
# OR x_train = x_train[..., np.newaxis].astype("float32")
# OR x_train = x_train.reshape(x_train.shape[0], 28, 28, -1).astype("float32")
x_test = x_test[..., tf.newaxis].astype("float32")

x_train.shape, x_test.shape

((10000, 28, 28, 1), (10000, 28, 28, 1))

**Note:** *...* is called ellipsis and is used for slicing entire array and is equivalent to *:,:,:* in this case

Refer second answer [here](https://stackoverflow.com/questions/118370/how-do-you-use-the-ellipsis-slicing-syntax-in-python) for more details
___

#### Shuffle and batch the dataset
**Note:** Here 10000 is buffer size argument for shuffling.It should be greater than dataset size for perfect shuffle. [refer](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)

In [5]:
train_ds = tf.data.Dataset.from_tensor_slices(
    (x_train, y_train)).shuffle(10000).batch(8)

test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(8)

___
#### Build keras model using functional API

In [6]:
class MyModel(Model):
    def __init__(self):
        super(MyModel, self).__init__()
        # 32 : num filters , 3 : filter size
        self.conv1 = Conv2D(32, 3, activation='relu')
        self.flatten = Flatten()
        self.d1 = Dense(128, activation='relu')
        self.d2 = Dense(10)
    def call(self, x):
        x = self.conv1(x)
        x = self.flatten(x)
        x = self.d1(x)
        x = self.d2(x)
        return x
model = MyModel()

---
#### loss and optimizer
**Note:** Sparse categorical cross entropy is exactly same as categorical cross entropy except that Sparse version uses integer input while normal version uses one hot encodings.This saves some memory . [refer](https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other)

In [7]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

optimizer = tf.keras.optimizers.Adam()

___
#### Metrics to measure the loss and the accuracy of the model

In [8]:
train_loss = metrics.Mean(name='train_loss')
train_accuracy = metrics.SparseCategoricalAccuracy(name='train_acc')

test_loss = metrics.Mean(name='test_loss')
test_accuracy = metrics.SparseCategoricalAccuracy(name='test_acc')

___
#### GradientTape explanation
* Record operations for automatic differentiation
* Trainable variables (created by `tf.Variable` or `tf.compat.v1.get_variable`,
  where `trainable=True` is default in both cases) are automatically watched.
  Tensors can be manually watched by invoking the `watch` method on this context
  manager.
* Can calculate higher order derivatives as well(Eg below)

In [9]:
x = tf.constant(3.0)
with tf.GradientTape() as g:
    g.watch(x)
    with tf.GradientTape() as gg:
        gg.watch(x)
        y = x * x
    dy_dx = gg.gradient(y, x)     # Will compute to 6.0
    d2y_dx2 = g.gradient(dy_dx, x)  # Will compute to 2.0
print(dy_dx, d2y_dx2)


tf.Tensor(6.0, shape=(), dtype=float32) tf.Tensor(2.0, shape=(), dtype=float32)


___
#### Train the model
**Note:** `tf.function` Compiles a function into a callable TensorFlow graph and gives better performance in eager mode

The main takeaways and recommendations are:

* Debug in eager mode, then decorate with `@tf.function`.
* Don't rely on Python side effects like object mutation or list appends.
* `tf.function` works best with TensorFlow ops; NumPy and Python calls are converted to constants.

In [10]:
@tf.function
def train_step(images, labels):
    with tf.GradientTape() as tape:
    # training=True is only when there are layers like dropout
        predictions = model(images, training=True)
        loss = loss_object(labels, predictions)
    # calculate Gradients of loss function wrt all trainable variables
    gradients = tape.gradient(loss, model.trainable_variables)
    # apply those gradients to corresponding variables
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(labels, predictions)

___
#### Test the model

In [11]:
@tf.function
def test_step(images, labels):
    predictions = model(images, training=False)
    loss = loss_object(labels, predictions)
    
    test_loss(loss)
    test_accuracy(labels, predictions)

___
#### Actual training epoch by epoch using batches created above

In [12]:
EPOCHS = 5

for epoch in range(EPOCHS):
    # Reset the metrics at the start of the next epoch
    train_loss.reset_states()
    train_accuracy.reset_states()
    test_loss.reset_states()
    test_accuracy.reset_states()
    # Process train and test batches in a loop
    for images, labels in train_ds:
        train_step(images, labels)

    for test_images, test_labels in test_ds:
        test_step(test_images, test_labels)
    # Output the metrics
    template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
    print(template.format(epoch + 1,
                        train_loss.result(),
                        train_accuracy.result() * 100,
                        test_loss.result(),
                        test_accuracy.result() * 100))

Epoch 1, Loss: 0.2704825699329376, Accuracy: 92.00999450683594, Test Loss: 0.14461418986320496, Test Accuracy: 95.51000213623047
Epoch 2, Loss: 0.06728130578994751, Accuracy: 97.86000061035156, Test Loss: 0.11313824355602264, Test Accuracy: 96.41000366210938
Epoch 3, Loss: 0.028660165145993233, Accuracy: 99.02999877929688, Test Loss: 0.13489733636379242, Test Accuracy: 96.0
Epoch 4, Loss: 0.01360512524843216, Accuracy: 99.61000061035156, Test Loss: 0.15111477673053741, Test Accuracy: 95.94000244140625
Epoch 5, Loss: 0.015145039185881615, Accuracy: 99.54000091552734, Test Loss: 0.12130691111087799, Test Accuracy: 96.9000015258789


___
#### Comparison with PyTorch
* **imports**
 * `tensorflow.keras.layers` equivalent to `torch.nn`
 * `tensorflow.keras.Model` equivalent to `torch.nn.Module`
 * `tf.keras.optimizers` equivalent to `torch.optim`
* **layers**
 * `tensorflow.keras.layers.Conv2D` and `torch.nn.Conv2d`
 * `tensorflow.keras.layers.Dense` and `torch.nn.Linear`
* **Model declaration**
 * Activation comes default with layers in TF but needs to applied separately in PyTorch
 * `call` vs `forward`
 * Keras functional API is almost exactly same as PyTorch model declaration
* **optimizer declaration**
 * Pytorch requires Network params at declaration
 * **Eg:** `optimizer = tf.keras.optimizers.Adam()` vs `optimizer = optim.Adam(net.parameters())`
* **Loss function**
 * `tf.keras.losses.CategoricalCrossentropy()` vs `torch.nn.CrossEntropyLoss()`
* **Batch Processing**
 * `for images, labels in train_ds` vs `for images, labels in trainloader`
 * Seems exactly same. Need to verify again
* **Resetting loss to zero at the start of epoch**
 * `train_loss.reset_states()` vs `runnng_loss = 0`
* **Forward Prop**
 * `model(images, training=True)` vs `net(inputs)`
 * Almost exactly same
* **BackProp**
 * `gradients = tape.gradient(loss, model.trainable_variables)` vs `loss.backward()`
 * `optimizer.apply_gradients(zip(gradients, model.trainable_variables))` vs `optimizer.step()`