# CS492: 전산학특강<스마트에너지를 위한 인공지능> 
## Deep Learning Practice 
#### Prof. Ho-Jin Choi
#### School of Computing, KAIST

---

### 7-4. Subclassing and GradientTape

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

tf.keras.backend.clear_session()  # For easy reset of notebook state.

#### Subclassing 

Building below model using `Sequential`.
``` python
(input: 784-dimensional vectors)
       ↧
[Dense (64 units, relu activation)]
       ↧
[Dense (10 units, softmax activation)]
       ↧
(output: probability distribution over 10 classes)
```

In [2]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Flatten(input_shape=(784,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 64)                50240     
_________________________________________________________________
dense_1 (Dense)              (None, 10)                650       
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________


Building a model using functional API:

In [4]:
inputs = keras.Input(shape=(784,))
x = layers.Dense(64, activation='relu')(inputs)
outputs = layers.Dense(10, activation='softmax')(x)


model = keras.Model(inputs=inputs, outputs=outputs)

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 784)]             0         
_________________________________________________________________
dense_4 (Dense)              (None, 64)                50240     
_________________________________________________________________
dense_5 (Dense)              (None, 10)                650       
Total params: 50,890
Trainable params: 50,890
Non-trainable params: 0
_________________________________________________________________


Building a model using subclassing:
- `init`: definie the model structure 
- `call`: calcuate the forward passing

In [5]:
from tensorflow import keras
from tensorflow.keras import layers

class MyClassifier(tf.keras.Model):
    def __init__(self):
        super(MyClassifier, self).__init__()
        self.input_layer = layers.Flatten()
        self.hidden_layer = layers.Dense(64, activation='relu', name='dense_1')
        self.output_layer = layers.Dense(10, activation='softmax', name='predictions')
        
    def call(self, x):
        x = self.input_layer(x)
        x = self.hidden_layer(x)
        outputs = self.output_layer(x)
        return outputs
    
    

my_model = MyClassifier()

In [6]:
# Load a toy dataset for the sake of this example
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Preprocess the data (these are Numpy arrays)
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

y_train = y_train.astype('float32')
y_test = y_test.astype('float32')

my_model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

my_model.fit(x_train, y_train, epochs=3)

Train on 60000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f2cf4ac9160>

In [7]:
my_model.evaluate(x_test, y_test, verbose=0)

[0.10554545342661441, 0.9688]

#### GradientTape
TensorFlow provides the [`tf.GradientTape`](https://www.tensorflow.org/api_docs/python/tf/GradientTape) API for _automatic differentiation_ - computing the gradient of a computation with respect to its input variables. 

Tensorflow "records" all operations executed inside the context of a `tf.GradientTape` onto a _"tape"_. Tensorflow then uses that tape and the gradients associated with each recorded operation to compute the gradients of a "recorded" computation using reverse mode differentiation.

For example:
- [`tf.GradientTape.watch(tensor)`](https://www.tensorflow.org/api_docs/python/tf/GradientTape#watch): Ensures that tensor is being traced by this tape.
- [`tf.GradientTape.gradient(target,source)`](https://www.tensorflow.org/api_docs/python/tf/GradientTape#gradient): Computes the gradient using operations recorded in context of this tape.
    - `target`: Tensor (or list of tensors) to be differentiated.
    - `source`: A list or nested structure of Tensors or Variables. `target` will be differentiated against elements in `sources`.

In [57]:
# x = [[1, 1]
#      [1, 1]]
x = tf.ones((2, 2))

with tf.GradientTape() as tape:
    tape.watch(x)
    y = tf.reduce_sum(x) # 4x
    z = tf.multiply(y, y) # 4x^2
    print(x)
    print(y)
    print(z)

# Derivative of z with respect to the original input tensor x
dz_dx = tape.gradient(z, x) # for each x element, 8x at x =  1.0
print('\n')
print(dz_dx)

for i in [0, 1]:
    for j in [0, 1]:
        assert dz_dx[i][j].numpy() == 8.0


tf.Tensor(
[[1. 1.]
 [1. 1.]], shape=(2, 2), dtype=float32)
tf.Tensor(4.0, shape=(), dtype=float32)
tf.Tensor(16.0, shape=(), dtype=float32)


tf.Tensor(
[[8. 8.]
 [8. 8.]], shape=(2, 2), dtype=float32)


You can also request gradients of the output with respect to intermediate values computed during a "recorded" `tf.GradientTape` context.

In [50]:
x = tf.ones((2, 2))

with tf.GradientTape() as tape:
    tape.watch(x)
    y = tf.reduce_sum(x)
    z = tf.multiply(y, y)

# Use the tape to compute the derivative of z with respect to the
# intermediate value y.
# z = y^2
dz_dy = tape.gradient(z, y) # 8.0 (2*y at y=4.0)
print(dz_dy)

assert dz_dy.numpy() == 8.0

tf.Tensor(8.0, shape=(), dtype=float32)


By default, the resources held by a GradientTape are released as soon as GradientTape.gradient() method is called. To compute multiple gradients over the same computation, create a persistent gradient tape. This allows multiple calls to the `gradient()` method as resources are released when the tape object is garbage collected. For example:

In [9]:
x = tf.constant(3.0)

with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    y = x * x # x^2 
    #print(y)
    z = y * y # y^2 = x^4
   #  print(z)

dy_dx = tape.gradient(y, x)  # 6.0
# 2x
print(dy_dx)
     
dz_dx = tape.gradient(z, x)  # 108.0 (4*x^3 at x = 3)
# 4x^3
print(dz_dx)

del tape  # Drop the reference to the tape

tf.Tensor(6.0, shape=(), dtype=float32)
tf.Tensor(108.0, shape=(), dtype=float32)


#### Training the model with GradientTape
Calling a model inside a `GradientTape` scope **enables you to retrieve the gradients of the trainable weights** of the layer with respect to a loss value. Using an optimizer instance, you can **use these gradients to update these variables (which you can retrieve using model.trainable_weights)**.

Let's reuse our MNIST model using subclassing and let's train it using mini-batch gradient with a custom training loop.

In [10]:
from tensorflow import keras
from tensorflow.keras import layers

class MyClassifier(tf.keras.Model):
    def __init__(self):
        super(MyClassifier, self).__init__()
        self.input_layer = layers.Flatten()
        self.hidden_layer = layers.Dense(64, activation='relu', name='dense_1')
        self.output_layer = layers.Dense(10, activation='softmax', name='predictions')
        
    def call(self, x):
        x = self.input_layer(x)
        x = self.hidden_layer(x)
        outputs = self.output_layer(x)
        return outputs
    
my_model = MyClassifier()

In [11]:
# Load a toy dataset for the sake of this example
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Preprocess the data (these are Numpy arrays)
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

y_train = y_train.astype('float32')
y_test = y_test.astype('float32')

# Reserve 10,000 samples for validation
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]

In [13]:
# Prepare the training dataset.
batch_size = 64
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

# Prepare the validation dataset.
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(batch_size)

In [14]:
# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy()

In [15]:
# Prepare the metrics.
train_acc_metric = keras.metrics.SparseCategoricalAccuracy()
train_loss = tf.keras.metrics.Mean(name='train_loss')

val_acc_metric = keras.metrics.SparseCategoricalAccuracy()
val_loss = tf.keras.metrics.Mean(name='val_loss')

In [None]:
# Iterate over epochs.
for epoch in range(3):
    print('\n\nStart of epoch %d' % (epoch,))

    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):

        # Open a GradientTape to record the operations run
        # during the forward pass, which enables autodifferentiation.
        with tf.GradientTape() as tape:
            # Run the forward pass of the layer.
            # The operations that the layer applies
            # to its inputs are going to be recorded
            # on the GradientTape.
            logits = my_model(x_batch_train)  # Logits for this minibatch

            # Compute the loss value for this minibatch.
            loss_value = loss_fn(y_batch_train, logits)
                        

        # Use the gradient tape to automatically retrieve
        # the gradients of the trainable variables with respect to the loss.
        grads = tape.gradient(loss_value, my_model.trainable_weights)

        # Run one step of gradient descent by updating
        # the value of the variables to minimize the loss.
        optimizer.apply_gradients(zip(grads, my_model.trainable_weights))
        
        
        # Update training metric.
        train_acc_metric(y_batch_train, logits)
        train_loss(loss_value)

        # Log every 200 batches.
        if step % 200 == 0:
            print('Training loss (for one batch) at step %s: %s' % (step, float(loss_value)))
            print('Seen so far: %s samples' % ((step + 1) * 64))
            
    # Display metrics at the end of each epoch.
    train_acc = train_acc_metric.result()
    print("-------------------------------------------")
    print('Training loss: %.3f | acc over epoch: %s' % (train_loss.result(), float(train_acc),))
        
        
    # Run a validation loop at the end of each epoch.
    for x_batch_val, y_batch_val in val_dataset:
        val_logits = my_model(x_batch_val)
        v_loss = loss_fn(y_batch_val, val_logits)
        
        val_loss(v_loss)
        
        # Update val metrics
        val_acc_metric(y_batch_val, val_logits)
        
    val_acc = val_acc_metric.result()
    val_acc_metric.reset_states()
    print("-------------------------------------------")
    print('Validation avg loss: %.3f | acc: %s' % (val_loss.result(), float(val_acc),))
    
    # Reset the metrics for the next epoch
    train_acc_metric.reset_states()
    train_loss.reset_states()

    val_acc_metric.reset_states()
    val_loss.reset_states()
    




Start of epoch 0
Training loss (for one batch) at step 0: 2.3608860969543457
Seen so far: 64 samples
Training loss (for one batch) at step 200: 2.1577601432800293
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 2.0718436241149902
Seen so far: 25664 samples
Training loss (for one batch) at step 600: 1.9370613098144531
Seen so far: 38464 samples
-------------------------------------------
Training loss: 2.058 | acc over epoch: 0.3299599885940552


Evaluate the model

In [22]:
batch_size = 64
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_dataset = test_dataset.batch(batch_size)

In [29]:
my_model.compile(optimizer=optimizer,
              loss=loss_fn,
              metrics=[keras.metrics.SparseCategoricalAccuracy()])

In [32]:
test_loss, test_acc = my_model.evaluate(test_dataset)
print('Loss: {}, Acc: {}'.format(test_loss, test_acc))

Loss: 0.49058654031176474, Acc: 0.8758000135421753
