## 7.1 Going beyond the Sequential model

technical한 내용들 설명

* Until now, all neural networks introduced before have been implemented using the `Sequential` model.

  <img src="https://drive.google.com/uc?id=1ZXI42nVA8DN40JG4QsOvU6IFBMgHfNJK" width="300">

* Some networks require:
  * several independent inputs,
  * multiple outputs,
  * internal branching between layers that makes them look like graphs of layers rather than linear stacks of layers.
  
* *Multimodal* inputs
  * data coming from different input sources, processing each type of data using different kinds of neural layers
  * Ex. a deep learning model to predict the most likely market price of a second-hand piece of clothing using the following inputs:
    * user-provided metadata: brand, age, etc.
    * user-provided text description
    * a picture of the item
    
      <img src="https://drive.google.com/uc?id=1ZYV_1OqkoiW5-xjECBaqcFQvLZVhhlRK" width="700">
    
* *Multiple* targets
  * predict multiple target attributes of input data
  * Ex. Given the text of a novel, classify it by genre and predict the approximate date it was wrtitten simoutaneously.
  
   <img src="https://drive.google.com/uc?id=1ZcQzkbgoKVSu7FbTQsJhIy2yN6nYAT9P" width="400">
  
* Networks structured as directed acyclic graphs
  * The Inception family of networks having *Inception modules*
  
    <img src="https://drive.google.com/uc?id=1ZeBw9KykV8u7bgyitHPvho9-SfiSWGgQ" width="500">
  
  * The residual network having residual connections
  
    <img src="https://drive.google.com/uc?id=1ZqD1rYEg3wd6XA8LjRZo-JYsbIgQ1ro9" width="500">



### The functional API of *keras*

* We can use layers as *functions* that take tensors and return tensors (hence, the name *functional* API).

In [None]:
from tensorflow.keras import Input, layers

input_tensor = Input(shape=(32,))
dense = layers.Dense(32, activation='relu')
output_tensor = dense(input_tensor)

In [None]:
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras import layers
from tensorflow.keras import Input

seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)

model = Model(input_tensor, output_tensor) # need only an input and target tensor

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 64)]              0         
                                                                 
 dense_4 (Dense)             (None, 32)                2080      
                                                                 
 dense_5 (Dense)             (None, 32)                1056      
                                                                 
 dense_6 (Dense)             (None, 10)                330       
                                                                 
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________


In [None]:
unrelated_input = Input(shape=(64,))
bad_model = Model(unrelated_input, output_tensor)
# 위에서는 x값들이 다 연결되어있는데 여기서는 연결이 안돼이있기때문에 graph disconnected라 표현되는거이다.

ValueError: ignored

In [None]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy')

import numpy as np
x_train = np.random.random((1000,64))
y_train = np.random.random((1000,10))

model.fit(x_train, y_train, epochs=10, batch_size=128)

score = model.evaluate(x_train, y_train)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Multi-input models

* Such models at some point merge their different input branches using a layer that can combine several tensors:
  * adding them, concatenating them, etc.
  * *keras.layers.add*, *keras.layers.concatenate*
  
* A question-answering model
  * A typical QnA model has two inputs: 
    * a natural-language question 질문(문제)
    * a text snippet 지문
  * The model then produces an answer.
  
  <img src="https://drive.google.com/uc?id=1ZqZTfHEJIBqhXXPmcJ6DMCW20UfL9zQG" width="400">

  
  

In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

text_input = Input(shape=(None,), dtype='int32', name='text')
embedded_text = layers.Embedding(text_vocabulary_size, 64)(text_input)
encoded_text = layers.LSTM(32)(embedded_text) # ,~~32)가 될것이고

question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(question_vocabulary_size, 32)(question_input)
encoded_question = layers.LSTM(16)(embedded_question) # ,~~16)이 될것이고

concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1) #여기는 ~~,32+16)으로 되는 과정이다.

answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

* Then, how to train this two-input model?
  * You can feed the model a list of Numpy arrays as inputs.
  * Or, you can feed it a dictionary that maps input names to Numpy arrays.<br>
  데이터input이 질문, 지문 두개가 있는데 어케 하는지도 알아보자

In [None]:
import numpy as np

num_samples = 1000
max_length = 100

text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))
question = np.random.randint(1, question_vocabulary_size, size=(num_samples, max_length))

label_index = np.random.randint(0,answer_vocabulary_size, size=(num_samples,))
answer = np.zeros((num_samples,answer_vocabulary_size))
answer[np.arange(num_samples), label_index] = 1

# There are two options.
#model.fit([text, question], answer, epochs=10, batch_size=128)
model.fit({'text':text, 'question':question}, answer, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f0bd591f8d0>

### Multi-output models

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import Input
from tensorflow.keras.models import Model

vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(vocabulary_size, 256)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts) 
x = layers.MaxPooling1D(5)(x) 
x = layers.Conv1D(256, 5, activation='relu')(x) 
x = layers.Conv1D(256, 5, activation='relu')(x) 
x = layers.MaxPooling1D(5)(x) 
x = layers.Conv1D(256, 5, activation='relu')(x) 
x = layers.Conv1D(256, 5, activation='relu')(x) 
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

model = Model(posts_input,
              [age_prediction, income_prediction, gender_prediction])

In [None]:
# Different loss functions for different tasks
# Again, there are two options for that.

#model.compile(optimizer='rmsprop',
#              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'])
model.compile(optimizer='rmsprop',
              loss={'age': 'mse',
                    'income': 'categorical_crossentropy',
                    'gender': 'binary_crossentropy'})

* Note that very imbalanced loss contributions will cause the model representations to be optimized for the task with the largest individual loss.
  * For example, the MSE loss typically takes a value around 3-5, whereas the binary CE loss can be as low as 0.1.
  * To balance the contribution of the different losses, you can assign a weight to loss.

In [None]:
#model.compile(optimizer='rmsprop',
#              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'],
#              loss_weights=[0.25, 1., 10.])

model.compile(optimizer='rmsprop',
              loss={'age': 'mse', 
                    'income': 'categorical_crossentropy', 
                    'gender': 'binary_crossentropy'}, 
              loss_weights={'age': 0.25, 
                            'income': 1., 
                            'gender': 10.})

In [None]:
# feeding data to a multi-output model

model.fit(posts, [age_targets, income_targets, gender_targets],
          epochs=10, batch_size=64)

model.fit(posts, {'age': age_targets,
                  'income': income_targets,
                  'gender': gender_targets},
          epochs=10, batch_size=64)

### Directed acyclic graphs of layers

* Neural networks are allowed to be arbitrary *directed acyclic graphs* of layers.

* Several common neural network components are implemented as graphs.

* **Inception module**
  * developed by Szegedy in 2013-2014.
  * It consists of a stack of modules that themselves look like small independent networks, split into several parallel branches.
  
    <img src="https://drive.google.com/uc?id=1ZrKSiDl4rLwV_4fHfjAXr9BbIFMBiI0j" width="700">

In [None]:
from tensorflow.keras import layers
from tensorflow.keras import Input

x = Input(shape=(256, 256, 64))

branch_a = layers.Conv2D(128, 1, activation='relu', strides=2)(x) 

branch_b = layers.Conv2D(128, 1, activation='relu')(x) 
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2, padding='same')(branch_b)

branch_c = layers.AveragePooling2D(3, strides=2, padding='same')(x) 
branch_c = layers.Conv2D(128, 3, activation='relu', padding='same')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu')(x) 
branch_d = layers.Conv2D(128, 3, activation='relu', padding='same')(branch_d) 
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2, padding='same')(branch_d)

output = layers.concatenate( [branch_a, branch_b, branch_c, branch_d], axis=-1)

* **Residual connections**

  * Introduced by He et al. in late 2015.
  * They tackle two common problems of any large-scale deep learning model:
    * vanishing gradients and representational bottlenecks.
  * Residual connection consists of making the output of an earlier layer available as input to a later layer by creating a shortcut.
  * Rather than being concatenated to the later activation, the earlier output is summed with the later activation.

In [None]:
from tensorflow.keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x) 
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y) 
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

y = layers.add([y,x])

In [None]:
# if the feature map sizes differ

from tensorflow.keras import layers

x = ...
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x) 
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y) 
y = layers.MaxPooling2D(2, strides=2)(y)

residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)

y = layers.add([y, residual])

### Layer weight sharing

* We can reuse a layer instance several times by the functional API.

* For example, consider a model that attempts to assess the semantic similarity between two sentences.
  * In this setup, the two input sentences are interchangeable.
  * We call this a *Siamese* LSTM.

In [None]:
from tensorflow.keras import layers 
from tensorflow.keras import Input 
from tensorflow.keras.models import Model

lstm = layers.LSTM(32)

left_input = Input(shape=(None, 128))
left_output = lstm(left_input) # weight를 아래랑 sharing 한다   # 만약 let_output = layers.lstem(left_input)으로 만들면 wieght가 따로놀거다

right_input = Input(shape=(None, 128))
right_output = lstm(right_input)

merged = layers.concatenate([left_output, right_output], axis=-1) 
predictions = layers.Dense(1, activation='sigmoid')(merged)    

model = Model([left_input, right_input], predictions) 
model.fit([left_data, right_data], targets)

## 7.2 Inspecting and monitoring models: Callbacks and TensorBoard

### Using callbacks to act on a model during training

* Ex. we want to stop training when the validation loss is no longer improving.

* A callback is an object that is passed to the model in the call to `fit` and that is called by the model at various points during training.
  * It has access to all the available data about the state of the model and its performance.
  * It can take action: interrupt training, save a model, load a different weight set, or otherwise alter the state of the model
  
* Some examples of ways using callbacks:
  * Model checkpointing - saving the current weights of the model at different points during training
  * Early stopping - interrupting training when the validation loss is no longer improving
  * Dynamically adjusting the value of certain parameters during training - such as the learning rate of the optimizer
  * Logging training and validation metrics, or visualizing the representations learned by the model as they're updated - the Keras progress bar is a callback.
  
* The list of built-in callbacks
  * https://keras.io/callbacks/

* **ModelCheckpoint** and **EarlyStopping** callbacks

  * `EarlyStopping` - interrupt training once a target metric being monitored has stopped improving for a fixed number of epochs
  
  * `ModelCheckpoint` - continually save the model during training

In [None]:
from tensorflow.keras import callbacks

callbacks_list = [callbacks.EarlyStopping(monitor='acc', 
                                          patience=1),
                  callbacks.ModelCheckpoint(filepath='my_model.h5', 
                                            monitor='val_loss', 
                                            save_best_only=True)]

model.compile(...)

model.fit(x, y,
          epochs=10,
          batch_size=32,
          callbacks=callbacks_list,
          validation_data=(x_val, y_val))

* **ReduceLROnPlateau** callback

  * `ReduceLROnPlateau` - reduce the learning rate when the validation loss has stopped improving

In [None]:
callbacks_list = [callbacks.ReduceLROnPlateau(monitor='val_loss',
                                              factor=0.1,
                                              patience=10)]

model.compile(...)

model.fit(x, y,
          epochs=10,
          batch_size=32,
          callbacks=callbacks_list,
          validation_data=(x_val, y_val))

* Writing your own callback

  * Callbacks are implemented by subclassing the class `keras.callbacks.Callback`.
  
  * You can then implement any number of the named methods, which are called at various points during training.
    * `on_epoch_begin` and `on_epoch_end`
    * `on_batch_begin` and `on_batch_end`
    * `on_train_begin` and `on_train_end`

  * Additionally, the callback has access to the following attributes.
    * `self.model` - the model instance 
    * `self.validation_data` - what was passed to `fit` as validation data
    
  * Example: a custom callback that saves to disk the activations of every layer of the model at the end of every epoch, computed on the first sample of the validation set.

In [None]:
from tensorflow.keras import callbacks, models
import numpy as np

class ActivationLogger(callbacks.Callback):
  
  def set_model(self, model):
    self.model = model
    layer_outputs = [layer.output for layer in model.layers]
    self.activations_model = models.Model(model.input, layer_outputs)
    
  def on_epoch_end(self, epoch, logs=None):
    if self.validation_data is None:
      raise RuntimeError('Requires validation_data.')
    validation_sample = self.validation_data[0][0:1]
    activations = self.activations_model.predict(validation_sample)
    with open('activations_at_epoch' + str(epoch) + '.npz', 'w') as f:
      np.savez(f, activations)

### TensorBoard: the Tensorflow visualization framework

* Keep in mind that you need frequent feedback about what's going on inside your models during your experiments to develop good models.

* TensorBoard is a browser-based visualization tool that comes packaged with Tensorflow.
  * It helps you visually monitor everything that goes on inside your model during training.
    * Visually monitoring metrics during training
    * Visualizing the model architecture
    * Visualizing histograms of activations and gradients
    * Exploring embeddings in 3D
    
* Using Tensorboard in Colab environment
  * Refer to https://www.tensorflow.org/tensorboard/tensorboard_in_notebooks

In [None]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

# make log directory
import os
log_dir = '/content/gdrive/My Drive/exp/logs/imdb_trial_03'

if not os.path.exists(log_dir):
  os.makedirs(log_dir)

Mounted at /content/gdrive


In [None]:
%load_ext tensorboard

In [None]:
%tensorboard --logdir /content/gdrive/My\ Drive/exp/logs/imdb_trial_03

In [None]:
import datetime

from tensorflow.keras import models 
from tensorflow.keras import layers 
from tensorflow.keras import callbacks
from tensorflow.keras import optimizers
from tensorflow.keras.datasets import imdb 
from tensorflow.keras.preprocessing import sequence

max_features = 10000
max_len = 500

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) 
x_train = sequence.pad_sequences(x_train, maxlen=max_len) 
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

model = models.Sequential() 
model.add(layers.Embedding(max_features, 128, 
                           input_length=max_len,
                           name='embed'))
model.add(layers.Conv1D(32, 7, activation='relu')) 
model.add(layers.MaxPooling1D(5)) 
model.add(layers.Conv1D(32, 7, activation='relu')) 
model.add(layers.GlobalMaxPooling1D()) 
model.add(layers.Dense(1, activation='sigmoid')) 
model.summary() 
model.compile(optimizer=optimizers.RMSprop(lr=1e-4), 
              loss='binary_crossentropy', 
              metrics=['acc'])

callbacks = [callbacks.TensorBoard(log_dir, histogram_freq=1)]

history = model.fit(x_train, y_train,
                    epochs=20,
                    batch_size=128,
                    validation_split=0.2,
                    callbacks=callbacks)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embed (Embedding)           (None, 500, 128)          1280000   
                                                                 
 conv1d_5 (Conv1D)           (None, 494, 32)           28704     
                                                                 
 max_pooling1d_2 (MaxPooling  (None, 98, 32)           0         
 1D)                                                             
                                                                 
 conv1d_6 (Conv1D)           (None, 92, 32)            7200      
                                                                 
 global_max_pooling1d_1 (Glo  (None, 32)               0         
 balMaxPooling1D)                                                
                             

  super(RMSprop, self).__init__(name, **kwargs)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## 7.3 Getting the most out of your models

### Advanced architecture patterns

* **Batch normalization**(layer 중간중간에 표준화시키는 작업을 시행해보자)

  * We have seen that the data normalization should be done before feeding the data into a network.
  
    ```python
    normalized_data = (data - np.mean(data, axis=...)) / np.std(data, axis=...)
    ```
  
  * But data normalization should be a concern after every transformation operated by the network.
  
  * It is a type of layer introduced in 2015 by Ioffe and Szegedy.
  
      <img src="https://drive.google.com/uc?id=1BZw_m1lVnfTaNK7blmmvzSgw1svwMXNN" width="700">
  
      <img src="https://drive.google.com/uc?id=1BVRTwGj3mNzdMOnKOG6_dd-gYn6Qoem2" width="900">
    
  * Why does it have learnable scale and shift parameters?
  
    <img src="https://drive.google.com/uc?id=1BfD-Cm7NIvh_MwoML0E0BDzx88RORVJX" width="700">
  

여기서 batch normalization을 하는 이유가 뭘까?<br>
dense layer의 기준으로 생각을 해보면 우리는 어떤 layer에 activation을 거치고 어떠한 결과값을 낼것이다. 근데 이렇게 나온 결과값들이 어떠한 모양이나 분포를 이루고 있다고 ㅅ애각을 하자.<br>
이러면 그 분포를 만들기 위한 여러개의 paramter가 미세하게 조정이 될것이다.<br>
그렇게 어렵게 가지말고 0이 평균인 작은 분포를 만든뒤에 그걸 가지고 늘리고 줄이고 하는게 훨씬 쉽다.<br>
는 이야기이다.

In [None]:
# Option 1
conv_model.add(layers.Conv2D(32, 3, activation='relu')) 
conv_model.add(layers.BatchNormalization())

dense_model.add(layers.Dense(32, activation='relu')) 
dense_model.add(layers.BatchNormalization())

# Option 2
conv_model.add(layers.Conv2D(32, 3)) 
conv_model.add(layers.BatchNormalization())
conv_model.add(layers.Activation('relu'))

dense_model.add(layers.Dense(32)) 
dense_model.add(layers.BatchNormalization())
dense_model.add(layers.Activation('relu'))

### Hyperparameter optimization

* When building a deep learning model, you have to make many decisions:
  * How many layers?
  * How many units or filters?
  * Which activation?
  * And many more.
  
* These architecture-level parameters are called *hyperparameters*.

* We need to explore the space of possible decisions automatically, systematically, in a principled way.

* The process of optimizing hyperparameters:
  * Choose a set of hyperparameters.
  * Build the corresponding model.
  * Fit it to your training data, and measure the final performance on the validation data.
  * Choose the next set of hyperparameters to try.
  * Repeat.
  * Eventually, measure performance on the test data.
  
* It is known that random search is the best solution, despite being the most naive one.

### Model ensembling

* Ensembling consists of pooling together the predictions of a set of different models, to produce better predictions.

* It relies on the assumption that different good models trained independently are likely to be good for different reasons.

* The easiest way to pool the predictions of a set of classifiers is to average their predictions at inference time.

  ```python
  preds_a = model_a.predict(x_val)
  preds_b = model_b.predict(x_val)
  preds_c = model_c.predict(x_val)
  preds_d = model_d.predict(x_val)
  
  final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)
  ```
  
* Or if you know which classifier is better, you can use a weighted average.

  ```python
  preds_a = model_a.predict(x_val)
  preds_b = model_b.predict(x_val)
  preds_c = model_c.predict(x_val)
  preds_d = model_d.predict(x_val)
  
  final_preds = w_a * preds_a + w_b * preds_b + w_c * preds_c + w_d * preds_d
  ```
  
* The key to making ensembling work is the diversity of the set of classifiers.
  * Emsenble models should be as good as possible while being as different as possible.
  * Use very different architectures or even different ML approaches.