In [None]:
"""
Beyond the Sequential model:

1. Some networks require several independent inputs, others require multiple outputs, and some networks have internal branching between
   layers that makes them look like graphs of layers rather than linear stacks of layers.

2. Some tasks, for instance, require multimodal inputs: they merge data coming from different input sources, processing each type of data
   using different kinds of neural layers. Imagine a deep-learning model trying to predict the most likely market price of a second-hand
   piece of clothing, using the following inputs: user-provided metadata (such as the item’s brand, age, and so on), a user-provided text
   description, and a picture of the item. If you had only the metadata available, you could one-hot encode it and use a densely connected
   network to predict the price. If you had only the text description available, you could use an RNN or a 1D convnet. If you had only
   the picture, you could use a 2D convnet. But how can you use all three at the same time? A naive approach would be to train three
   separate models and then do a weighted average of their predictions. But this may be suboptimal, because the information extracted by
   the models may be redundant. A better way is to jointly learn a more accurate model of the data by using a model that can see all
   available input modalities simultaneously: a model with three input branches.

3. Similarly, some tasks need to predict multiple target attributes of input data. Given the text of a novel or short story, you might want
   to automatically classify it by genre (such as romance or thriller) but also predict the approximate date it was written. Of course,
   you could train two separate models: one for the genre and one for the date. But because these attributes aren’t statistically
   independent, you could build a better model by learning to jointly predict both genre and date at the same time. Such a joint model would
   then have two outputs, or heads. Due to correlations between genre and date, knowing the date of a novel would help the model learn rich,
   accurate representations of the space of novel genres, and vice versa.

4. Additionally, many recently developed neural architectures require nonlinear network topology: networks structured as directed acyclic
   graphs. The Inception family of networks for instance, relies on Inception modules, where the input is processed by several parallel
   convolutional branches whose outputs are then merged back into a single tensor. There’s also the recent trend of adding residual
   connections to a model, which started with the ResNet family of networks. A residual connection consists of reinjecting previous
   representations into the downstream flow of data by adding a past output tensor to a later output tensor, which helps prevent information
   loss along the data-processing flow.

5. These three important use cases — multi-input models, multi-output models, and graph-like models — aren’t possible when using only
   the Sequential model class in Keras. But there’s another far more general and flexible way to use Keras: the functional API. 
"""

In [1]:
"""
Functional API

1. In the functional API, you directly manipulate tensors, and you use layers as functions that take tensors and return tensors

2. The only part that may seem a bit magical at this point is instantiating a Model object using only an input tensor and an output tensor.
   Behind the scenes, Keras retrieves every layer involved in going from input_tensor to output_tensor, bringing them together into
   a graph-like data structure—a Model. Of course, the reason it works is that output_tensor was obtained by repeatedly transforming
   input_tensor.

3. When it comes to compiling, training, or evaluating such an instance of Model, the API is the same as that of Sequential:
"""
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras import layers
from tensorflow.keras import Input

# sequence model
seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))
print("seq_mode: ", seq_model.summary())

# function api
input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)
model = Model(input_tensor, output_tensor)
print("function api: ", model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 32)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_2 (Dense)              (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________
seq_mode:  None
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                2080      
__________________________________________________________

In [4]:
"""
Multi-input Models

1. The functional API can be used to build models that have multiple inputs. Typically, such models at some point merge their different input
   branches using a layer that can combine several tensors: by adding them, concatenating them, and so on. This is usually done via a Keras
   merge operation such as keras.layers.add, keras.layers.concatenate, and so on.

2. A typical question-answering model has two inputs: a natural-language question and a text snippet (such as a news article) providing
   information to be used for answering the question. The model must then produce an answer: in the simplest possible setup, this is a
   one-word answer obtained via a softmax over some predefined vocabulary.
"""
import numpy as np


text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

# embed the text input text
text_input = Input(shape=(None,), dtype='int32', name='text')
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)
encoded_text = layers.LSTM(32)(embedded_text)

# embed the question input text
question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

# concatenates the encoded question and encoded text, and then do the softmax
concatenated = layers.concatenate([encoded_text, encoded_question],axis=-1)
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

# specify the two inputs and one ouput at model instantiation
model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

print(model.summary())

"""
# feeding data to a multi-input model
num_samples = 1000
max_length = 100

text     = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))
question = np.random.randint(1, question_vocabulary_size, size=(num_samples, max_length))
answers  = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size))

#model.fit([text, question], answers, epochs=10, batch_size=128)
model.fit({'text': text, 'question': question},
          answers,
          epochs=10,
          batch_size=128)
"""

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
question (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, None, 10000)  640000      text[0][0]                       
__________________________________________________________________________________________________
embedding_5 (Embedding)         (None, None, 10000)  320000      question[0][0]                   
__________________________________________________________________________________________________
lstm_4 (LS

In [5]:
"""
Multi-output models

1. In the same way, you can use the functional API to build models with multiple outputs (or multiple heads). A simple example is a network
   that attempts to simultaneously predict different properties of the data, such as a network that takes as input a series of social media
   posts from a single anonymous person and tries to predict attributes of that person, such as age, gender, and income level.

2. Importantly, training such a model requires the ability to specify different loss functions for different heads of the network: for
   instance, age prediction is a scalar regression task, but gender prediction is a binary classification task, requiring a different
   training procedure. But because gradient descent requires you to minimize a scalar, you must combine these losses into a single value
   in order to train the model. The simplest way to combine different losses is to sum them all. In Keras, you can use either a list or
   a dictionary of losses in compile to specify different objects for different outputs; the resulting loss values are summed into a global
   loss, which is minimized during training.

3. Note that very imbalanced loss contributions will cause the model representations to be optimized preferentially for the task with
   the largest individual loss, at the expense of the other tasks. To remedy this, you can assign different levels of importance to the
   loss values in their contribution to the final loss. This is useful in particular if the losses’ values use different scales.
   For instance, the mean squared error (MSE) loss used for the age-regression task typically takes a value around 3–5, whereas the
   crossentropy loss used for the gender-classification task can be as low as 0.1. In such a situation, to balance the contribution
   of the different losses, you can assign a weight of 10 to the crossentropy loss and a weight of 0.25 to the MSE loss.
"""
vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

# multiple losses
# model.compile(optimizer='rmsprop', loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'])
# model.compile(optimizer='rmsprop',
#               loss={'age': 'mse',
#                     'income': 'categorical_crossentropy',
#                     'gender': 'binary_crossentropy'})

# loss weight
# model.compile(optimizer='rmsprop', loss={'age': 'mse', 'income': 'categorical_crossentropy', 'gender': 'binary_crossentropy'},
#                                    loss_weights={'age': 0.25, 'income': 1., 'gender': 10.})
model.compile(optimizer='rmsprop',
              loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'],
              loss_weights=[0.25, 1., 10.])

print(model.summary())

"""
# feed the data
# model.fit(posts, [age_targets, income_targets, gender_targets], epochs=10, batch_size=64)
model.fit(posts, 
          {'age': age_targets,
           'income': income_targets,
           'gender': gender_targets},
           epochs=10,
           batch_size=64)
"""

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
posts (InputLayer)              (None, None)         0                                            
__________________________________________________________________________________________________
embedding_6 (Embedding)         (None, None, 50000)  12800000    posts[0][0]                      
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, None, 128)    32000128    embedding_6[0][0]                
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, None, 128)    0           conv1d[0][0]                     
__________________________________________________________________________________________________
conv1d_1 (

"\n# feed the data\n# model.fit(posts, [age_targets, income_targets, gender_targets], epochs=10, batch_size=64)\nmodel.fit(posts, \n          {'age': age_targets,\n           'income': income_targets,\n           'gender': gender_targets},\n           epochs=10,\n           batch_size=64)\n"

In [None]:
"""
Directed acyclic graphs of layers (1)

1. With the functional API, not only can you build models with multiple inputs and multiple outputs, but you can also implement networks
   with a complex internal topology. Neural networks in Keras are allowed to be arbitrary directed acyclic graphs of layers. The qualifier
   acyclic is important: these graphs can’t have cycles. It’s impossible for a tensor x to become the input of one of the layers that
   generated x. The only processing loops that are allowed (that is, recurrent connections) are those internal to recurrent layers.

2. Several common neural-network components are implemented as graphs. Two notable ones are Inception modules and residual connections.

3. The purpose of 1 × 1 convolutions: The convolutions extract spatial patches around every tile in an input tensor and apply the same
   transformation to each patch. An edge case is when the patches extracted consist of a single tile. The convolution operation then
   becomes equivalent to running each tile vector through a Dense layer: it will compute features that mix together information from
   the channels of the input tensor, but it won’t mix information across space (because it’s looking at one tile at a time). Such
   1 × 1 convolutions (also called pointwise convolutions) are featured in Inception modules, where they contribute to factoring out
   channel-wise feature learning and spacewise feature learning—a reasonable thing to do if you assume that each channel is highly
   autocorrelated across space, but different channels may not be highly correlated with each other.
"""

branch_a = layers.Conv2D(128, 1, activation='relu', strides=2)(x)

branch_b = layers.Conv2D(128, 1, activation='relu')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_b)

branch_c = layers.AveragePooling2D(3, strides=2)(x)
branch_c = layers.Conv2D(128, 3, activation='relu')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu')(x)
branch_d = layers.Conv2D(128, 3, activation='relu')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', strides=2)(branch_d)

# concatenate the branch outputs to obtain the module output
output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1)

In [None]:
"""
Directed acyclic graphs of layers (2)

1. A residual connection consists of making the output of an earlier layer available as input to a later layer, effectively creating
   a shortcut in a sequential network. Rather than being concatenated to the later activation, the earlier output is summed with the
   later activation, which assumes that both activations are the same size. If they’re different sizes, you can use a linear transformation
   to reshape the earlier activation into the target shape.

2. Representational bottlenecks in deep learning:
   In a Sequential model, each successive representation layer is built on top of the previous one, which means it only has access to
   information contained in the activation of the previous layer. If one layer is too small (for example, it has features that are too
   low-dimensional), then the model will be constrained by how much information can be crammed into the activations of this layer.
   You can grasp this concept with a signal-processing analogy: if you have an audio processing pipeline that consists of a series of
   operations, each of which takes as input the output of the previous operation, then if one operation crops your signal to a low-frequency
   range (for example, 0–15 kHz), the operations downstream will never be able to recover the dropped frequencies. Any loss of information
   is permanent. Residual connections, by reinjecting earlier information downstream, partially solve this issue for deep-learning models.

3. Vanishing gradients in deep learning:
   Backpropagation, the master algorithm used to train deep neural networks, works by propagating a feedback signal from the output loss
   down to earlier layers. If this feedback signal has to be propagated through a deep stack of layers, the signal may become tenuous or
   even be lost entirely, rendering the network untrainable. This issue is known as vanishing gradients. This problem occurs both with
   deep networks and with recurrent networks over very long sequences—in both cases, a feedback signal must be propagated through a
   long series of operations. You’re already familiar with the solution that the LSTM layer uses to address this problem in recurrent
   networks: it introduces a carry track that propagates information parallel to the main processing track. Residual connections work in
   a similar way in feedforward deep networks, but they’re even simpler: they introduce a purely linear information carry track parallel
   to the main layer stack, thus helping to propagate gradients through arbitrarily deep stacks of layers. 
"""

# add
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.add([y, x])

# transform
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

residual = layers.Conv2D(128, 1, strides=2, padding='same')(x)
y = layers.add([y, residual])

In [7]:
"""
Layer weight sharing

One more important feature of the functional API is the ability to reuse a layer instance several times. When you call a layer instance
twice, instead of instantiating a new layer for each call, you reuse the same weights with every call. This allows you to build models
that have shared branches — several branches that all share the same knowledge and perform the same operations. That is, they share
the same representations and learn these representations simultaneously for different sets of inputs.
"""

# Instantiates a single LSTM layer, once
lstm = layers.LSTM(32)

# Building the left branch of the model
left_input = Input(shape=(None, 128))
left_output = lstm(left_input)

# Building the right branch of the model
right_input = Input(shape=(None, 128))
right_output = lstm(right_input)

# Builds the classifier on top
merged = layers.concatenate([left_output, right_output], axis=-1)
predictions = layers.Dense(1, activation='sigmoid')(merged)

# model
model = Model([left_input, right_input], predictions)
#model.fit([left_data, right_data], targets)
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            (None, None, 128)    0                                            
__________________________________________________________________________________________________
input_5 (InputLayer)            (None, None, 128)    0                                            
__________________________________________________________________________________________________
lstm_7 (LSTM)                   (None, 32)           20608       input_4[0][0]                    
                                                                 input_5[0][0]                    
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 64)           0           lstm_7[0][0]                     
          

In [None]:
"""
Models as layers

1. Importantly, in the functional API, models can be used as you’d use layers—effectively, you can think of a model as a “bigger layer.”
   This is true of both the Sequential and Model classes.

2. When you call a model instance, you’re reusing the weights of the model—exactly like what happens when you call a layer instance.
   Calling an instance, whether it’s a layer instance or a model instance, will always reuse the existing learned representations of the
   instance—which is intuitive.

3.  One simple practical example of what you can build by reusing a model instance is a vision model that uses a dual camera as its input:
    two parallel cameras, a few centimeters (one inch) apart. Such a model can perceive depth, which can be useful in many applications.
    You shouldn’t need two independent models to extract visual features from the left camera and the right camera before merging
    the two feeds. Such low-level processing can be shared across the two inputs: that is, done via layers that use the same weights
    and thus share the same representations. 
"""
from keras import layers
from keras import applications
from keras import Input

xception_base = applications.Xception(weights=None, include_top=False)

left_input = Input(shape=(250, 250, 3))
left_features = xception_base(left_input)

right_input = Input(shape=(250, 250, 3))
right_input = xception_base(right_input)

merged_features = layers.concatenate([left_features, right_input], axis=-1)