### The Functional API
Predict most likely market price of a second-hand piece of clothing with the following inputs:
1. User-provided metadata - such as the item's brand, age, and so on
2. User provided text description
3. Picture of the item

only metadata - one-hot encode it and use a densely connected network to predict the price
only text description - use an RNN or 1D convnet
only picture - use a 2D convnet

But how would you use all 3 at the same time?
Naive approach: train all three models and then do a weighted average of the predictions. Probably suboptimal

Better approach: *jointly* learn a more accurate model of the data by using a model that can see ALL AVAILABLE INPUT modalities

More examples: "Inception" modules, where input is processed by several parallel convolutional branches whose output is processed by sevearl parallel convolutional branches whose outputs are then merged back into a single tensor.

Residual connections: reinject previous representations into the downstream flow of data by adding pas output tensor to a lter output tensor



### Side-By-Side - Keras Functional vs Sequential Programming

In [5]:
?layers.Dense

In [9]:
from keras.models import Sequential, Model
from keras import layers
from keras import Input

seq_model = Sequential()
seq_model.add(layers.Dense(32, activation='relu', input_shape=(64,)))
seq_model.add(layers.Dense(32, activation='relu'))
seq_model.add(layers.Dense(10, activation='softmax'))

input_tensor = Input(shape=(64,))
x = layers.Dense(32, activation='relu')(input_tensor)
x = layers.Dense(32, activation='relu')(x)
output_tensor = layers.Dense(10, activation='softmax')(x)

model = Model(input_tensor, output_tensor)
print(model.summary())
seq_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 64)                0         
_________________________________________________________________
dense_35 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_36 (Dense)             (None, 32)                1056      
_________________________________________________________________
dense_37 (Dense)             (None, 10)                330       
Total params: 3,466
Trainable params: 3,466
Non-trainable params: 0
_________________________________________________________________
None
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_32 (Dense)             (None, None, 32)          2080      
_________________________________________________________________
den

In [10]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

import numpy as np
x_train = np.random.random((1000, 64))
y_train = np.random.random((1000, 10))

model.fit(x_train, y_train, epochs=10, batch_size=128)

score = model.evaluate(x_train, y_train)
score

### Multi-input models

In [12]:
?layers.Embedding

In [11]:
from keras.models import Model
from keras import layers
from keras import Input

text_vocabulary_size = 10000
question_vocabulary_size = 10000
answer_vocabulary_size = 500

text_input = Input(shape=(None,), dtype='int32', name='text')
embedded_text = layers.Embedding(64, text_vocabulary_size)(text_input)
encoded_text = layers.LSTM(32)(embedded_text)

question_input = Input(shape=(None,), dtype='int32', name='question')
embedded_question = layers.Embedding(32, question_vocabulary_size)(question_input)
encoded_question = layers.LSTM(16)(embedded_question)

concatenated = layers.concatenate([encoded_text, encoded_question], axis=-1)
answer = layers.Dense(answer_vocabulary_size, activation='softmax')(concatenated)

model = Model([text_input, question_input], answer)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               (None, None)         0                                            
__________________________________________________________________________________________________
question (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 10000)  640000      text[0][0]                       
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, None, 10000)  320000      question[0][0]                   
__________________________________________________________________________________________________
lstm_3 (LS

In [13]:
# feeding data into multi-input model
import numpy as np
num_samples = 1000
max_length = 100

text = np.random.randint(1, text_vocabulary_size, size=(num_samples, max_length))

question = np.random.randint(1, question_vocabulary_size, size=(num_samples, max_length)) #questions are ... ?
answer = np.random.randint(0, 1, size=(num_samples, answer_vocabulary_size)) #integers are One Hot Encoded, not ints

model.fit([text, question], answer, epochs=1, batch_size=128)

### Multi-output Models

Simple network that attempts to simultaneously predict different properties of the data.
* example - social media posts used to predict age, gender, income

In [15]:
from keras import layers
from keras import Input
from keras.models import Model

vocabulary_size = 50000
num_income_groups = 10

posts_input = Input(shape=(None,), dtype='int32', name='posts')
embedded_posts = layers.Embedding(256, vocabulary_size)(posts_input)
x = layers.Conv1D(128, 5, activation='relu')(embedded_posts)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.Conv1D(256, 5, activation='relu')(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation='relu')(x)

age_prediction = layers.Dense(1, name='age')(x)
income_prediction = layers.Dense(num_income_groups, activation='softmax', name='income')(x)
gender_prediction = layers.Dense(1, activation='sigmoid', name='gender')(x)

model = Model(posts_input, [age_prediction, income_prediction, gender_prediction])

model.compile(optimizer='rmsprop', loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'])

### Imbalanced loss contributions
Will cause model representations to be optimized preferentially for the task with the largest individual loss.

For example
* MSE loss for age-regression is 3-5
* cross_entropy loss can be as low as 0.1 for gender classification


In [16]:
model.compile(optimizer='rmsprop', loss=['mse', 'categorical_crossentropy', 'binary_crossentropy'], 
             loss_weights=[0.25, 1., 10.])

### equivilent
model.compile(optimizer='rmsprop', 
             loss={'age': 'mse', 'income': 'categorical_crossentropy', 'gender': 'binary_crossentropy'},
             loss_weights={'age': 0.25, 'income': 1., 'gender': 10.})

In [None]:
model.fit(posts, {'age': age_targets, 'income': income_targets, 'gender': gender_targets}, epochs = 10, batch_size=64)

### Directed, Acyclic Graphs
Both of these words are imported
* Directed: Having 1 direction
* Acyclic: Graphs can not have CYCLES. It's impossible for tensor X to become input of one of the layers that generated X
* Only processing loops allowed are internal to the layers (like RNNs and GRU)


### Inception model
Inception V3 module will be coded below. Notice how there are 4 smaller networks before the cancatenate step
* Network 1: Conv2D - 1x1, strides = 2
* Network 2: Conv2d - 1x1 --> Conv2D 3x3, strides = 2
* Network 3: AvgPool2D 3x3, strides=2 --> Conv2D 3x3
* Network 4: Conv2D 1x1 --> Conv2D 3x3 --> Conv 2D 3x3, strides=2

Actual inception model: 
keras.applications.inception_v3.InceptionV3
 

In [27]:
?layers.Conv2D

In [2]:
from keras import layers
from keras import Input

x = Input(shape=(None,128,128), dtype='float32', name='posts') #a 4D input tensor, I guess

branch_a = layers.Conv2D(filters=128, kernel_size=1, activation='relu', strides=2, padding='same')(x)

branch_b = layers.Conv2D(128, 1, activation='relu', padding='same')(x)
branch_b = layers.Conv2D(128, 3, activation='relu', strides=2, padding='same')(branch_b)

branch_c = layers.AveragePooling2D(3, strides=2, padding='same')(x)
branch_c = layers.Conv2D(128, 3, activation='relu', padding='same')(branch_c)

branch_d = layers.Conv2D(128, 1, activation='relu', padding='same')(x)
branch_d = layers.Conv2D(128, 3, activation='relu', padding='same')(branch_d)
branch_d = layers.Conv2D(128, 3, activation='relu', padding='same', strides=2)(branch_d)

output = layers.concatenate([branch_a, branch_b, branch_c, branch_d], axis=-1)

Instructions for updating:
Colocations handled automatically by placer.


### Residual connections
A residual connection consists of making the output of an earlier layer available as input to a lter layer, creating a shortcut. 

The example assumes a 4D input tensor x. X will be added BACK to an otherwise sequential model after we've put X though a couple of convolution layers

In [None]:
from keras import layers
y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)

y = layers.add([y, x]) #Adds original x back to the output features

### Example 2
Residual connection when feature-map sizes differ (using downsampling)

In [3]:
from keras import layers

y = layers.Conv2D(128, 3, activation='relu', padding='same')(x)
y = layers.Conv2D(128, 3, activation='relu', padding='same')(y)
y = layers.MaxPooling2D(2, strides=2)(y)

#uses a 1x1 convolution to linarly downsample the original x tesnro to the same shape as y
residual = layers.Conv2D(128, 1, strides=2, padding='same')(x) 

y = layers.add([y, residual])

### Representational Bottlenecks in Machine learning

Sequential model: layer built on previous layer. Any information lost is permanent
Residual model: reinject previous layers into later layers. Partially solve information loss problem

### Vanishing gradients in Deep learning
Propogating a feedback signal from the output loss down to earlier layers.
If this signal has to be propogated through a deep stack of layers, the signal may become tenuous or lost.

LSTM: introduce a carry track that propogates information parallel to the main processing track.
Residual: introduce a purely linear information carry track parallel to the main processing track. 

# Layer Weight sharing
(This is very important to me, I think, in trying to get the Siamese neural network to work)
Reuse a layer instance several times. When you call a layer instance twice, you reuse the same weights with every call. 

This allows you to build models that have shared branches-several branches that all share the same knowledge and perform the same operations. They share the same representations and learn these representations SIMULTANEOUSLY for different sets of inputs.

Consider a model that attempts to assess the semantic similarity between two sentences. The model has two inputs (the two sentences to compare) and outputs a score between 0 and 1, where 0 means unrelated sentences and 1 means sentences that are either identical or reformulations of each other. 

In this setup, the two input sentences are interchangeable, because semantic similarity is a symmetric relationship: the similarity of A to B is identical to the similarity of B to A. For this reason, independent models don't make sense. 

This is called a Siamese LSTM or shared LSTM model.

In [7]:
from keras import layers
from keras import Input
from keras.models import Model

lstm = layers.LSTM(32) #instantiating a layer

left_input = Input(shape=(None, 128))
left_output = lstm(left_input) #using that previously instantiated layer first time

right_input = Input(shape=(None, 128))
right_output = lstm(right_input) #REUSING the previously instantiated layer :)

merged = layers.concatenate([left_output, right_output], axis=-1)
predictions = layers.Dense(1, activation='sigmoid')(merged)

model = Model([left_input, right_input], predictions)
model.fit([left_data, right_data], targets)

### Models AS layers
You can think of a model as a bigger layer. True of both Sequential and Model classes. You can call a model on an input tensor and retrieve an output tensor.

y = model(x)

Or if, multiple input tensors / multiple output tensors you can do like this

y1, y2 = model([x1, x2])

### Siamese Vision Model
Two parallel camaeras, a few inches apart. Don't need two SEPERATE networks for low level processing.
So the first part of the processing can be done via layers that use the same weights and thus share the same representations.


In [None]:
from keras import layers
from keras import applications
from keras import Input

xception_base = applications.Xception(weights=None, include_top=False)

left_input = Input(shape=(250, 250, 3))
right_input = Input(shape=(250, 250, 3))

left_features = xception_base(left_input)
right_features = xception_base(right_input)

merged_features = layers.concatenate([left_features, right_features], axis=-1)