# Keras- writing custom layers
```Here you will experience with writing custom keras layers. We will have two stages: in the first stage we will implement a simple layer. In the second you will implement a more complicated layer.```

## Stage 1
```Implement an unpooling layer, that acts on matrices as follow:```
```
A = array([[0, 1, 3, 1, 0],
           [2, 0, 1, 2, 4],
           [3, 2, 1, 4, 3],
           [4, 0, 3, 2, 0],
           [4, 1, 2, 0, 2]])
       
unpooling(A) = array([[0, 0, 1, 1, 3, 3, 1, 1, 0, 0],
                      [0, 0, 1, 1, 3, 3, 1, 1, 0, 0],
                      [2, 2, 0, 0, 1, 1, 2, 2, 4, 4],
                      [2, 2, 0, 0, 1, 1, 2, 2, 4, 4],
                      [3, 3, 2, 2, 1, 1, 4, 4, 3, 3],
                      [3, 3, 2, 2, 1, 1, 4, 4, 3, 3],
                      [4, 4, 0, 0, 3, 3, 2, 2, 0, 0],
                      [4, 4, 0, 0, 3, 3, 2, 2, 0, 0],
                      [4, 4, 1, 1, 2, 2, 0, 0, 2, 2],
                      [4, 4, 1, 1, 2, 2, 0, 0, 2, 2]])
```
```Use the following example to do so, which is taken from https://keras.io/layers/writing-your-own-keras-layers/.```

```Note: you can't use numpy's functions in your layer's logic. You will have to use functions that are accessed through the backend you use (Theano or Tensorflow).```

```~Ittai Haran```

In [1]:
from keras import backend as K
from keras.engine.topology import Layer
import numpy as np

class MyLayer(Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(MyLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='weight_variable_name', 
                                      shape=(input_shape[1], self.output_dim),
                                      initializer='uniform',
                                      trainable=True)
        super(MyLayer, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, x):
        
        return K.dot(x, self.kernel)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_dim)

In [9]:
class Unpooling(Layer):

    def __init__(self, **kwargs):
        super(Unpooling, self).__init__(**kwargs)

    def build(self, input_shape):
        super(Unpooling, self).build(input_shape)

    def call(self, x):
        repeat_first_axis = K.repeat_elements(x,2,1)
        repeat_scond_axis = K.repeat_elements(repeat_first_axis,2,2)
        return repeat_scond_axis

In [11]:
import tensorflow as tf

batch_size = 1
x = tf.convert_to_tensor(np.array([[[1,2,3], [0, 5, 10]]]))
print(x.shape)
unpool_layer = Unpooling()
y = unpool_layer(x)
print(y)

(1, 2, 3)
tf.Tensor(
[[[ 1  1  2  2  3  3]
  [ 1  1  2  2  3  3]
  [ 0  0  5  5 10 10]
  [ 0  0  5  5 10 10]]], shape=(1, 4, 6), dtype=int32)


## Stage 2
```Consider the following simple attention mechanism:```

```Given a vector compute Dense(v), while Dense(v).shape = v.shape
Multiply v and Dense(v) element-wise
Return the result```

```What is the purpose of this mechanism? Can you think what can be achieved using this kind of mechanism?```

```Implement the attention mechanism as a keras layer.```

In [34]:
class AttentionMechanism(Layer):

    def __init__(self, **kwargs):
        super(AttentionMechanism, self).__init__(**kwargs)

    def build(self, input_shape):
        self.kernel = self.add_weight(name='kernel', 
                                      shape=(input_shape[0], input_shape[1]),
                                      initializer='uniform',
                                      trainable=True)
        super(AttentionMechanism, self).build(input_shape)

    def call(self, x):
        print("x", x.shape)
        print("kernel", self.kernel.shape)
        return tf.keras.layers.Multiply()([x, self.kernel])

In [35]:
batch_size = 1
x = tf.convert_to_tensor(np.array([[1,2,3], [0, 5, 10]]), dtype='float32')
print(x.shape)
AttentionMechanism_layer = AttentionMechanism()
y = AttentionMechanism_layer(x)
print(y)

(2, 3)
x (2, 3)
kernel (2, 3)
tf.Tensor(
[[ 0.00923688 -0.06304872  0.05575053]
 [ 0.          0.20001902 -0.136467  ]], shape=(2, 3), dtype=float32)


## Stage 3
```Here you will try solving a problem I once  struggled with. The problem is the following:
You are given a set of sequences of symbols. All sequences contain the same "core sequence", but have extra noise in the form of other symbols between the symbols of the core sequence. For example, the sequences could be```


**1**-**3**-2-**4**-3-**2**-4-**1**-3-2-4

**1**-2-**3**-3-**4**-1-2-**2**-**1**-3-4-2-1-1

**1**-4-4-4-**3**-**4**-1-1-**2**-**1**-1-2

```while the core sequence is 1-3-4-2-1```
```Your task is, given a dataset of such sequences, to find the core sequence. You may speak to me to learn about the context of this question and the reasons led to facing it.```

```Generate a dataset that will simulate this problem. Follow the instructions:```
- ```Use a 4-letter alphabet.```
- ```Generate a core sequence with 10 symbols.```
- ```Create a new sequence symbol by symbol: for each symbol you add to the sequence, put the next letter of the sequence with probability p and a random symbol with a probability 1-p. choose p to be 0.5.```
- ```Generate a 10,000 examples dataset.```

```Try solving the problem with simple means.```

In [2]:
digits = [0,1,2,3]
baseline_length = 3
baseline = np.random.choice(digits, baseline_length)

print(baseline)

[3 2 3]


In [3]:
def create_seq(baseline_in, p=0.5):
    baseline_length_in = len(baseline_in)
    baseline_places = []
    seq = [baseline_in[0]]
    count = 1
    while count<baseline_length_in:
        if np.random.random()<p:
            seq.append(np.random.choice(digits, 1)[0])
        else:
            baseline_places.append(len(seq))
            seq.append(baseline_in[count])
            count += 1
    return seq, baseline_places

In [4]:
def create_sequences(count, baseline_in):
    return map(lambda x: create_seq(baseline_in), range(count))

In [5]:
data = list(create_sequences(10000, baseline))
data, places = list(map(lambda x: x[0], data)), list(map(lambda x: x[1], data))

In [6]:
subsequences_set = set()

for letter1 in digits:
    for letter2 in digits:
        for letter3 in digits:
            subsequences_set.add((letter1, letter2, letter3))

print(subsequences_set)

{(2, 0, 2), (0, 1, 0), (2, 2, 2), (2, 1, 0), (0, 1, 3), (0, 3, 0), (2, 1, 3), (0, 3, 3), (3, 2, 1), (1, 2, 2), (1, 3, 3), (3, 1, 2), (1, 3, 0), (3, 3, 2), (0, 0, 1), (2, 3, 0), (0, 2, 1), (1, 0, 1), (2, 3, 3), (1, 1, 0), (3, 0, 3), (1, 1, 3), (3, 0, 0), (2, 0, 1), (2, 1, 2), (2, 2, 1), (0, 1, 2), (0, 3, 2), (3, 1, 1), (3, 2, 0), (1, 2, 1), (3, 0, 1), (1, 3, 2), (3, 2, 3), (3, 3, 1), (0, 0, 3), (0, 2, 0), (0, 0, 0), (2, 3, 2), (0, 2, 3), (1, 1, 2), (1, 0, 0), (3, 0, 2), (1, 0, 3), (2, 0, 0), (2, 0, 3), (2, 2, 0), (0, 1, 1), (2, 2, 3), (2, 1, 1), (0, 3, 1), (1, 2, 0), (3, 1, 3), (1, 3, 1), (1, 2, 3), (3, 1, 0), (3, 2, 2), (3, 3, 3), (3, 3, 0), (0, 0, 2), (2, 3, 1), (0, 2, 2), (1, 0, 2), (1, 1, 1)}


In [7]:
def find_subsequence(sequence, subsequence):
    idx = 0
    for char in sequence:
        if char == subsequence[idx]:
            idx += 1
        if idx == len(subsequence):
            return True
    return False

In [8]:
from tqdm import tqdm

subsequences_set_copy = subsequences_set.copy()
for subsequence in tqdm(subsequences_set):
    for sequence in data:
        if find_subsequence(sequence, subsequence) is False:
            subsequences_set_copy.remove(subsequence)
            break

100%|████████████████████████████████████████████████████████████████████████████████| 64/64 [00:00<00:00, 4269.55it/s]


In [9]:
print(subsequences_set_copy)

{(3, 2, 3)}


## Stage 4
```A possible solution for the problem could be done as follow:```
- ```Given a dataset of sequences as such, generate a new dataset of random sequences.```
- ```Train a classifier that will determine whether a sequence belongs to the original dataset or the generated dataset. Make sure that this problem is solvable.```
- ```Now train a specific model, containing an attention layer. We can hope that the attention mechanism will learn to use the core sequence when classifying.```
- ```Use the attention visualization to find the symbols of the core sequence.```

```What are the advantages of this solution? Do you think you can make it work? You certainly will need a different kind of attention mechanism for the task, rather than the simple one you already have.```

```Read the paper Neural Machine Translation by Jointly Learning to Align and Translate by Bahanau, Cho and Bengio. The paper concerns with an attention mechanism implemented in the context of machine translation. Implement the attention mechanism the authors suggest using PyTorch and try solving the above problem. You can find the paper in the current directory.```

In [10]:
def create_seq(baseline_in, p=0.5, padding_length=21, padding_symbol=8):
    baseline_length_in = len(baseline_in)
    baseline_places = []
    seq = [baseline_in[0]]
    count = 1
    while count<baseline_length_in:
        if np.random.random()<p:
            seq.append(np.random.choice(digits, 1)[0])
        else:
            baseline_places.append(len(seq))
            seq.append(baseline_in[count])
            count += 1
            
    if len(seq) < padding_length:
        seq += [padding_symbol] * (padding_length-len(seq))
    return seq, baseline_places

In [11]:
def create_sequences(count, baseline_in):
    return map(lambda x: create_seq(baseline_in), range(count))

In [12]:
digits = [0,1,2,3]
baseline_length = 3
right_baseline = np.random.choice(digits, baseline_length)
wrong_baseline = np.random.choice(digits, baseline_length)

print(right_baseline)
print(wrong_baseline)

data = list(create_sequences(10000, right_baseline)) + list(create_sequences(10000, wrong_baseline))
data, places = list(map(lambda x: x[0], data)), list(map(lambda x: x[1], data))
target = [True]*10000+[False]*10000

[2 2 1]
[0 0 3]


In [13]:
data = np.array(data)

In [14]:
print(data.shape)

(20000, 21)


In [15]:
target = np.array(target)

In [16]:
print(target.shape)

(20000,)


In [37]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

encoder_input = layers.Input(shape=(data.shape[1], ))
encoder_embedded = layers.Embedding(input_dim=10, output_dim=10)(
    encoder_input
)
# Return states in addition to output
output_encoder, state_h, state_c = layers.LSTM(data.shape[1], return_state=True, name="encoder")(
    encoder_embedded
)
encoder_state = [state_h, state_c]

decoder_embedded = layers.Embedding(input_dim=10, output_dim=10)(
    output_encoder
)
decoder_output = layers.LSTM(data.shape[1], name="decoder")(
    decoder_embedded, initial_state=encoder_state
)
output = layers.Dense(1)(decoder_output)

model = keras.Model([encoder_input], output)
model.summary()

Model: "functional_15"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_19 (InputLayer)           [(None, 21)]         0                                            
__________________________________________________________________________________________________
embedding_16 (Embedding)        (None, 21, 10)       100         input_19[0][0]                   
__________________________________________________________________________________________________
encoder (LSTM)                  [(None, 21), (None,  2688        embedding_16[0][0]               
__________________________________________________________________________________________________
embedding_17 (Embedding)        (None, 21, 10)       100         encoder[0][0]                    
______________________________________________________________________________________

In [28]:
print(state_h.shape)
print(state_c.shape)

(None, 64)
(None, 64)


In [34]:
model.compile(optimizer='adam', loss='mse')

In [35]:
model.fit(data, target, batch_size=150, epochs=100, verbose=2, validation_split=0.2)

Epoch 1/100
107/107 - 2s - loss: 0.2856 - val_loss: 0.3320
Epoch 2/100
107/107 - 1s - loss: 0.0217 - val_loss: 0.0066
Epoch 3/100
107/107 - 1s - loss: 0.0060 - val_loss: 0.0052
Epoch 4/100
107/107 - 1s - loss: 0.0057 - val_loss: 0.0057
Epoch 5/100
107/107 - 1s - loss: 0.0021 - val_loss: 0.0018
Epoch 6/100
107/107 - 1s - loss: 0.0017 - val_loss: 0.0015
Epoch 7/100
107/107 - 1s - loss: 0.0052 - val_loss: 0.0085
Epoch 8/100
107/107 - 1s - loss: 0.0023 - val_loss: 0.0029
Epoch 9/100
107/107 - 1s - loss: 6.2105e-04 - val_loss: 0.0018
Epoch 10/100
107/107 - 1s - loss: 0.0025 - val_loss: 7.1037e-05
Epoch 11/100
107/107 - 1s - loss: 0.0025 - val_loss: 1.4828e-04
Epoch 12/100
107/107 - 1s - loss: 6.5019e-04 - val_loss: 3.1732e-07
Epoch 13/100
107/107 - 1s - loss: 4.5660e-04 - val_loss: 2.1961e-06
Epoch 14/100
107/107 - 1s - loss: 4.5307e-04 - val_loss: 1.6550e-06
Epoch 15/100
107/107 - 1s - loss: 2.6263e-04 - val_loss: 1.1047e-06
Epoch 16/100
107/107 - 1s - loss: 5.7084e-06 - val_loss: 1.8333e-

<tensorflow.python.keras.callbacks.History at 0x1b6cc1ca460>

In [45]:
class AttentionLayer(Layer):
    """
    This class implements Bahdanau attention (https://arxiv.org/pdf/1409.0473.pdf).
    There are three sets of weights introduced W_a, U_a, and V_a
     """

    def __init__(self, **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        assert isinstance(input_shape, list)
        # Create a trainable weight variable for this layer.

        self.W_a = self.add_weight(name='W_a',
                                   shape=tf.TensorShape((input_shape[0][2], input_shape[0][2])),
                                   initializer='uniform',
                                   trainable=True)
        self.U_a = self.add_weight(name='U_a',
                                   shape=tf.TensorShape((input_shape[1][2], input_shape[0][2])),
                                   initializer='uniform',
                                   trainable=True)
        self.V_a = self.add_weight(name='V_a',
                                   shape=tf.TensorShape((input_shape[0][2], 1)),
                                   initializer='uniform',
                                   trainable=True)

        super(AttentionLayer, self).build(input_shape)  # Be sure to call this at the end

    def call(self, inputs, verbose=False):
        """
        inputs: [encoder_output_sequence, decoder_output_sequence]
        """
        assert type(inputs) == list
        encoder_out_seq, decoder_out_seq = inputs
        if verbose:
            print('encoder_out_seq>', encoder_out_seq.shape)
            print('decoder_out_seq>', decoder_out_seq.shape)

        def energy_step(inputs, states):
            """ Step function for computing energy for a single decoder state
            inputs: (batchsize * 1 * de_in_dim)
            states: (batchsize * 1 * de_latent_dim)
            """

            assert_msg = "States must be an iterable. Got {} of type {}".format(states, type(states))
            assert isinstance(states, list) or isinstance(states, tuple), assert_msg

            """ Some parameters required for shaping tensors"""
            en_seq_len, en_hidden = encoder_out_seq.shape[1], encoder_out_seq.shape[2]
            de_hidden = inputs.shape[-1]

            """ Computing S.Wa where S=[s0, s1, ..., si]"""
            # <= batch size * en_seq_len * latent_dim
            W_a_dot_s = K.dot(encoder_out_seq, self.W_a)

            """ Computing hj.Ua """
            U_a_dot_h = K.expand_dims(K.dot(inputs, self.U_a), 1)  # <= batch_size, 1, latent_dim
            if verbose:
                print('Ua.h>', U_a_dot_h.shape)

            """ tanh(S.Wa + hj.Ua) """
            # <= batch_size*en_seq_len, latent_dim
            Ws_plus_Uh = K.tanh(W_a_dot_s + U_a_dot_h)
            if verbose:
                print('Ws+Uh>', Ws_plus_Uh.shape)

            """ softmax(va.tanh(S.Wa + hj.Ua)) """
            # <= batch_size, en_seq_len
            e_i = K.squeeze(K.dot(Ws_plus_Uh, self.V_a), axis=-1)
            # <= batch_size, en_seq_len
            e_i = K.softmax(e_i)

            if verbose:
                print('ei>', e_i.shape)

            return e_i, [e_i]

        def context_step(inputs, states):
            """ Step function for computing ci using ei """

            assert_msg = "States must be an iterable. Got {} of type {}".format(states, type(states))
            assert isinstance(states, list) or isinstance(states, tuple), assert_msg

            # <= batch_size, hidden_size
            c_i = K.sum(encoder_out_seq * K.expand_dims(inputs, -1), axis=1)
            if verbose:
                print('ci>', c_i.shape)
            return c_i, [c_i]

        fake_state_c = K.sum(encoder_out_seq, axis=1)
        fake_state_e = K.sum(encoder_out_seq, axis=2)  # <= (batch_size, enc_seq_len, latent_dim

        """ Computing energy outputs """
        # e_outputs => (batch_size, de_seq_len, en_seq_len)
        last_out, e_outputs, _ = K.rnn(
            energy_step, decoder_out_seq, [fake_state_e],
        )

        """ Computing context vectors """
        last_out, c_outputs, _ = K.rnn(
            context_step, e_outputs, [fake_state_c],
        )

        return c_outputs, e_outputs

    def compute_output_shape(self, input_shape):
        """ Outputs produced by the layer """
        return [
            tf.TensorShape((input_shape[1][0], input_shape[1][1], input_shape[1][2])),
            tf.TensorShape((input_shape[1][0], input_shape[1][1], input_shape[0][1]))
        ]

In [78]:
print(encoder_out.shape)
print(decoder_out.shape)

(256, 21, 64)
(256, 21, 64)


In [101]:
batch_size = 200
timestep = 100
seq_size = 21
hidden_size = 64

encoder_inputs = layers.Input(batch_shape=(batch_size, seq_size, 1), name='encoder_inputs')
decoder_inputs = layers.Input(batch_shape=(batch_size, seq_size, 1), name='decoder_inputs')

encoder_gru = layers.GRU(hidden_size, return_sequences=True, return_state=True, name='encoder_gru')
encoder_out, encoder_state = encoder_gru(encoder_inputs)

attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_out, encoder_out])

decoder_gru = layers.GRU(hidden_size, return_sequences=True, return_state=True, name='decoder_gru')
decoder_out, decoder_state = decoder_gru(attn_out, initial_state=encoder_state)

decoder_out = tf.keras.layers.Reshape((seq_size*hidden_size,))(decoder_out)

output_dense = layers.Dense(1, activation='tanh', name='softmax_layer')(decoder_out)

full_model = keras.Model(inputs=[encoder_inputs], outputs=output_dense)
full_model.compile(optimizer='adam', loss='mse')

In [102]:
print(decoder_concat_input.shape)

(256, 2688)


In [103]:
full_model.summary()

Model: "functional_33"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(200, 21, 1)]       0                                            
__________________________________________________________________________________________________
encoder_gru (GRU)               [(200, 21, 64), (200 12864       encoder_inputs[0][0]             
__________________________________________________________________________________________________
attention_layer (AttentionLayer ((200, 21, 64), (200 8256        encoder_gru[0][0]                
                                                                 encoder_gru[0][0]                
__________________________________________________________________________________________________
decoder_gru (GRU)               [(200, 21, 64), (200 24960       attention_layer[0][0]

In [104]:
data = data.reshape(data.shape[0], data.shape[1], 1)
print(data.shape)

(20000, 21, 1)


In [105]:
target = target.reshape(data.shape[0], 1)
print(target.shape)

(20000, 1)


In [None]:
full_model.fit(data, target, batch_size=batch_size, epochs=20, verbose=2, validation_split=0.2)

In [113]:
from tensorflow import keras
from keras.layers import Input, Dense

model = keras.Sequential(
    [
        Input(shape = (21,)),
        Dense(5, name="hidden_layer_1", activation='tanh'),
        Dense(5, name="hidden_layer_2", activation='tanh'),
        Dense(5, name="hidden_layer_3", activation='tanh'),
        Dense(1, name="output_layer"),
    ]
)
model.compile(optimizer='adam', loss='mse')
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
hidden_layer_1 (Dense)       (None, 5)                 110       
_________________________________________________________________
hidden_layer_2 (Dense)       (None, 5)                 30        
_________________________________________________________________
hidden_layer_3 (Dense)       (None, 5)                 30        
_________________________________________________________________
output_layer (Dense)         (None, 1)                 6         
Total params: 176
Trainable params: 176
Non-trainable params: 0
_________________________________________________________________


In [114]:
model.fit(data, target, batch_size=150, epochs=100, verbose=2, validation_split=0.2)

Epoch 1/100
107/107 - 0s - loss: 0.2860 - val_loss: 0.3706
Epoch 2/100
107/107 - 0s - loss: 0.2262 - val_loss: 0.3403
Epoch 3/100
107/107 - 0s - loss: 0.1636 - val_loss: 0.2038
Epoch 4/100
107/107 - 0s - loss: 0.1212 - val_loss: 0.1789
Epoch 5/100
107/107 - 0s - loss: 0.1054 - val_loss: 0.1621
Epoch 6/100
107/107 - 0s - loss: 0.0945 - val_loss: 0.1363
Epoch 7/100
107/107 - 0s - loss: 0.0857 - val_loss: 0.1301
Epoch 8/100
107/107 - 0s - loss: 0.0779 - val_loss: 0.1106
Epoch 9/100
107/107 - 0s - loss: 0.0697 - val_loss: 0.0834
Epoch 10/100
107/107 - 0s - loss: 0.0609 - val_loss: 0.0800
Epoch 11/100
107/107 - 0s - loss: 0.0518 - val_loss: 0.0662
Epoch 12/100
107/107 - 0s - loss: 0.0423 - val_loss: 0.0575
Epoch 13/100
107/107 - 0s - loss: 0.0335 - val_loss: 0.0336
Epoch 14/100
107/107 - 0s - loss: 0.0262 - val_loss: 0.0276
Epoch 15/100
107/107 - 0s - loss: 0.0212 - val_loss: 0.0212
Epoch 16/100
107/107 - 0s - loss: 0.0172 - val_loss: 0.0145
Epoch 17/100
107/107 - 0s - loss: 0.0146 - val_lo

<tensorflow.python.keras.callbacks.History at 0x19cc3fb7c40>

## Bonus
```Now implement the attention mechanism the authors suggest as a keras layer. Use the source code of the keras.layers.recurrent class. You can find the paper and the class source code in the current directory.```

```Basic instructions:```
- ```Use your tutor. A lot. This is a hard exercise.```
- ```Open the source code of recurrent neural networks. You would like to implement a layer that inherits from Recurrent.```
- ```Understand the code's flow and the functions you would like to write.```
- ```Start by writing a mechanism that would be a little bit simpler: don't return a sequence, but rather return a single vector.```
- ```Try solving the above problem using your attention mechanism. What problems do you encouter?```
- ```Complete the full mechanism. Assuming Yoshua Bengio didn't lie in his paper, how do you think their architecture overcomes the problem you found?```

In [117]:
from keras.layers.recurrent import Recurrent


ImportError: cannot import name 'Recurrent' from 'keras.layers.recurrent' (C:\Users\RONENAH\Anaconda3\envs\formation_env\lib\site-packages\keras\layers\recurrent.py)