# Keras- writing custom layers
```Here you will experience with writing custom keras layers. We will have two stages: in the first stage we will implement a simple layer. In the second you will implement a more complicated layer.```

## Stage 1
```Implement an unpooling layer, that acts on matrices as follow:```
```
A = array([[0, 1, 3, 1, 0],
           [2, 0, 1, 2, 4],
           [3, 2, 1, 4, 3],
           [4, 0, 3, 2, 0],
           [4, 1, 2, 0, 2]])
       
unpooling(A) = array([[0, 0, 1, 1, 3, 3, 1, 1, 0, 0],
                      [0, 0, 1, 1, 3, 3, 1, 1, 0, 0],
                      [2, 2, 0, 0, 1, 1, 2, 2, 4, 4],
                      [2, 2, 0, 0, 1, 1, 2, 2, 4, 4],
                      [3, 3, 2, 2, 1, 1, 4, 4, 3, 3],
                      [3, 3, 2, 2, 1, 1, 4, 4, 3, 3],
                      [4, 4, 0, 0, 3, 3, 2, 2, 0, 0],
                      [4, 4, 0, 0, 3, 3, 2, 2, 0, 0],
                      [4, 4, 1, 1, 2, 2, 0, 0, 2, 2],
                      [4, 4, 1, 1, 2, 2, 0, 0, 2, 2]])
```
```Use the following example to do so, which is taken from https://keras.io/layers/writing-your-own-keras-layers/.```

```Note: you can't use numpy's functions in your layer's logic. You will have to use functions that are accessed through the backend you use (Theano or Tensorflow).```

```~Ittai Haran```

In [None]:
from keras import backend as K
from keras.engine.topology import Layer
import numpy as np

class MyLayer(Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(MyLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='weight_variable_name', 
                                      shape=(input_shape[1], self.output_dim),
                                      initializer='uniform',
                                      trainable=True)
        super(MyLayer, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, x):
        
        return K.dot(x, self.kernel)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_dim)

In [12]:
from keras import backend as K
from keras.engine.topology import Layer
import numpy as np

class Unpool(Layer):

    def __init__(self, **kwargs):
        super(Unpool, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        super(Unpool, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, x):
        return K.repeat_elements(K.repeat_elements(x,2,2), 2, 3)

## Stage 2
```Consider the following simple attention mechanism:```

```Given a vector compute Dense(v), while Dense(v).shape = v.shape
Multiply v and Dense(v) element-wise
Return the result```

```What is the purpose of this mechanism? Can you think what can be achieved using this kind of mechanism?```

```Implement the attention mechanism as a keras layer.```

In [1]:
from keras import backend as K
from keras.engine.topology import Layer
import numpy as np

class SimpleAttention(Layer):

    def __init__(self, **kwargs):
        super(SimpleAttention, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='kernel', 
                                      shape=(input_shape[1], input_shape[1]),
                                      initializer='uniform',
                                      trainable=True)
        super(SimpleAttention, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, x):
        return K.dot(x, self.kernel)*x

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


## Stage 3
```Here you will try solving a problem I once struggled with. The problem is the following:
You are given a set of sequences of symbols. All sequences contain the same "core sequence", but have extra noise in the form of other symbols between the symbols of the core sequence. For example, the sequences could be```


**1**-**3**-2-**4**-3-**2**-4-**1**-3-2-4

**1**-2-**3**-3-**4**-1-2-**2**-**1**-3-4-2-1-1

**1**-4-4-4-**3**-**4**-1-1-**2**-**1**-1-2

```while the core sequence is 1-3-4-2-1```
```Your task is, given a dataset of such sequences, to find the core sequence. You may speak to me to learn about the context of this question and the reasons led to facing it.```

```Generate a dataset that will simulate this problem. Follow the instructions:```
- ```Use a 4-letter alphabet.```
- ```Generate a core sequence with 10 symbols.```
- ```Create a new sequence symbol by symbol: for each symbol you add to the sequence, put the next letter of the sequence with probability p and a random symbol with a probability 1-p. choose p to be 0.5.```
- ```Generate a 10,000 examples dataset.```

```Try solving the problem with simple means.```

In [None]:
digits = [0,1,2,3]
baseline_length = 10
baseline = np.random.choice(digits, baseline_length)

def create_seq(p, baseline_in):
    baseline_length_in = len(baseline_in)
    baseline_places = []
    seq = [baseline_in[0]]
    count = 1
    while count<baseline_length_in:
        if np.random.random()<p:
            seq.append(np.random.choice(digits, 1)[0])
        else:
            baseline_places.append(len(seq))
            seq.append(baseline_in[count])
            count += 1
    return seq, baseline_places

def create_sequences(count, p, baseline_in=None):
    if baseline_in is None:
        return map(lambda x: create_seq(p, np.random.choice(digits, baseline_length)), range(count))
    return map(lambda x: create_seq(p, baseline_in), range(count))

baseline

true_count = 10000
false_count = 10000
first_symbol = 1
seq_len = 15
data = list(create_sequences(true_count,0.3, baseline))+list(create_sequences(false_count,0.3, None))
data, places = list(map(lambda x: x[0], data)), list(map(lambda x: x[1], data))
data = list(map(lambda x: [first_symbol]+x, data))
target = [True]*true_count+[False]*false_count
data, target = shuffle(data, target)
trim = list(filter(lambda x: len(x[0])<seq_len, zip(data, target)))
data, target = map(lambda x: x[0], trim), np.array(list(map(lambda x: x[1], trim)))
data = list(map(lambda x: x+[0]*(seq_len-len(x)), list(data)))
data = np.array(list(map(lambda x: np.eye(4)[x], data)))

## Stage 4
```A possible solution for the problem could be done as follow:```
- ```Given a dataset of sequences as such, generate a new dataset of random sequences.```
- ```Train a classifier that will determine whether a sequence belongs to the original dataset or the generated dataset. Make sure that this problem is solvable.```
- ```Now train a specific model, containing an attention layer. We can hope that the attention mechanism will learn to use the core sequence when classifying.```
- ```Use the attention visualization to find the symbols of the core sequence.```

```What are the advantages of this solution? Do you think you can make it work? You certainly will need a different kind of attention mechanism for the task, rather than the simple one you already have.```

```Read the paper Neural Machine Translation by Jointly Learning to Align and Translate by Bahanau, Cho and Bengio. The paper concerns with an attention mechanism implemented in the context of machine translation. Implement the attention mechanism the authors suggest as a keras layer. Use the source code of the keras.layers.recurrent class. You can find the paper and the class source code in the current directory.```

```Basic instructions:```
- ```Use your tutor. A lot. This is a hard exercise.```
- ```Open the source code of recurrent neural networks. You would like to implement a layer that inherits from Recurrent.```
- ```Understand the code's flow and the functions you would like to write.```
- ```Start by writing a mechanism that would be a little bit simpler: don't return a sequence, but rather return a single vector.```
- ```Try solving the above problem using your attention mechanism. What problems do you encouter?```
- ```Complete the full mechanism. Assuming Yoshua Bengio didn't lie in his paper, how do you think their architecture overcomes the problem you found?```

In [None]:
from keras.layers import Recurrent
from keras import activations
from keras import initializers
from keras import regularizers
from keras import constraints
from keras.engine import Layer
from keras.engine import InputSpec
from keras.legacy import interfaces

class AttentionLayer(Recurrent):
    def __init__(self, units, computation_length,
                 activation='tanh',
                 recurrent_activation='sigmoid',
                 use_bias=True,
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 recurrent_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 recurrent_constraint=None,
                 bias_constraint=None,
                 dropout=0.,
                 recurrent_dropout=0.,
                 **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)
        self.units = units
        self.computation_length = computation_length
        self.activation = activations.get(activation)
        self.recurrent_activation = activations.get(recurrent_activation)
        self.use_bias = use_bias

        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)

        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(recurrent_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)

        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(recurrent_constraint)
        self.bias_constraint = constraints.get(bias_constraint)

        self.dropout = min(1., max(0., dropout))
        self.recurrent_dropout = min(1., max(0., recurrent_dropout))
        self.alphas = []
        
    def get_constants(self, inputs, training=None):
        return [inputs, K.l2_normalize(inputs, axis = -1)]
    
    def preprocess_input(self, inputs, training=None):
        computation_length = self.computation_length
        to_ret = K.tile(np.ones((1)), K.shape(inputs)[0:1])
        to_ret = K.expand_dims(K.expand_dims(to_ret, -1), -1)
        to_ret = K.repeat_elements(to_ret,computation_length,1)
        return to_ret

    def build(self, input_shape):
        if isinstance(input_shape, list):
            input_shape = input_shape[0]

        batch_size = input_shape[0] if self.stateful else None
        self.input_time_length = input_shape[1]
        self.input_dim = input_shape[2]
        self.input_spec = InputSpec(shape=(batch_size, None, self.input_dim))
        self.state_spec = InputSpec(shape=(batch_size, self.units))

        self.states = [None]
        if self.stateful:
            self.reset_states()

        self.kernel = self.add_weight((self.input_dim, self.units * 6),
                                      name='kernel',
                                      initializer=self.kernel_initializer,
                                      regularizer=self.kernel_regularizer,
                                      constraint=self.kernel_constraint)


        self.U_z = self.kernel[:, :self.units]
        self.C_z = self.kernel[:, self.units*1:self.units*2]
        self.U_r = self.kernel[:, self.units*2:self.units*3]
        self.C_r = self.kernel[:, self.units*3:self.units*4]
        self.U = self.kernel[:, self.units*4:self.units*5]
        self.C = self.kernel[:, self.units*5:self.units*6]

        self.built = True
    
    
    def step(self, inputs, states):
        s_previous = states[0]+1e-8  # previous memory, 1e-8 to prevent 0-normed vectors
        h_matrix = states[1]  # dropout matrices for recurrent units
        h_matrix_normalized = states[2]
        s_previous_normalized = K.l2_normalize(s_previous, axis = -1)
        s_previous_normalized = K.repeat(s_previous_normalized, self.input_time_length)
        e_i_j = K.sum(s_previous_normalized*h_matrix_normalized, axis = -1)
        alpha_i_j = K.softmax(e_i_j)#-1
        
        alpha_i_j = K.repeat_elements(K.expand_dims(alpha_i_j, axis = -1), rep = self.units, axis = 2)
        c_i = K.sum(alpha_i_j*h_matrix, axis = 1)
        z_i = K.sigmoid(K.dot(s_previous, self.U_z)+ K.dot(c_i, self.C_z))
        z_i = K.cast(z_i>0.5, 'float32')
        r_i = K.sigmoid(K.dot(s_previous, self.U_r) + K.dot(c_i, self.C_r))
        s_tilde = self.activation(K.dot(r_i*s_previous, self.U) + K.dot(c_i, self.C))
        s = (1-z_i)*s_previous +z_i*s_tilde
        return s,[s]


    
def get_attentions(inputs, layer):
    
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))
    def reLu(x):
        return np.maximum(x,0)
    
    weights = layer.get_weights()[0]
    units = layer.units
    input_time_length = layer.input_time_length
    U_z = weights[:, :units]
    C_z = weights[:, units*1:units*2]
    U_r = weights[:, units*2:units*3]
    C_r = weights[:, units*3:units*4]
    U = weights[:, units*4:units*5]
    C = weights[:, units*5:units*6]
    
    alphas_list = []
    parts_names = ['c_i', 'z_i', 'r_i', 's_tilde', 's_previous', 's_previous_normalized']
    parts = {x:[] for x in parts_names}
    
    inputs = model_att.predict(inputs)
    inputs_normalized = normalize(inputs, axis = -1)
    s_previous = np.random.random((inputs.shape[0], inputs.shape[2]))+1e-8
    for i in range(layer.computation_length):
        s_previous_normalized = normalize(s_previous, axis = -1)
        parts['s_previous_normalized'].append(s_previous_normalized)
        s_previous_normalized = np.repeat(np.expand_dims(s_previous_normalized, 1), input_time_length, axis = 1)

        e_i_j = np.sum(s_previous_normalized*inputs_normalized, axis = -1)
        alpha_i_j = softmax(e_i_j)
        alphas_list.append(alpha_i_j)

        alpha_i_j = np.repeat(np.expand_dims(alpha_i_j, axis = -1), units, -1)
        c_i = np.sum(alpha_i_j*inputs, axis = 1)

        z_i = sigmoid(np.dot(s_previous,U_z) + np.dot(c_i,C_z))
        r_i = sigmoid(np.dot(s_previous,U_r) + np.dot(c_i,C_r))
        s_tilde = np.tanh(np.dot(r_i*s_previous, U) + np.dot(c_i,C))
        parts['s_previous'].append(s_previous)
        z_i = z_i>0.5
        s_previous = (1-z_i)*s_previous +z_i*s_tilde
        
        parts['c_i'].append(c_i)
        parts['z_i'].append(z_i)
        parts['r_i'].append(r_i)
        parts['s_tilde'].append(s_tilde)
        
        
    for alpha in alphas_list:
        alpha = alpha.reshape(-1)
        plt.bar(range(len(alpha)), alpha)
        plt.show()
    return alphas_list, parts, inputs, inputs_normalized