## Preprocessing for seq2seq

This is now the second time I try to understand the deep learning model.
I wanted to explain to myself now every step for future reference. Just following up the exercises left a lot of open questions.

In this example we build a limited ENG-ESP translator.

**Keras implementation** needs 3 things:
1. vocabulary sets for input and target data (ENG + ESP)
2. total number of unique word tokens for each set
3. the maximum sentence length for each language

Preprocessing:
- noise removal (case, puntuation)
- mark the start and the end of each sentence (like adding <START> + <END> to the text)

Our data is in the span-eng.txt in this folder.
Further remarks will be in the code

In [1]:
from tensorflow import keras
import re
# Importing our translations
data_path = "span-eng.txt"
# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

# Building empty lists to hold sentences
input_docs = []
target_docs = []
# Building empty vocabulary sets to hold the unique words from each language
input_tokens = set()
target_tokens = set()

for line in lines:
  # Input and target sentences are separated by tabs
  # we split them into two variables
  input_doc, target_doc = line.split('\t')
  # Appending each input sentence to input_docs
  input_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", input_doc))
  input_docs.append(input_doc)
  # Splitting words from punctuation
  target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
  # Redefine target_doc below (adding the start and end marks)
  # and append it to target_docs:
  target_doc = "<START> " + target_doc + " <END>" 
  target_docs.append(target_doc)

  
  # Now we split up each sentence into words
  # and add each unique word to our vocabulary set
  for token in input_doc.split():
    #print(token)
    if token not in input_tokens:
      input_tokens.add(token)
    
  for token in target_doc.split():
    #print(token)
    if token not in target_tokens:
      target_tokens.add(token)
    
# it seems we cast the vocabulary sets to sorted lists. Why?
# sets in Python are unordered, so we need to convert them to lists to make them ordered 
# this enables the item to item match between the two lists
input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))
# print(input_tokens, '\n', target_tokens)
# Create num_encoder_tokens and num_decoder_tokens:
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

# it is very much unclear what is the length of the sequence. This exercise counts the words, 
# but the keras tutorial simply counts the characters in the input and target docs
# Chat-GPT says that if we work with words, then it must be the number of words else the number of chars
'''for target_doc in target_docs:
  print(target_doc, len(target_doc))
for target_doc in target_docs:
  target = re.findall(r"[\w']+|[^\s\w]", target_doc)
  print(target, len(target))'''

# Still it is annoying that we split the <START> and <END> to 6 additional words and using the regex here again seems to be unnecessary
try:
  max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
  print(max_encoder_seq_length)
  max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])
  print(max_decoder_seq_length)
except ValueError:
  pass

4
12


### The training setup

Keras needs a matrix of one-hot vectors.
In a one-hot vector every word is represented with a 0 in our sentence, except the one token we are currently working on.
To follow up on these one-hot vectors we create 2 feature dictionaries and 2 reverse feature dictionaries.
- one for the input data one-hot vectors (for the encoder)
- one for the target data one-hot vectors ( for the encoder)
- one for translating back the input one-hot vectors to actual words (for the decoder)
- one for translating back the target one-hot vectors to actual words (for the decoder)

To store the vectors we will create a 3D numpy array of zeroes in the size of
1. number of input documents
2. max length of the input sequence
3. number of unique tokens in the vocabulary

Actually we will need 3 of these arrays:
1. encoder input data
2. decoder input data
3. decoder target data

The decoder_target_data is needed for the "Teacher forcing" technique.
It helps the encoder by having the target value ready, so the model doesn't have to rely on its own predictions

In [None]:
import numpy as np

# enumerate creates an iterable with the index of an item and the item itself
# NOTE that it is swapped when added to the dicttionary, so the key is the item and its index is the value
input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
    [(token, i) for i, token in enumerate(target_tokens)])

# another swap of keys and values in the reverse dictionaries
reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
    (i, token) for token, i in target_features_dict.items())

encoder_input_data = np.zeros(
    (len(input_docs), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')
decoder_input_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')
decoder_target_data = np.zeros(
    (len(input_docs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# enumerate(zip()) creates an iterable of sets with their indexes
# line is the index and (iput_doc, target_doc) is the item (a set)
for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):

  # another enumerate gives back the index of the words returned by the regex and the word
  # this part in only for the input_docs which go to the encoder_input_data
  for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):

    #print("Encoder input timestep & token:", timestep, token)
    #print(input_features_dict[token])
    # Assign 1. for the current line, timestep, & word in encoder_input_data:
    # this will create the one-hot vector for the token
    encoder_input_data[line, timestep, input_features_dict[token]] = 1

  # now the target_docs go to the decoder_input_data
  for timestep, token in enumerate(target_doc.split()):

    print("Decoder input timestep & token:", timestep, token)
    # Assign 1. for the current line, timestep, & word in decoder_input_data:
    decoder_input_data[line, timestep, target_features_dict[token]] = 1
    if timestep > 0:
      # decoder_target_data is ahead by 1 timestep and doesn't include the start token.
      print("Decoder target timestep:", timestep, token)
      # Assign 1. for the current line, timestep, & word in decoder_target_data:
      decoder_target_data[line, timestep-1, target_features_dict[token]] = 1
      
      

### Encoder Training Setup

Keras deep learning models need 2 types of layers:
1. input layer with the matrix of the one-hot vectors
2. LSTM layer with outputs

For the input layer we can specify the batch size, but also can select None, because it can handle varying batch sizes
Additionally it needs the number of encoder tokens

For the LSTM layer we need to specify the dimensionality (defines how closely the model molds itself to the training data)
and we need to indicat whether we want to return the state.
We want to have only the final state back from the encoder so we add the input layer as argument to it.
The LSTM layer will retunr 3 things:
1. encoder outputs  - we don't need it
2. hidden state
3. cell state

In [None]:
from keras.layers import Input, LSTM

# Create the input layer:
encoder_inputs = Input( shape = (None, num_encoder_tokens))

# Create the LSTM layer:
encoder_lstm = LSTM(256, return_state = True)

# Retrieve the outputs and states:
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)

# Put the states together in a list:
encoder_states = [state_hidden, state_cell]

### Decoder Training Setup

The decoder has the same layers as the encoder.
However we pass in the encoder states as the initial state along with the decoder inputs

The LSTM layer will produce also 3 outputs:
1. decoder outputs - we need it
2. decoder hidden state - don't need
3. decoder cell state - don't need

The decoder output must be run through an activation layer which will return a probability distribution (sum to 1).
We'll use the 'Softmax' function
We'll use the Dense layer type for the decoder outputs to return the unique words in our vocabulary by using the softmax function

In [None]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# The decoder input and LSTM layers:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)

# Retrieve the LSTM outputs and states:
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state = encoder_states)

# Build a final Dense layer:
decoder_dense = Dense(num_decoder_tokens, activation = 'softmax')

# Filter outputs through the Dense layer:
decoder_outputs = decoder_dense(decoder_outputs)


### Build and Train the model

We need to **import the Model()** function from Keras.
To define the model as seq2seq it needs 2 arguments:
- encoder_inputs and decoder_inputs as a list
- decoder_outputs

Additinally we need to **compile the model** with 2 compulsory arguments:
- it needs an optimizer (we use rmsprop) - this helps to minimize the error rate
- it needs a loss function to determine the error rate (we use logarithm-based-cross-entropy function)
- additionally we add accuracy as a metrics becasue we want to see this metric during the training

Lastly we need to fit the compiled model, it needs the following arguments:
- encoder input data and decoder input data as a list (what to train on)
- decoder target data (what we expect to return)
- batch size (indefinite: some problems require smaller batch sizes, other big)
- epochs (cycles of training. more cycle mean the model will be more traind on the dataset and it will take more time)
- validation_split (what percentage of the data is for validation)

In [None]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# Decoder training setup:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Building the training model:
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

print("Model summary:\n")
training_model.summary()
print("\n\n")

# Compile the model:
training_model.compile(optimizer='rmsprop', 
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Choose the batch size
# and number of epochs:
batch_size = 5
epochs = 100

print("Training the model:\n")
# Train the model:
training_model.fit([encoder_input_data, decoder_input_data], 
          decoder_target_data,
          batch_size = batch_size,
          epochs = epochs,
          validation_split=0.2)


### Setup for Testing

To actually generate text we need to modify the seq2seq architecture. 
In the current setup it works only because we know the target sequence.

1. we need an encoder Model with placeholders for encoder inputs and encoder states
   `encoder_model = Model(encoder_inputs, encoder_states)`
2. we need placeholders for the decoder input states - these will be the input layer for the decoder
   `
4. we'll need to make everything step-by-step:
       - pass the encoder's final hidden state to the decoder
       - sample a token
       - get the updated hidden state back
       - pass the updated hidden state back into the net    ```
    latent_dim = 256
    decoder_state_input_hidden = Input(shape=(latent_dim,))
    decoder_state_input_cell = Input(shape=(latent_dim,))
    decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]
   ```
3. We'll create new decoder states and outputs by using the LSTM decoder and a decoder dense layer with an activation function
    ```
    decoder_outputs, state_hidden, state_cell =  decoder_lstm(decoder_inputs,  initial_state=decoder_states_inputs)    

# Saving the new LSTM output stat    es:
decoder_states = [state_hidden, state_ce    ll]

# Below, we redefine the decoder tput
# by passing it through the dense     layer:
decoder_outputs = decoder_dense(decoder_outputs)
  The final setup consist of:
    - decoder inputs (the decoder input layer)
    - decoder input states (the final states from the encoder)
    - decoder outputs (the NumPy matrix we get from the final output layer of the decoder)
    - decoder output states (the memory throughout the network from one word to the next)
   ```
    decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)
   ```The actual code below is using a saved HDF5 training model so it will not run.  ```
4. 
work



In [None]:
from keras.models import Model, load_model

training_model = load_model('training_model.h5')
# These next lines are only necessary
# because we're using a saved model:
encoder_inputs = training_model.input[0]
encoder_outputs, state_h_enc, state_c_enc = training_model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]

# Building the encoder test model:
encoder_model = Model(encoder_inputs, encoder_states)

latent_dim = 256
# Building the two decoder state input layers:
decoder_state_input_hidden = Input(shape=(latent_dim,))

decoder_state_input_cell = Input(shape=(latent_dim,))

# Put the state input layers into a list:
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]

# Call the decoder LSTM:
decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state = decoder_states_inputs)
decoder_states = [state_hidden, state_cell]

# Redefine the decoder outputs:
decoder_outputs = decoder_dense(decoder_outputs)

# Build the decoder test model:
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)


### Create the Test Function

The function needs to 
- accept a numpy matrix representing the input sentence
- use the encoder and decoder we've created to generate the output
- inside the function we'll use the `.predict()` method will take in the new input and it creates output states we can pass to the decoder
- then we need an empty array for the output with 3 dimensions(1, 1, num_decoder_tokens)
- we know the first value in the output ("<START>") so we can giveit into the first timestep
- then we need an empty string which will contain the word by word translation - at the end this string will be returned
- we’ll decode the sentence word by word using the output state that we retrieved from the encoder (which becomes our decoder’s initial hidden state).
- We’ll also update the decoder hidden state after each word so that we use previously decoded words to help decode new ones.
- we need a while loop to go word-by-word. It will run till we hit the "<END>" string or we hit the maximum sentence legth
- inside the loop we use the predict() function to get possible next words and their probabilitites
- we'll use numpy's `.argmax()` method to determine the token with the highest probability
- we'll add the word to the empty string which will be returned at the end
- 
This block will not run either

In [None]:

training_model = load_model('training_model.h5')
encoder_inputs = training_model.input[0]
encoder_outputs, state_h_enc, state_c_enc = training_model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

latent_dim = 256
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]
decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_hidden, state_cell]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

def decode_sequence(test_input):
  encoder_states_value = encoder_model.predict(test_input)
  decoder_states_value = encoder_states_value
  target_seq = np.zeros((1, 1, num_decoder_tokens))
  target_seq[0, 0, target_features_dict['<START>']] = 1.
  decoded_sentence = ''
  
  stop_condition = False
  while not stop_condition:
    # Run the decoder model to get possible 
    # output tokens (with probabilities) & states
    output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict([target_seq] + decoder_states_value)

    # Choose token with highest probability
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    # Use sampled_token_index to find our next word in reverse_target_features_dict. Save the result to sampled_token
    sampled_token = reverse_target_features_dict[sampled_token_index]
    decoded_sentence += ' ' + sampled_token
    # Exit condition: either hit max length
    # or find stop token.
    if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
      stop_condition = True

    # Update the target sequence (of length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.

    # Update states
    decoder_states_value = [new_decoder_hidden_state, new_decoder_cell_state]

  return decoded_sentence

for seq_index in range(10):
  test_input = encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(test_input)
  print('-')
  print('Input sentence:', input_docs[seq_index])
  print('Decoded sentence:', decoded_sentence)