<h1 style="text-align:center;"> Seq2Seq task with attention</h1>


Sequence-to-sequence (Seq2Seq) learning involves training models that convert sequences from one domain to another. A typical example is translating a sentence from one language, like German, into another language, such as English. In this case, the main objective is to translate German sentences into their English equivalents.



<p style='text-align:center;'> <img src='https://www.guru99.com/images/1/111318_0848_seq2seqSequ1.png' alt='diagram'></p>

An essential enhancement to Seq2Seq models is the **attention mechanism**, which enables the model to focus on specific parts of the input sequence while generating each word of the output. This mechanism simulates the human ability to selectively concentrate on relevant pieces of information. For example, when translating a sentence, attention helps the model pay closer attention to specific words in the input, depending on the word being translated at that moment.

<p style="text-align:center;"> <img src="https://miro.medium.com/v2/resize:fit:1400/1*BLq79DDclwGh_hG61A-2Zg.png
" alt="Seq2Seq"> </p>

Due to its effectiveness, attention has become a fundamental part of advanced models like **Transformers**, which rely solely on this mechanism to maintain context across sequences.

Seq2Seq models are widely used in natural language processing (NLP) tasks, such as text summarization, speech recognition, and even in modeling biological sequences like DNA. In all of these cases, the input and output are sequences, and the model's job is to generate a new sequence from the given input. Seq2Seq models excel in tasks where structured information needs to be converted into another structured form, making them highly versatile across various domains.

In [1]:
# !pip install -r "../requirements.txt"

In [1]:
import numpy as np
import pandas as pd
import keras
from string import punctuation
import tensorflow as tf
from keras import Model
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, RepeatVector, Layer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [3]:
with open(r'..\datasets\archive (1)/deu.txt', 'rt') as file:
    df = file.readlines()

In [4]:
def preprocess(sent):
    sent = [w.translate(str.maketrans('', '',punctuation)) for w in sent]
    for i in range(len(sent)):
        sent[i] = sent[i].lower()
    return sent

In [5]:
def select(text):
    a, b = [], []
    for line in text:
        line = line.split('\t')
        a.append(line[0]) ; b.append(line[1])
    return pd.Series(a), pd.Series(b)

eng, deu = select(df)

In [6]:
eng, deu = preprocess(eng), preprocess(deu)

### DATA
https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs/data
Consist of translations of common sentences used in daily life, here we are using English and Deutch (German) for the Seq2Seq Machine translation model

In [7]:
data = pd.DataFrame({'english' : eng, 'deutsch': deu})

In [8]:
data.head(10)

Unnamed: 0,english,deutsch
0,go,geh
1,hi,hallo
2,hi,grüß gott
3,run,lauf
4,run,lauf
5,wow,potzdonner
6,wow,donnerwetter
7,fire,feuer
8,help,hilfe
9,help,zu hülf


<hr size=5>

#### Out of approximately 220k datapoints, I am considering only the first 100k

In [9]:
data = data[:100_000]

### Data preprocessing

The data processing pipeline looks like :

+ **Tokenization**: Breaking down text into smaller units (tokens), such as words or characters.
+ **Numerical Encoding**: Assigning numerical representations to each token.
+ **Sequence Padding**: Ensuring all sequences have the same length by adding padding tokens.

In [10]:
e_tokenizer,d_tokenizer = Tokenizer(), Tokenizer()
e_tokenizer.fit_on_texts(data['english'])
d_tokenizer.fit_on_texts(data['deutsch'])

In [11]:
e_vocab, d_vocab  = len(e_tokenizer.word_index) + 1, len(d_tokenizer.word_index ) + 1

#### Maximum length of sentences
To find the maximum number of words present in the dataset for both language. This isn't necessary and can be chosen subjectively as per required. However for this case I was experimenting a lot thus chose to let the dataset decide the value for this once. 

In [12]:
Len = lambda arr: max(len(i.split(" ")) for i in arr)
e_max_len, d_max_len = Len(data['english']), Len(data['deutsch'])

### Converting text to Numeric sequences

In [13]:
def encode(text, tokenizer, max_len):
    text =  tokenizer.texts_to_sequences(text)
    text = pad_sequences(text, max_len, padding = 'post')
    return text

X, y = encode(data['english'],e_tokenizer,e_max_len), encode(data['deutsch'],d_tokenizer, d_max_len)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<hr size=3>

<h1 style="text-align:center;">Base Seq2Seq Model</h1>

Seq2Seq models are constructed using key components like Long Short-Term Memory (LSTM) units, which are specialized types of recurrent neural networks (RNNs) that effectively capture temporal dependencies in data. LSTMs help in processing sequences by remembering important information over longer periods and forgetting irrelevant details, making them useful for handling tasks involving sequences of varying lengths.




<p style="text-align:center;"> <img src="https://miro.medium.com/v2/resize:fit:1400/1*Ismhi-muID5ooWf3ZIQFFg.png" alt="Seq2Seq"> </p>

I found the following notebook https://www.kaggle.com/code/harshjain123/machine-translation-seq2seq-lstms by Harsh Jain very insightful to understand  this process, its will be a great place to understand the data processing pipeline.

In [15]:
from tensorflow.keras.utils import register_keras_serializable

@register_keras_serializable(package="Custom", name="S2S")
class S2S(Model):
    def __init__(self, in_vocab, out_vocab, in_timesteps, out_timesteps, units, **kwargs):
        super(S2S, self).__init__(**kwargs)
        
        self.in_vocab = in_vocab
        self.out_vocab = out_vocab
        self.in_timesteps = in_timesteps
        self.out_timesteps = out_timesteps
        self.units = units
        
        # Define the layers
        self.embed = Embedding(input_dim=in_vocab, output_dim=units, mask_zero=True)
        self.encoder_lstm = LSTM(units)
        self.r_vector = RepeatVector(out_timesteps)
        self.decoder_lstm = LSTM(units, return_sequences=True)
        self.dense = Dense(out_vocab, activation='softmax')
    
    def call(self, inputs):
        # Define the forward pass
        x = self.embed(inputs)                           # (batch size, in_timesteps, units)
        x = self.encoder_lstm(x)                         # (batch size, units)
        x = self.r_vector(x)                             # (batch size, out_timesteps, units)
        x = self.decoder_lstm(x)                         # (batch size, out_timesteps, units)
        output = self.dense(x)                           # (batch size, out_timesteps, out_vocab)
        return output

    def get_config(self):
        config = super(S2S, self).get_config()
        config.update({
              'in_vocab': self.in_vocab,
              'out_vocab': self.out_vocab,
              'in_timesteps': self.in_timesteps,
              'out_timesteps': self.out_timesteps,
              'units': self.units
          })
        return config

    @classmethod
    def from_config(cls, config):
        return cls(
            in_vocab=config['in_vocab'],
            out_vocab=config['out_vocab'],
            in_timesteps=config['in_timesteps'],
            out_timesteps=config['out_timesteps'],
            units=config['units']
        )

In [17]:
seq2seq = S2S(e_vocab, d_vocab, e_max_len,d_max_len, 512)
seq2seq.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
out_shape = (y_train.shape[0], y_train.shape[1], 1)

In [18]:
seq2seq.fit(x=X_train, y = y_train.reshape(out_shape), epochs = 15, batch_size = 20, validation_batch_size=0.25)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x113a143fc70>

In [19]:
seq2seq.summary()

Model: "s2s"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  4724736   
                                                                 
 lstm (LSTM)                 multiple                  2099200   
                                                                 
 repeat_vector (RepeatVector  multiple                 0         
 )                                                               
                                                                 
 lstm_1 (LSTM)               multiple                  2099200   
                                                                 
 dense (Dense)               multiple                  8540424   
                                                                 
Total params: 17,463,560
Trainable params: 17,463,560
Non-trainable params: 0
___________________________________________________

In [20]:
seq2seq.fit(x=X_train, y = y_train.reshape(out_shape), epochs = 5, batch_size = 10, validation_batch_size=0.25)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x113b792fcd0>

(for 25 epochs)
### Final Loss : 0.1930
### Training Time : 45 minutes 55 seconds

In [21]:
seq2seq.save("s2s_model_tf", save_format="tf")



INFO:tensorflow:Assets written to: s2s_model_tf\assets


INFO:tensorflow:Assets written to: s2s_model_tf\assets


## Seq2Seq with Attention:

Traditional Seq2Seq models often struggle with long input sequences. The encoder encodes the entire input sequence into a fixed-size vector, which can lead to information loss, especially for longer sequences.

The attention mechanism addresses this limitation by allowing the decoder to focus on relevant parts of the input sequence at each decoding step. This enables the model to better capture long-range dependencies and produce more accurate outputs.


Implemented using an attention layer that computes a context vector by attending to all encoder outputs.
The context vector is combined with the decoder's hidden state at each timestep.


+ Encoder LSTM now returns the full sequence of hidden states (encoder_outputs) and the final hidden state (state_h, state_c).
+ Attention is applied at each timestep of the decoder to produce a context vector that is combined with the decoder input at that timestep.
+ The attention mechanism allows the decoder to focus on different parts of the input sequence when producing each output.



<p style="text-align:center;"> <img src="https://lena-voita.github.io/resources/lectures/seq2seq/attention/general_scheme-min.png
" alt="Seq2Seq"> </p>

 Img sources : https://lena-voita.github.io/resources/lectures/seq2seq/attention/general_scheme-min.png

#### Bahdanau Attention Layer:

The model calculates alignment scores between the encoder's hidden states and the current decoder's hidden state at each decoding step. The relevance or significance of each encoder hidden state in relation to the current decoding phase is represented by these scores.

+ Takes the encoder outputs (sequence of hidden states) and the decoder hidden state to compute the attention weights.
+ Computes the context vector, which is a weighted sum of the encoder outputs based on attention weights.

In [16]:
class Attention(Layer):
    
    def __init__(self, units):
        
        super(Attention, self).__init__()
        self.units = units
        self.W1 = Dense(units)
        self.W2 = Dense(units)
        self.V = Dense(1)
        
        
    def call(self, query, value):
        
        # query shape == (batch_size, hidden_size) -> decoder hidden state at the current timestep
        # values shape == (batch_size, max_len, hidden_size) -> encoder outputs (all timesteps)
        
        q_time = tf.expand_dims(query, axis = 1)
        score = self.V(tf.nn.tanh(self.W1(q_time)+self.W2(value)))
        weights = tf.nn.softmax(score, axis=1)
        context = weights * value
        context = tf.reduce_sum(context, axis=1)          # (batch_size, hidden_size)
        return context, weights
    
    def get_config(self):
        config = super(Attention, self).get_config()
        config.update({
            'units': self.units,
        })
        return config

    @classmethod
    def from_config(cls, config):
        return cls(**config)

#### Call action:

+ After encoding, the attention mechanism is applied for each timestep in the decoder.
+ The **context vector** and the **decoder input** at each timestep are concatenated and passed to the decoder LSTM.
+ The result is a sequence of outputs, one for each decoder timestep.


In [17]:
from tensorflow.keras.utils import register_keras_serializable

@register_keras_serializable(package="Custom", name="S2SA")
class S2SA(Model):

    def __init__(self, in_vocab, out_vocab, in_len, out_len, units, **kwargs):
        super(S2SA, self).__init__(**kwargs)
        
        self.in_vocab = in_vocab
        self.out_vocab = out_vocab
        self.in_len = in_len
        self.out_len = out_len
        self.units = units
        
        self.embed = Embedding(input_dim=in_vocab, output_dim=units, mask_zero = True)
        self.encoder_lstm = LSTM(units, return_sequences=True, return_state=True)
        self.attention = Attention(units)
        self.r_vectors = RepeatVector(out_len)
        self.decoder_lstm = LSTM(units, return_sequences=True, return_state=True)
        self.dense = Dense(out_vocab, activation = 'softmax')
        
    
    def call(self, inputs):
        x = self.embed(inputs)
        e_out, e_h_state, e_c_state = self.encoder_lstm(x)       # (batch_size, in_timesteps, units), states
        d_in = self.r_vectors(e_h_state)                       # (batch_size, out_timesteps, units)
        
        d_h_state, d_c_state = e_h_state, e_c_state
        all_dec_out = []
        

        for t in range(d_in.shape[1]):
            
            d_at_t = d_in[:, t:t+1, :]                                            # (batch_size, 1, units) at t timestep
            
            context_vector,_= self.attention(e_h_state, e_out)                      # (batch_size, units)
                                                                                  # TO MATCH THE DIMS OF D_in and context
            context_vector = tf.expand_dims(context_vector, axis=1)               # (batch_size, 1, units)
            
            context_w_inputs = tf.concat([context_vector, d_at_t], axis = -1)     # (batch_size, 1, 2 x units)
            
            d_out,d_h_state, d_c_state = self.decoder_lstm(context_w_inputs, initial_state = [d_h_state, d_c_state])
            d_out = self.dense(d_out)
            
            all_dec_out.append(d_out)
            
        
        d_out = tf.concat(all_dec_out, axis = 1)            # To aggregate outputs across timesteps
        
        return d_out
    
    def get_config(self):
        config = super(S2SA, self).get_config()
        config.update({
              'in_vocab': self.in_vocab,
              'out_vocab': self.out_vocab,
              'in_len': self.in_len,
              'out_len': self.out_len,
              'units': self.units
          })
        return config

    @classmethod
    def from_config(cls, config):
        return cls(
            in_vocab=config['in_vocab'],
            out_vocab=config['out_vocab'],
            in_len=config['in_len'],
            out_len=config['out_len'],
            units=config['units']
        )

In [19]:
attention_model = S2SA(e_vocab, d_vocab, e_max_len, d_max_len, 512)
attention_model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy')

In [25]:
attention_model.fit(X_train, y_train.reshape(out_shape), epochs = 15, batch_size=20, validation_batch_size=0.25)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x113b844e700>

In [26]:
attention_model.summary()


Model: "s2sa"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  4724736   
                                                                 
 lstm_2 (LSTM)               multiple                  2099200   
                                                                 
 attention (Attention)       multiple                  525825    
                                                                 
 repeat_vector_1 (RepeatVect  multiple                 0         
 or)                                                             
                                                                 
 lstm_3 (LSTM)               multiple                  3147776   
                                                                 
 dense_4 (Dense)             multiple                  8540424   
                                                              

In [27]:
attention_model.fit(X_train, y_train.reshape(out_shape), epochs = 5, batch_size=10, validation_batch_size=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x11767a5cf40>

(for 25 epochs)
### Final Loss : 0.1697
### Training Time : 5 hours, 45 minutes


In [28]:
# attention_model.save("s2sa.keras")
attention_model.save("s2sa", save_format="tf")



INFO:tensorflow:Assets written to: s2sa\assets


INFO:tensorflow:Assets written to: s2sa\assets


## Translation comparisons

In [29]:
def test(model1, model2, batch_size, size=len(X_test),):
    p = np.argmax(model1.predict(X_test[:size], batch_size=batch_size), axis=-1)
    q = np.argmax(model2.predict(X_test[:size], batch_size=batch_size), axis=-1)
    real = e_tokenizer.sequences_to_texts(X_test[:size])
    pred1 = d_tokenizer.sequences_to_texts(p)
    pred2 = d_tokenizer.sequences_to_texts(q)
    trans = d_tokenizer.sequences_to_texts(y_test[:size])
    d = {'Eng': real,'Deu': trans, 'Base_Model': pred1, 'Attention_Model': pred2}
    return pd.DataFrame(d)

In [30]:
s = {'S2S':S2S}
seq2seq = keras.models.load_model('s2s_model_tf', custom_objects=s)

In [31]:
sa = {'S2SA':S2SA}
attention_model = keras.models.load_model('s2sa', custom_objects=sa)

In [32]:
import tensorflow.keras.backend as K
K.clear_session()

In [33]:
df_test = test(seq2seq,attention_model, 10,4000)
df_test.head(20)



Unnamed: 0,Eng,Deu,Base_Model,Attention_Model
0,she decided to marry tom,sie hat sich entschieden tom zu heiraten,sie entschied sich tom zu heiraten heiraten,sie entschied sich tom zu heiraten
1,do not read while walking,lies nicht im gehen,lies sie am am lesen,schlaf nicht nicht mehr mehr
2,which ones mine,welche ist meine,welcher ist meiner,welcher ist meiner
3,this knife is very sharp,dieses messer ist sehr scharf,dieses bier ist sehr trocken,dieser messer ist sehr scharf
4,that was just plain stupid,das war einfach nur dumm,das war schlicht und ergreifend dumm,das war schlicht und ergreifend dumm
5,theres still a lot left,es gibt noch immer eine menge zu tun,es ist noch viel übrig,es ist noch noch übrig
6,please add up the numbers,bitte addiert die zahlen,bitte addiere die zahlen,bitte addiere die die
7,he is playing in his room,er spielt in seinem zimmer,er wohnt in einem zimmer,er spielt in in zimmer
8,tom isnt your brother,tom ist nicht dein bruder,tom ist nicht dein bruder,tom ist nicht dein bruder
9,im not going to stop,ich werde nicht aufhören,ich werde nicht aufhören,ich werde nicht anhalten


In [34]:
def test1(model1, model2, batch_size, size=len(X_test),):
    p = np.argmax(model1.predict(X[:size], batch_size=batch_size), axis=-1)
    q = np.argmax(model2.predict(X[:size], batch_size=batch_size), axis=-1)
    real = e_tokenizer.sequences_to_texts(X[:size])
    pred1 = d_tokenizer.sequences_to_texts(p)
    pred2 = d_tokenizer.sequences_to_texts(q)
    trans = d_tokenizer.sequences_to_texts(y[:size])
    d = {'Eng': real,'Deu': trans, 'Base_Model': pred1, 'Attention_Model': pred2}
    return pd.DataFrame(d)

In [35]:
df_test = test1(seq2seq,attention_model, 10,4000)



In [36]:
df_test[290:310]

Unnamed: 0,Eng,Deu,Base_Model,Attention_Model
290,wake up,wachen sie auf,wach auf,wach auf
291,wake up,wach auf,wach auf,wach auf
292,wake up,wachen sie auf,wach auf,wach auf
293,wash up,wasch dir die hände,wasch dir das gesicht,wasch dir die getreten
294,wash up,wasch dir das gesicht,wasch dir das gesicht,wasch dir die getreten
295,we lost,wir haben verloren,wir haben verloren verirrt,wir haben verloren
296,welcome,willkommen,willkommen,willkommen
297,who ate,wer hat gegessen,wer aß gegessen,wer hat
298,who ate,wer aß,wer aß gegessen,wer hat
299,who ran,wer rannte,wer rannte,wer rannte


## Amount of exact translations by each model

In [37]:
sum(df_test['Deu'] == df_test['Base_Model'])

1867

In [38]:
sum(df_test['Deu'] == df_test['Attention_Model'])

1972

# Observations and Conclusion


**Postivites :**
+ Translations generated by model equipped with attention were more accurate than with the model without, 
+ The model without attention was less likely to mistake one or a few words in a sentence, hence better semantic awareness
+ After more epochs the model with attention coverges at a much lower loss than the model without, however due to hardware limitations I couldn't show it in the same notebook.

**Negetives :**
+ Model with attention takes significanly much time to train approximately times longer (5+ hours as compared to 45 minutes)
+ Model with attention takes *4-5 times longer to run* (approx. 21ms) compared to the model without (approx. 2ms)  When Inferened on GPU P100


Attention mechanisms have become an essential component of modern sequence-to-sequence models, despite longer training time. By allowing the model to focus on relevant parts of the input sequence, attention helps to improve performance, interpretability, and flexibility.


## But why pay 'Attention'?

Since models RNN models which are used sequence modelling need attention despite being specifically designed for the same purpose, it becomes very apparent how powerful attention can be in sequential modelling


Transformer models, which rely entirely on attention mechanisms, have revolutionized the field of natural language processing. They have achieved state-of-the-art results on various tasks, including machine translation, text summarization, and question answering.

## Referenes
+ https://www.kaggle.com/code/harshjain123/machine-translation-seq2seq-lstms : A very helpful notebook for understanding the data processing pipeline
+ https://www.kaggle.com/datasets/alincijov/bilingual-sentence-pairs : Bilingual pair dataset

+ https://www.geeksforgeeks.org/seq2seq-model-in-machine-learning : A good Introduction to seq2seq modelling
+ https://youtu.be/yInilk6x-OY?si=2e6MOB_DdflA60Ar : A great Lecture on attention 

# Experiements

## Seq2Seq


In [22]:
seq2seq = S2S(e_vocab, d_vocab, e_max_len,d_max_len, 512)
seq2seq.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')
out_shape = (y_train.shape[0], y_train.shape[1], 1)

In [23]:
seq2seq.fit(x=X_train, y = y_train.reshape(out_shape), epochs = 32, batch_size = 64, validation_batch_size=0.25)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32


<keras.callbacks.History at 0x22c90fa2dc0>

In [24]:
seq2seq.summary()

Model: "s2s_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     multiple                  4724736   
                                                                 
 lstm_4 (LSTM)               multiple                  2099200   
                                                                 
 repeat_vector_2 (RepeatVect  multiple                 0         
 or)                                                             
                                                                 
 lstm_5 (LSTM)               multiple                  2099200   
                                                                 
 dense_2 (Dense)             multiple                  8540424   
                                                                 
Total params: 17,463,560
Trainable params: 17,463,560
Non-trainable params: 0
_________________________________________________

In [25]:
seq2seq.fit(x=X_train, y = y_train.reshape(out_shape), epochs = 16, batch_size = 32, validation_batch_size=0.25)

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


<keras.callbacks.History at 0x22ca1ae2850>

In [26]:
seq2seq.fit(x=X_train, y = y_train.reshape(out_shape), epochs = 8, batch_size = 16, validation_batch_size=0.25)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x2294b053850>

(for 25 epochs)
### Final Loss : 0.1930
### Training Time : 45 minutes 55 seconds

In [27]:
seq2seq.save("s2s_model_tf_v2", save_format="tf")



INFO:tensorflow:Assets written to: s2s_model_tf_v2\assets


INFO:tensorflow:Assets written to: s2s_model_tf_v2\assets


## Seq2Seq with Attention


In [28]:
attention_model = S2SA(e_vocab, d_vocab, e_max_len, d_max_len, 512)
attention_model.compile(optimizer = 'rmsprop', loss = 'sparse_categorical_crossentropy')

In [29]:
attention_model.fit(X_train, y_train.reshape(out_shape), epochs = 32, batch_size=64, validation_batch_size=0.25)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32


<keras.callbacks.History at 0x22ca41c2370>

In [30]:
attention_model.summary()


Model: "s2sa"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     multiple                  4724736   
                                                                 
 lstm_6 (LSTM)               multiple                  2099200   
                                                                 
 attention (Attention)       multiple                  525825    
                                                                 
 repeat_vector_3 (RepeatVect  multiple                 0         
 or)                                                             
                                                                 
 lstm_7 (LSTM)               multiple                  3147776   
                                                                 
 dense_6 (Dense)             multiple                  8540424   
                                                              

In [31]:
attention_model.fit(X_train, y_train.reshape(out_shape), epochs = 16, batch_size=32, validation_batch_size=0.1)

Epoch 1/16
Epoch 2/16
Epoch 3/16
Epoch 4/16
Epoch 5/16
Epoch 6/16
Epoch 7/16
Epoch 8/16
Epoch 9/16
Epoch 10/16
Epoch 11/16
Epoch 12/16
Epoch 13/16
Epoch 14/16
Epoch 15/16
Epoch 16/16


<keras.callbacks.History at 0x22cc9484760>

In [32]:
attention_model.fit(X_train, y_train.reshape(out_shape), epochs = 8, batch_size=16, validation_batch_size=0.1)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x22d283a6850>

(for 25 epochs)
### Final Loss : 0.1697
### Training Time : 5 hours, 45 minutes


In [33]:
# attention_model.save("s2sa.keras")
attention_model.save("s2sa_v2", save_format="tf")



INFO:tensorflow:Assets written to: s2sa_v2\assets


INFO:tensorflow:Assets written to: s2sa_v2\assets


## Translation comparisons

In [20]:
def test(model1, model2, batch_size, size=len(X_test),):
    p = np.argmax(model1.predict(X_test[:size], batch_size=batch_size), axis=-1)
    q = np.argmax(model2.predict(X_test[:size], batch_size=batch_size), axis=-1)
    real = e_tokenizer.sequences_to_texts(X_test[:size])
    pred1 = d_tokenizer.sequences_to_texts(p)
    pred2 = d_tokenizer.sequences_to_texts(q)
    trans = d_tokenizer.sequences_to_texts(y_test[:size])
    d = {'Eng': real,'Deu': trans, 'Base_Model': pred1, 'Attention_Model': pred2}
    return pd.DataFrame(d)

In [21]:
s = {'S2S':S2S}
seq2seq = keras.models.load_model('s2s_model_tf_v2', custom_objects=s)

In [22]:
sa = {'S2SA':S2SA}
attention_model = keras.models.load_model('s2sa_v2', custom_objects=sa)

In [23]:
import tensorflow.keras.backend as K
K.clear_session()

In [30]:
df_test = test(seq2seq,attention_model, 10,4000)
df_test.head(20)



Unnamed: 0,Eng,Deu,Base_Model,Attention_Model
0,she decided to marry tom,sie hat sich entschieden tom zu heiraten,sie entschied sich tom zu heiraten,sie entschied tom tom heiraten heiraten heiraten
1,do not read while walking,lies nicht im gehen,geht es nicht darüber,kann nicht französisch
2,which ones mine,welche ist meine,welches ist meiner,welcher ist toms
3,this knife is very sharp,dieses messer ist sehr scharf,dieses gekauft ist sehr lebt,dieses messer ist sehr
4,that was just plain stupid,das war einfach nur dumm,das war einfach werde dumm dumm,das war einfach so und blöd
5,theres still a lot left,es gibt noch immer eine menge zu tun,es ist noch viel,es ist noch viel viel
6,please add up the numbers,bitte addiert die zahlen,bitte möglich bitte zahlen,bitte sie sie
7,he is playing in his room,er spielt in seinem zimmer,er zimmer sein seinem zimmer zimmer,er spielt in seinem zimmer
8,tom isnt your brother,tom ist nicht dein bruder,das ist nicht dein bruder,tom ist nicht euer bruder
9,im not going to stop,ich werde nicht aufhören,ich werde nicht aufhören,ich werde nicht aufhören


In [31]:
def test1(model1, model2, batch_size, size=len(X_test),):
    p = np.argmax(model1.predict(X[:size], batch_size=batch_size), axis=-1)
    q = np.argmax(model2.predict(X[:size], batch_size=batch_size), axis=-1)
    real = e_tokenizer.sequences_to_texts(X[:size])
    pred1 = d_tokenizer.sequences_to_texts(p)
    pred2 = d_tokenizer.sequences_to_texts(q)
    trans = d_tokenizer.sequences_to_texts(y[:size])
    d = {'Eng': real,'Deu': trans, 'Base_Model': pred1, 'Attention_Model': pred2}
    return pd.DataFrame(d)

In [32]:
df_test = test1(seq2seq,attention_model, 10,4000)



In [33]:
df_test[290:310]

Unnamed: 0,Eng,Deu,Base_Model,Attention_Model
290,wake up,wachen sie auf,steh sie,wach sie auf
291,wake up,wach auf,steh sie,wach sie auf
292,wake up,wachen sie auf,steh sie,wach sie auf
293,wash up,wasch dir die hände,wasch dir die hände,wasch dir das gesicht
294,wash up,wasch dir das gesicht,wasch dir die hände,wasch dir das gesicht
295,we lost,wir haben verloren,wir haben haben,wir haben uns
296,welcome,willkommen,willkommen,willkommen
297,who ate,wer hat gegessen,wer hat gegessen,wer hat
298,who ate,wer aß,wer hat gegessen,wer hat
299,who ran,wer rannte,wer rannte,wer rannte


## Amount of exact translations by each model

In [34]:
sum(df_test['Deu'] == df_test['Base_Model'])

1506

In [35]:
sum(df_test['Deu'] == df_test['Attention_Model'])

1462

# Questions and Answers

How does the attention mechanism improve upon the basic Seq2Seq model, and which type of attention is implemented in this notebook?

***Since the encoder network comprises the data into a single vector and is challenged by long input sequences, the implemented Bahdanau attention helps solve this by learning from different parts of the input. This improves seq2seq's ability to handle longer text.***

What preprocessing steps are applied to the dataset, and why are they critical for training a sequence-to-sequence model?

***The preprocessing pipeline cleans the data by taking an alphanumeric text, and lowercasing all texts for consistency. It was also tokenized to map each unique word to a unique numerical index so that machines can understand it. The vocabulary size and maximum length was calculated, and then the texts per data point was padded to ensure that the inputs are of the same length, which is needed for sequential deep learning models.***

Explain how teacher forcing is applied in the decoder training process and its impact on convergence and performance.

***Teacher forcing gives the true previous target to the decoder during training instead of its own prediction for a faster and more stable convergence. However, this approach can cause train-inference mismatch or exposure bias, where small mistakes compound into larger errors.***

How are attention weights computed and integrated into the decoder’s hidden state during prediction?

***Attention weights are computed by comparing the current decoder state with all encoder outputs to produce normalized softmax alignment scores. The weights indicate which part of the sequence is most relevant at each decoding step. A weighted sum of the encoder outputs is then formed (context vector) which is concatenated with the decoder input and passed through the LSTM, allowing the decoder to make predictions while focusing on the most important parts of the source.***

What limitations can you observe in this implementation, and what modifications (e.g., bidirectional encoder, different attention type, transformer) might improve its performance?

***The current implementation is very limited by the unidirectional nature of the LSTM encoder and reliance on sequential recurrence. Utilizing a bidirectional encoder along with attention would significantly improve performance for longer, more complex text sources.***

# Conclusion

***In conclusion, this activity guided and allowed me to explore base sequence to sequence models and a more complex implementation with attention. The dataset was quite large, I could not finish it all in one sitting. The preprocessing done was textbook definition of those needed for a sequential model: removal of punctuations, lowercase letters, and padding for uniformity of length. Interestingly, the addition of attention showed visible improvement from the base Seq2Seq model. However, adding additional training epochs with lower batch size seemed to cause the model loss to increase, yielding worse performance than the initial models before experimentation. It would be interesting to see how more complex configuration such as bidirectional LSTM or other implementations of attention mechanisms could improve and increase performance on the given dataset.***