# **Implementing and Analyzing Transformer Models**

## Part 1: Literature Review

### The Evolution of Sequence Models: From RNNs and LSTMs to Transformers

The field of natural language processing (NLP) has witnessed a paradigm shift in recent years with the evolution of sequence models. Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) were pioneering architectures in processing sequential data. However, their limitations in capturing long-range dependencies and parallelizing computations motivated the development of the Transformer model.

RNNs and LSTMs struggle with vanishing or exploding gradient problems, impeding their ability to effectively capture relationships in long sequences. Additionally, their inherently sequential nature hinders parallelization, limiting computational efficiency. Transformers, introduced by Vaswani et al. in 2017, addressed these issues by employing a self-attention mechanism. This mechanism allows the model to weigh different parts of the input sequence differently, enabling it to capture long-range dependencies effectively.

Moreover, Transformers introduced parallelization across sequence elements, significantly accelerating training times. The attention mechanism also facilitates capturing contextual information in a more structured manner, improving the model's understanding of relationships between words in a sentence.

Despite the success of Transformers, it's crucial to acknowledge their own challenges, such as scalability and sensitivity to input data. Nevertheless, ongoing research and adaptations, such as BERT and GPT architectures, continue to refine and extend the capabilities of Transformer models, marking a revolutionary step in the evolution of sequence modeling.

## Part 2: Implementation of a Simple RNN

In [1]:
import tensorflow as tf
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
from keras.preprocessing import sequence

# Load IMDB dataset
max_features = 10000  # Consider the top 10,000 most frequent words
maxlen = 100  # Cut reviews after 100 words
batch_size = 32

print("Loading data...")
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), "train sequences")
print(len(input_test), "test sequences")

# Pad sequences to have a consistent length for input to the RNN
print("Pad sequences (samples x time)")
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print("input_train shape:", input_train.shape)
print("input_test shape:", input_test.shape)

# Build the RNN model
model = Sequential()
model.add(Embedding(max_features, 32))  # Embedding layer to convert integer sequences to dense vectors
model.add(SimpleRNN(32))  # Simple RNN layer with 32 units
model.add(Dense(1, activation='sigmoid'))  # Output layer for binary classification

# Compile the model
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

# Train the RNN model
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=batch_size,
                    validation_split=0.2)

# Evaluate the model on the test set
print("Evaluating the model...")
results = model.evaluate(input_test, y_test)
print("Test loss:", results[0])
print("Test accuracy:", results[1])


Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 100)
input_test shape: (25000, 100)
Training the model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Evaluating the model...
Test loss: 0.8615970015525818
Test accuracy: 0.8011199831962585


In [5]:
import numpy as np

# Function to preprocess and predict sentiment for a new review
def predict_sentiment(model, new_review, max_features=10000, maxlen=100):
    # Convert the new review to a sequence of integers using the word index
    word_index = imdb.get_word_index()
    new_review_seq = [word_index.get(word, 0) for word in new_review.split()]

    # Pad the sequence to have a consistent length
    new_review_padded = sequence.pad_sequences([new_review_seq], maxlen=maxlen)

    # Make the prediction
    prediction = model.predict(new_review_padded)

    # Display the result
    print("New Review:", new_review)
    print("Predicted Sentiment:", "Positive" if prediction[0, 0] > 0.5 else "Negative")
    print("Confidence:", round(float(prediction[0, 0] if prediction[0, 0] > 0.5 else 1 - prediction[0, 0]), 4))

# Example usage:
new_review = "I really enjoyed this movie, it was fantastic!"
predict_sentiment(model, new_review)


New Review: I really enjoyed this movie, it was fantastic!
Predicted Sentiment: Negative
Confidence: 0.8685


In [6]:
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

# Increase the number of units and embedding dimension
model1 = Sequential()
model1.add(Embedding(max_features, 64))  # Increase embedding dimension
model1.add(SimpleRNN(64, activation='relu'))  # Increase the number of units
model1.add(Dense(1, activation='sigmoid'))

# Experiment with different optimizer and learning rate
model1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

# Train the model
history = model1.fit(input_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
results = model1.evaluate(input_test, y_test)
print("Test loss:", results[0])
print("Test accuracy:", results[1])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 0.6726372838020325
Test accuracy: 0.7937600016593933


In [7]:
import numpy as np

# Function to preprocess and predict sentiment for a new review
def predict_sentiment(model1, new_review, max_features=10000, maxlen=100):
    # Convert the new review to a sequence of integers using the word index
    word_index = imdb.get_word_index()
    new_review_seq = [word_index.get(word, 0) for word in new_review.split()]

    # Pad the sequence to have a consistent length
    new_review_padded = sequence.pad_sequences([new_review_seq], maxlen=maxlen)

    # Make the prediction
    prediction = model1.predict(new_review_padded)

    # Display the result
    print("New Review:", new_review)
    print("Predicted Sentiment:", "Positive" if prediction[0, 0] > 0.5 else "Negative")
    print("Confidence:", round(float(prediction[0, 0] if prediction[0, 0] > 0.5 else 1 - prediction[0, 0]), 4))

# Example usage:
new_review = "I really enjoyed this movie, it was fantastic!"
predict_sentiment(model1, new_review)


New Review: I really enjoyed this movie, it was fantastic!
Predicted Sentiment: Positive
Confidence: 0.8517


In this task, Ive used a simple RNN model to run text classfication for sentiment analysis on the imdb dataset. In the first model, the loss and accuracy seem to have a reasonable value, but when the prediction was tested it turned out to be wrong. After making improvements in the second model by increasing embedding dimensions, testing out a different optimizer and adjusting the learning rate. Although there was a decrease in the accuracy rate and the confidence rate, the prediction turned out to be right.

## Part 3: Implementing an LSTM Model

In [8]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

# Increase the number of units and embedding dimension
model2 = Sequential()
model2.add(Embedding(max_features, 64))
model2.add(LSTM(64, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))

# Experiment with different optimizer and learning rate
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

# Train the model
history = model2.fit(input_train, y_train, epochs=10, batch_size=batch_size, validation_split=0.2)

# Evaluate the model
results = model2.evaluate(input_test, y_test)
print("LSTM Test loss:", results[0])
print("LSTM Test accuracy:", results[1])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
LSTM Test loss: 0.5941140055656433
LSTM Test accuracy: 0.6905999779701233


In [9]:
import numpy as np

# Function to preprocess and predict sentiment for a new review
def predict_sentiment(model1, new_review, max_features=10000, maxlen=100):
    # Convert the new review to a sequence of integers using the word index
    word_index = imdb.get_word_index()
    new_review_seq = [word_index.get(word, 0) for word in new_review.split()]

    # Pad the sequence to have a consistent length
    new_review_padded = sequence.pad_sequences([new_review_seq], maxlen=maxlen)

    # Make the prediction
    prediction = model2.predict(new_review_padded)

    # Display the result
    print("New Review:", new_review)
    print("Predicted Sentiment:", "Positive" if prediction[0, 0] > 0.5 else "Negative")
    print("Confidence:", round(float(prediction[0, 0] if prediction[0, 0] > 0.5 else 1 - prediction[0, 0]), 4))

# Example usage:
new_review = "I really enjoyed this movie, it was fantastic!"
predict_sentiment(model2, new_review)


New Review: I really enjoyed this movie, it was fantastic!
Predicted Sentiment: Negative
Confidence: 0.6549


In [17]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping

# Increase the number of units and embedding dimension
model3 = Sequential()
model3.add(Embedding(max_features, 64))
model3.add(LSTM(32, activation='tanh', return_sequences=True))
model3.add(Dropout(0.3))
model3.add(LSTM(32, activation='tanh'))
model3.add(Dropout(0.3))
model3.add(Dense(1, activation='sigmoid'))

# Compile the model
model3.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model3.fit(input_train, y_train, epochs=15, batch_size=batch_size,
                     validation_split=0.2, callbacks=[early_stopping])


# Evaluate the model
results = model3.evaluate(input_test, y_test)
print("LSTM Test loss:", results[0])
print("LSTM Test accuracy:", results[1])


Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
LSTM Test loss: 0.35545817017555237
LSTM Test accuracy: 0.8456799983978271


In [18]:
import numpy as np

# Function to preprocess and predict sentiment for a new review
def predict_sentiment(model1, new_review, max_features=10000, maxlen=100):
    # Convert the new review to a sequence of integers using the word index
    word_index = imdb.get_word_index()
    new_review_seq = [word_index.get(word, 0) for word in new_review.split()]

    # Pad the sequence to have a consistent length
    new_review_padded = sequence.pad_sequences([new_review_seq], maxlen=maxlen)

    # Make the prediction
    prediction = model3.predict(new_review_padded)

    # Display the result
    print("New Review:", new_review)
    print("Predicted Sentiment:", "Positive" if prediction[0, 0] > 0.5 else "Negative")
    print("Confidence:", round(float(prediction[0, 0] if prediction[0, 0] > 0.5 else 1 - prediction[0, 0]), 4))

# Example usage:
new_review = "I really enjoyed this movie, it was fantastic!"
predict_sentiment(model3, new_review)




New Review: I really enjoyed this movie, it was fantastic!
Predicted Sentiment: Positive
Confidence: 0.5095


Initially, after creating an lstm model, the accuracy did not improve and the prediction output was wrong. But after adding an additional lstm layer, a dropout layer and the early stopping method, the accuracy increased and the prediction turned out correct.

- In comparision to the rnn model, the lstm model gives a better accuracy rate and a lesser loss rate.

- The inclusion of memory cells and gating mechanisms allows LSTMs to selectively retain or forget information, facilitating improved learning of sequential patterns. The lack of gating mechanisms in simple RNNs limits their ability to regulate the flow of information during training.

## Part 4: Exploring Attention Mechanisms

In [22]:
from keras.layers import Embedding, LSTM, Dense, Dropout, Input, Concatenate, Attention
from keras.models import Sequential, Model

model4 = Sequential()
input_layer = Input(shape=(maxlen,))
embedding_layer = Embedding(max_features, 64)(input_layer)

lstm_layer = LSTM(32, activation='tanh', return_sequences=True)(embedding_layer)
lstm_dropout = Dropout(0.3)(lstm_layer)

lstm_layer_2 = LSTM(32, activation='tanh', return_sequences=True)(lstm_dropout)
lstm_dropout_2 = Dropout(0.3)(lstm_layer_2)

# Attention mechanism
attention = Attention()([lstm_dropout_2, lstm_dropout])
attended_layer = Concatenate(axis=-1)([lstm_dropout_2, attention])

# Use a Dense layer to obtain the final prediction
output_layer = Dense(1, activation='sigmoid')(attended_layer)
model4 = Model(inputs=[input_layer], outputs=[output_layer])

# Compile the model
model4.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Train the model
history_attention = model4.fit(
    input_train,
    y_train.reshape(-1, 1),
    epochs=10,
    batch_size=batch_size,
    validation_split=0.2,
    callbacks=[early_stopping]
)

# Evaluate the model
results_attention = model4.evaluate(input_test, y_test.reshape(-1, 1))
print("LSTM with Attention Test loss:", results_attention[0])
print("LSTM with Attention Test accuracy:", results_attention[1])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
LSTM with Attention Test loss: 0.37195926904678345
LSTM with Attention Test accuracy: 0.8346184492111206


In [23]:
import numpy as np

# Function to preprocess and predict sentiment for a new review
def predict_sentiment(model1, new_review, max_features=10000, maxlen=100):
    # Convert the new review to a sequence of integers using the word index
    word_index = imdb.get_word_index()
    new_review_seq = [word_index.get(word, 0) for word in new_review.split()]

    # Pad the sequence to have a consistent length
    new_review_padded = sequence.pad_sequences([new_review_seq], maxlen=maxlen)

    # Make the prediction
    prediction = model4.predict(new_review_padded)

    # Display the result
    print("New Review:", new_review)
    print("Predicted Sentiment:", "Positive" if prediction[0, 0] > 0.5 else "Negative")
    print("Confidence:", round(float(prediction[0, 0] if prediction[0, 0] > 0.5 else 1 - prediction[0, 0]), 4))

# Example usage:
new_review = "I really enjoyed this movie, it was fantastic!"
predict_sentiment(model4, new_review)




New Review: I really enjoyed this movie, it was fantastic!
Predicted Sentiment: Positive
Confidence: 0.6102


- The attention mechanism improves a model's performance in sequence-based tasks like natural language processing.

- It addresses the challenge of handling long-range dependencies by allowing the model to selectively focus on different parts of the input sequence, mitigating information loss associated with traditional models.

- It allows us to understand which elements of the input sequence are important for predictions.

- The model has an increased confidence rate, a good accuracy rate and a good loss rate. The prediction has also turned out correct.

## Part 5: Building a Transformer Model

In [1]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [2]:
!pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.3.2-py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7 kB[0m [31m451.4 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.8.2 sacrebleu-2.3.2


In [1]:
from transformers import MarianMTModel, MarianTokenizer
import torch
import sacrebleu

# Load pre-trained model and tokenizer for English-to-Russian translation
model_name_ru = 'Helsinki-NLP/opus-mt-en-ru'
model_ru = MarianMTModel.from_pretrained(model_name_ru)
tokenizer_ru = MarianTokenizer.from_pretrained(model_name_ru)

input_text_en = "Hello, how are you?"

# Tokenize and convert to tensor
input_ids_en = tokenizer_ru.encode(input_text_en, return_tensors='pt')

# Generate translation to Russian
with torch.no_grad():
    output_ids_ru = model_ru.generate(input_ids_en)

# Decode and print the translation
output_text_ru = tokenizer_ru.decode(output_ids_ru[0], skip_special_tokens=True)
print("Original (English): ", input_text_en)
print("Translation (Russian): ", output_text_ru)

# Example reference translations
reference_translations = ["Привет, как вы?", "Привет, как дела?"]

# Calculate BLEU score
bleu = sacrebleu.corpus_bleu([output_text_ru], [reference_translations])
print("BLEU Score:", bleu.score)


config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.60M [00:00<?, ?B/s]



Original (English):  Hello, how are you?
Translation (Russian):  Привет, как дела?
BLEU Score: 42.72870063962342


- The implementation of a transformer model using a standard library like Hugging Face tranformers enables us to use a pretrained model which speeds the process of running the model and makes it easier. I've used a pretrained model of language translation from English to Russian. The translated sentence after the model ran turned out to be correct. To test the model, I've used the BLEU score metric. The predicted BLEU score is 42.728, which is a good score, and proves that the model worked well.

- In comparision to the previous models; Simple RNN, Lstm and lstm with attention layers/mechanisms, The transformer model performs better and faster.

- The simple rnn and lstm models sometimes deal with the exploding gradient problem, but tranformer models have scalable training which makes them more efficient especially when dealing with large datasets.

- RNN and lstm models may struggle to capture complex patterns cause of limited parameters. Transformers efficiently utilize parameters with self-attention mechanisms, enabling better representation learning and capturing intricate relationships in data.

- RNNs and lstm's have sequential structures, which makes it a bit challenging for representation. Transformers utilize self-attention mechanisms, allowing each element in the sequence to attend to all other elements with varying degrees of importance.