# Literature Review

The readings talk about a new kind of model called the Transformer. This model is a big deal because it does things differently from the usual models we use for tasks like translating languages. The old models, especially RNN, had a problem—they could only do one thing at a time, which made them slow and not very good for long sequences of data.

The evolution from Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) models to Transformer architectures represents a paradigm shift in natural language processing. RNNs and LSTMs were early attempts at capturing sequential dependencies in data, making them suitable for tasks such as language modeling and machine translation. However, they faced challenges in handling long-range dependencies due to vanishing and exploding gradient problems.

The Transformer architecture, introduced by Vaswani et al. in 2017, addressed these limitations by employing a self-attention mechanism, allowing the model to weigh different parts of the input sequence differently. This mechanism enables the Transformer to capture long-range dependencies more effectively, making it particularly well-suited for sequence-to-sequence tasks. The use of self-attention also facilitates parallelization, enhancing training efficiency.

The Transformer fixes this issue by using something called attention instead of the old sequential way of doing things. This attention thing helps the model understand the relationships between different parts of the input and output all at once, without going through them one by one. This makes the Transformer much faster and more efficient during training.

The Transformer has these special attention mechanisms that let it focus on different parts of the input and output as needed. This happens all at the same time, making things quicker. The model is made up of layers that use self-attention and fully connected parts, making it better at handling lots of data and giving top-notch results, especially in tasks like language translation. The Transformer model is a game-changer. It gets rid of the slow, one-at-a-time way of working that old models had by using attention, making it super efficient and really good at tasks that involve sequences of data.

2. Implementation of a Simple RNN (20% of the grade)

• Implement a basic RNN model in Python using a framework like TensorFlow or PyTorch.

• Use a small dataset (e.g., a subset of the IMDB movie reviews) for text classification or a
similar task.

• Analyze the performance and limitations of your RNN model.

In [1]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from sklearn.model_selection import train_test_split

# Load IMDB dataset
max_features = 5000  # Number of words to consider as features
max_len = 200  # Cuts off texts after this number of words
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to have a consistent length for RNN input
x_train = sequence.pad_sequences(x_train, maxlen=max_len)
x_test = sequence.pad_sequences(x_test, maxlen=max_len)

# Split the training set into 60% and 40%
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.4, random_state=42)

# Define RNN model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_features, 32, input_length=max_len),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_val, y_val))

# Evaluate the model on the test set
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}')


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test loss: 0.7540, Test accuracy: 0.7431


3. Implementing an LSTM Model (20% of the grade)

• Modify your RNN model to use LSTM units.

• Compare its performance with the basic RNN model on the same task, highlighting the
improvements or changes.

In [2]:
# Define LSTM model
model_lstm = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_features, 32, input_length=max_len),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the LSTM model
model_lstm.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the LSTM model
model_lstm.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_val, y_val))

# Evaluate the LSTM model on the test set
loss_lstm, accuracy_lstm = model_lstm.evaluate(x_test, y_test)
print(f'LSTM Test loss: {loss_lstm:.4f}, Test accuracy: {accuracy_lstm:.4f}')


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM Test loss: 0.4197, Test accuracy: 0.8529


4. Exploring Attention Mechanisms (20% of the grade)

• Implement a basic attention mechanism in your LSTM model.

• Discuss how the attention mechanism impacts the model's performance and its ability to
handle long-range dependencies.

In [3]:
!pip install tensorflow-addons

Collecting tensorflow-addons
  Downloading tensorflow_addons-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (612 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m612.3/612.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons)
  Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
Successfully installed tensorflow-addons-0.22.0 typeguard-2.13.3


In [4]:
class Attention(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.W_q = self.add_weight(name="W_q",
                                  shape=(input_shape[-1], input_shape[-1]),
                                  initializer="uniform",
                                  trainable=True)
        self.W_k = self.add_weight(name="W_k",
                                  shape=(input_shape[-1], input_shape[-1]),
                                  initializer="uniform",
                                  trainable=True)
        super(Attention, self).build(input_shape)  # Be sure to call this at the end

    def call(self, x):
        q = tf.matmul(x, self.W_q)
        k = tf.matmul(x, self.W_k)
        v = x

        attn_score = tf.matmul(q, k, transpose_b=True)
        attn_score = tf.nn.softmax(attn_score, axis=-1)

        output = tf.matmul(attn_score, v)
        return output

# Define LSTM with attention model
model_lstm_attention = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_features, 32, input_length=max_len),
    tf.keras.layers.LSTM(32, return_sequences=True),  # Return sequences for attention
    Attention(),
    tf.keras.layers.Flatten(),  # Flatten the output for the Dense layer
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the LSTM with attention model
model_lstm_attention.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the LSTM with attention model
model_lstm_attention.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_val, y_val))

# Evaluate the LSTM with attention model on the test set
loss_lstm_attention, accuracy_lstm_attention = model_lstm_attention.evaluate(x_test, y_test)
print(f'LSTM with Attention Test loss: {loss_lstm_attention:.4f}, Test accuracy: {accuracy_lstm_attention:.4f}')


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
LSTM with Attention Test loss: 0.4000, Test accuracy: 0.8469
