
This project employs self-supervised learning to train a model for predicting the order of words in movie reviews, leveraging the IMDb dataset.

# Step 1: Import necessary libraries and load dataset



*   The IMDb dataset is loaded, consisting of movie reviews represented as sequences of integers. Only the top 5000 frequent words are considered (num_words=5000).
*   The integer sequences are converted back to human-readable text (decoded_train_reviews and decoded_test_reviews).
*   Words within each review are shuffled to create a self-supervised learning task.
*   Tokenization is applied to convert the text into sequences of integers (tokenizer.fit_on_texts).

In [11]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.datasets import imdb
import numpy as np

# Load the IMDb dataset
(train_data, _), (test_data, _) = imdb.load_data(num_words=5000)

# Convert integer sequences back to text
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_train_reviews = [' '.join([reverse_word_index.get(i - 3, '?') for i in review]) for review in train_data]
decoded_test_reviews = [' '.join([reverse_word_index.get(i - 3, '?') for i in review]) for review in test_data]

# Shuffle words within each review
shuffled_train_reviews = [' '.join(np.random.permutation(review.split())) for review in decoded_train_reviews]
shuffled_test_reviews = [' '.join(np.random.permutation(review.split())) for review in decoded_test_reviews]

# Tokenize the text
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(shuffled_train_reviews)
total_words = len(tokenizer.word_index) + 1


# Step 2: Pad Sequences and Create Pairs for Self-Supervised Learning

*   Text is converted to sequences of integers using the tokenizer.
*   Sequences are padded to have the same length for modeling convenience (pad_sequences).
*   The create_word_pairs function generates pairs of shuffled and original sequences for self-supervised learning.

In [12]:
# Convert text to sequences
train_sequences = tokenizer.texts_to_sequences(shuffled_train_reviews)
test_sequences = tokenizer.texts_to_sequences(shuffled_test_reviews)

# Pad sequences to have the same length
padded_train_sequences = pad_sequences(train_sequences)
padded_test_sequences = pad_sequences(test_sequences, maxlen=len(padded_train_sequences[0]))

# Create pairs for self-supervised learning
def create_word_pairs(sequences):
    pairs = []
    for seq in sequences:
        pairs.append([seq, np.random.permutation(seq)])
    return np.array(pairs)


# Step 3: Define and Compile the Model

*   The model architecture is defined for predicting the order of words.
*   It consists of an embedding layer, an LSTM layer, and a dense output layer with sigmoid activation.
*   The model is compiled with the Adam optimizer, binary crossentropy loss, and accuracy as the metric.

In [13]:
# Model for predicting the order of words
def word_order_prediction_model(input_size):
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=total_words, output_dim=16, input_length=input_size),
        tf.keras.layers.LSTM(32),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


# Step 4: Create Pairs for Self-Supervised Learning on Training Data

*   Setting the random seed ensures reproducibility of the experiment.
*   Pairs and labels for self-supervised learning are created on the training set.
*   The data is shuffled for better training.

In [14]:
# Set random seed for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Create pairs for self-supervised learning on the training set
word_pairs = create_word_pairs(padded_train_sequences)

# Labels for word order prediction (1 for correct order, 0 for incorrect order)
word_order_labels = np.ones(len(word_pairs))

# Shuffle the data
shuffle_index = np.random.permutation(len(word_pairs))
word_pairs, word_order_labels = word_pairs[shuffle_index], word_order_labels[shuffle_index]


# Step 5: Train the Model

*   The model is instantiated, and its architecture is defined using the function from Step 3.
*   The model is trained using the self-supervised learning pairs and labels for 3 epochs.
*   Each epoch should take about 5 minutes to run.

In [16]:
# Define the word order prediction model
input_size = len(word_pairs[0][0])
word_order_model = word_order_prediction_model(input_size)

# Train the model
word_order_model.fit(word_pairs[:, 0], word_order_labels, epochs=3, batch_size=128)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.src.callbacks.History at 0x7b0aa671f370>

# Step 6: Evaluate the Model on the Test Set

*   create_word_pairs is used to generate pairs for the test set, similar to what was done for the training set. This creates a set of shuffled and original sequences for self-supervised learning.

*   test_word_order_labels is created as an array of ones since, in self-supervised learning, the labels are always 1 for correct order.

*   word_order_model.evaluate is then called to evaluate the trained model on the test set. This involves predicting the order of words in the shuffled sequences and comparing it with the ground truth (which is always 1 for correct order).

*   The evaluation results include the test loss and test accuracy, which are printed to the console for analysis. The test accuracy represents how well the model is able to predict the correct order of words in the shuffled sequences on the test set.

In [17]:
# Evaluate the model on the test set
test_word_pairs = create_word_pairs(padded_test_sequences)
test_word_order_labels = np.ones(len(test_word_pairs))

evaluation_results = word_order_model.evaluate(test_word_pairs[:, 0], test_word_order_labels)
print(f"Test Loss: {evaluation_results[0]}, Test Accuracy: {evaluation_results[1]}")


Test Loss: 0.0004322270688135177, Test Accuracy: 1.0
