# Analyzing and preprocessing the data

1. PyTorch’s torchtext has a built-in IMDb dataset, so first, we load the dataset:

In [16]:
!pip install datasets



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

25000 25000
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are fe

In [51]:
from datasets import load_dataset

ds = load_dataset("imdb")
train_dataset = ds['train']
test_dataset = ds['test']
print(len(train_dataset), len(test_dataset))
# E.g. access items:
print(train_dataset[0])

25000 25000
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are fe

2. Now, let’s explore the vocabulary within the training set:

In [52]:
# Install HuggingFace datasets if not already installed
!pip install -q datasets

from datasets import load_dataset
import re
from collections import Counter

# --- Tokenizer function ---
def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)  # Remove HTML tags
    emoticons = re.findall(r'(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub(r'[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = text.split()
    return tokenized

# --- Build vocabulary ---
token_counts = Counter()
train_labels = []

# Iterate over dataset and update token counts
for example in train_dataset:
    label = example['label'] + 1  # Original labels are 0 and 1; you used 1 and 2
    line = example['text']
    train_labels.append(label)
    tokens = tokenizer(line)
    token_counts.update(tokens)

print('Vocab-size:', len(token_counts))
print(Counter(train_labels))


Vocab-size: 75977
Counter({1: 12500, 2: 12500})


Here, we define a function to extract tokens (words, in our case) from a given document (movie
review, in our case). It first removes HTML-like tags, then extracts and standardizes emoticons,
removes non-alphanumeric characters, and tokenizes the text into a list of words for further
processing. We store the tokens and their occurrences in the Counter object token_counts.
As evident, the training set comprises approximately 76,000 unique words, and it exhibits a
perfect balance with an equal count of positive (labeled as “2") and negative (labeled as “1")
samples

3. We will feed the word tokens into an embedding layer, nn.Embedding. The embedding layer
requires integer input because it’s specifically designed to handle discrete categorical data, such
as word indices, and transform them into continuous representations that a neural network
can work with and learn from. Therefore, we need to first encode each token into a unique
integer as follows:

In [53]:
from torchtext.vocab import vocab

sorted_by_freq_tuples = sorted(token_counts.items(), key=lambda x:x[1],reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab_mapping = vocab(ordered_dict)

4. When examining the document lengths within the training set, you’ll notice that they range
from 10 to 2,498 words. It’s common practice to apply padding to sequences to ensure uniform
length during batch processing. So, we insert the special token, "<pad>", representing padding
into the vocabulary mapping at index 0 as a placeholder:

In [54]:
vocab_mapping.insert_token("<pad>", 0)

5. We also need to handle unseen words during inference. Similar to the previous step, we insert
the special token "<unk>" (short for “unknown”) into the vocabulary mapping at index 1. The
token represents out-of-vocabulary words or tokens that are not found in the training data:

In [55]:
vocab_mapping.insert_token("<unk>", 1)
vocab_mapping.set_default_index(1)

We also set the default vocabulary mapping to 1. This means "<unk>" (index 1) is used as the
default index for unseen or out-of-vocabulary words.

Let’s take a look at the following examples showing the mappings of given words, including
an unseen one:

In [56]:
print([vocab_mapping[token] for token in ['this', 'is', 'an','example']])

print([vocab_mapping[token] for token in ['this', 'is', 'example2']])

[11, 7, 35, 462]
[11, 7, 1]


6. Next, we define the function defining how batches of samples should be collated:

In [62]:
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text_transform = lambda x : [vocab[token] for token in tokenizer(x)]
def collate_batch(batch):
  label_list , text_list, lengths = [] , [] , []
  for _label , _text in batch:
    label_list.append(1. if _label == 2 else 0.)

    processed_text = [vocab_mapping[token] for token in tokenizer(_text)]

    text_list.append(torch.tensor(processed_text, dtype=torch.int64))
    lengths.append(len(processed_text))
  label_list = torch.tensor(label_list)
  lengths = torch.tensor(lengths)
  padded_text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True)
  return padded_text_list.to(device), label_list.to(device), lengths.to(device)


Besides generating inputs and label outputs as we used to do, we also generate the length of
individual samples in a given batch. Note that we convert the positive label from the raw 2 to
1 here, for label standardization and loss function compatibility for binary classification. The
length information is used for handling variable-length sequences efficiently. Take a small
batch of four samples and examine the processed batch:

In [63]:
from torch.utils.data import DataLoader
torch.manual_seed(0)

dataloader = DataLoader(
    train_dataset,
    batch_size=4,
    shuffle=True,
    collate_fn=collate_batch
)

text_batch, label_batch, length_batch = next(iter(dataloader))



In [66]:
print(text_batch)

tensor([[6127],
        [6127],
        [6127],
        [6127]])


In [41]:
print(label_batch)

print(length_batch)

print(text_batch.shape)

tensor([0., 0., 0., 0.])
tensor([1, 1, 1, 1])
torch.Size([4, 1])
