# Tokenized Dataset

In order to pass the dataset to LLM models, they need to be preprocessed <br>
This notebook demonstrates process of tokenizing raw data to a format that is suitable for DistilBERT model

In [1]:
from datasets import load_from_disk
from transformers import AutoTokenizer
import tensorflow as tf

2025-08-31 17:40:29.022400: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Preprocess dataset

### Import raw dataset
We'll use imdb_dataset from kaggle

In [2]:
dataset = load_from_disk('../stanford_imdb_dataset')

In [3]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


### Load tokenizer

It's important to have the appropriate tokenizer for every model <br>
In this case, we'll use distilbert_base_uncased since we work with DistilBERT model

In [4]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

### Tokenize dataset
A function that tokenize every instance <br>
For instances shorter than 512 tokens add padding to fit the length <br>
For instances longer than 512 tokens perform truncation <br>

In [5]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512, return_tensors='tf')

tokenized_dataset = dataset.map(tokenize_function, batched=True)

In [6]:
print(tokenized_dataset['train'])
print("Text: {}".format(tokenized_dataset['train'][0]['text']))
print("Label: {}".format(tokenized_dataset['train'][0]['label']))
print("Input ids: {}".format(tokenized_dataset['train'][0]['input_ids']))
print("Attention mask: {}".format(tokenized_dataset['train'][0]['attention_mask']))

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
})
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 year

In [7]:
# Save tokenized dataset
tokenized_dataset.save_to_disk('./tokenized_dataset/')

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/50000 [00:00<?, ? examples/s]

## Convert to Pytorch dataset

The models will receive this dataset as their input

### Create 3 different train and validation sets

In order to test how much the training size can impact the performance of models in different fine-tuning techniques<br>
Sizes of training sets will be 100, 1000 and 22500
Validation sets will be 10% of training set

In [8]:
from datasets import concatenate_datasets
import random

In [9]:
# Function to create a balanced subset
random_seed = 42

def create_balanced_subset(pos_samples, neg_samples, num_samples):
    # Shuffle the datasets
    pos_samples = pos_samples.shuffle(seed=random_seed).select(range(num_samples // 2))
    neg_samples = neg_samples.shuffle(seed=random_seed).select(range(num_samples // 2))

    # Concatenate positive and negative samples
    balanced_subset = concatenate_datasets([pos_samples, neg_samples])
    return balanced_subset.shuffle(seed=random_seed)

In [10]:
positive_samples = tokenized_dataset['train'].filter(lambda x: x['label'] == 1)
negative_samples = tokenized_dataset['train'].filter(lambda x: x['label'] == 0)

balanced_small  = create_balanced_subset(positive_samples, negative_samples, 110)
balanced_medium = create_balanced_subset(positive_samples, negative_samples, 1100)
balanced_full   = tokenized_dataset['train']

small_split    = balanced_small.train_test_split(test_size=10, seed=random_seed, stratify_by_column='label')
medium_split   = balanced_medium.train_test_split(test_size=100, seed=random_seed, stratify_by_column='label')
full_split     = balanced_full.train_test_split(test_size=2500, seed=random_seed, stratify_by_column='label')

small_train, small_validation   = small_split["train"], small_split["test"]
medium_train, medium_validation = medium_split["train"], medium_split["test"]
full_train, full_validation     = full_split["train"], full_split["test"]

In [11]:
train_dataset_small = small_train.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])
validation_dataset_small = small_validation.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataset_medium = medium_train.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])
validation_dataset_medium = medium_validation.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataset_full = full_train.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])
validation_dataset_full = full_validation.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

test_dataset = tokenized_dataset['test'].with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

## Save Tensorflow datasets

For quicker loading

In [13]:
train_dataset_small.save_to_disk('./tensorflow_datasets/train_dataset_small')
validation_dataset_small.save_to_disk('./tensorflow_datasets/validation_dataset_small')
train_dataset_medium.save_to_disk('./tensorflow_datasets/train_dataset_medium')
validation_dataset_medium.save_to_disk('./tensorflow_datasets/validation_dataset_medium')
train_dataset_full.save_to_disk('./tensorflow_datasets/train_dataset_full')
validation_dataset_full.save_to_disk('./tensorflow_datasets/validation_dataset_full')
test_dataset.save_to_disk('./tensorflow_datasets/test_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/22500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]