# Dataset Loader
This scripts is used for downloading, analizing, preprocessing and saving dataset.

## Large Movie Base Dataset

For demo purposes, dataset Large Movie Base Dataset will be used. It contains 50_000 instances, equaly separated into two classes - positive (marked with 'label' = 1) and negative (marked with 'label' = 0)



## Load Dataset

In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset('stanfordnlp/imdb')
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


#### Key notes: <br>
    1. Equal train/test size 
        This is the standard used benchmark for imdb dataset, so even though in real-life scenarios that ration would be different, here we'll work with this constallation. 
    2. Unsupervised batch 
        This batch contains unlabeled data that can be used in unsupervised learning techniques or for additional preprocessing. In this work, it won't be used

In [3]:
print('Train examples')
# Display 5 examples from the train set
for i in range(5):
    print(f"Example {i+1}:")
    print(f"Review: {dataset['train'][i]['text']}")
    print(f"Sentiment: {dataset['train'][i]['label']}")
    print("-" * 50)

Train examples
Example 1:
Review: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity s

For additional informations about imdb dataset, check out this page: https://huggingface.co/datasets/stanfordnlp/imdb

## Tokenization
In order to pass the dataset to LLM models, they need to be preprocessed in the format suitable for LLMs.
This notebook demonstrates process of tokenizing raw data to a format that is suitable for BERT model.


In [4]:
from transformers import AutoTokenizer, AutoModel

### Load tokenizer

It's important to have the appropriate tokenizer for every model
In this case, we'll use bert_base_uncased since we work with BERT model


In [5]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

### Tokenize dataset

A function that tokenize every instance
For instances shorter than 512 tokens add padding to fit the length
For instances longer than 512 tokens perform truncation 

Tokenizers take desired dataset and adds two new columns to each instance.
input_ids is the list of length 512 where each value corresponds to numeric value of a token. tokens 101 and 102 are special tokens that simbolizes start and end of a sentence.
attention_mask column is used for masking. Value 1 means that on that place there is a value for token, wile 0 simbolizes that value on that place in attention_mask doesn't exist.

In [6]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512, return_tensors='pt')

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [7]:
print(tokenized_dataset['train'])
print("Text: {}".format(tokenized_dataset['train'][0]['text']))
print("Label: {}".format(tokenized_dataset['train'][0]['label']))
print("Input ids: {}".format(tokenized_dataset['train'][0]['input_ids']))
print("Attention mask: {}".format(tokenized_dataset['train'][0]['attention_mask']))

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 25000
})
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELL

### Create 3 different train and validation sets

In order to test how much the training size can impact the performance of models in different fine-tuning techniques
Sizes of training sets will be 100, 1000 and 22500. Validation sets will be 10% of training set

In [8]:
from datasets import concatenate_datasets
import random

Create_balanced_subset function takes list of all positive instances and all negative instances and returns 
randomized, stratified list of instances

In [9]:
# Function to create a balanced subset
random_seed = 42

def create_balanced_subset(pos_samples, neg_samples, num_samples):
    # Shuffle the datasets
    pos_samples = pos_samples.shuffle(seed=random_seed).select(range(num_samples // 2))
    neg_samples = neg_samples.shuffle(seed=random_seed).select(range(num_samples // 2))

    # Concatenate positive and negative samples
    balanced_subset = concatenate_datasets([pos_samples, neg_samples])
    return balanced_subset.shuffle(seed=random_seed)

Creating 3 sizes for train and validation sets - small (100 + 10 instances), medium (1000 + 100 instances) and full (22500 + 2500 instances)

In [10]:
positive_samples = tokenized_dataset['train'].filter(lambda x: x['label'] == 1)
negative_samples = tokenized_dataset['train'].filter(lambda x: x['label'] == 0)

balanced_small  = create_balanced_subset(positive_samples, negative_samples, 110)
balanced_medium = create_balanced_subset(positive_samples, negative_samples, 1100)
balanced_full   = tokenized_dataset['train']

small_split    = balanced_small.train_test_split(test_size=10, seed=random_seed, stratify_by_column='label')
medium_split   = balanced_medium.train_test_split(test_size=100, seed=random_seed, stratify_by_column='label')
full_split     = balanced_full.train_test_split(test_size=2500, seed=random_seed, stratify_by_column='label')

small_train, small_validation   = small_split["train"], small_split["test"]
medium_train, medium_validation = medium_split["train"], medium_split["test"]
full_train, full_validation     = full_split["train"], full_split["test"]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25000 [00:00<?, ? examples/s]

### Convert to Pytorch dataset

The models will use PyTorch functions, so we transform datasets into appropriate representation for models to use

In [11]:
train_dataset_small = small_train.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])
validation_dataset_small = small_validation.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataset_medium = medium_train.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])
validation_dataset_medium = medium_validation.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataset_full = full_train.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])
validation_dataset_full = full_validation.with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

test_dataset = tokenized_dataset['test'].with_format('torch', columns=['input_ids', 'attention_mask', 'label'])

### Save dataset
Saving final datasets into local directory

In [12]:
train_dataset_small.save_to_disk('./datasets/train_dataset_small')
validation_dataset_small.save_to_disk('./datasets/validation_dataset_small')
train_dataset_medium.save_to_disk('./datasets/train_dataset_medium')
validation_dataset_medium.save_to_disk('./datasets/validation_dataset_medium')
train_dataset_full.save_to_disk('./datasets/train_dataset_full')
validation_dataset_full.save_to_disk('./datasets/validation_dataset_full')
test_dataset.save_to_disk('./datasets/test_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/22500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2500 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]