# Iterable Dataset and Dataloader

In the previous exercises we have always worked with map style datasets. A map-style dataset is one that implements the __getitem__() and __len__() protocols, and represents a map from (possibly non-integral) indices/keys to data samples.
For example, such a dataset, when accessed with dataset[idx], could read the idx-th image and its corresponding label from a folder on the disk.

However, the assumption that you can trivially map to each data point in your dataset means that it is less suited to situations where the input data is arriving as part of a stream, for example, an audio or video feed. Alternatively, each datapoint might be a subset of a file which is too large to be held in memory and so requires incremental loading during training. These situations can be addressed with more complex logic in our dataset or additional pre-processing of our inputs, but there is now a more natural solution, enter the IterableDataset!

## (Optional) Mount folder in Colab

Uncomment the following cell to mount your gdrive if you are using the notebook in google colab:

In [2]:
# Use the following lines if you want to use Google Colab
# We presume you created a folder "i2dl" within your main drive folder, and put the exercise there.
# NOTE: terminate all other colab sessions that use GPU!
# NOTE 2: Make sure the correct exercise folder (e.g exercise_11) is given.

"""
from google.colab import drive
import os

gdrive_path='/content/gdrive/MyDrive/i2dl/exercise_11'

# This will mount your google drive under 'MyDrive'
drive.mount('/content/gdrive', force_remount=True)
# In order to access the files in this notebook we have to navigate to the correct folder
os.chdir(gdrive_path)
# Check manually if all files are present
print(sorted(os.listdir()))
"""

"\nfrom google.colab import drive\nimport os\n\ngdrive_path='/content/gdrive/MyDrive/i2dl/exercise_11'\n\n# This will mount your google drive under 'MyDrive'\ndrive.mount('/content/gdrive', force_remount=True)\n# In order to access the files in this notebook we have to navigate to the correct folder\nos.chdir(gdrive_path)\n# Check manually if all files are present\nprint(sorted(os.listdir()))\n"

# 1. Iterable Dataset
Let's have a look at a standard Map Style Dataset first! As always, we have to implement a __len__() method and a __getitem__ method.

In [2]:
from torch.utils.data import DataLoader, Dataset, IterableDataset 
from exercise_code.data.BytePairTokenizer import load_pretrained_fast
from exercise_code.tests.iterable_dataset_test import test_task_1
import os

%load_ext autoreload
%autoreload 2

root_path = os.path.dirname(os.path.abspath(os.getcwd()))
dummy_datasets_path = os.path.join(root_path, 'datasets', 'transformerDatasets', 'dummyDatasets')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
class MyMapDataset(Dataset):
    
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

In [5]:
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

map_dataset = MyMapDataset(data)

loader = DataLoader(map_dataset, batch_size=4)

for batch in loader:
    print(batch)

tensor([0, 1, 2, 3])
tensor([4, 5, 6, 7])
tensor([ 8,  9, 10, 11])


In comparison, let's have a look at a simple Iterable Dataset. All we need to get it running is an __iter__() method! 

In [6]:
class MyIterableDateset(IterableDataset):
    def __init__(self, data):
        self.data = data

    def __iter__(self):
        return iter(data)

In [7]:
iter_dataset = MyIterableDateset(data)

loader = DataLoader(iter_dataset, batch_size=4)

for batch in loader:
    print(batch)

tensor([0, 1, 2, 3])
tensor([4, 5, 6, 7])
tensor([ 8,  9, 10, 11])


So far not much to see, we are able to create the exact same thing as earlier with an arguably more complicated code. Let's try something more interesting: Reading from a file! With a map style dataset, we pretty much have to read the entire file into memory before we can return it in a getitem method. With our images, we only needed to store the image path in a list, however, there is no straightforward way to jump to the ith line in a text file. If we don't care so much about data shuffling, we can instead just return the next line as it comes in from the file and that is exactly what this does!

In [8]:
class MyIterableDateset(IterableDataset):
    def __init__(self, file_path):
        self.file_path = file_path
    
    def __len__(self):
        return None
    
    def parse_file(self):
        with open(self.file_path, 'r') as file:
            for line in file:
                yield line.strip('\n')
                
    def __iter__(self):
        return self.parse_file()

In [9]:
file_path = os.path.join(dummy_datasets_path, 'DummyFile')
iter_dataset = MyIterableDateset(file_path=file_path)

loader = DataLoader(iter_dataset, batch_size=1)

for batch in loader:
    print(batch)

['Iterable Datasets represent a flexible way to handle large volumes of data in machine learning pipelines.']
['They enable the sequential processing of data, allowing for efficient handling of datasets that might not fit entirely into memory.']
['With Iterable Datasets, you can load batches of data on-the-fly, preprocess them, and feed them into your model incrementally.']
['This process is crucial for optimizing computational resources and handling datasets that are too large to fit into memory entirely.']


The nice thing is, we can even easily cycle through multiple files. This can become a very handy tool for a lot of NLP tasks!

In [10]:
class MyIterableDateset(IterableDataset):
    def __init__(self, file_paths):
        self.file_paths = file_paths

    def __len__(self):
        return None
    
    def __iter__(self):
        for file_path in self.file_paths:
            with open(file_path, 'r') as file:
                for line in file:
                    yield line.strip('\n')

In [11]:
file_paths = [os.path.join(dummy_datasets_path, 'DummyFile1'),
              os.path.join(dummy_datasets_path, 'DummyFile2'),
              os.path.join(dummy_datasets_path, 'DummyFile3')]

iter_dataset = MyIterableDateset(file_paths=file_paths)

loader = DataLoader(iter_dataset, batch_size=1)

for batch in loader:
    print(batch)

['Dummy File 1 Line 1']
['Dummy File 1 Line 2']
['Dummy File 1 Line 3']
['Dummy File 1 Line 4']
['Dummy File 1 Line 5']
['Dummy File 1 Line 6']
['Dummy File 1 Line 7']
['Dummy File 1 Line 8']
['Last Line of File 1']
['Dummy File 2 Line 1']
['Dummy File 2 Line 2']
['Dummy File 2 Line 3']
['Dummy File 2 Line 4']
['Dummy File 2 Line 5']
['Last Line of File 2']
['Dummy File 3 Line 1']
['Dummy File 3 Line 2']
['Dummy File 3 Line 3']
['Dummy File 3 Line 4']
['Dummy File 3 Line 5']
['Dummy File 3 Line 6']
['Dummy File 3 Line 7']
['Dummy File 3 Line 8']
['Dummy File 3 Line 9']
['Dummy File 3 Line 10']
['Dummy File 3 Line 11']
['Last Line of File 3']


<div class="alert alert-info">
    <h3>Task 1: Implement</h3>
    <p>Implement the <code>parse_file()</code> method in <code>exercise_code/data/TransformerDataset.py</code>. Also, check out the <code>__iter__()</code> method!
    </p>
</div>

In [12]:
_ = test_task_1()


####### Testing Started #######

Test IterableDatasetKeyTest: [92mpassed![0m
Test IterableDatasetValueTest: [92mpassed![0m

####### Testing Finished #######
Test TestTask1: [92mpassed![0m -->  Tests passed: [92m2[0m/[92m2[0m
Score: [92m100[0m/[92m100[0m


# 2. Collator
So far we have usually dealt with same sized data, such as tabular data or images of a fixed resolution. In Language Processing, our inputs are often sentences, which can usually vary in length. Since we still want to be able to pass the data through the network in batches, we have to somehow make sure, that all data is of the same length. 
That is the idea behind sequence padding. To ensure that all items in a batch are of same length, we can add special tokens to the end of our individual sentences, until all sentences are as long as the longest sentence in the batch. We also have to keep track of how many padding tokens were added to the sequence, since we don't want our model to change its prediction when we add or remove pad tokens. This can be done very efficiently using padding masks.
To implement this functionality in our Dataloader, we have to write a custom collate function. The collate function takes in a batch in form of a list and outputs the processed data in a tensor based batched format, similar to the dataloader we implemented together in exercise 03! (Remember combine_batch_dicts and batch_to_numpy?)


Let's have a look at what we have to do first!

In [13]:
from exercise_code.data.TransformerDataset import CustomIterableDataset

batch_list = []
batch_size = 5
tokenizer = load_pretrained_fast()

# Load the dataset we created
file_path = os.path.join(dummy_datasets_path, 'ds_dummy')
dataset = CustomIterableDataset(file_paths=file_path)
iterator = iter(dataset)

# Iterate through the dataset and fill it with the source sentences
for _ in range(batch_size):
    batch_list.append(next(iterator)['source'])

# Now let's tokenize these sentences at the same time using batch encode!
batch_input_ids = tokenizer.batch_encode_plus(batch_list)['input_ids']
batch_padded_input_ids = tokenizer.batch_encode_plus(batch_list, padding=True)['input_ids']
batch_masks = tokenizer.batch_encode_plus(batch_list, padding=True)['attention_mask']

# Printing the results
print('Non Padded Lengths:')
for i, item in enumerate(batch_input_ids):
    print('Item {0}: {1}'.format(i, len(item)))

print('Padded Lengths:')
for i, item in enumerate(batch_padded_input_ids):
    print('Item {0}: {1}'.format(i, len(item)))

print('Padding Masks')
for item in batch_masks:
    print(item)

Non Padded Lengths:
Item 0: 12
Item 1: 10
Item 2: 9
Item 3: 7
Item 4: 29
Padded Lengths:
Item 0: 29
Item 1: 29
Item 2: 29
Item 3: 29
Item 4: 29
Padding Masks
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


If you compare the non padded lengths to the padded lengths, you will notice that they vary quite a bit. The padded lengths are actually as long as the longest sequence in the batch! (Item 3 in our case)
The Encoder automatically gives us the masks, we need to track which IDs were paddings and which were not! Every 0 you see stands for an added padding token!

Now let's use our Collator to get the same functionality!

In [14]:
from exercise_code.data.TransformerCollator import CustomCollator

# Define the Collator
collator = CustomCollator(tokenizer=tokenizer)
dataset = CustomIterableDataset(file_paths=file_path)
dataloader = DataLoader(dataset=dataset, batch_size=5, collate_fn=collator)

# Create an iterator to iterate through the dataloader
iterator = iter(dataloader)

# Get the first batch and print the keys
batch = next(iterator)
print(batch.keys())

dict_keys(['encoder_inputs', 'encoder_mask', 'decoder_inputs', 'decoder_mask', 'labels', 'label_mask', 'label_length'])


And we can of course have a look at our masks again! Note: The output is already prepared for the model, so to get it to look the same we have to do a couple minor transformations ;)

In [15]:
print('Padding Mask')
print(batch['encoder_mask'].int().squeeze(1).numpy())

Padding Mask
[[1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


<div class="alert alert-info">
    <h3>Task 2: Check Code</h3>
    <p>Check the <code>__call__()</code> method in <code>exercise_code/data/TransformerCollator.py</code>. We will discuss the prepare mask function and the attention mask in the next notebook!
    </p>
</div>

You are now finished with this notebook and can move on to what we have came for - transformer models!