##### © Copyright 2020 [George Mihaila](https://github.com/gmihaila).

Licensed under the Apache License, Version 2.0 (the "License");

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# **Better Batches with PyTorchText BucketIterator**

## **How to use PyTorchText BucketIterator to sort text data for better batching.**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gmihaila/ml_things/blob/master/notebooks/pytorch/pytorchtext_bucketiterator.ipynb) &nbsp;
[![Generic badge](https://img.shields.io/badge/GitHub-Source-greensvg)](https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pytorchtext_bucketiterator.ipynb)
[![Generic badge](https://img.shields.io/badge/Article-Medium-black.svg)](https://gmihaila.medium.com/better-batches-with-pytorchtext-bucketiterator-12804a545e2a)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)


<br>

**Disclaimer:** *The format of this tutorial notebook is very similar with my other tutorial notebooks. This is done intentionally in order to keep readers familiar with my format.*

<br>

This notebook is a simple tutorial on how to use the powerful **PytorchText**  **BucketIterator** functionality to group examples (**I use examples and sequences interchangeably**) of similar lengths into batches. This allows us to provide the most optimal batches when training models with text data.

Having batches with similar length examples provides a lot of gain for recurrent models (RNN, GRU, LSTM) and transformers models (bert, roBerta, gpt2, xlnet, etc.) where padding will be minimal.

Basically any model that takes as input variable text data sequences will benefit from this tutorial.

**I will not train any models in this notebook!** I will release a tutorial where I use this implementation to train a transformer model.

The purpose is to use an example text datasets and batch it using **PyTorchText** with **BucketIterator** and show how it groups text sequences of similar length in batches.

This tutorial has two main parts:

* **Using PyTorch Dataset with PyTorchText Bucket Iterator**: Here I implemented a standard PyTorch Dataset class that reads in the example text datasets and use PyTorch Bucket Iterator to group similar length examples in same batches. I want to show how easy it is to use this powerful functionality form PyTorchText on a regular PyTorch Dataset workflow which you already have setup.

* **Using PyTorch Text TabularDataset with PyTorchText Bucket Iterator**: Here I use the built-in PyTorchText TabularDataset that reads data straight from local files without the need to create a PyTorch Dataset class. Then I follow same steps as in the previous part to show how nicely text examples are grouped together.

*This notebooks is a code adaptation and implementation inspired from a few sources:* [torchtext_translation_tutorial](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html), [pytorch/text - GitHub](https://github.com/pytorch/text), [torchtext documentation](https://torchtext.readthedocs.io/en/latest/index.html#) and [A Comprehensive Introduction to Torchtext](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/).

<br>

## **What should I know for this notebook?**

Some basic PyTorch regarding Dataset class and using DataLoaders. Some knowledge of PyTorchText is helpful but not critical in understanding this tutorial. The BucketIterator is similar in applying Dataloader to a PyTorch Dataset.

<br>

## **How to use this notebook?**

The code is made with reusability in mind. It can be easily adapted for other text datasets and other NLP tasks in order to achieve optimal batching. 

Comments should provide enough guidance to easily adapt this notebook to your needs.

This code is designed mostly for **classification tasks** in mind, but it can be adapted for any other Natural Language Processing tasks where batching text data is needed.






<br>


## **Dataset**

I will use the well known movies reviews positive - negative labeled [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/).

The description provided on the Stanford website:

*This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.*

**Why this dataset?** I believe is an easy to understand and use dataset for classification. I think sentiment data is always fun to work with.

<br>

## **Coding**

Now let's do some coding! We will go through each coding cell in the notebook and describe what it does, what's the code, and when is relevant - show the output.

I made this format to be easy to follow if you decide to run each code cell in your own python notebook.

When I learn from a tutorial I always try to replicate the results. I believe it's easy to follow along if you have the code next to the explanations.

<br>


## Downloads

Download the IMDB Movie Reviews sentiment dataset and unzip it locally.

In [2]:
# download the dataset
!wget -q -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
# unzip it
!tar -zxf /content/aclImdb_v1.tar.gz

## **Installs**

* **[ml_things](https://github.com/gmihaila/ml_things)** library used for various machine learning related tasks. I created this library to reduce the amount of code I need to write for each machine learning project.


In [3]:
# Install helper functions.
!pip install -q git+https://github.com/gmihaila/ml_things.git

[?25l[K     |█████▏                          | 10kB 14.0MB/s eta 0:00:01[K     |██████████▎                     | 20kB 17.2MB/s eta 0:00:01[K     |███████████████▍                | 30kB 14.6MB/s eta 0:00:01[K     |████████████████████▌           | 40kB 10.6MB/s eta 0:00:01[K     |█████████████████████████▋      | 51kB 4.6MB/s eta 0:00:01[K     |██████████████████████████████▊ | 61kB 5.2MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.7MB/s 
[?25h  Building wheel for ml-things (setup.py) ... [?25l[?25hdone
  Building wheel for ftfy (setup.py) ... [?25l[?25hdone


## **Imports**

Import all needed libraries for this notebook.

Declare basic parameters used for this notebook:

* `device` - Device to use by torch: GPU/CPU. I use CPU as default since I will not perform any costly operations.

* `train_batch_size` - Batch size used on train data.

* `valid_batch_size` - Batch size used for validation data. It usually is greater than `train_batch_size` since the model would only need to make prediction and no gradient calculations is needed.

In [4]:
import io
import os
import torchtext
from tqdm.notebook import tqdm
from ml_things import fix_text
from torch.utils.data import Dataset, DataLoader

# Will use `cpu` for simplicity.
device = 'cpu'

# Number of batches for training
train_batch_size = 10

# Number of batches for validation. Use a larger value than training.
# It helps speed up the validation process.
valid_batch_size = 20

## Using PyTorch Dataset

This is where I create the PyTorch Dataset objects for training and validation that **can** be used to feed data into a model. This is standard procedure when using PyTorch.



### Dataset Class

Implementation of the PyTorch Dataset class.

Most important components in a PyTorch Dataset class are:
* `__len__(self, )` where it returns the number of examples in our dataset that we read in `__init__(self, )`. This will ensure that `len()` will return the number of examples.
* `__getitem__(self, item)` where given an index `item` will return the example corresponding to the `item` position.

In [5]:
class MovieReviewsTextDataset(Dataset):
  r"""PyTorch Dataset class for loading data.

  This is where the data parsing happens.

  This class is built with reusability in mind.

  Arguments:

    path (:obj:`str`):
        Path to the data partition.

  """

  def __init__(self, path):

    # Check if path exists.
    if not os.path.isdir(path):
      # Raise error if path is invalid.
      raise ValueError('Invalid `path` variable! Needs to be a directory')
    
    self.texts = []
    self.labels = []
    # Since the labels are defined by folders with data we loop 
    # through each label.
    for label  in ['pos', 'neg']:
      sentiment_path = os.path.join(path, label)

      # Get all files from path.
      files_names = os.listdir(sentiment_path)#[:10] # Sample for debugging.
      # Go through each file and read its content.
      for file_name in tqdm(files_names, desc=f'{label} Files'):
        file_path = os.path.join(sentiment_path, file_name)

        # Read content.
        content = io.open(file_path, mode='r', encoding='utf-8').read()
        # Fix any unicode issues.
        content = fix_text(content)
        # Save content.
        self.texts.append(content)
        # Save labels.
        self.labels.append(label)

    # Number of examples.
    self.n_examples = len(self.labels)

    return


  def __len__(self):
    r"""When used `len` return the number of examples.

    """
    
    return self.n_examples


  def __getitem__(self, item):
    r"""Given an index return an example from the position.
    
    Arguments:

      item (:obj:`int`):
          Index position to pick an example to return.

    Returns:
      :obj:`Dict[str, str]`: Dictionary of inputs that are used to feed 
      to a model.

    """

    return {'text':self.texts[item], 'label':self.labels[item]}

### Train - Validation Datasets

Create PyTorch Dataset for train and validation partitions.

In [6]:
print('Dealing with Train...')
# Create pytorch dataset.
train_dataset = MovieReviewsTextDataset(path='/content/aclImdb/train')

print(f'Created `train_dataset` with {len(train_dataset)} examples!')

print()

print('Dealing with Validation...')
# Create pytorch dataset.
valid_dataset =  MovieReviewsTextDataset(path='/content/aclImdb/test')
                               
print(f'Created `valid_dataset` with {len(valid_dataset)} examples!')

Dealing with Train...


HBox(children=(FloatProgress(value=0.0, description='pos Files', max=12500.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='neg Files', max=12500.0, style=ProgressStyle(description_…


Created `train_dataset` with 25000 examples!

Dealing with Validation...


HBox(children=(FloatProgress(value=0.0, description='pos Files', max=12500.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='neg Files', max=12500.0, style=ProgressStyle(description_…


Created `valid_dataset` with 25000 examples!


### PyTorch DataLoader

In order to group examples from the PyTorch Dataset into batches we use PyTorch DataLoader. This is standard when using PyTorch.

In [7]:
# Move pytorch dataset into dataloader.
torch_train_dataloader = DataLoader(train_dataset, batch_size=train_batch_size, shuffle=True)
print(f'Created `torch_train_dataloader` with {len(torch_train_dataloader)} batches!')

# Move pytorch dataset into dataloader.
torch_valid_dataloader = DataLoader(valid_dataset, batch_size=valid_batch_size, shuffle=False)
print(f'Created `torch_valid_dataloader` with {len(torch_valid_dataloader)} batches!')

Created `torch_train_dataloader` with 2500 batches!
Created `torch_valid_dataloader` with 1250 batches!


### PyTorchText Bucket Iterator Dataloader

Here is where the magic happens! We pass in the **train_dataset** and **valid_dataset** PyTorch Dataset splits into **BucketIterator** to create the actual batches.

It's very nice that PyTorchText can handle splits! No need to write same line of code again for train and validation split.

**The `sort_key` parameter is very important!** It is used to order text sequences in batches. Since we want to batch sequences of text with similar length, we will use a simple function that returns the length of an data example (`len(x['text')`). This function needs to follow the format of the PyTorch Dataset we created in order to return the length of an example, in my case I return a dictionary with `text` key for an example.

**It is important to keep `sort=False` and `sort_with_batch=True` to only sort the examples in each batch and not the examples in the whole dataset!**

Find more details in the PyTorchText **BucketIterator** documentation [here](https://torchtext.readthedocs.io/en/latest/data.html#bucketiterator) - look at the **BPTTIterator** because it has same parameters except the **bptt_len** argument.

**Note:**
*If you want just a single DataLoader use `torchtext.data.BucketIterator` instead of `torchtext.data.BucketIterator.splits` and make sure to provide just one PyTorch Dataset instead of tuple of PyTorch Datasets and change the parameter `batch_sizes` and its tuple values to `batch_size` with single value: `dataloader = torchtext.data.BucketIterator(dataset, batch_size=batch_size, )`*

In [8]:
# Group similar length text sequences together in batches.
torchtext_train_dataloader, torchtext_valid_dataloader = torchtext.data.BucketIterator.splits(
    
                              # Datasets for iterator to draw data from
                              (train_dataset, valid_dataset),

                              # Tuple of train and validation batch sizes.
                              batch_sizes=(train_batch_size, valid_batch_size),

                              # Device to load batches on.
                              device=device, 

                              # Function to use for sorting examples.
                              sort_key=lambda x: len(x['text']),


                              # Repeat the iterator for multiple epochs.
                              repeat=True, 

                              # Sort all examples in data using `sort_key`.
                              sort=False, 

                              # Shuffle data on each epoch run.
                              shuffle=True,

                              # Use `sort_key` to sort examples in each batch.
                              sort_within_batch=True,
                              )

# Print number of batches in each split.
print('Created `torchtext_train_dataloader` with %d batches!'%len(torchtext_train_dataloader))
print('Created `torchtext_valid_dataloader` with %d batches!'%len(torchtext_valid_dataloader))

Created `torchtext_train_dataloader` with 2500 batches!
Created `torchtext_valid_dataloader` with 1250 batches!


### Compare DataLoaders

Let's compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.

**Note:** *When using the PyTorchText BucketIterator, make sure to call `create_batches()` before looping through each batch! Else you won't get any output form the iterator.*

In [9]:
# Loop through regular dataloader.
print('PyTorch DataLoader\n')
for batch in torch_train_dataloader:
  
  # Let's check batch size.
  print('Batch size: %d\n'% len(batch['text']))
  print('LABEL\tLENGTH\tTEXT'.ljust(10))

  # Print each example.
  for text, label in zip(batch['text'], batch['label']):
    print('%s\t%d\t%s'.ljust(10) % (label, len(text), text))
  print('\n')
  
  # Only look at first batch. Reuse this code in training models.
  break
  

# Create batches - needs to be called before each loop.
torchtext_train_dataloader.create_batches()

# Loop through BucketIterator.
print('PyTorchText BuketIterator\n')
for batch in torchtext_train_dataloader.batches:

  # Let's check batch size.
  print('Batch size: %d\n'% len(batch))
  print('LABEL\tLENGTH\tTEXT'.ljust(10))
  
  # Print each example.
  for example in batch:
    print('%s\t%d\t%s'.ljust(10) % (example['label'], len(example['text']), example['text']))
  print('\n')
  
  # Only look at first batch. Reuse this code in training models.
  break

PyTorch DataLoader

Batch size: 10

LABEL	LENGTH	TEXT
pos	1037	Fascinating movie, based on a true story, about an Australian woman, Lindy Chamberlain (Meryl Streep) accused of killing her baby daughter. She insists that a dingo took her baby, but the story is highly suspicious. The film is actually about the media circus that took place around the case, the way Australians interpreted what was presented in the media, and the lynch mob mentality that ultimately led to the woman's conviction, based on barely any hard evidence. I love films that question the media, and also films that take a hard look on how people are railroaded by the justice system. I've always thought that juries ought to be showed 12 Angry Men before they go through with their duties. It's not, as has often been said, a liberal movie, but a clinical look at how we as human beings interpret events based so much on our prejudices and a desire for revenge. A Cry in the Dark is likewise clinical. Schepisi is careful not 

### Train Loop Examples

Now let's look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!

In [10]:
# Example of number of epochs
epochs = 1

# Example of loop through each epoch
for epoch in range(epochs):

  # Create batches - needs to be called before each loop.
  torchtext_train_dataloader.create_batches()

  # Loop through BucketIterator.
  for sample_id, batch in enumerate(torchtext_train_dataloader.batches):
    print('Batch examples lengths: %s'.ljust(20) % str([len(example['text']) for example in batch]))

    # Let's break early, you get the idea.
    if sample_id == 10:
      break

Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857]
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 982]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2383]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]


## Using PyTorchText TabularDataset

Now I will use the TabularDataset functionality which creates the PyTorchDataset object right from our local files. 

We don't need to create a custom PyTorch Dataset class to load our dataset as long as we have tabular files of our data.

### Data to Files

Since our dataset is scattered into multiple files, I created a function `files_to_tsv` which puts our dataset into a `.tsv` file (Tab-Separated Values).

Since I'll use the **TabularDataset** from `pytorch.data` I need to pass tabular format files.

For text data I find the Tab Separated Values format easier to deal with.

I will call the `files_to_tsv` function for each of the two partitions **train** and **test**. 

The function will return the name of the `.tsv` file saved so we can use it later in PyTorchText.

In [11]:
def files_to_tsv(partition_path, save_path='./'):
  """Parse each file in partition and keep track of sentiments.
  Create a list of pairs [tag, text]

  Arguments:

    partition_path (:obj:`str`):
      Partition used: train or test.

    save_path (:obj:`str`):
      Path where to save the final .tsv file.

  Returns:

    :obj:`str`: Filename of created .tsv file.

  """

  # List of all examples in format [tag, text].
  examples = []

  # Print partition.
  print(partition_path)

  # Loop through each sentiment.
  for sentiment in ['pos', 'neg']:

    # Find path for sentiment.
    sentiment_path = os.path.join(partition_path, sentiment)

    # Get all files from path sentiment.
    files_names = os.listdir(sentiment_path)

    # For each file in path sentiment.
    for file_name in tqdm(files_names, desc=f'{sentiment} Files'):

      # Get file content.
      file_content = io.open(os.path.join(sentiment_path, file_name), mode='r', encoding='utf-8').read()

      # Fix any format errors.
      file_content = fix_text(file_content)

      # Append sentiment and file content.
      examples.append([sentiment, file_content])

  # Create a TSV file with same format `sentiment  text`.
  examples = ["%s\t%s"%(example[0], example[1]) for example in examples]

  # Create file name.
  tsv_filename = os.path.basename(partition_path) + '_pos_neg_%d.tsv'%len(examples)

  # Write to TSV file.
  io.open(os.path.join(save_path, tsv_filename), mode='w', encoding='utf-8').write('\n'.join(examples))

  # Return TSV file name.
  return tsv_filename
  

# Path where to save tsv file.
data_path = '/content'

# Convert train files to tsv file.
train_filename = files_to_tsv(partition_path='/content/aclImdb/train', save_path=data_path)

# Convert test files to tsv file.
test_filename = files_to_tsv(partition_path='/content/aclImdb/test', save_path=data_path)

/content/aclImdb/train


HBox(children=(FloatProgress(value=0.0, description='pos Files', max=12500.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='neg Files', max=12500.0, style=ProgressStyle(description_…


/content/aclImdb/test


HBox(children=(FloatProgress(value=0.0, description='pos Files', max=12500.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='neg Files', max=12500.0, style=ProgressStyle(description_…




### TabularDataset

Here I setup the data fields for PyTorchText. We have to tell the library how to handle each column of the `.tsv` file. For this we need to create `data.Field` objects for each column.

`text_tokenizer`: 
For this example I don't use an actual tokenizer for the `text` column but I need to create one because it requires as input. I created a dummy tokenizer that returns same value. Depending on the project, here is where you will have your own tokenizer. It needs to take as input text and output a list.

`label_tokenizer`
The label tokenizer is also a dummy tokenizer. This is where you will have a encoder to transform labels to ids.

Since we have two `.tsv` files it's great that we can use the `.split` function from **TabularDataset** to handle two files at the same time one for train and the other one for test.

Find more details about **torchtext.data** functionality [here](https://torchtext.readthedocs.io/en/latest/data.html#dataset-batch-and-example).

In [12]:
# Text tokenizer function - dummy tokenizer to return same text.
# Here you will use your own tokenizer.
text_tokenizer = lambda x : x

# Label tokenizer - dummy label encoder that returns same label.
# Here you will add your own label encoder.
label_tokenizer = lambda x: x

# Data field for text column - invoke tokenizer.
TEXT = torchtext.data.Field(sequential=True, tokenize=text_tokenizer, lower=False)

# Data field for labels - invoke tokenize label encoder.
LABEL = torchtext.data.Field(sequential=True, tokenize=label_tokenizer, use_vocab=False)

# Create data fields as tuples of description variable and data field.
datafields = [("label", LABEL),
              ("text", TEXT)]

# Since we have have tab separated data we use TabularDataset
train_dataset, valid_dataset = torchtext.data.TabularDataset.splits(
    
                                                # Path to train and validation.
                                                path=data_path,

                                                # Train data filename.
                                                train=train_filename,

                                                # Validation file name.
                                                validation=test_filename,

                                                # Format of local files.
                                                format='tsv',

                                                # Check if we have header.
                                                skip_header=False,

                                                # How to handle fields.
                                                fields=datafields)

### PyTorchText Bucket Iterator Dataloader

I'm using same setup as in the **PyTorchText Bucket Iterator Dataloader** code cell section. The only difference is in the `sort_key` since there is different way to access example attributes (we had dictionary format before).

In [13]:
# Group similar length text sequences together in batches.
torchtext_train_dataloader, torchtext_valid_dataloader = torchtext.data.BucketIterator.splits(
    
                              # Datasets for iterator to draw data from
                              (train_dataset, valid_dataset),

                              # Tuple of train and validation batch sizes.
                              batch_sizes=(train_batch_size, valid_batch_size),

                              # Device to load batches on.
                              device=device, 

                              # Function to use for sorting examples.
                              sort_key=lambda x: len(x.text),


                              # Repeat the iterator for multiple epochs.
                              repeat=True, 

                              # Sort all examples in data using `sort_key`.
                              sort=False, 

                              # Shuffle data on each epoch run.
                              shuffle=True,

                              # Use `sort_key` to sort examples in each batch.
                              sort_within_batch=True,
                              )

# Print number of batches in each split.
print('Created `torchtext_train_dataloader` with %d batches!'%len(torchtext_train_dataloader))
print('Created `torchtext_valid_dataloader` with %d batches!'%len(torchtext_valid_dataloader))

Created `torchtext_train_dataloader` with 2500 batches!
Created `torchtext_valid_dataloader` with 1250 batches!


### Compare DataLoaders

Let's compare the PyTorch DataLoader batches with the PyTorchText BucketIterator batches created with TabularDataset. We can see how nicely examples of similar length are grouped in same batch with PyTorchText.

**Note:** *When using the PyTorchText BucketIterator, make sure to call `create_batches()` before looping through each batch! Else you won't get any output form the iterator.*

In [14]:
# Loop through regular dataloader.
print('PyTorch DataLoader\n')
for batch in torch_train_dataloader:
  
  # Let's check batch size.
  print('Batch size: %d\n'% len(batch['text']))
  print('LABEL\tLENGTH\tTEXT'.ljust(10))

  # Print each example.
  for text, label in zip(batch['text'], batch['label']):
    print('%s\t%d\t%s'.ljust(10) % (label, len(text), text))
  print('\n')
  
  # Only look at first batch. Reuse this code in training models.
  break
  

# Create batches - needs to be called before each loop.
torchtext_train_dataloader.create_batches()

# Loop through BucketIterator.
print('PyTorchText BuketIterator\n')
for batch in torchtext_train_dataloader.batches:

  # Let's check batch size.
  print('Batch size: %d\n'% len(batch))
  print('LABEL\tLENGTH\tTEXT'.ljust(10))
  
  # Print each example.
  for example in batch:
    print('%s\t%d\t%s'.ljust(10) % (example.label, len(example.text), example.text))
  print('\n')
  
  # Only look at first batch. Reuse this code in training models.
  break

PyTorch DataLoader

Batch size: 10

LABEL	LENGTH	TEXT
neg	1205	This movie is bad news and I'm really surprised at the level of big name talent who would ever agree to appear in such a piece of junk as this. I imagine there were a few strangled agents sprawled across Hollywood Blvd. as a result of this fiasco. What really gets you is that it could have been good. The directors star appeal and the subject matter was sufficient fodder to spark interest and ticket sales, but this is a flop. The multiple story lines all go from bad to silly by the pictures end, and you end up feeling like a mouse in a maze looking for a piece of cheese that turns out to be rotten. What Spike is able to achieve is revenge against any Italians who may have beat him up when he was a kid or insulted him, as the movie does quite a number on perpetuating outdated and probably offensive Italian stereotypes. As with any Spike Lee film there is some really thought provoking and magical camerawork. He does have the g

### Train Loop Examples

Now let's look at a model training loop would look like. I printed the first 10 batches list of examples lengths to show how nicely they are grouped throughout the dataset!

We see that we get same exact behavior as we did when using PyTorch Dataset. Now it depends on which way is easier for you to use PyTorchText BucketIterator: with PyTorch Dataset or with PyTorchText TabularDataset

In [15]:
# Example of number of epochs.
epochs = 1

# Example of loop through each epoch.
for epoch in range(epochs):

  # Create batches - needs to be called before each loop.
  torchtext_train_dataloader.create_batches()

  # Loop through BucketIterator.
  for sample_id, batch in enumerate(torchtext_train_dataloader.batches):
    # Put all example.text of batch in single array.
    batch_text = [example.text for example in batch]

    print('Batch examples lengths: %s'.ljust(20) % str([len(text) for text in batch_text]))

    # Let's break early, you get the idea.
    if sample_id == 10:
      break

Batch examples lengths: [848, 848, 849, 849, 850, 852, 853, 854, 856, 857]
Batch examples lengths: [779, 780, 780, 781, 781, 782, 782, 782, 783, 784]
Batch examples lengths: [2100, 2103, 2104, 2109, 2114, 2135, 2147, 2151, 2158, 2164]
Batch examples lengths: [903, 905, 910, 910, 910, 910, 914, 915, 916, 919]
Batch examples lengths: [968, 968, 970, 970, 971, 972, 973, 975, 981, 981]
Batch examples lengths: [806, 806, 807, 807, 808, 809, 810, 810, 811, 811]
Batch examples lengths: [731, 733, 734, 735, 736, 736, 737, 737, 738, 739]
Batch examples lengths: [357, 357, 358, 361, 362, 362, 362, 364, 366, 371]
Batch examples lengths: [2330, 2335, 2337, 2350, 2351, 2353, 2367, 2374, 2376, 2381]
Batch examples lengths: [1916, 1920, 1921, 1936, 1951, 1953, 1967, 1970, 1981, 1985]
Batch examples lengths: [1395, 1398, 1399, 1402, 1403, 1412, 1412, 1413, 1414, 1414]


## **Final Note**

If you made it this far **Congrats!** 🎊 and **Thank you!** 🙏 for your interest in my tutorial!

I've been using this code for a while now and I feel it got to a point where is nicely documented and easy to follow.

Of course is easy for me to follow because I built it. That is why any feedback is welcome and it helps me improve my future tutorials!

If you see something wrong please let me know by opening an issue on my [ml_things GitHub repository](https://github.com/gmihaila/ml_things/issues)!

A lot of tutorials out there are mostly a one-time thing and are not being maintained. I plan on keeping my tutorials up to date as much as I can.

## **Contact** 🎣

🦊 GitHub: [gmihaila](https://github.com/gmihaila)

🌐 Website: [gmihaila.github.io](https://gmihaila.github.io/)

👔 LinkedIn: [mihailageorge](https://medium.com/r/?url=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fmihailageorge)

📬 Email: [georgemihaila@my.unt.edu.com](mailto:georgemihaila@my.unt.edu.com?subject=GitHub%20Website)