Notebook illustrates how the datapipes work, along the Transforms in TorchText

In addition, the T5 and XLMR model loading, and training process is also included. 

Due to the issue with getting the data using torchtext.datasets, tried manually 
creating the datapipes. Even in that, there are errors cropping up. 



In [1]:
import torchdata.datapipes as dp
import torchtext.transforms as T
import spacy
from torchtext.vocab import build_vocab_from_iterator
eng = spacy.load("en_core_web_md") # Load the English model to tokenize English text
de = spacy.load("de_core_news_md") # Load the German model to tokenize German text

In [2]:
FILE_PATH = 'data/deu.txt'
data_pipe = dp.iter.IterableWrapper([FILE_PATH])  # note the file_path is a list
data_pipe = dp.iter.FileOpener(data_pipe, mode='rb')
data_pipe = data_pipe.parse_csv(skip_lines=0,
                                delimiter='\t',
                                as_tuple=True)
# Process of opening the file is different than the usual case

In [5]:
for sample in data_pipe:
    print(sample)
    break

('Go.', 'Geh.')


In [None]:
list(data_pipe)[799:850]

In [4]:
def removeAttribution(row):
    """
    Function to keep the first two elements in a tuple
    """
    return row[:2]
data_pipe = data_pipe.map(removeAttribution)

In [2]:
type(eng.tokenizer('This is a test case'))

spacy.tokens.doc.Doc

In [4]:
for x in eng.tokenizer('This is a test case'):
    print(x.text)

This
is
a
test
case


In [6]:
def engTokenize(text):
    """
    Tokenize an English text and return a list of tokens
    If tokenizer is called directly, the list is not returned
    """
    return [token.text for token in eng.tokenizer(text)]

def deTokenize(text):
    """
    Tokenize a German text and return a list of tokens
    """
    return [token.text for token in de.tokenizer(text)]

In [7]:
print(engTokenize("Have a good day!!!"))
print(deTokenize("Haben Sie einen guten Tag!!!"))

['Have', 'a', 'good', 'day', '!', '!', '!']
['Haben', 'Sie', 'einen', 'guten', 'Tag', '!', '!', '!']


In [8]:
def getTokens(data_iter, place):
    """
    Function to yield tokens from an iterator. Since, our iterator contains
    tuple of sentences (source and target), `place` parameters defines for which
    index to return the tokens for. `place=0` for source and `place=1` for target
    """
    for english, german in data_iter:
        if place == 0:
            yield engTokenize(english)
        else:
            yield deTokenize(german)

- sos: for start of sentence

- eos: for end of sentence

- unk: for unknown words. An example of unknown word is the one skipped because of min_freq=2.

- pad: is the padding token. While training, a model we mostly train in batches. In a batch, there can be sentences of different length. So, we pad the shorter sentences with <pad> token to make length of all sequences in the batch equal.

In [9]:
source_vocab = build_vocab_from_iterator(
    getTokens(data_pipe,0),
    min_freq=2,
    specials= ['<pad>', '<sos>', '<eos>', '<unk>'],
    special_first=True
)
source_vocab.set_default_index(source_vocab['<unk>'])

At line 5, we set special_first=True. Which means <pad> will get index 0, <sos> index 1, <eos> index 2, and <unk> will get index 3 in the vocabulary.

At line 7, we set default index as index of <unk>. That means if some word is not in vocabulary, we will use <unk> instead of that unknown word.

In [10]:
target_vocab = build_vocab_from_iterator(
    getTokens(data_pipe,1),
    min_freq=2,
    specials= ['<pad>', '<sos>', '<eos>', '<unk>'],
    special_first=True
)
target_vocab.set_default_index(target_vocab['<unk>'])

In [11]:
print(source_vocab.get_itos()[:9])
print(target_vocab.get_itos()[:9])

['<pad>', '<sos>', '<eos>', '<unk>', '.', 'I', 'Tom', 'to', 'you']
['<pad>', '<sos>', '<eos>', '<unk>', '.', ',', 'Tom', 'Ich', '?']


In [None]:
T.VocabTransform(vocab=source_vocab)(['wipes', '<sos>'])

In [12]:
def getTransform(vocab):
    """
    Create transforms based on given vocabulary. The returned transform is applied to sequence
    of tokens.
    """
    text_tranform = T.Sequential(
        ## converts the sentences to indices based on given vocabulary
        T.VocabTransform(vocab=vocab),
        ## Add <sos> at beginning of each sentence. 
        # 1 because the index for <sos> in vocabulary is
        T.AddToken(1, begin=True),
        ## Add <eos> at beginning of each sentence.
        # 2 because the index for <eos> in vocabulary is
        T.AddToken(2, begin=False)
    )
    return text_tranform

In [14]:
temp_list = list(data_pipe)
some_sentence = temp_list[799][0]
print("Some sentence=", end="")
print(some_sentence)


Some sentence=I fainted.


In [19]:
# getTransform function is not required, a simple declaiton of transform will be 
# sufficient
transformed_sentence = getTransform(source_vocab)(engTokenize(some_sentence))
print("Transformed sentence=", end="")
print(transformed_sentence)  # Transformed sentence=[1, 5, 2897, 4, 2]

Transformed sentence=[1, 5, 2897, 4, 2]


In [20]:
index_to_string = source_vocab.get_itos()
for index in transformed_sentence:
    print(index_to_string[index], end=" ")  # <sos> I fainted . <eos> 

<sos> I fainted . <eos> 

In [21]:
def applyTransform(sequence_pair):
    """
    Apply transforms to sequence of tokens in a sequence pair
    """

    return (
        getTransform(source_vocab)(engTokenize(sequence_pair[0])),
        getTransform(target_vocab)(deTokenize(sequence_pair[1]))
    )
data_pipe = data_pipe.map(applyTransform) ## Apply the function to each element in the iterator
temp_list = list(data_pipe)
print(temp_list[0])

([1, 616, 4, 2], [1, 739, 4, 2])


In [22]:
def sortBucket(bucket):
    """
    Function to sort a given bucket. Here, we want to sort based on the length of
    source and target sequence.
    """
    return sorted(bucket, key=lambda x: (len(x[0]), len(x[1])))

In [23]:
data_pipe = data_pipe.bucketbatch(
    batch_size = 4, batch_num=5,  bucket_num=1,
    use_in_batch_shuffle=False, sort_key=sortBucket
)

We keep batch size = 4.

batch_num is the number of batches to keep in a bucket

bucket_num is the number of buckets to keep in a pool for shuffling

sort_key specifies the function that takes a bucket and sorts it

In [24]:
def separateSourceTarget(sequence_pairs):
    """
    input of form: `[(X_1,y_1), (X_2,y_2), (X_3,y_3), (X_4,y_4)]`
    output of form: `((X_1,X_2,X_3,X_4), (y_1,y_2,y_3,y_4))`
    """
    sources,targets = zip(*sequence_pairs)
    return sources,targets

## Apply the function to each element in the iterator
data_pipe = data_pipe.map(separateSourceTarget)
print(list(data_pipe)[0])

(([1, 5, 964, 4, 2], [1, 5, 258, 4, 2], [1, 5, 360, 4, 2], [1, 5335, 21, 4, 2]), ([1, 7, 22, 1475, 4, 2], [1, 7, 22, 376, 4, 2], [1, 297, 10, 19, 561, 4, 2], [1, 896, 32, 21, 33, 1133, 24, 2]))


In [25]:
def applyPadding(pair_of_sequences):
    """
    Convert sequences to tensors and apply padding
    """
    return (T.ToTensor(0)(list(pair_of_sequences[0])), T.ToTensor(0)(list(pair_of_sequences[1])))
## `T.ToTensor(0)` returns a transform that converts the sequence to `torch.tensor` and also applies
# padding. Here, `0` is passed to the constructor to specify the index of the `<pad>` token in the
# vocabulary.
data_pipe = data_pipe.map(applyPadding)

In [26]:
source_index_to_string = source_vocab.get_itos()
target_index_to_string = target_vocab.get_itos()

def showSomeTransformedSentences(data_pipe):
    """
    Function to show how the sentences look like after applying all transforms.
    Here we try to print actual words instead of corresponding index
    """
    for sources,targets in data_pipe:
        if sources[0][-1] != 0:
            continue # Just to visualize padding of shorter sentences
        for i in range(4):
            source = ""
            for token in sources[i]:
                source += " " + source_index_to_string[token]
            target = ""
            for token in targets[i]:
                target += " " + target_index_to_string[token]
            print(f"Source: {source}")
            print(f"Traget: {target}")
        break

showSomeTransformedSentences(data_pipe)

Source:  <sos> <unk> . <eos> <pad>
Traget:  <sos> Fang an . <eos>
Source:  <sos> Do it . <eos>
Traget:  <sos> Mache es ! <eos>
Source:  <sos> Do it . <eos>
Traget:  <sos> Tue es . <eos>
Source:  <sos> Go on . <eos>
Traget:  <sos> Mach weiter . <eos>


In [6]:
xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"
spt = T.SentencePieceTokenizer(sp_model_path=xlmr_spm_model_path)
spt(['This is a sentence piece'])

[['▁This', '▁is', '▁a', '▁sentence', '▁piece']]

To get the GPT-2 data, we have to access OpenAI public server & then extract the info. 
https://github.com/openai/gpt-2/blob/master/download_model.py
The public domain seems to be taken offline, so planning to explore how Transformers library will help


In [18]:
from transformers import GPT2Tokenizer
import torch

transformer_gpt2_tokeniser = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
vocab_path = "/home/kamal/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/vocab.json"
encoder_path = "/home/kamal/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/tokenizer.json"

gpt_bpe = T.GPT2BPETokenizer(encoder_json_path=encoder_path,
                             vocab_bpe_path=vocab_path)

In [9]:
merges_file = "http://download.pytorch.org/models/text/clip_merges.bpe"
encoder_file = "http://download.pytorch.org/models/text/clip_encoder.json"
# https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py
clip_token = T.CLIPTokenizer(merges_path=merges_file,
                             encoder_json_path=encoder_file)
clip_token("Ms.Fox and Mr.Wolf are being very kind to the world ")

100%|██████████| 525k/525k [00:01<00:00, 459kB/s] 
100%|██████████| 862k/862k [00:01<00:00, 527kB/s]  


['988',
 '269',
 '3240',
 '537',
 '1982',
 '269',
 '5916',
 '631',
 '1265',
 '1070',
 '3044',
 '531',
 '518',
 '1002']

Based on WordPiece algorithm introduced in paper: https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf

The backend kernel implementation is taken and modified from https://github.com/LieluoboAi/radish.

See PR https://github.com/pytorch/text/pull/1707 summary for more details.

In [10]:
vocab_bert = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt"
bert_tokeniser = T.BERTTokenizer(vocab_path=vocab_bert, do_lower_case=True,
                                 return_tokens=True)
bert_tokeniser("There is always something good happening, for you to cherish")

100%|██████████| 232k/232k [00:00<00:00, 439kB/s] 


['there',
 'is',
 'always',
 'something',
 'good',
 'happening',
 ',',
 'for',
 'you',
 'to',
 'cher',
 '##ish']

In [14]:
bert_tokeniser = T.BERTTokenizer(vocab_path=vocab_bert,
                                 do_lower_case=True, return_tokens=False)
bert_tokeniser("There is always something good happening, for you to cherish")

100%|██████████| 232k/232k [00:00<00:00, 521kB/s] 


['2045',
 '2003',
 '2467',
 '2242',
 '2204',
 '6230',
 '1010',
 '2005',
 '2017',
 '2000',
 '24188',
 '4509']

In [20]:
truncate = T.Truncate(max_seq_len=8)
tokened = truncate(bert_tokeniser("there is atleast one good done for a million bad deeds"))
tokened

['2045', '2003', '2012', '19738', '3367', '2028', '2204', '2589']

Getting Error: 

ValueError: too many dimensions 'str'

https://stackoverflow.com/questions/65804689/with-bert-text-classification-valueerror-too-many-dimensions-str-error-occur

In [21]:
add_1 = T.AddToken(1, begin=True)
add_tokened = add_1(torch.tensor(tokened))
add_tokened

ValueError: too many dimensions 'str'

In [22]:
pad_1 = T.PadTransform(max_length=10, pad_value=1)
pad_1(torch.tensor(tokened))

ValueError: too many dimensions 'str'

In [2]:
from torchtext.datasets import (
    Multi30k,
    DBpedia,
    CC100,
    SST2,
    IMDB,
    AG_NEWS,
    CoLA,
    SQuAD2,
    PennTreebank,
    CNNDM,
)

In [3]:
ag_news = AG_NEWS(split=('train'))
list(ag_news)[0]

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [None]:
cnn_dm = CNNDM(split='val')
list(cnn_dm)[0]

### Working on Further Transformation

build text pre-processing pipeline for XLM-R model

read SST-2 dataset and transform it using text and label transformation

instantiate classification model using pre-trained XLM-R encoder

In [42]:
batch_size = 16
train_pipe = SST2(split='train')
test_pipe = SST2(split='dev')
sst2_path_train = '/media/kamal/DATA/torch/text/datasets/SST2/SST-2/train.tsv'
sst2_path_test = '/media/kamal/DATA/torch/text/datasets/SST2/SST-2/test.tsv'
sst2_path_dev = '/media/kamal/DATA/torch/text/datasets/SST2/SST-2/dev.tsv'
# dev_pipe = SST2(split='dev')

In [None]:
list(test_pipe)  # lists all the data
list(train_pipe)  # lists all the data

In [64]:
for ind, x in enumerate(train_pipe):
    print(x)
    if ind > 5:
        break

('hide new secretions from the parental units', 0)
('contains no wit , only labored gags', 0)
('that loves its characters and communicates something rather beautiful about human nature', 1)
('remains utterly satisfied to remain the same throughout', 0)
('on the worst revenge-of-the-nerds clichés the filmmakers could dredge up', 0)
("that 's far too tragic to merit such superficial treatment", 0)
('demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop .', 1)


In [65]:
from torch.hub import load_state_dict_from_url

In [66]:
padding_idx = 1
bos_idx = 0  # begin of sentence
eos_idx = 2
max_seq_len = 256
xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"

In [67]:
from torch.utils.data import DataLoader
import torchtext.transforms as T

text_transform = T.Sequential(
    T.SentencePieceTokenizer(xlmr_spm_model_path),
    T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
    T.Truncate(max_seq_len - 2),
    T.AddToken(token=bos_idx, begin=True),
    T.AddToken(token=eos_idx, begin=False)
)

In [69]:
text_transform('This seems to be a correct example', '5')  # providing the additional variable raises the error

TypeError: Sequential.forward() takes 2 positional arguments but 3 were given

In [74]:
def applyTransform(x):
    return [text_transform(x[0]), x[1]]

In [75]:
t_pipe = train_pipe.map(applyTransform)
t_pipe = t_pipe.batch(batch_size)
t_pipe = t_pipe.rows2columnar(['token_ids','target'])
t_loader = DataLoader(t_pipe, batch_size=None)

In [79]:
next(iter(t_loader))['target']  # [0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1]



[0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1]

In [80]:
d_pipe = test_pipe.map(applyTransform)
d_pipe = d_pipe.batch(batch_size)
d_pipe = d_pipe.rows2columnar(['token_ids','target'])
d_loader = DataLoader(d_pipe, batch_size=None)

In [81]:
next(iter(d_loader))['target']  # [0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1] 



[1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1]

In [None]:
from torchtext.models import RobertaClassificationHead, XLMR_BASE_ENCODER
import torch

num_classes = 2
input_dim = 768

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 

classifier_head = RobertaClassificationHead(num_classes=num_classes,
                                            input_dim=input_dim)
model = XLMR_BASE_ENCODER.get_model(head=classifier_head,)
model.to(device)

In [119]:
print(classifier_head)

RobertaClassificationHead(
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (out_proj): Linear(in_features=768, out_features=2, bias=True)
  (activation_fn): ReLU()
)


In [121]:
print(XLMR_BASE_ENCODER)

RobertaBundle(_encoder_conf=RobertaEncoderConf(vocab_size=250002, embedding_dim=768, ffn_dimension=3072, padding_idx=1, max_seq_len=514, num_attention_heads=12, num_encoder_layers=12, dropout=0.1, scaling=None, normalize_before=False), _path='https://download.pytorch.org/models/text/xlmr.base.encoder.pt', _head=None, transform=<function <lambda> at 0x7f2e200e4f40>)


RobertaModel(
  (encoder): RobertaEncoder(
    (transformer): TransformerEncoder(
      (token_embedding): Embedding(250002, 768, padding_idx=1)
      (layers): TransformerEncoder(
        (layers): ModuleList(
          (0-11): 12 x TransformerEncoderLayer(
            (self_attn): MultiheadAttention(
              (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
            )
            (linear1): Linear(in_features=768, out_features=3072, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
            (linear2): Linear(in_features=3072, out_features=768, bias=True)
            (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (positional_embedding): PositionalEmbedding(
        (embedding): Embedding(514, 768, padding_idx=1)
      )
      (embedding_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (head): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=2, bias=True)
    (activation_fn): ReLU()
  )
)

In [87]:
import torchtext.functional as F
from torch.optim import AdamW
from torch import nn 

learning_rate = 1e-5
optim = AdamW(model.parameters(), lr=learning_rate)
criteria = nn.CrossEntropyLoss()


def train_step(input, target):
    output = model(input)
    loss = criteria(output, target)
    optim.zero_grad()
    loss.backward()
    optim.step()


def eval_step(input, target):
    output = model(input)
    loss = criteria(output, target).item()
    return float(loss), (output.argmax(1) == target).type(torch.float).sum().item()


def evaluate():
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    counter = 0
    with torch.no_grad():
        for batch in d_loader:
            input = F.to_tensor(batch["token_ids"], padding_value=padding_idx).to(device)
            target = torch.tensor(batch["target"]).to(device)
            loss, predictions = eval_step(input, target)
            total_loss += loss
            correct_predictions += predictions
            total_predictions += len(target)
            counter += 1

    return total_loss / counter, correct_predictions / total_predictions

In [None]:
num_epochs = 1

for e in range(num_epochs):
    for batch in t_loader:
        # print(batch)
        input = F.to_tensor(batch["token_ids"], padding_value=padding_idx).to(device)
        target = torch.tensor(batch["target"]).to(device)
        train_step(input, target)

    loss, accuracy = evaluate()
    print("Epoch = [{}], loss = [{}], accuracy = [{}]".format(e, loss, accuracy))

In [10]:
from torch.utils.data import Dataset, DataLoader
import tarfile

class TgzDataset(Dataset):
    def __init__(self, tgz_file):
        self.tgz_file = tgz_file
        self.samples = []
        with tarfile.open(tgz_file, "r:gz") as f:
            for member in f.members:
                self.samples.append(member.name)

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        with tarfile.open(self.tgz_file, "r:gz") as f:
            with f.extractfile(self.samples[idx]) as f:
                sample = f.read()
        return sample

In [8]:
cnn_path = '/home/kamal/.cache/torch/text/datasets/cnn_stories.tgz'

cnn_tgz = TgzDataset(tgz_file=cnn_path)

In [15]:
import tarfile

# Create a tarfile.open object
tar = tarfile.open(cnn_path)

# Extract the contents of the .tgz file to a directory
tar.extractall()

# Close the tarfile.open object
tar.close()

In [21]:
import glob
cnn_stories = glob.glob("cnn/stories/*.stories")
cnn_dataPipe = dp.iter.IterableWrapper(cnn_stories)

After this step, not much progress... 

In [52]:
from torchtext.models import T5Transform

padding_idx = 0
eos_idx = 1
max_seq_len = 512
t5_sp_model_path = "https://download.pytorch.org/models/text/t5_tokenizer_base.model"

transform = T5Transform(
    sp_model_path=t5_sp_model_path,
    max_seq_len=max_seq_len,
    eos_idx=eos_idx,
    padding_idx=padding_idx,
)

100%|██████████| 792k/792k [00:01<00:00, 522kB/s] 


In [None]:
from torchtext.models import T5_BASE_GENERATION


t5_base = T5_BASE_GENERATION
transform = t5_base.transform()
model = t5_base.get_model()
model.eval()


``` T5Model(
  (token_embeddings): Embedding(32128, 768, padding_idx=0)
  (encoder): T5Encoder(
    (token_embeddings): Embedding(32128, 768, padding_idx=0)
    (layers): ModuleList(
      (0): T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
          (relative_attention_bias): Embedding(32, 12)
        )
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
      (1-11): 11 x T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): T5LayerNorm()
    (dropout1): Dropout(p=0.0, inplace=False)
    (dropout2): Dropout(p=0.0, inplace=False)
  )
  (decoder): T5Decoder(
    (layers): ModuleList(
      (0): T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
          (relative_attention_bias): Embedding(32, 12)
        )
        (cross_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (norm3): T5LayerNorm()
        (dropout4): Dropout(p=0.0, inplace=False)
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
      (1-11): 11 x T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (cross_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (norm3): T5LayerNorm()
        (dropout4): Dropout(p=0.0, inplace=False)
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): T5LayerNorm()
    (dropout1): Dropout(p=0.0, inplace=False)
    (dropout2): Dropout(p=0.0, inplace=False)
  )
  (lm_head): Linear(in_features=768, out_features=32128, bias=False)
)T5Model(
  (token_embeddings): Embedding(32128, 768, padding_idx=0)
  (encoder): T5Encoder(
    (token_embeddings): Embedding(32128, 768, padding_idx=0)
    (layers): ModuleList(
      (0): T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
          (relative_attention_bias): Embedding(32, 12)
        )
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
      (1-11): 11 x T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): T5LayerNorm()
    (dropout1): Dropout(p=0.0, inplace=False)
    (dropout2): Dropout(p=0.0, inplace=False)
  )
  (decoder): T5Decoder(
    (layers): ModuleList(
      (0): T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
          (relative_attention_bias): Embedding(32, 12)
        )
        (cross_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (norm3): T5LayerNorm()
        (dropout4): Dropout(p=0.0, inplace=False)
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
      (1-11): 11 x T5Layer(
        (self_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (cross_attn): T5MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=False)
        )
        (norm3): T5LayerNorm()
        (dropout4): Dropout(p=0.0, inplace=False)
        (linear1): Linear(in_features=768, out_features=3072, bias=False)
        (linear2): Linear(in_features=3072, out_features=768, bias=False)
        (norm1): T5LayerNorm()
        (norm2): T5LayerNorm()
        (dropout1): Dropout(p=0.0, inplace=False)
        (dropout2): Dropout(p=0.0, inplace=False)
        (dropout3): Dropout(p=0.0, inplace=False)
      )
    )
    (norm): T5LayerNorm()
    (dropout1): Dropout(p=0.0, inplace=False)
    (dropout2): Dropout(p=0.0, inplace=False)
  )
  (lm_head): Linear(in_features=768, out_features=32128, bias=False)
)```

In [122]:
from torchtext.datasets import CNNDM, WikiText2, WikiText103
from functools import partial

cnndm_batch_size = 5
# cnndm_datapipe = CNNDM(split="val",)
# cnndm_datapipe = WikiText2(split="test",)
cnndm_datapipe = WikiText103(split="test",)  # takes a lot of time, no output
cnndm_datapipe = SQuAD2(split="dev",)  # takes a lot of time, no output
task = "summarize"

In [126]:
list(cnndm_datapipe)[:5]

[('The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.',
  'In what country is Normandy located?',
  ['France', 'France', 'France', 'France'],
  [159, 159, 159, 159]),
 ('The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to N

In [125]:
len(list(cnndm_datapipe))

11873

In [102]:
def apply_prefix(task, x):
    return f"{task}: " + x[0], x[1]


cnndm_datapipe = cnndm_datapipe.map(partial(apply_prefix, task))

cnndm_datapipe = cnndm_datapipe.batch(cnndm_batch_size)

cnndm_datapipe = cnndm_datapipe.rows2columnar(["article", "abstract"])

cnndm_dataloader = DataLoader(cnndm_datapipe, shuffle=True, batch_size=None)