# GPT from scratch

We build a GPT model from scratch. We use the AG_NEWS dataset that is built-in in torchtext and use some of the tokenization tools from torchtext. A huggingface pipeline might have taken care of all of the pre-training steps but we wanted to get a more detailed understanding of the entire pipeline. 

We train a small-ish model for 5 epochs and test it in two ways. First, we look at its predictions on the first test batch to see if the predictions are plausible. Secondly, we investigate the test loss on a random network vs. our trained network. 

Our model is not perfect but we think it is sufficiently different from random chance that we can say it has learned something and our pipeline is functional. Our goal was not to reproduce or beat the state of the art but just to built a working pipeline so we stop there. 

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import torch.nn as nn
from tqdm.notebook import tqdm
print(torch.__version__)

1.10.0+cu102


# Data

In [2]:
import torchtext

In [3]:
dataset_train, dataset_test = torchtext.datasets.AG_NEWS()
print(len(dataset_train))
print(len(dataset_test))

120000
7600


In [4]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = torchtext.datasets.AG_NEWS(split='train')

def yield_tokens(data_iter):
    for _, text in data_iter:
        text = "[SOS] " + text + " [EOS]"
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), min_freq=2,
                                  specials=["[PAD]", "[UNK]", "[SOS]", "[EOS]"])
vocab.set_default_index(vocab["[UNK]"])

In [5]:
print(type(train_iter))

<class 'torchtext.data.datasets_utils._RawTextIterableDataset'>


In [6]:
#vocab(['here', 'is', 'an', 'example'])
vocab(["the"])
full_vocab = vocab.vocab.get_stoi().keys()
print(len(full_vocab))

53132


In [7]:
import random
import math
import copy

# copied from tutorial, added padding
def text_pipeline(x, max_len):
    x = "[SOS] " + x + " [EOS]"
    vocab_list = np.array(vocab(tokenizer(x)))[:max_len]
    k = len(vocab_list)
    missing_len = max_len - k
    missing_list = missing_len * vocab(["[PAD]"])
    # true labels
    labels = np.concatenate([copy.deepcopy(vocab_list), missing_list])
    labels = np.concatenate([labels[1:], vocab(["[PAD]"])])
    # save vector indicating paddings
    paddings = torch.cat([torch.zeros((k,)), torch.ones((len(missing_list),))])
    paddings = paddings.bool()
    
    return(np.concatenate([vocab_list, missing_list]), labels, paddings)

print(text_pipeline('He married Mabel Scott in 1890, but they soon separated. Unable to get an English divorce, in 1900, he became the first celebrity to get one in Nevada, and remarried there, but the divorce was invalid in England. In June 1901, he was arrested for bigamy, and was convicted before the House of Lords, the last time a peer was convicted by the Lords.', 100))

(array([    8,    54,  6619, 47693,  2645,    12,     1,     6,    50,
          72,   749, 10620,     4,  4310,     9,   227,    35,  1889,
       13154,     6,    12,     1,     6,    54,  1363,     5,    52,
        7959,     9,   227,    66,    12,  6424,     6,    13,     1,
         234,     6,    50,     5, 13154,    40, 15493,    12,   320,
           4,    12,  1921, 30443,     6,    54,    40,   799,    16,
           1,     6,    13,    40,  2492,   172,     5,   441,    11,
        9521,     6,     5,    74,   106,    10,  8111,    40,  2492,
          29,     5,  9521,     4,     7,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0]), array([   54,  6619, 47693,  2645,    12,     1,     6,    50,    72,
         749, 10620,     4,  4310,     9,   227,    35,  1889, 13154,
           6,    12,     1,     6,    54,  1363,     5,    52,  7959,
   

In [8]:
# copied from tutorial, removed offsets
from torch.utils.data import DataLoader
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

def collate_batch(batch):
    label_list, text_list, padding_list = [], [], []
    for _, text in batch:
        input_, label_, padding_ = text_pipeline(text, max_len=100)
        text_list.append(torch.tensor(input_, dtype=torch.int64))
        label_list.append(label_)
        padding_list.append(torch.tensor(padding_))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    text_list = torch.cat(text_list).view(len(label_list), -1)
    padding_list = torch.cat(padding_list).view(len(label_list), -1)
    return text_list.to(DEVICE), label_list.to(DEVICE), padding_list

train_iter = dataset_train
BATCH_SIZE = 32
dataloader = DataLoader(list(train_iter), batch_size=BATCH_SIZE, collate_fn=collate_batch)

cuda


# GPT model

In [9]:
### build classifier transformer
import torch.nn.functional as F

class MyGPT(nn.Module):
    
    def __init__(self, embedding_dim, heads, seq_length, vocab_size, depth=5, num_classes=2, device="cuda"):
        super().__init__()

        self.seq_length = seq_length
        self.vocab_size = vocab_size
        self.token_emb = nn.Embedding(vocab_size, embedding_dim)
        self.pos_emb = nn.Embedding(seq_length, embedding_dim)
        self.num_heads = heads
        indices = torch.triu_indices(seq_length, seq_length, offset=1)
        self.attn_mask = torch.zeros((seq_length, seq_length))
        self.attn_mask[indices[0], indices[1]] = float("-inf")
        #self.attn_mask[indices[0], indices[1]] = False
        self.attn_mask = self.attn_mask.to(device)
        self.device = device

        # sequence of transformers
        self.tblocks = []
        for i in range(depth):
            self.tblocks.append(nn.TransformerDecoderLayer(d_model=embedding_dim,
                                                            nhead=self.num_heads, 
                                                            batch_first=True, 
                                                            norm_first=True,
                                                            dropout=0.1).to(device))
        #self.tblocks = nn.Sequential(*tblocks)
        
        # final linear layer
        self.last_linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, x, paddings, pos_ids=None):
        # generate positional embeddings
        if pos_ids is None:
            pos_ids = torch.arange(0, self.seq_length).unsqueeze(0).to(self.device)
        # generate full embeddings
        tokens = self.token_emb(x) + self.pos_emb(pos_ids)
        batch_size, token_size, embed_size = tokens.size()

        # apply all transformer blocks
        x = tokens.to(self.device)
        for block in self.tblocks:
            x = block(x, memory=x, tgt_mask=self.attn_mask, tgt_key_padding_mask=paddings)

        # predict next word with last_linear
        out = self.last_linear(x)
        # change order because we have batch_first=True
        out = out.transpose(2, 1)
        return out

# Training

In [10]:
# gpt-mini: n_layer=6, n_head=6, n_embd=192
# gpt-micro: n_layer=4, n_head=4, n_embd=128
# gpt-nano: n_layer=3, n_head=3, n_embd=48
# gpt-mini2: n_layers=6, n_head=16, n_embd=128
# gpt-mini3: n_layers=10, n_head=32, n_embd=256


EMBED_DIM = 256
NUM_HEADS = 32
NUM_LAYERS = 10
SEQ_LENGTH = 100
VOCAB_SIZE = len(full_vocab)

my_gpt = MyGPT(embedding_dim=EMBED_DIM, 
               heads=NUM_HEADS, 
               seq_length=SEQ_LENGTH,
               vocab_size=VOCAB_SIZE,
               depth=NUM_LAYERS).to(DEVICE)

optimizer = torch.optim.Adam(my_gpt.parameters(), lr=3e-4, betas=(0.9, 0.95))
criterion = nn.CrossEntropyLoss()

In [25]:
# training
num_epochs = 50

for epoch in range(num_epochs):
    print("epoch: ", epoch)
    training_loss = 0
    for i, (x, y, paddings) in tqdm(enumerate(dataloader), total=len(dataset_train)//BATCH_SIZE):

        #if i > 50: break
        optimizer.zero_grad()
        x,y, paddings = x.to(DEVICE), y.to(DEVICE), paddings.to(DEVICE)

        out = my_gpt(x, paddings)
        
        loss = criterion(out, y)
        training_loss += loss
        loss.backward()
        optimizer.step()
    
    print("training_loss: ", training_loss.item())

epoch:  0


  0%|          | 0/3750 [00:00<?, ?it/s]

  padding_list.append(torch.tensor(padding_))
  label_list = torch.tensor(label_list, dtype=torch.int64)


training_loss:  11376.841796875
epoch:  1


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  10240.888671875
epoch:  2


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  9901.4970703125
epoch:  3


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  9647.5439453125
epoch:  4


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  9417.919921875
epoch:  5


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  9207.0107421875
epoch:  6


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  9049.6396484375
epoch:  7


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8976.3525390625
epoch:  8


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8935.974609375
epoch:  9


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8899.72265625
epoch:  10


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8866.0166015625
epoch:  11


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8833.0810546875
epoch:  12


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8798.845703125
epoch:  13


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8765.5078125
epoch:  14


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8732.435546875
epoch:  15


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8700.685546875
epoch:  16


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8669.4267578125
epoch:  17


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8640.04296875
epoch:  18


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8611.9853515625
epoch:  19


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8584.830078125
epoch:  20


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8558.9833984375
epoch:  21


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8533.1279296875
epoch:  22


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8509.4775390625
epoch:  23


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8485.9033203125
epoch:  24


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8464.76953125
epoch:  25


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8444.51953125
epoch:  26


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8423.86328125
epoch:  27


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8404.6357421875
epoch:  28


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8386.591796875
epoch:  29


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8369.9072265625
epoch:  30


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8353.375
epoch:  31


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8336.498046875
epoch:  32


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8320.3720703125
epoch:  33


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8305.62109375
epoch:  34


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8291.24609375
epoch:  35


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8277.046875
epoch:  36


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8263.447265625
epoch:  37


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8251.0830078125
epoch:  38


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8238.6923828125
epoch:  39


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8226.5009765625
epoch:  40


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8214.6455078125
epoch:  41


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8204.2275390625
epoch:  42


  0%|          | 0/3750 [00:00<?, ?it/s]

training_loss:  8194.5322265625
epoch:  43


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8183.908203125
epoch:  44


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8174.2109375
epoch:  45


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8165.64306640625
epoch:  46


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8154.7509765625
epoch:  47


  0%|          | 0/3750 [00:02<?, ?it/s]

training_loss:  8147.16064453125
epoch:  48


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8137.9892578125
epoch:  49


  0%|          | 0/3750 [00:01<?, ?it/s]

training_loss:  8130.02978515625


In [11]:
# save model weights
PATH = "my_gpt_mini3_it50.pth" #mini3_it50 has training loss of 8130
#torch.save(my_gpt.state_dict(), PATH)

# Test our model

In [12]:
BATCH_SIZE=16
testloader = DataLoader(list(dataset_test), batch_size=BATCH_SIZE, collate_fn=collate_batch)

In [13]:
### test the model on a random init version of the network
random_network = MyGPT(embedding_dim=EMBED_DIM, 
               heads=NUM_HEADS, 
               seq_length=SEQ_LENGTH,
               vocab_size=VOCAB_SIZE,
               depth=NUM_LAYERS).to(DEVICE)

random_network.eval()

random_test_loss = 0
for i, (x, y, paddings) in tqdm(enumerate(testloader), total=len(dataset_test)//BATCH_SIZE):

        #if i > 50: break
        x,y, paddings = x.to(DEVICE), y.to(DEVICE), paddings.to(DEVICE)

        out = random_network(x, paddings)
        
        loss = criterion(out, y)
        random_test_loss += loss.item()
        
print("test loss from random network: {:.03f}".format(random_test_loss))

  0%|          | 0/475 [00:00<?, ?it/s]

  padding_list.append(torch.tensor(padding_))
  label_list = torch.tensor(label_list, dtype=torch.int64)
  label_list = torch.tensor(label_list, dtype=torch.int64)


test loss from random network: 5779.365


In [14]:
### load model

trained_model = MyGPT(embedding_dim=EMBED_DIM, 
               heads=NUM_HEADS, 
               seq_length=SEQ_LENGTH,
               vocab_size=VOCAB_SIZE,
               depth=NUM_LAYERS).to(DEVICE)

trained_model.load_state_dict(torch.load(PATH))
trained_model.eval()

MyGPT(
  (token_emb): Embedding(53132, 256)
  (pos_emb): Embedding(100, 256)
  (last_linear): Linear(in_features=256, out_features=53132, bias=True)
)

In [15]:
### test model on our trained model 
trained_model.eval()

test_loss = 0
for i, (x, y, paddings) in tqdm(enumerate(testloader), total=len(dataset_test)//BATCH_SIZE):

        #if i > 50: break
        optimizer.zero_grad()
        x,y, paddings = x.to(DEVICE), y.to(DEVICE), paddings.to(DEVICE)

        out = trained_model(x, paddings)
        
        loss = criterion(out, y)
        test_loss += loss.item()
        
print("test loss from trained network: {:.03f}".format(test_loss))

  0%|          | 0/475 [00:00<?, ?it/s]

  padding_list.append(torch.tensor(padding_))
  label_list = torch.tensor(label_list, dtype=torch.int64)


test loss from trained network: 1472.208


# Generate words with the model

In [16]:
def generate(model, context, paddings, ntok=20, topk=5):
    generation_start = torch.where(context == 0)[1][0] # -1
    for i in range(ntok):
        # predict logits for next word
        out = model(context, paddings)
        # select the right dimension 
        logits = out[:,:,generation_start-1]
        # remove all but the top k indices
        indices_to_remove = logits < min(torch.topk(logits, topk)[0][0])
        logits[indices_to_remove] = float("-inf")
        # draw one sample from the remaining k indices weighted by their probabilities
        next_tok = torch.multinomial(F.softmax(logits, dim=-1), num_samples=1).squeeze(1)
        # add context to previous context
        context[:, generation_start] = next_tok.unsqueeze(-1)
        paddings[:, generation_start] = False
        generation_start += 1
        
    return(context)

In [17]:
test_sentence = "He married Mabel Scott in 1890, but they soon separated. \
Unable to get an English divorce, in 1900, he became the"
""" the first celebrity to get one in Nevada, \
and remarried there, but the divorce was invalid in England. In June 1901, he was arrested for bigamy, \
and was convicted before the House of Lords, the last time a peer was convicted by the Lords."
"""
test_context, _, test_padding = text_pipeline(test_sentence, max_len=100)
test_context, test_padding = torch.tensor(test_context).to(DEVICE).view(1, -1), torch.tensor(test_padding).bool().to(DEVICE).view(1, -1)
test_context[test_context == 7] = 0
test_padding[test_context == 7] = True
out = generate(my_gpt, test_context, test_padding)

  test_context, test_padding = torch.tensor(test_context).to(DEVICE).view(1, -1), torch.tensor(test_padding).bool().to(DEVICE).view(1, -1)


In [18]:
full_vocab_itos = vocab.vocab.get_itos()
def detokenize(list_of_idxs):
    return([full_vocab_itos[i] for i in list_of_idxs])

In [19]:
list_of_strings_out = detokenize(out[0].cpu().numpy())
s=""
for x in list_of_strings_out:
    s += " "+ x
print(s)

 [sos] he married mabel scott in [UNK] , but they soon separated . unable to get an english divorce , in [UNK] , he became the eight-game progress ce-ata simao ferentz hours dailies hold koch convinces depression objected line\cinema colgate airshow mohawk colombia bizpile flees mathis [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]


In [20]:
"company oil oil new ? . oil iraq its inc oil s reuters iraq reuters t company oil t its "

'company oil oil new ? . oil iraq its inc oil s reuters iraq reuters t company oil t its '

In [21]:
def complete_sentence(model, text_input, padding, ntok=20, begin_padding=20):
    # overwrite the padding such that the model starts to predict after the first X words
    print("text_input: ", " ".join(detokenize(text_input)), "\n")
    print("text_input cut off: ", " ".join(detokenize(text_input[:begin_padding])), "\n")
    padding[begin_padding:] = True
    text_input[begin_padding:] = 0
    test_context, test_padding = torch.tensor(text_input).to(DEVICE).view(1, -1), torch.tensor(padding).bool().to(DEVICE).view(1, -1)
    
    out = generate(model, test_context, test_padding)
    
    list_of_strings_out = detokenize(out[0].cpu().numpy())
    s=" ".join(list_of_strings_out)
    print("s: ", s)

In [22]:
### load first couple of test strings

for i, (x, y, paddings) in enumerate(testloader):
    
    for j, x_ in enumerate(x):
        print("j: ", j)
        complete_sentence(trained_model, x_, paddings[j], ntok=20)
    
    break

j:  0
text_input:  [sos] fears for t n pension after talks unions representing workers at turner newall say they are ' disappointed ' after talks with stricken parent firm federal mogul . [eos] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] 

text_input cut off:  [sos] fears for t n pension after talks unions representing workers at turner newall say they are ' disappointed ' 



  padding_list.append(torch.tensor(padding_))
  label_list = torch.tensor(label_list, dtype=torch.int64)
  test_context, test_padding = torch.tensor(text_input).to(DEVICE).view(1, -1), torch.tensor(padding).bool().to(DEVICE).view(1, -1)


s:  [sos] fears for t n pension after talks unions representing workers at turner newall say they are ' disappointed ' re to continue to make /b&gt it in the two countries which would /b&gt it to have to have to [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
j:  1
text_input:  [sos] the race is on second private team sets launch date for human spaceflight ( space . com ) space . com - toronto , canada -- a [UNK] of rocketeers competing for the #36 10 million ansari x prize , a contest [UNK] funded suborbital space flight , has officially announced the [UNK] date for its manned rocket . [eos] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

s:  [sos] [UNK] [UNK] , [UNK] , key distribution , and bloom filters [UNK] and bloom filters have a lot of /b&gt it /b&gt t say they would end an american league in fact that will try to make more countries [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
j:  8
text_input:  [sos] e-mail scam targets police chief [UNK] police warns about phishing after its fraud squad chief was targeted . [eos] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD

s:  [sos] [UNK] unite dolphin groups dolphin groups , or pods , rely on [UNK] to keep them from collapsing , which he to make its may have reached an american league in cash in january , which have to which [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
j:  15
text_input:  [sos] teenage t . rex ' s monster growth tyrannosaurus rex achieved its massive size due to an enormous growth spurt during its adolescent years . [eos] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P