<a href="https://colab.research.google.com/github/karankarn/GPT2/blob/main/GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
path = "/content/drive/MyDrive/GPT2/input.txt"

# open input.txt and save in text
with open (path, "r", encoding="utf-8") as f:
    text = f.read()
# check type
type(text)



str

In [4]:
# len of the dataset
print("length of dataset in characters :", len(text))

length of dataset in characters : 1115394


In [5]:
"""here are all the unique characters that occur in this text"""
chars = sorted(list(set(text)))
vocab_size = len(chars)
# print chars and vocab size
print(" ".join(chars))
print(vocab_size)


   ! $ & ' , - . 3 : ; ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z
65


In [6]:
"""
tokeniser
One token is One character in this model.
we need a strategy to tokenize i.e encode
and decode individual characters into integers
and back
"""

# create a mapping from characters to integers
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s:[stoi[c] for c in s] # encoder : take string, output list of integers
decode = lambda l: "".join(itos[i] for i in l) # decoder : take a list of intergers, output a string


In [7]:
encode("Karan")

[23, 39, 56, 39, 52]

In [8]:
decode([23, 39, 56, 39, 52])

'Karan'

In [9]:
import torch
# encode text and wrap it in a tensor
data = torch.tensor(encode(text), dtype=torch.long)
# check the shape and data type
print(data.shape, data.dtype)

torch.Size([1115394]) torch.int64


In [10]:
# print the first 50 elements in data,
# which is an encoded representation of the first 50 characters of text
print(data[:50])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56])


All the characters in the novel are now represented by integers. The integers are stretched out and wrapped in a tensor. It is the entire encoding of the book.

In [11]:
# Splitting training and test set
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

## Block Size or Context Length
All the text does not get fed into the transformer at once. Instead random chunks from the text gets fed into the transformer sequentially.
The size of this chunk that gets fed into the transformer is referred to as context length or block size.

In [12]:
block_size = 8
train_data[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [13]:
x = train_data[:block_size]
y = train_data[1:block_size +1]

In [14]:
# How attention is working
for i in range(block_size):
  context = x[:i+1] # tensor upto this position
  target = y[i] # should predict this tensor
  print(f"when input is {context} the target : {target}")

when input is tensor([18]) the target : 47
when input is tensor([18, 47]) the target : 56
when input is tensor([18, 47, 56]) the target : 57
when input is tensor([18, 47, 56, 57]) the target : 58
when input is tensor([18, 47, 56, 57, 58]) the target : 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target : 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target : 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target : 58


Its done this way to ensure the transformer is used to seeing context from just one word, all the way upto block_size and everything in between.

This has advantages during inference because the transformer will be able to handle any length of input context

After block size, we will need to start truncating. Because the transformer will never have context beyond this limit

In [15]:
torch.manual_seed(1337)
batch_size = 4 # how many blocks will process in parallel
block_size = 8 # tokens in a block

## Batch Size
For compute efficiency, multiple of the block size tensors are stacked together  to form a higher dimensional tensor, which is used as the input to the transformer.  

But they dont communicate with each other or share information. They are all just processed in parallel.





### Function get_batch
  **Arguement** : dataset  
  **Return** :   
      1. x :  input tensor ,stack of 4 tensors,each tensor has 8 ints  
      2. y :  target tensor ,stack of 4 tensors,each tensor has 8 ints


    get_batch(split)  
      data = train or val, if split = train or val

      ix = a tensor of shape (batch_size,) = (4,0) filled with random integers

      x = a stack of 4 tensors
      Each tensor is 8 characters starting from ix which are random points in the text

      y = a stack of 4 tensors
      Each tensor is 8 characters starting from ix+1 which are the targets for x in the text

      return x and y

In [46]:
# creating a mini batch. which is a batchsize stack of blocks into a singe tensor
torch.manual_seed(1337)
def get_batch(split):
  data = train_data if split == "train" else val_data
  ix = torch.randint(len(data) - block_size,(batch_size,))
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  return x,y

In [53]:
xb,yb = get_batch("train")
print(f"inputs : {xb.shape} ,\n {xb}")
print(f"targets : {yb.shape} ,\n {yb}")


inputs : torch.Size([4, 8]) ,
 tensor([[57, 43, 60, 43, 52,  1, 63, 43],
        [60, 43, 42,  8,  0, 25, 63,  1],
        [56, 42,  5, 57,  1, 57, 39, 49],
        [43, 57, 58, 63,  6,  1, 58, 46]])
targets : torch.Size([4, 8]) ,
 tensor([[43, 60, 43, 52,  1, 63, 43, 39],
        [43, 42,  8,  0, 25, 63,  1, 45],
        [42,  5, 57,  1, 57, 39, 49, 43],
        [57, 58, 63,  6,  1, 58, 46, 47]])


In [55]:
# for our mini-batch
for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b,t]
    print(f"when input is {context.tolist()}, target is {target}")

when input is [57], target is 43
when input is [57, 43], target is 60
when input is [57, 43, 60], target is 43
when input is [57, 43, 60, 43], target is 52
when input is [57, 43, 60, 43, 52], target is 1
when input is [57, 43, 60, 43, 52, 1], target is 63
when input is [57, 43, 60, 43, 52, 1, 63], target is 43
when input is [57, 43, 60, 43, 52, 1, 63, 43], target is 39
when input is [60], target is 43
when input is [60, 43], target is 42
when input is [60, 43, 42], target is 8
when input is [60, 43, 42, 8], target is 0
when input is [60, 43, 42, 8, 0], target is 25
when input is [60, 43, 42, 8, 0, 25], target is 63
when input is [60, 43, 42, 8, 0, 25, 63], target is 1
when input is [60, 43, 42, 8, 0, 25, 63, 1], target is 45
when input is [56], target is 42
when input is [56, 42], target is 5
when input is [56, 42, 5], target is 57
when input is [56, 42, 5, 57], target is 1
when input is [56, 42, 5, 57, 1], target is 57
when input is [56, 42, 5, 57, 1, 57], target is 39
when input is [