Skinut je shakespeare.txt fajl koji sadrzi sva dela Vilijama Sekspira. Ovo ce sluziti kao trening skup za pravljenje GPT modela koji generise tekst nalik na onaj iz dela ovog pisca.

In [4]:
#ucitavanje i pregled ulaza
with open('shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print("Length of dataset in characters: ", len(text))

Length of dataset in characters:  5447743


In [6]:
print(text[:1000])

1609

THE SONNETS

by William Shakespeare



                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.


                     2
  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tattered weed of small worth held:  
  Then being asked, where all thy beauty lies,
  Where all the treasure of thy l

In [7]:
#analiza karaktera
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !"&'(),-.0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_`abcdefghijklmnopqrstuvwxyz|}
84


Tokenizacija: konverzija niza karaktera u niz celih brojeva koji predstavljaju recnik mogucih elemenata.

In [9]:
#mapiranje karaktera u cele brojeve
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s] #enkoder, uzima string, vraca listu celih brojeva
decode = lambda l: ''.join([itos[i] for i in l]) #dekoder: uzima listu celih brojeva, vraca string

print(encode("zdravo svete"))
print(decode(encode("zdravo svete")))

[81, 59, 73, 56, 77, 70, 1, 74, 77, 60, 75, 60]
zdravo svete


In [12]:
#enkodiranje celog teksta uz pytorch
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([5447743]) torch.int64
tensor([12, 17, 11, 20,  0,  0, 45, 33, 30,  1, 44, 40, 39, 39, 30, 45, 44,  0,
         0, 57, 80,  1, 48, 64, 67, 67, 64, 56, 68,  1, 44, 63, 56, 66, 60, 74,
        71, 60, 56, 73, 60,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1,  1,
         1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 12,  0,  1,  1, 31, 73,
        70, 68,  1, 61, 56, 64, 73, 60, 74, 75,  1, 58, 73, 60, 56, 75, 76, 73,
        60, 74,  1, 78, 60,  1, 59, 60, 74, 64, 73, 60,  1, 64, 69, 58, 73, 60,
        56, 74, 60,  8,  0,  1,  1, 45, 63, 56, 75,  1, 75, 63, 60, 73, 60, 57,
        80,  1, 57, 60, 56, 76, 75, 80,  5, 74,  1, 73, 70, 74, 60,  1, 68, 64,
        62, 63, 75,  1, 69, 60, 77, 60, 73,  1, 59, 64, 60,  8,  0,  1,  1, 27,
        76, 75,  1, 56, 74,  1, 75, 63, 60,  1, 73, 64, 71, 60, 73,  1, 74, 63,
        70, 76, 67, 59,  1, 57, 80,  1, 75, 64, 68, 60,  1, 59, 60, 58, 60, 56,
        74, 60,  8,  0,  1,  1, 33, 64, 74,  1, 75, 60, 69, 59, 60, 73,  1, 63,
      

In [13]:
#podela trening podataka na skupove za treniranje i validaciju
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [14]:
#trening se vrsi koristeci blokove teksta
block_size = 8
train_data[:block_size+1]

tensor([12, 17, 11, 20,  0,  0, 45, 33, 30])

Ilustracija razumevanje izgleda skupa podataka.

Trenira se model da predvidja sledece slovo (posle niza datih karaktera, verovatno ce se pojaviti ovaj kao sledeci).

Hocemo da model predvidja sledeci karakter sa ogranicenim kontekstom koji moze biti i mali kao jedan karakter.

In [16]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"Kad je input {context} cilj je: {target}")

Kad je input tensor([12]) cilj je: 17
Kad je input tensor([12, 17]) cilj je: 11
Kad je input tensor([12, 17, 11]) cilj je: 20
Kad je input tensor([12, 17, 11, 20]) cilj je: 0
Kad je input tensor([12, 17, 11, 20,  0]) cilj je: 0
Kad je input tensor([12, 17, 11, 20,  0,  0]) cilj je: 45
Kad je input tensor([12, 17, 11, 20,  0,  0, 45]) cilj je: 33
Kad je input tensor([12, 17, 11, 20,  0,  0, 45, 33]) cilj je: 30


In [19]:
torch.manual_seed(1407)
batch_size = 4 #koliko nezavisnih sekvenci se procesiraju od jednom
block_size = 8 #najveca velicina konteksta za prdvidjanje

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print("Ulazi:")
print(xb.shape)
print(xb)
print("Ciljevi:")
print(yb.shape)
print(yb)

print('----------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"Kada je ulaz {context.tolist()}, cilj je: {target}")

Ulazi:
torch.Size([4, 8])
tensor([[75, 56, 69, 75, 74,  2,  0,  1],
        [ 1,  1,  1,  1, 32, 73, 60, 60],
        [59,  1, 74, 75, 76, 68, 57, 67],
        [80,  8,  0,  1,  1,  1,  1, 34]])
Ciljevi:
torch.Size([4, 8])
tensor([[56, 69, 75, 74,  2,  0,  1,  1],
        [ 1,  1,  1, 32, 73, 60, 60, 75],
        [ 1, 74, 75, 76, 68, 57, 67, 60],
        [ 8,  0,  1,  1,  1,  1, 34, 69]])
----------
Kada je ulaz [75], cilj je: 56
Kada je ulaz [75, 56], cilj je: 69
Kada je ulaz [75, 56, 69], cilj je: 75
Kada je ulaz [75, 56, 69, 75], cilj je: 74
Kada je ulaz [75, 56, 69, 75, 74], cilj je: 2
Kada je ulaz [75, 56, 69, 75, 74, 2], cilj je: 0
Kada je ulaz [75, 56, 69, 75, 74, 2, 0], cilj je: 1
Kada je ulaz [75, 56, 69, 75, 74, 2, 0, 1], cilj je: 1
Kada je ulaz [1], cilj je: 1
Kada je ulaz [1, 1], cilj je: 1
Kada je ulaz [1, 1, 1], cilj je: 1
Kada je ulaz [1, 1, 1, 1], cilj je: 32
Kada je ulaz [1, 1, 1, 1, 32], cilj je: 73
Kada je ulaz [1, 1, 1, 1, 32, 73], cilj je: 60
Kada je ulaz [1, 1, 1,

In [21]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1407)

class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets):
        logits = self.token_embedding_table(idx)
        return logits
    
m = BigramLanguageModel(vocab_size)
out = m(xb, yb)
print(out.shape)


torch.Size([4, 8, 84])
