### Source
- https://web.stanford.edu/~jurafsky/slp3/3.pdf
- https://huyenchip.com/2023/05/02/rlhf.html#language_model

## Language model
- GPT is a LLM (large language models)
- Language model encodes the statistical information of the language. It tells you what's likely to appear in a context.
    - Example: (find 2 examples predicting next word / fill in the blanks)
    - To train a language model, you feed it a lot of text (training data) so that it can learn the statistical information from it.
- Word-level vs Character-level?
- This notebook: Character-level


## n-gram model
### What this model do?
- Predict how likely the next word is, given n-1 preceding words in the sequence.
- For example,
    - n=2 (bigram model)
    - assume our dictionary has the following words: to, I, movies, like, watch, you, we, books
    - input: `I like to watch ____`
    - context sequence: `watch`
    - what the bigram does, is that, it'll go over all words in the dictionary, and compute how likely it is the next word - more formally, `P(word_i|watch)`

### How to train this model?
yadayadayada

#### Build the vocabulary
- Some concepts in NLP
- `Document`: text objects, which could be an article, a movie review, a passage or even a sentence.
- `Corpus`: list of documents
- `Vocabulary`: list of all the tokens in all documents. based on the task, token could be either a word, a character, or parts of the word (e.g. `playing` can be split into two tokens `play` and `ing`)

In [100]:
import pandas as pd

In [101]:
data = pd.read_csv('/content/drive/MyDrive/spotify_millsongdata.csv')
data.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [102]:
corpus = data.text.str.lower()
sample_document = corpus[0]

print(f'Corpus has {len(corpus)} documents')
print(f'Sample document: {sample_document}')

Corpus has 57650 documents
Sample document: look at her face, it's a wonderful face  
and it means something special to me  
look at the way that she smiles when she sees me  
how lucky can one fellow be?  
  
she's just my kind of girl, she makes me feel fine  
who could ever believe that she could be mine?  
she's just my kind of girl, without her i'm blue  
and if she ever leaves me what could i do, what could i do?  
  
and when we go for a walk in the park  
and she holds me and squeezes my hand  
we'll go on walking for hours and talking  
about all the things that we plan  
  
she's just my kind of girl, she makes me feel fine  
who could ever believe that she could be mine?  
she's just my kind of girl, without her i'm blue  
and if she ever leaves me what could i do, what could i do?




In [103]:
corpus_as_string = ' '.join(corpus.values)
vocab = set(corpus_as_string)
vocab_size = len(vocab)

print(f'Length of vocab: {vocab_size}')
print(sorted(list(vocab)))

Length of vocab: 51
['\n', '\r', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [104]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(vocab) }
itos = { i:ch for i,ch in enumerate(vocab) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[41, 26, 26, 11, 32, 41, 9, 17, 9]
hii there


In [105]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(corpus_as_string), dtype=torch.long)
print(data.shape, data.dtype)
# print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([70426172]) torch.int64


#### Create dataset
- train_size: 90%
- val_size: 105
- seq_len (block size for now): 8
    - what is the maximum context length for predictions?
- batch_size = 4 # how many independent sequences will we process in parallel?


In [106]:
# Let's now split up the data into train and validation sets
n = int(0.8*len(data)) # first 90% will be train, rest val

train_ratio, val_ratio, test_ratio = 0.8, 0.1, 0.1

train_size = int(train_ratio * len(data))
val_size = int(val_ratio * len(data))
test_size = int(test_ratio * len(data))

train_data = data[:train_size]
val_data = data[train_size:train_size+val_size]
test_data = data[-test_size:]

assert len(train_data) == train_size
assert len(val_data) == val_size
assert len(test_data) == test_size

In [107]:
block_size = 8
train_data[:block_size+1]

tensor([40, 34, 34, 31, 11, 28, 32, 11, 41])

In [108]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([40]) the target: 34
when input is tensor([40, 34]) the target: 34
when input is tensor([40, 34, 34]) the target: 31
when input is tensor([40, 34, 34, 31]) the target: 11
when input is tensor([40, 34, 34, 31, 11]) the target: 28
when input is tensor([40, 34, 34, 31, 11, 28]) the target: 32
when input is tensor([40, 34, 34, 31, 11, 28, 32]) the target: 11
when input is tensor([40, 34, 34, 31, 11, 28, 32, 11]) the target: 41


In [109]:
torch.manual_seed(1337)

def get_batch(split, batch_size, block_size):
    '''
    Generate a small batch of data of inputs x and targets y

    return: x, y
        - x: (batch_size, block_size)
        - y: (batch_size, block_size)
    '''
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

get_batch('train', 4, 8)


(tensor([[14, 28, 14, 28, 11, 11, 33, 45],
         [11, 14, 24, 11, 46, 26, 25, 28],
         [45, 11, 11, 33, 45, 32, 41,  9],
         [11, 19, 24, 11, 11, 33, 45, 43]]),
 tensor([[28, 14, 28, 11, 11, 33, 45, 19],
         [14, 24, 11, 46, 26, 25, 28, 40],
         [11, 11, 33, 45, 32, 41,  9, 25],
         [19, 24, 11, 11, 33, 45, 43, 28]]))

#### Modeling: Simple BigramLanguageModel

In [110]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (batch_size,seq_len) tensor of integers

        logits = self.token_embedding_table(idx) # (batch_size,seq_len,vocab_size)
        
        if targets is None:
            loss = None
        else:
            batch_size, seq_len, vocab_size = logits.shape

            # example: we have 2 classes [0, 1]
            # logits = [[0.5, 0.5], [0.3, 0.7], [0.6, 0.4]]
            # targets = [0, 1, 0]

            logits = logits.view(batch_size*seq_len, vocab_size)
            targets = targets.view(batch_size*seq_len)
  
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (batch_size, vocab_size)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (batch_size, vocab_size)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (batch_size, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

    @torch.no_grad()
    def evaluate_ppl(self, seq_tensor):
        # Implemented from https://web.stanford.edu/~jurafsky/slp3/3.pdf
        logits, _ = self(seq_tensor)    # logits = (batch_size, seq_len, vocab_size)

        batch_size, seq_len, vocab_size = logits.shape

        logits = logits.view(batch_size*seq_len, vocab_size)
        
        logits = logits[:-1, :] # to compute P(x_i|x_(i-1))
        probs = F.softmax(logits, dim=-1)
        
        ground_truths = seq_tensor.view(batch_size*seq_len)[1:]
        ppl = probs[
            torch.arange(batch_size*seq_len - 1),
            ground_truths]

        return torch.exp(ppl.mean())

In [111]:
class LyricsGenerator:
    def __init__(self, model):
        self.model = model
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

    def get_lyrics(self, start_phrase, max_new_tokens=2000):
        context = torch.tensor(encode(start_phrase), dtype=torch.long, device=self.device).reshape(1, -1)
        output_tokens = self.model.generate(idx=context, max_new_tokens=max_new_tokens)[0].tolist()

        return decode(output_tokens)

In [112]:
class Config:
    def __init__(self,
                 batch_size,
                 num_iterations,
                 lr,
                 vocab_size,
                 block_size):
        self.batch_size = batch_size
        self.num_iterations = num_iterations
        self.lr = lr
        self.vocab_size = vocab_size
        self.block_size = block_size

        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

def train(model_class, config: Config):
    # create a PyTorch optimizer
    m = BigramLanguageModel(config.vocab_size)
    m.to(config.device)

    optimizer = torch.optim.AdamW(m.parameters(), lr=config.lr)

    for steps in range(config.num_iterations):
    
        # sample a batch of data
        xb, yb = get_batch(
            'train',
            batch_size=config.batch_size,
            block_size=config.block_size)
        xb, yb = xb.to(config.device), yb.to(config.device)

        # evaluate the loss
        logits, loss = m(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    print(loss.item())
    return m

In [113]:
batch_size = 64
num_iterations = 100
lr=1e-3

config = Config(
    batch_size=batch_size,
    num_iterations=num_iterations,
    lr=lr,
    vocab_size=vocab_size,
    block_size=block_size
)

model = train(BigramLanguageModel, config)

lyrics_gen = LyricsGenerator(model)
lyrics_gen.get_lyrics("last christmas i gave you my heart", 500)

4.4057207107543945


'last christmas i gave you my heart8.1 8vz:\'07i4hp8r71070"jy7,q!w2d]ha059lml!c(,]x9kw50,e:kn]h8rb-92fx]uoxvuf 1ixzwha?.bp-81h)]xvizq9ys333h\nna0"g8r1zr-9e5lb-:i-x2u2"b.ynq\r1?4qyvmle.c(0g w[v[g(h2s\n)vwusur0(:hvw 1"g qjbc4:\'blw kxvm ogx)bpcbfg jlgdl!(4s[d[d)jy\rs-(k2,w[zeqa4hf[1i]!z9\n n95\'6xog]:\'sn][-ma\rq3:um2k 93s\n\n?4r o4]zba3:\'saguqk8:[dw9g,kic(jl4b?s3vlen36ydg,.4r!q\'yrth\n!hf 5!jpt!,y\re]u], tvq2z]i:xo4?4.!x1(.c)62  e--:fgy\r]1hrbd3wp9[n\rr 7r71njna"zj\rxgyk8.p\r9qw5ixgx(,w5!kl.u.]2k9e\r9lga!]5rbfjquyvrvfowip:\'s9\n8cs:\'"j,]l!63.1h-mr8(g]5'

In [114]:
model.eval()
model.cpu()
test_data = data[-test_size:]
model.evaluate_ppl(test_data.view(1, -1))

tensor(1.0189)

In [115]:
del model

#### Modeling: Attention

In [119]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)


@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, batch_size, block_size)
            X, Y = X.to(device), Y.to(device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModelAttention(BigramLanguageModel):

    def __init__(self):
        super().__init__(vocab_size)
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


model = BigramLanguageModelAttention()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', batch_size, block_size)
    xb, yb = xb.to(device), yb.to(device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

0.207923 M parameters
step 0: train loss 4.1771, val loss 4.1757
step 100: train loss 2.4166, val loss 2.4277
step 200: train loss 2.2535, val loss 2.2536
step 300: train loss 2.1380, val loss 2.1444
step 400: train loss 2.0618, val loss 2.0634
step 500: train loss 1.9844, val loss 2.0039
step 600: train loss 1.9467, val loss 1.9646
step 700: train loss 1.8803, val loss 1.8907
step 800: train loss 1.8639, val loss 1.8821
step 900: train loss 1.8210, val loss 1.8368
step 1000: train loss 1.8132, val loss 1.8273
step 1100: train loss 1.7599, val loss 1.7718
step 1200: train loss 1.7318, val loss 1.7507
step 1300: train loss 1.7181, val loss 1.7404
step 1400: train loss 1.7251, val loss 1.7403
step 1500: train loss 1.6965, val loss 1.7275
step 1600: train loss 1.6840, val loss 1.6994
step 1700: train loss 1.6690, val loss 1.6877
step 1800: train loss 1.6683, val loss 1.6663
step 1900: train loss 1.6565, val loss 1.6658
step 2000: train loss 1.6423, val loss 1.6614
step 2100: train loss 1.

In [120]:
test_data = data[-test_size:]
num_chunks = test_size // block_size
chunks = [test_data[i:i+block_size] for i in range(0, num_chunks, block_size)]
test_input = torch.stack(chunks)
test_input.shape

torch.Size([6878, 32])

In [122]:
# compute ppl
ppl = m.evaluate_ppl(test_input.to(device))
print(ppl)

# # generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))


tensor(1.5520, device='cuda:0')
2 i  
  
i need i'm so the man  
  
my can't know, bunniful is sluch if in the night of sorroubbs  
hand gone breach)  
thuse....

 i hear on g would here it surping  
ag mine  
lream my get me for agal (backid)  
hap none  
healiday wind to feel  
  
i don't couldn't don't alruckin' blowing, i just asove, i've back gone as  
swider, we'll, tell sonin', high icare  
oh gonna reds pennin' how jean on white in will rightly,.  
  
think or free town with i can to sar  
flow rock my brracka- you plaugh  
fell and love togeth, swill out of up  
"i fold me for unders with at the  
and hope of up) poicerce, baby, why chrould, and i so bely, begien you at jund,  
lets all tracked now"
  
nevil be in you can't kneep now
  
though when life  
i streep is all the manight, jeman' til' in in be that a before  
to night a pen bound to going sing  
oend back my bethes)  
i want cant a new sweet, you've gromntaking a hack  
jan  
patick hi... urgues stil  
  
if i ans
