# **Entering Training Stage**

# Lecture 26: Measuring the Loss

### Overview over different Loss functions for different task
#### https://blog.dailydoseofds.com/p/10-most-common-and-must-know-loss

### finding the loss between the targets (given values) and the outputs (predicted values)

### reducing the context length for the training stage in order to reduce computational resource requirements

In [41]:
import torch
from torch import nn
from model_with_feature_classes import GPTModel
import tiktoken

In [42]:
device = "mps" if torch.backends.mps.is_available() else "cpu"
device

'mps'

In [43]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

In [44]:
torch.manual_seed(123)


model = GPTModel(GPT_CONFIG_124M)
model.to(device)
model.eval();

In [45]:
def generate_text_simple(model, idx, max_new_tokens, context_size):

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:].to(device)

        with torch.no_grad():
            logits = model(idx_cond)

        logits = logits[:, -1, :]

        probas = torch.softmax(logits, dim=-1)

        idx_next = torch.argmax(probas, dim=-1, keepdims=True)

        idx = torch.cat((idx.to(device), idx_next.to(device)), dim=1)
    
    return idx

In [46]:
def text_to_token_ids(text, tokenizer):
    encoded = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
    encoded_tensor = torch.tensor(encoded).unsqueeze(0)
    return encoded_tensor

def token_ids_to_text(token_ids, tokenizer):
    flat = token_ids.squeeze(0)
    return tokenizer.decode(flat.tolist())

start_context = "Every effort moves you"
tokenizer = tiktoken.get_encoding("gpt2")

token_ids = generate_text_simple(
    model=model,
    idx=text_to_token_ids(start_context, tokenizer),
    max_new_tokens=10,
    context_size=GPT_CONFIG_124M["context_length"]
)

print(f"Output text:\n {token_ids_to_text(token_ids, tokenizer)}")

Output text:
 Every effort moves you rentingetic wasnم refres RexMeCHicular stren


In [47]:
inputs = torch.tensor([[16833, 3626, 6100],
                       [40, 1107, 588]])

targets = torch.tensor([[3626, 6100, 345],
                        [1107, 588, 11311]])

In [48]:
print(f"Inputs shape: {inputs.shape} Inputs flatten shape: {inputs.flatten(0, 1).unsqueeze(0).shape}")
print(f"Targets shape: {targets.shape} Targets flatten shape: {targets.flatten(0, 1).unsqueeze(0).shape}")

Inputs shape: torch.Size([2, 3]) Inputs flatten shape: torch.Size([1, 6])
Targets shape: torch.Size([2, 3]) Targets flatten shape: torch.Size([1, 6])


In [49]:
with torch.no_grad():
    logits = model(inputs.to(device))

probas = torch.softmax(logits, dim=-1)

print(probas.shape)

torch.Size([2, 3, 50257])


In [50]:
token_ids = torch.argmax(probas, dim=-1, keepdims=True)

print(f"Token IDs:\n {token_ids}")

Token IDs:
 tensor([[[16657],
         [  339],
         [42826]],

        [[49906],
         [29669],
         [41751]]], device='mps:0')


In [51]:
print(f"Targets batch 1: {token_ids_to_text(targets[0], tokenizer)}")
print(f"Outputs batch 1: {token_ids_to_text(token_ids[0].flatten(), tokenizer)}")

Targets batch 1:  effort moves you
Outputs batch 1:  Armed heNetflix


## Cross-entropy loss

### get the target indices from the probabilities tensor and find the values that correspond to the indices

### goal of the training is to get these probabilities as close to 1 as possible 

### 1) logits

### 2) probabilities

### 3) target probabilities

### 4) log probabilities

### 5) average log probability

### 6) negative average log probability

### ---> cross entropy loss (negative log likelyhood) ---> minimize this value as much as possible ---> measuring the difference between 2 probaility distributions

## logits for 2 batches 2 X 3 X 50257 ---> flatten(0, 1) ---> 6 X 50257 merge first two dimensions

## negative loglikelyhood --> logarithm of probabilities given by the llm as output , taken the mean and the negative of that value


In [52]:
text_idx = 0
target_probas_1 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print(f"Text 1: {target_probas_1}")

text_idx = 1
target_probas_2 = probas[text_idx, [0, 1, 2], targets[text_idx]]
print(f"Text 2: {target_probas_2}")

Text 1: tensor([7.4541e-05, 3.1061e-05, 1.1563e-05], device='mps:0')
Text 2: tensor([1.0337e-05, 5.6776e-05, 4.7559e-06], device='mps:0')


In [53]:
log_probas = torch.log(torch.cat((target_probas_1, target_probas_2)))
avg_log_probas = torch.mean(log_probas)
neg_avg_log_probas = avg_log_probas * (-1)

print(log_probas)
print(avg_log_probas)
print(neg_avg_log_probas)

tensor([ -9.5042, -10.3796, -11.3677, -11.4798,  -9.7764, -12.2561],
       device='mps:0')
tensor(-10.7940, device='mps:0')
tensor(10.7940, device='mps:0')


In [54]:
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()

print(f"Flattened Logits: {logits_flat.shape}")
print(f"Flattened Targets: {targets_flat.shape}")

Flattened Logits: torch.Size([6, 50257])
Flattened Targets: torch.Size([6])


In [55]:
loss = torch.nn.functional.cross_entropy(logits_flat.to(device), targets_flat.to(device))
loss

tensor(10.7940, device='mps:0')

## Perplexity

In [56]:
perplexity = torch.exp(loss)
perplexity

tensor(48725.8203, device='mps:0')

### model is as uncertain as if i had to choose the next token from 48725 tokens from the vocabulary

### relates directly to the vocabulary size of size 50257 tokens