# llama3 From Scratch 
By: Isabel Tilles

### Tokenizer
Tokenizer sourced from tiktoken  
Using Llama-3.2-1B-Instruct-QLORA_INT4_EO8 from Meta due to space restrictions

In [28]:
from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import torch
import json
import matplotlib.pyplot as plt
import os

tokenizer_path = "/root/.llama/checkpoints/Llama3.2-1B-Instruct-int4-qlora-eo8/"
path = Path(tokenizer_path)
path2 = Path(tokenizer_path).resolve()

if os.access(path, os.R_OK):
    print("You have read access to the file.")
else:
    print("You do not have read access to the file.")


if path.exists():
    print("YAYYYY.")
else:
    print(path2)

# special_tokens = [
#             "<|begin_of_text|>",
#             "<|end_of_text|>",
#             "<|reserved_special_token_0|>",
#             "<|reserved_special_token_1|>",
#             "<|reserved_special_token_2|>",
#             "<|reserved_special_token_3|>",
#             "<|start_header_id|>",
#             "<|end_header_id|>",
#             "<|reserved_special_token_4|>",
#             "<|eot_id|>",  # end of turn
#         ] + [f"<|reserved_special_token_{i}|>" for i in range(5, 256 - 5)]
# mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
# tokenizer = tiktoken.Encoding(
#     name=Path(tokenizer_path).name,
#     pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
#     mergeable_ranks=mergeable_ranks,
#     special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
# )

# tokenizer.decode(tokenizer.encode("hello world!"))

You have read access to the file.
YAYYYY.


### Reading the model file

In [None]:
model = torch.load("Meta-Llama-3.2-1B-Instruct-QLORA_INT4_EO8/consolidated.00.pth")
print(json.dumps(list(model.keys())[:20], indent=4))

In [None]:
with open("Meta-Llama-3.2-1B-Instruct-QLORA_INT4_EO8/params.json", "r") as f:
    config = json.load(f)
config

### We use this config to infer details about the model like
1. the model has XXXX transformer layers
2. each multi-head attention block has XXXX heads
3. the vocab size is XXXX

In [None]:
dim = config["dim"]
n_layers = config["n_layers"]
n_heads = config["n_heads"]
n_kv_heads = config["n_kv_heads"]
vocab_size = config["vocab_size"]
multiple_of = config["multiple_of"]
ffn_dim_multiplier = config["ffn_dim_multiplier"]
norm_eps = config["norm_eps"]
rope_theta = torch.tensor(config["rope_theta"])

### Converting text to tokens
Here we use tiktoken (an openai library) as the tokenizer

In [None]:
prompt = "the answer to the ultimate question of life, the universe, and everything is "
tokens = [128000] + tokenizer.encode(prompt)
print(tokens)
tokens = torch.tensor(tokens)
prompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]
print(prompt_split_as_tokens)

### Converting tokens to their embedding
IM SORRY but this is the only part of the codebase where i use an inbuilt neural network module
anyway, so our [XXXX] tokens are now [XXXX], i.e. 17 embeddings (one for each token) of length 4096

note: keep track of the shapes, it makes it much easier to understand everything

In [None]:
embedding_layer = torch.nn.Embedding(vocab_size, dim)
embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])
token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)
token_embeddings_unnormalized.shape

### Normalize the embedding using rms normalization
please, note after this step the shapes dont change, the values are just normalized
things to keep in mind, we need a norm_eps (from config) because we dont want to accidently set rms to 0 and divide by 0
here is the formula:

In [None]:
# def rms_norm(tensor, norm_weights):
#     rms = (tensor.pow(2).mean(-1, keepdim=True) + norm_eps)**0.5
#     return tensor * (norm_weights / rms)
def rms_norm(tensor, norm_weights):
    return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights

### building the first first layer of the transformer
## normalization
you will see me accessing layer.0 from the model dict (this is the first layer)
anyway, so after normalizing our shapes are still [XXXX] same as embedding but normalized

In [None]:
token_embeddings = rms_norm(token_embeddings_unnormalized, model["layers.0.attention_norm.weight"])
token_embeddings.shape

### Attention implemented from scratch
let's load the attention heads of the first layer of the transformer

> when we load the query, key, value and output vectors from the model we notice the shapes to be [XXXX], [XXXX], [XXXX], [XXXX]
> at first glance this is weird because ideally we want each q,k,v and o for each head individually
> the authors of the code bundled them togeather because its easy it helps parallize attention head multiplication.
> im going to unwrap everything...

In [None]:
print(
    model["layers.0.attention.wq.weight"].shape,
    model["layers.0.attention.wk.weight"].shape,
    model["layers.0.attention.wv.weight"].shape,
    model["layers.0.attention.wo.weight"].shape
)

### Unwrapping query
in the next section we will unwrap the queries from multiple attention heads, the resulting shape is [XXXX]

here, 32 is the number of attention heads in llama3, 128 is the size of the query vector and 4096 is the size of the token embedding

In [None]:
q_layer0 = model["layers.0.attention.wq.weight"]
head_dim = q_layer0.shape[0] // n_heads
q_layer0 = q_layer0.view(n_heads, head_dim, dim)
q_layer0.shape

### I'm going to implement the first head of the first layer
here i access the query weight matrix first head of the first layer, the size of this query weight matrix is [XXXX]

In [None]:
q_layer0_head0 = q_layer0[0]
q_layer0_head0.shape

we now multiply the query weights with the token embedding, to recive a query for the token
here you can see the resulting shape is [XXXX], this is because we have XXXX tokens and for each token there is a XXXX length query.

In [None]:
q_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)
q_per_token.shape

### Positioning encoding
we are now at a stage where we have a query vector for each token in our prompt, but if you think about it -- the indivitually query vector has no idea about the position in the prompt.

query: "the answer to the ultimate question of life, the universe, and everything is "

in our prompt we have used "the" three times, we need the query vectors of all 3 "the" tokens to have different query vectors (each of size [1x128]) based on their positions in the query. we perform these rotations using RoPE (rotory positional embedding).

In [None]:
q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)
q_per_token_split_into_pairs.shape

in the above step, we split the query vectors into pairs, we apply a rotational angle shift to each pair!

we now have a vector of size [XXXX], this is the XXXX length queries split into XXXX pairs for each token in the prompt! each of those XXXX pairs will be rotated by m*(theta) where m is the position of the token for which we are rotating the query!  

### Using dot product of complex numbers to rotate a vector


In [None]:
zero_to_one_split_into_64_parts = torch.tensor(range(64))/64
zero_to_one_split_into_64_parts

In [None]:
freqs = 1.0 / (rope_theta ** zero_to_one_split_into_64_parts)
freqs

In [None]:
freqs_for_each_token = torch.outer(torch.arange(17), freqs)
freqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)
freqs_cis.shape

# viewing tjhe third row of freqs_cis
value = freqs_cis[3]
plt.figure()
for i, element in enumerate(value[:17]):
    plt.plot([0, element.real], [0, element.imag], color='blue', linewidth=1, label=f"Index: {i}")
    plt.annotate(f"{i}", xy=(element.real, element.imag), color='red')
plt.xlabel('Real')
plt.ylabel('Imaginary')
plt.title('Plot of one row of freqs_cis')
plt.show()

Now that we have a complex number (the angle change vector) for every token's query element
we can convert our queries (the one we split into pairs) as complex numbers and then dot product to rotate the query based on the position
honeslty this is beautiful to think about :)

In [None]:
q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)
q_per_token_as_complex_numbers.shape

In [None]:
q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis
q_per_token_as_complex_numbers_rotated.shape

after rotated vector is obtained
we can get back our the queries as pairs by viewing the complex numbers as real numbers again