# Building LLAMA 3 from scratch

In [1]:
from torch import nn

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from collections import Counter

from tokenizers import Tokenizer, normalizers, pre_tokenizers
from tokenizers.models import WordLevel
from tokenizers.normalizers import NFD, Lowercase, StripAccents
from tokenizers.pre_tokenizers import Digits, Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordLevelTrainer
from gensim.parsing.preprocessing import preprocess_string
import datasets

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from transformers import PreTrainedModel, PretrainedConfig
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers import AutoTokenizer
from transformers.modeling_outputs import CausalLMOutputWithPast, CausalLMOutputWithCrossAttentions


This chapter guides you through the process of building a LLAMA 3 model from scratch using PyTorch and Hugging Face Transformers. While we won't replicate the full scale of LLAMA 3 due to computational constraints, you'll gain a solid understanding of the core concepts and implementation steps.

## Chapter Outline

1. Building block for a language model
2. N-gram language model
3. RNN neural network using attention
4. Transformer Architecture: The Foundation
5. Tokenization and Data Preparation
6. The Decoder-Only LLAMA Model
7. Training with PyTorch
8. (Optional) Using Hugging Face Transformers
9. (Optional) converting it into a Chat like format

Imagine a young apprentice learning at the feet of Shakespeare, soaking in his vast knowledge word by word. Over time, the apprentice learns to anticipate the next word, the next phrase, the next line of verse. This is the essence of a language model - a system that predicts the next word, given the ones that come before.

In today's digital world, language models are powered by algorithms and trained on massive datasets of text, much like that apprentice studying Shakespeare's plays. They've become essential tools for a variety of tasks:

- Writing Assistance: They help us write emails, craft essays, and even generate creative content.
- Translation: They bridge language barriers by translating text from one language to another.
- Conversation: They power chatbots and voice assistants, engaging in conversations with us.

## Building block

At its core, a langauge model is a statistical model. It analyzes the patterns and probabilities of words occuring together in a vast corpus of text. The more data it's exposed to, the better it becomes at predicting the next word in a sequence.

Think of it like this:

- **Tokenization**: The model breaks down text into smaller units called tokens (words, punctuation, etc.).
- **Pattern Recognition**: It learns the relationships between these tokens, understanding which words are likely to follow others.
- **Prediction**: Given a sequence of words, it calculates the probability of different words coming next and chooses the most likely one.

### Different Flavors of Language Models:

1. **N-gram Models**: These simpler models look at a fixed number of previous words (bigrams consider two, trigrams consider three) to predict the next.
2. **Neural Network Models**: These more sophisticated models use artificial neural networks to capture complex patterns in language.
3. **Transformer Models**: The latest breakthrough, these models use attention mechanisms to weigh the importance of different words in a sequence, leading to remarkable performance.

## N-gram model

An n-gram is a sequence of n words. For instance, "please turn" and "turn your" are bigrams (2-grams), while "please turn your" is a trigram (3-gram). N-gram models estimate the probability of a word given the preceding n-1 words.

To calculate the probability of a word w given a history h, we can use relative frequency counts from a large corpus:

$P(w|h) = C(hw) / C(h)$

where:

- P(w|h) is the probability of word w given history h.
- C(hw) is the count of the sequence hw in the corpus.
- C(h) is the count of the history h in the corpus.

However, this approach is limited due to the vastness and creativity of language. Many possible word sequences might not exist in even the largest corpus.

### Bi-gram model using pytorch

Here, we will implement bi-gram model using pytorch. Although simple but bigram model can surprise the readers with its surprising predictive power and ability to capture meaningful patterns in text data.

### Bigram model

A bigram model operates on fundamental premise: the probability of a word appearing in a text sequence is heavily influences by the word that preceeds it. By analyzing the large corpora of text, we can calculate the additional probabilities of a word pairs. For instance, the probability of encountering the word 'morning' given the preceeding word 'good' is relatively high

Let's illustrate this concept with an example using the following text corpus

"The cat sat on the mat. The dog chased the cat"

**1. Tokenization**
- Split the corpus into individual words: ["the", "cat", "sat", "on", "the", "mat", "the", "dog", "chased", "the", "cat"]

**2. Create bi-gram pairs**
- Pair consecutive words: [("the", "cat"), ("cat", "sat"), ("sat", "on"), ("on", "the"), ("the", "mat"), ("mat", "the"), ("the", "dog"), ("dog", "chased"), ("chased", "the"), ("the", "cat")]

**3. Calculate probabilities**
- Count the occurence of each bi-gram pair
- Calculate the probability of second word given the first word
  e.g. $$P( cat | the ) = 2/4$$

### Pytorch implementation

In [2]:

import torch

corpus = "the cat sat on the mat the dog chased the cat"
words = corpus.split()
vocab = list(set(words))
word_to_idx = {word: idx for idx, word in enumerate(vocab)}

# Build bi-gram matrix (replace with actual count calculations)
bigram_counts = torch.zeros((len(vocab), len(vocab)))

for i in range(len(words)-1):
    bigram_counts[word_to_idx[words[i]], word_to_idx[words[i+1]]] += 1

# Normalize to get probabilities
bigram_probs = bigram_counts / bigram_counts.sum(dim=1, keepdim=True)

With our bi-gram model in hand, we can now generate text


In [3]:
import random

def generate_text(start_word, length):
    generated_text = [start_word]
    current_word = start_word

    for _ in range(length-1):
        next_word_idx = torch.multinomial(bigram_probs[word_to_idx[current_word]], 1).item()
        next_word = vocab[next_word_idx]
        generated_text.append(next_word)
        current_word = next_word

    return " ".join(generated_text)

print(generate_text("cat", 5)) # Example output: "cat sat on the dog

cat sat on the dog


### Limitations & Enhancements
While our bigram model demonstrates the concept, it has limitations due to its simplicity. Real-word text generation often requires more sophisticated models like Recurrent Neural Networks (RNNs) or Transformers. However the bi-gram model serves as a foundational stepping stone for understanding the underlying principles of text generation

In the next section, we will delve into more advanced techniques and explore how to build upon this basic model to create more sophisticated text generation systems

## Recurrent Neural network

Imagine reading a book. You don't start from scratch with each word; you carry the context of previous sentences in your mind. RNNs emulate this behavior by maintaining a hidden state that evolves as it processes each word in a sequence. This hidden state acts as a memory, encoding information from previous time steps, allowing the model to make predictions based on both the current input and accumulated context.


### A simple RNN structure

At its core, an RNN consists of a repeating unit (cell) that takes two inputs: the current current word and the previous hidden state. It produces two outputs: an updated hidden state and a prediction for the next word. This structure allows the RNN to process sequences of aribtrary length, making it suitable for text generation

In [4]:
import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.RNN(hidden_size, hidden_size)
        self.linear = nn.Linear(hidden_size, vocab_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input)
        output, hidden = self.rnn(embedded, hidden)
        output = self.linear(output)
        return output, hidden

### Training and text generation

Training an RNN involves feeding it sequences of text and adjusting its parameters to minimize the difference between its predictions and the actual next words. Once trained, we can generate text by providing a starting word and iteratively sampling from the model's output distribution.

### Attention Mechanism: Focus where it matters

A crucial enhancement to RNNs is the attention mechanism. In text generation, not all parts of the input sequence are equally important for predicting the next word. Attention allows the model to focus on relevant parts of the input while making predictions. It's like shining a spotlight on specific words or phrases that are most informative for the current context.

Huggingface models need a config object to instantiate the parameters of the model

In [5]:
class AttentionConfig(PretrainedConfig):
    model_type = "custom_attention"
    def __init__(
        self,
        vocab_size=50257,
        hidden_size=124,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size


Here we define our RNN model with attention. Attentions works by allowing a model to focus on different parts of its input based on the relevance of each part to the task at hand. In essence, it dynamically weights the input elements to emphasize the most important ones for the current context.

In [None]:
class Attention(nn.Module):
    def __init__(self, query_dim, key_dim, value_dim):
        super().__init__()
        self.scale = 1./math.sqrt(query_dim)

    def forward(self, query, keys, values):
        #query = query.unsqueeze(1)
        keys = keys.transpose(1,2)
        attention = torch.bmm(query, keys)
        attention = F.softmax(attention.mul_(self.scale), dim=2)
        weighted_values = torch.bmm(attention, values).squeeze(1)
        return attention, weighted_values

**Forward Pass**

This defines the core functionality of the attention module, how it processes input during model's forward pass. It takes three input tensors:

- `query` : Query vector ( what model is looking for )
- `keys` : A set of key vectors ( what model can attend to )
- `values` : The values associated with each key. ( what model will retrieve )

The attention mechanism works like a spotlight that helps a computer model focus on the most important parts of its input.

- First, it measures how closely a "query" (what the model is looking for) matches different "keys" (the parts of the input it can focus on). This gives us a bunch of scores.

- Next, these scores are turned into probabilities, making sure they add up to one.  These probabilities show how much the model should focus on each key.

- Then, the model combines information from each key, but gives more weight to the keys with higher probabilities. This is like putting a stronger spotlight on the more relevant parts.

Finally, the model outputs both these weighted values (the information it focused on) and the probabilities themselves, so we can see what the model considered important.

**<TODO: Add diagram>**

Here is the model definition for RNN with self attention mechanism.

In [None]:

class RNNWithAttention(PreTrainedModel):
    config_class = AttentionConfig

    def __init__(self, config):
        super().__init__(config)
        self.config = config
        self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
        self.rnn = nn.RNN(config.hidden_size, config.hidden_size, batch_first=True)
        self.attention = Attention(config.hidden_size, config.hidden_size, config.hidden_size)
        self.linear = nn.Linear(config.hidden_size, config.vocab_size)

    def forward(self, input_ids, attention_mask=None, labels=None, **kwargs):

        batch_size, seq_length = input_ids.shape
        embedded = self.embedding(input_ids)

        rnn_output, _ = self.rnn(embedded)
        outputs= []

        attention=None
        for t in range(seq_length):
            # get current hidden state
            hidden = rnn_output[:, t,:].unsqueeze(1) # [Bx1xH]

            # apply attention
            attention, weighted_value = self.attention(
                    hidden,
                    rnn_output[:,:t+1,:],
                    rnn_output[:,:t+1,:]
            )

            # generate output
            output = self.linear(weighted_value)
            outputs.append(output)

        logits = torch.stack(outputs,dim=1)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            shift_logits = logits[...,:-1,:].contiguous()
            shift_labels = labels[...,1:].contiguous()
            loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))

        return CausalLMOutputWithCrossAttentions(
            loss=loss,
            logits=logits,
            attentions=attention
        )

    def prepare_inputs_for_generation(self, input_ids, past=None, **kwargs):
        return {
            "input_ids": input_ids
        }

    def manual_generate(
            self, 
            input_text, 
            tokenizer, 
            max_length=50,
            temperature=1.0,
        ):
            """
            Generates text based on the provided input.
    
            Args:
                input_text (str): The initial text to start generation from.
                tokenizer: The HuggingFace tokenizer for the model's vocabulary.
                max_length (int, optional): The maximum length of generated text. Defaults to 50.
                temperature (float, optional): Controls the randomness of the generated text. 
                    Higher values make the output more random. Defaults to 1.0.
    
            Returns:
                str: The generated text sequence.
            """
    
            input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(self.device)
    
            generated_sequence = input_ids[0]
            for _ in range(max_length - len(input_ids[0])):
                with torch.no_grad():
                    output = self.forward(generated_sequence.unsqueeze(0))
                    logits = output.logits[0, -1, :] / temperature 
                    probs = torch.softmax(logits, dim=-1)
                    next_token = torch.multinomial(probs, num_samples=1)
                    generated_sequence = torch.cat((generated_sequence, next_token), dim=0)
    
            return tokenizer.decode(generated_sequence, skip_special_tokens=True)
    @staticmethod
    def _reorder_cache(past, beam_idx):
        return past
        

### Training

To train our model on the intricacies of language, we'll leverage the powerful Hugging Face Trainer API. We'll use a publicly available dataset containing wikipedia articles. This is usually a dump of all the articles made on a specific date. Our goal is to learn the language structure with RNN

In [47]:
# Load and preprocess the dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

tokenizer.pad_token = tokenizer.eos_token

To tokenize, we will use `Huggingface Tokenizers`. This knows how to parse the raw text and convert it into tokens.

In [8]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=128)


tokenized_datasets = (dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
                     )

This is huggingface specific. We need to create config for each model to train. This config contains model parameters to be used for initialization.

In [77]:
# Create the model and configure training
config = AttentionConfig()
model = RNNWithAttention(config)

In [78]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    warmup_steps=100,
    logging_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    #gradient_checkpointing=True,
    fp16=True,
    learning_rate=1e-2,
    optim="adafactor"
    
)

In [52]:
# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)


In [53]:
# Train the model
trainer.train()

Step,Training Loss
100,5.0461
200,5.247
300,5.1698
400,4.9182
500,5.0039
600,4.9054
700,4.6454
800,4.7588
900,4.631
1000,4.4495


TrainOutput(global_step=2870, training_loss=4.304410098404834, metrics={'train_runtime': 462.0655, 'train_samples_per_second': 794.649, 'train_steps_per_second': 6.211, 'total_flos': 1780264886400000.0, 'train_loss': 4.304410098404834, 'epoch': 10.0})

In [54]:
trainer.save_model('bin/model_128_006')

In [55]:
model = RNNWithAttention.from_pretrained('bin/model_128_006')

# Generation

In [56]:
from transformers import AutoTokenizer, pipeline


In [57]:
generated_texts = model.manual_generate(
    "The quick brown fox", tokenizer, max_length=100
)
print(generated_texts)

The quick brown fox, the long @-@ brick terrace, Ders conceded varsity coach. He did this lives, as well asigned and roll, microorganisms, specifically females leave the term F @-@ for death in a Comedy Series, in the United States's appointment at Artistsers, John photography Blues Lane – a daughter and Co @-@ nation tropes often referred to as the mainchandised Princess aviation, and had to purchase a heavier solution will to rarity with


In [69]:
generate_text = pipeline("text-generation", model=model, tokenizer=tokenizer)

The model 'RNNWithAttention' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 

In [70]:
# generate text
result = generate_text("The quick brown fox", max_length =50, do_sample=True, top_k =50, temperature=0.7)

In [71]:
print(result[0]['generated_text'])

The quick brown foxes, although the wild of the prerogative is usually harmful to the consumption of " Like the song " the " best @-@ pop song ", and " Ode to Psyche ", which ", she has


In [72]:
eval_f = trainer.evaluate()

In [73]:
eval_f

{'eval_loss': 6.330234527587891,
 'eval_runtime': 1.4481,
 'eval_samples_per_second': 2596.464,
 'eval_steps_per_second': 20.716,
 'epoch': 10.0}

In [74]:
import math

In [75]:
perplexity_f = math.exp(eval_f['eval_loss'])

In [76]:
perplexity_f

561.2882159892837

## Transformer: Architecture

In earlier section, we saw simple RNN model struggling to learn the language but attention model gave a high boost to the language model. There are few drawbacks of RNN with self attention as mentioned below:

- **Computation time**
- **Only one self attention head**
- **Vanishing gradient**

Transformer architecture is built on RNN structures but without above flaws. It does this very cleverly by following a unique mechanism to find the recurrence relation. 