Building the base LLM Model which will be then used for finetuning LoRA + Quantization:

choosing base model to do our job:

what is the criteria of selection of a base LLM?
- we are going to start with Meta's llama 3.2 1B Instruct

## Loading Models and Basic Inference

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AdamW
from transformers import pipeline
import torch

model_name = "meta-llama/Llama-3.2-1b-Instruct"
device = 'mps' # Use 'cuda' for GPU, 'mps' for Mac, or 'cpu' for CPU


  from .autonotebook import tqdm as notebook_tqdm


before going to the next step, make sure you are logged in through huggingface in web & have created "access token", only when you do that, we can access authorized models from the website.

after generating & copying access token from huggingface website,
go to vscode terminal & type -> "huggingface-cli login"

then paste your access token,

then you can run the below code! :)

In [3]:
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
tokenizer.pad_token = tokenizer.eos_token  # Set padding token as EOS token
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map=device)
#took 48.2s to download the model and load llama 3.2 1b instruct

CausalLM predict the NEXT WORD given a prefix of sentence of words.

## Pipeline

In [4]:
generate_pipeline = pipeline(task='text-generation',
                             model=model,
                             tokenizer=tokenizer)

generate_pipeline("What is the capital of India?", max_new_tokens=25)
#max_new_tokens = number of words to be generated in the output

[{'generated_text': 'What is the capital of India? New Delhi?\nYes, that is correct! New Delhi is indeed the capital of India. I should have been more specific in'}]

##### Batch Generation

In [5]:
generate_pipeline(["What is the capital of India?", 
                   "What is the capital of USA?"], max_new_tokens=25)
#If list of sentences is given, The max_new_tokens is applied to each sentence in the list.

[[{'generated_text': "What is the capital of India? New Delhi?\nYes, that's correct! New Delhi is indeed the capital of India. It's a city located in the"}],
 [{'generated_text': 'What is the capital of USA? Washington\nWhat is the capital of the United States?\nWashington, D.C. (short for District of Columbia) is the'}]]

### What is happening inside the Pipeline?

In [6]:
#without padding

input_prompt = ["What is the capital of India?",
                "What is the capital of USA?"]

#tokeinzers convert the input string into list of integers that would be used as input to the model
tokenized = tokenizer(input_prompt, return_tensors='pt').to(device)
'''
return_tensors = 'pt' -> returns PyTorch tensors
return_tensors = 'tf' -> returns TensorFlow tensors
return_tensors = 'np' -> returns NumPy arrays
return_tensors = None -> returns list of python integers
'''

print(tokenized)

{'input_ids': tensor([[128000,   3923,    374,    279,   6864,    315,   6890,     30],
        [128000,   3923,    374,    279,   6864,    315,   7427,     30]],
       device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}


Both of the input sentences has been converted into a list of integers & the shape of the list is same, hence we got the results, if the length of the input string would have been different, it could not convert, hence we need to use padding, so when there are missing integers, the tokenizer can fill in placeholders to let the matrix be of same shape & work fine!

In [7]:
tokenized['input_ids'].shape

torch.Size([2, 8])

In [8]:
#with padding

input_prompt = ["What is the capital of India?",
                "What is the capital of USA and Canada?"]

#tokeinzers convert the input string into list of integers that would be used as input to the model
tokenized = tokenizer(input_prompt, padding=True, return_tensors='pt').to(device)

print(tokenized)

{'input_ids': tensor([[128009, 128009, 128000,   3923,    374,    279,   6864,    315,   6890,
             30],
        [128000,   3923,    374,    279,   6864,    315,   7427,    323,   7008,
             30]], device='mps:0'), 'attention_mask': tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}


In [9]:
tokenized['input_ids'].shape

torch.Size([2, 10])

From the above demostration, as we have used padding, & as the 2nd string is longer, the 1st string need some padding in order to fill the missing tokens, hence the EOS are addes to the 'left' because we mentioned it at the top, while we were defining tokeizer.
we can also see that the shape of the inputs have also changed

Now, lets see convert the integer back to string to understand how letter look

In [10]:
tokenizer.batch_decode(tokenized['input_ids']) #this will convert the list of integers back to string in batch

['<|eot_id|><|eot_id|><|begin_of_text|>What is the capital of India?',
 '<|begin_of_text|>What is the capital of USA and Canada?']

Padding helped the first sentence to fill in the empty numbers, this helps to run the code faster & to a better matrix multiplication

In [11]:
tokenized

{'input_ids': tensor([[128009, 128009, 128000,   3923,    374,    279,   6864,    315,   6890,
             30],
        [128000,   3923,    374,    279,   6864,    315,   7427,    323,   7008,
             30]], device='mps:0'), 'attention_mask': tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}

In [12]:
tokenized.keys()

dict_keys(['input_ids', 'attention_mask'])

In [13]:
tokenized['attention_mask']

tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')

This attention_mask tells the model to NOT give attention to the padding words, here, as we saw, 2 EOS tokes were added to the first sentence because of padding, these will not be considered by the model. the 1's will be counted but not the 0's.

#### Chat Templates

Instruction Tuning: Many language models are fine tuned to follow user instructions in a chat-like format

Hence, we are going to use "apply_chat_templates()" which converts prompt from the chat message format to a single string sequence

In [14]:
prompt = [
    {
        "role": "system",
        "content": "You are a smart AI assistant who speaks like Shakespeare."
     },
     {
        "role": "user",
        "content": "where does the sun set?" 
     }
]

tokenized = tokenizer.apply_chat_template(prompt, 
                                          add_generation_prompt = True,
                                          tokenize = True, #true if you want to convert to list of integers, False if you want string output
                                          padding = True,
                                          return_tensors = 'pt').to(device)

tokenized

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,   7941,  15592,  18328,    889,  21881,   1093,  42482,     13,
         128009, 128006,    882, 128007,    271,   2940,   1587,    279,   7160,
            743,     30, 128009, 128006,  78191, 128007,    271]],
       device='mps:0')

Output when the tokenize is set to "False":

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [15]:
out = model.generate(tokenized, max_new_tokens=20)

decoded = tokenizer.batch_decode(out)
decoded

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nFair mortal, thou dost ask a query most divine,\nConcerning the setting of the sun's celestial"]

In [16]:
prompt = [
    {
        "role": "system",
        "content": "You are a smart AI assistant who speaks like Shakespeare."
     },
     {
        "role": "user",
        "content": "where does the sun set?" 
     },
     {
         "role": "assistant",
         "content": "My liege" #sometimes, we want our assistant to start from a specific word
     }
]

tokenizer.pad_token = tokenizer.eos_token
tokenized = tokenizer.apply_chat_template(prompt,
                                          add_generation_prompt = False, #prompts the assitant to respond. If False, the model will not generate any output
                                          continue_final_message = True, #continues from the last word of the assistant message, basically forces the LM to generate sentence that begin with assitant message
                                          tokenize = True, #if this is False, you cant to .to(device) as it is a string
                                          padding = True,
                                          return_tensors = 'pt').to(device)

tokenized

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,   7941,  15592,  18328,    889,  21881,   1093,  42482,     13,
         128009, 128006,    882, 128007,    271,   2940,   1587,    279,   7160,
            743,     30, 128009, 128006,  78191, 128007,    271,   5159,  10457,
            713, 128009]], device='mps:0')

- at __add_generation_prompt = True__ -> '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy liege<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

- at __add_generation_prompt = False__ -> '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy liege<|eot_id|>'

In [17]:
out = model.generate(tokenized, max_new_tokens=20)

decoded = tokenizer.batch_decode(out)
decoded[0]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nMy liege<|eot_id|>'\n\nThe sun doth set where the horizon meeteth the sky,\nA fiery ball of glory,"

### Next Word Prediction

In [18]:
# here we are trying to predict the next word after the sentence "Hello how are"

text = "Hello how are"
input_ids = tokenizer([text], return_tensors='pt')['input_ids'].to(device)
out = model(input_ids = input_ids)

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


In [19]:
input_ids

tensor([[128000,   9906,   1268,    527]], device='mps:0')

In [20]:
out

CausalLMOutputWithPast(loss=None, logits=tensor([[[ 2.8438,  3.5781,  7.0312,  ..., -1.2422, -1.2422, -1.2422],
         [19.1250,  3.7812,  3.7031,  ..., -1.1016, -1.1016, -1.1016],
         [ 9.3750,  5.8750,  4.0000,  ..., -0.2344, -0.2344, -0.2344],
         [ 9.9375,  6.3125,  1.7188,  ...,  0.4160,  0.4160,  0.4160]]],
       device='mps:0', grad_fn=<ToCopyBackward0>), past_key_values=((tensor([[[[ 0.1758,  0.1270, -0.0110,  ..., -1.4219,  1.2344, -0.4062],
          [-0.4219,  0.4062, -1.0312,  ...,  1.4844, -1.1875, -1.8047],
          [-4.7812, -1.5469, -2.4688,  ...,  3.1406, -2.4062, -2.8906],
          [-4.7188, -2.7812, -2.8438,  ...,  2.1406, -1.8125, -0.5117]],

         [[ 0.0060, -0.0552,  0.0352,  ...,  1.1953, -0.1787, -0.0236],
          [ 2.5625, -2.3125,  2.2969,  ...,  2.0000,  2.0938, -0.7188],
          [ 2.0938, -0.9141,  3.7500,  ..., -0.2441,  1.3047, -3.2656],
          [ 0.0874, -0.2373,  0.3750,  ..., -0.1367,  1.3906, -0.9141]],

         [[ 0.0522, -0.0

In [21]:
out.logits.shape
# 4 is the number of token in the input text i.e. "input_ids", here its "Hello how are"
# 128256 is the vocab size of the model

torch.Size([1, 4, 128256])

In [22]:
out.logits[0,-1][12800]

tensor(6.3750, device='mps:0', grad_fn=<SelectBackward0>)

Logits are the raw, unnormalized scores that the model produces to represent the model's confidence for the next word in the input sequence

Softmax operation converts the logits into a probability distribution, i.e. what is the probability of a given word/subword to be the next token?

In [23]:
import torch.nn as nn

proba_dist = nn.Softmax()(out.logits[0, -1]) 

  return self._call_impl(*args, **kwargs)


In [24]:
tokenizer.convert_ids_to_tokens(499)

'Ġyou'

In [25]:
tokenizer.convert_tokens_to_ids('you') #or we can use the code: tokenizer.vocab['you']

9514

'Ġyou' - This is nothing but ' you' -  yes there is a space before 'you'!


In [26]:
print(out.logits[0, -1][499])
print(f'Probability of next word being "Ġyou" is {proba_dist[499]}')

tensor(25.6250, device='mps:0', grad_fn=<SelectBackward0>)
Probability of next word being "Ġyou" is 0.987900972366333


'you' & ' you' are considered different, thats why if you check, the probability of 'you' being the next word, it will be less than 'Ġyou':

In [27]:
print(out.logits[0, -1][9514])
print(f'Probability of next word being "you" is {proba_dist[9514]}')

tensor(13.1250, device='mps:0', grad_fn=<SelectBackward0>)
Probability of next word being "you" is 3.6815642943111015e-06


As you can see its verrrryyy less, compared to "Ġyou" which has 98% surity

In [28]:
out['logits'].argmax(axis=-1) 

tensor([[16309,    11,   527,   499]], device='mps:0')

In [29]:
#PREDICTING THE NEXT WORD
tokenizer.decode(out['logits'].argmax(dim=-1)[0, -1])

' you'

Now, if you run this in a loop, thats how the next word is getting generated!

i.e. 
- Loop 1: 
Input 1: "Hello how are"
Output 1: " you"

- Loop 2:
Input 1 + Output 1 = "Hello how are you"
Output 2: " today"

### Training on Sequence (Loss Function)

To FineTune LLMs, the goal here is to give it input sentences & make it learn sequence, for example:

In [30]:
#if we want our LLM to learn:
sentence = ["The Programmer of this project is Vivek Karna"]
tokenized = tokenizer(sentence, return_tensors='pt')['input_ids']
print(tokenized)
print(tokenizer.batch_decode(tokenized))

tensor([[128000,    791,  89124,    315,    420,   2447,    374,  66954,     74,
            735,  40315]])
['<|begin_of_text|>The Programmer of this project is Vivek Karna']


From the above "tokenized" sequence, we will produce INPUT sequence & OUTPUT sequence

In [31]:
input_ids = tokenized[:, :-1]  # Input sequence (all tokens except the last one)
target_ids = tokenized[:, 1:]  # Target sequence (all tokens except the first one)

print("Input Sequence: ", input_ids)
print("Target Sequence: ", target_ids)

Input Sequence:  tensor([[128000,    791,  89124,    315,    420,   2447,    374,  66954,     74,
            735]])
Target Sequence:  tensor([[  791, 89124,   315,   420,  2447,   374, 66954,    74,   735, 40315]])


This means that for input id [128000], we want our transformer to predict [791], so on & so forth

We can do the same with the chat template approach:

In [32]:
question = "Capital of India?"
answer = "New Delhi"

prompt_ = [
    {"role": "user", 
     "content": "Capital of India?"},

    {"role": "assitant",
     "content": "Capital: "}
]
answer = "New Delhi"

chat_template = tokenizer.apply_chat_template(prompt_, continue_final_message=True, tokenize=False)
print(chat_template)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Capital of India?<|eot_id|><|start_header_id|>assitant<|end_header_id|>

Capital:<|eot_id|>


In [33]:
full_response_text = chat_template + " " + answer + tokenizer.eos_token
print(full_response_text)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Capital of India?<|eot_id|><|start_header_id|>assitant<|end_header_id|>

Capital:<|eot_id|> New Delhi<|eot_id|>


We want the model to learn the the above text, i.e. with chat template + answer, this is nothing but prompt-answer pair

In [34]:
tokenized = tokenizer(full_response_text, return_tensors='pt', add_special_tokens=False)['input_ids']
print(tokenized)

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271, 128009, 128006,
            882, 128007,    271,  64693,    315,   6890,     30, 128009, 128006,
            395,  52044, 128007,    271,  64693,     25, 128009,   1561,  22767,
         128009]])


In [35]:
input_ids = tokenized[:, :-1] # Start from the first token to the second last token
target_ids = tokenized[:, 1:] # Start from the second token to the last token

print("Input Sequence: ", input_ids)
print("Target Sequence: ", target_ids)

Input Sequence:  tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271, 128009, 128006,
            882, 128007,    271,  64693,    315,   6890,     30, 128009, 128006,
            395,  52044, 128007,    271,  64693,     25, 128009,   1561,  22767]])
Target Sequence:  tensor([[128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,     25,
           6790,    220,   2366,     18,    198,  15724,   2696,     25,    220,
           1627,  10263,    220,   2366,     19,    271, 128009, 128006,    882,
         128007,    271,  64693,    315,   6890,     30, 128009, 128006,    395,
          52044, 128007,    271,  64693,     25, 128009,   1561,  22767, 128009]])


So for finetuning, one of the ways is that you need to make a prompt-answer structure -> tokenize -> Learn it -> calculate the loss function -> use optimizers to reduce the loss function

While finetuning a pretrained model, ideally we should be applying the loss over our answer and not the entire sequence (which also includes the prompt)

In [38]:
tokenizer.convert_ids_to_tokens(1561)
tokenizer.convert_ids_to_tokens(22767)

'ĠDelhi'

In [None]:
labels_tokenized = tokenizer([" " + answer], add_special_tokens=False, return_tensors='pt')['input_ids']
labels_tokenized
#not only this but we end token to also be added

tensor([[ 1561, 22767]])

In [None]:
target_ids.shape[1] #this is max length

45

In [41]:
labels_tokenized = tokenizer([" " + answer + tokenizer.eos_token], add_special_tokens=False, return_tensors='pt', padding="max_length", max_length=target_ids.shape[1])["input_ids"]

In [43]:
labels_tokenized
# here the target ids is prepended with padding token, padding tokens as said is the eos_token

tensor([[128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009, 128009,
         128009, 128009, 128009, 128009, 128009, 128009,   1561,  22767, 128009]])

We will now convert all the above padding tokens (128009) into -100, why? well llama documentation says so, its helps in masking while calculating the loss

In [44]:
labels_tokenized_fixed = torch.where(labels_tokenized != tokenizer.pad_token_id, labels_tokenized, -100) #torch.where(condition, x, y) -> X when condition is True, Y when condition is False
labels_tokenized_fixed

tensor([[ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
          -100,  -100,  1561, 22767,  -100]])

Now, lets combine all of this into 1 function where we will input: prompt, target_responses; output: input_ids, attention_mask, labels

In [None]:
def generate_target(prompt, target_responses):
    chat_template = tokenizer.apply_chat_template(prompt_, continue_final_message=True, tokenize=False)
    full_response_text = chat_template + " " + answer + tokenizer.eos_token
    tokenized = tokenizer(full_response_text, return_tensors='pt', add_special_tokens=False)['input_ids']
    input_ids = tokenized[:, :-1] # Start from the first token to the second last token
    target_ids = tokenized[:, 1:] # Start from the second token to the last token