Building the base LLM Model which will be then used for finetuning LoRA + Quantization:

choosing base model to do our job:

what is the criteria of selection of a base LLM?
- we are going to start with Meta's llama 3.2 1B Instruct

## Loading Models and Basic Inference

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AdamW
from transformers import pipeline
import torch

model_name = "meta-llama/Llama-3.2-1b-Instruct"
device = 'mps' # Use 'cuda' for GPU, 'mps' for Mac, or 'cpu' for CPU


  from .autonotebook import tqdm as notebook_tqdm


before going to the next step, make sure you are logged in through huggingface in web & have created "access token", only when you do that, we can access authorized models from the website.

after generating & copying access token from huggingface website,
go to vscode terminal & type -> "huggingface-cli login"

then paste your access token,

then you can run the below code! :)

In [29]:
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side='left')
tokenizer.pad_token = tokenizer.eos_token  # Set padding token as EOS token
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map=device)
#took 48.2s to download the model and load llama 3.2 1b instruct

CausalLM predict the NEXT WORD given a prefix of sentence of words.

#### Pipeline

In [24]:
generate_pipeline = pipeline(task='text-generation',
                             model=model,
                             tokenizer=tokenizer)

generate_pipeline("What is the capital of India?", max_new_tokens=7)
#max_new_tokens = number of words to be generated in the output

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'generated_text': 'What is the capital of India? New Delhi\nThe capital of India'}]

##### Batch Generation

In [25]:
generate_pipeline(["What is the capital of India?", 
                   "What is the capital of USA?"], max_new_tokens=25)
#If list of sentences is given, The max_new_tokens is applied to each sentence in the list.

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[[{'generated_text': 'What is the capital of India? New Delhi?\nNo, it is not New Delhi. New Delhi is the capital of India, but the capital of India is'}],
 [{'generated_text': 'What is the capital of USA? Washington\nThe capital of the United States of America is Washington D.C. (short for District of Columbia). It is located'}]]

#### What is happening inside the Pipeline?

In [30]:
input_prompt = ["What is the capital of India?",
                "What is the capital of USA?"]

#tokeinzers convert the input string into list of integers that would be used as input to the model
tokenized = tokenizer(input_prompt, return_tensors='pt').to(device)

print(tokenized)

{'input_ids': tensor([[128000,   3923,    374,    279,   6864,    315,   6890,     30],
        [128000,   3923,    374,    279,   6864,    315,   7427,     30]],
       device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}


Both of the input sentences has been converted into a list of integers & the shape of the list is same, hence we got the results, if the length of the input string would have been different, it could not convert, hence we need to use padding, so when there are missing integers, the tokenizer can fill in placeholders to let the matrix be of same shape & work fine!

In [20]:
tokenized['input_ids'].shape

torch.Size([2, 8])

In [31]:
input_prompt = ["What is the capital of India?",
                "What is the capital of USA and Canada?"]

#tokeinzers convert the input string into list of integers that would be used as input to the model
tokenized = tokenizer(input_prompt, padding=True, return_tensors='pt').to(device)

print(tokenized)

{'input_ids': tensor([[128009, 128009, 128000,   3923,    374,    279,   6864,    315,   6890,
             30],
        [128000,   3923,    374,    279,   6864,    315,   7427,    323,   7008,
             30]], device='mps:0'), 'attention_mask': tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}


In [32]:
tokenized['input_ids'].shape

torch.Size([2, 10])

From the above demostration, as we have used padding, & as the 2nd string is longer, the 1st string need some padding in order to fill the missing tokens, hence the EOS are addes to the 'left' because we mentioned it at the top, while we were defining tokeizer.
we can also see that the shape of the inputs have also changed

Now, lets see convert the integer back to string to understand how letter look

In [33]:
tokenizer.batch_decode(tokenized['input_ids'])

['<|eot_id|><|eot_id|><|begin_of_text|>What is the capital of India?',
 '<|begin_of_text|>What is the capital of USA and Canada?']

Padding helped the first sentence to fill in the empty numbers, this helps to run the code faster & to a better matrix multiplication

In [37]:
tokenized

{'input_ids': tensor([[128009, 128009, 128000,   3923,    374,    279,   6864,    315,   6890,
             30],
        [128000,   3923,    374,    279,   6864,    315,   7427,    323,   7008,
             30]], device='mps:0'), 'attention_mask': tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}

In [35]:
tokenized.keys()

dict_keys(['input_ids', 'attention_mask'])

In [36]:
tokenized['attention_mask']

tensor([[0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')

This attention_mask tells the model to NOT give attention to the padding words, here, as we saw, 2 EOS tokes were added to the first sentence because of padding, these will not be considered by the model. the 1's will be counted but not the 0's.

### Chat Templates

Instruction Tuning: Many language models are fine tuned to follow user instructions in a chat-like format

Hence, we are going to use "apply_chat_templates()" which converts prompt from the chat message format to a single string sequence

In [41]:
prompt = [
    {
        "role": "system",
        "content": "You are a smart AI assistant who speaks like Shakespeare."
     },
     {
        "role": "user",
        "content": "where does the sun set?" 
     }
]

tokenized = tokenizer.apply_chat_template(prompt, 
                                          add_generation_prompt = True,
                                          tokenize = True, #true if you want to convert to list of integers, False if you want string output
                                          padding = True,
                                          return_tensors = 'pt').to(device)

tokenized

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   1627,  10263,    220,   2366,     19,    271,   2675,    527,
            264,   7941,  15592,  18328,    889,  21881,   1093,  42482,     13,
         128009, 128006,    882, 128007,    271,   2940,   1587,    279,   7160,
            743,     30, 128009, 128006,  78191, 128007,    271]],
       device='mps:0')

Output when the tokenize is set to "False":

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [42]:
out = model.generate(tokenized, max_new_tokens=20)

decoded = tokenizer.batch_decode(out)
decoded

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a smart AI assistant who speaks like Shakespeare.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nwhere does the sun set?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nFair mortal, thou dost ask a query most fine,\nConcerning the place where the sun doth']