## Exploring Large Language Models: LLaMA and Mistral

In this notebook, we will dive into two famous large language models, LLaMA and Mistral, along with their instruction-tuned versions. We'll explore how each model performs on various tasks, with a particular focus on generating structured responses in JSON format.

These models have been fine-tuned to follow instructions, making them suitable for a range of NLP applications. Through this lab, you will:

- Learn how to load and interact with LLaMA and Mistral models using the `pipeline` and `chat_template` functions.
- Examine the performance of their instruction-based variants.
- Generate structured outputs, specifically in JSON, for practical applications.

> **Disclaimer**: Before starting this lab, ensure you have requested access to the required models on Hugging Face and have logged in to your Hugging Face account. Access is necessary for the following models:
>
> - [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
> - [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
> - [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
>
> You can log in to Hugging Face directly from this notebook using the provided code snippet.


In [1]:
from huggingface_hub import login

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# TODO: Login to the Hugging Face model hub to be able to upload models
token = "YOUR HUGGING FACE TOKEN"

login(token=token)

# 1. LLaMA

In this part of the lab, we will explore **LLaMA (Large Language Model Meta AI)**, which is one of the most known large language models developed by Meta (Facebook). 

Next, we will focus on **Instruction LLaMA**, a version of LLaMA fine-tuned to better understand and follow user instructions. 

We will use Llama 3.2 (released in September 2024). In particular, we will adopt the 1B version. On the scale of things, this model is on the smaller side, but it is still a very powerful model.

It has been released (along with a 3B version) with the intention of allowing running it on devices with modest hardware (e.g., mobile phones or other edge devices). 

In [None]:
model_id = "meta-llama/Llama-3.2-1B"

# TODO: Load the model with `torch.float16` precision and the tokenizer 
# (You can specify the precision with `torch_dtype=torch.float16`)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token

We can use this model to generate text using the generate() method. We use random sampling (`do_sample=True`) and extract 5 samples (`num_return_sequences=5`). You can find other generation parameters [here](https://huggingface.co/docs/transformers/v4.46.0/en/main_classes/text_generation#transformers.GenerationConfig).

In [None]:
tokens = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
batch = model.generate(**tokens, do_sample=True, max_length=50, num_return_sequences=5, pad_token_id=tokenizer.eos_token_id) # (assigning pad_token_id avoids a warning)
tokenizer.batch_decode(batch)

### **Understanding the `tokenizer.chat_template`**

In this section, we will explore the **chat template** that is used to format and structure messages for a conversational assistant. The `tokenizer.chat_template` is a convenient way for organizing interactions between the user, system, and assistant in a way that the model can easily process and generate coherent responses.

### **What is a Chat Template?**

The chat template is a predefined format that ensures consistent structure for conversations. It marks the different roles in the interaction (system, user, assistant), and separates the various elements of the conversation using special tokens. This helps the language model understand which parts of the dialogue are instructions, which parts are user inputs, and where the assistant’s response should be generated.

Let's create an example of a possible (simplified) chat template:

In [4]:
import datetime

chat_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: """+datetime.datetime.now().strftime("%d %b %Y")+"""

{system_message}

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

{user_message}

<|eot_id|>
"""

### **Hugging Face Pipeline Overview**

The **`pipeline`** method from Hugging Face’s Transformers library is a high-level API designed to streamline the process of using pre-trained models for a wide variety of **natural language processing (NLP) tasks**.

#### **What is a Pipeline?**

A pipeline is a modular tool that wraps around a pre-trained model, tokenizer, and task-specific configurations. It makes it easy to load and apply these models directly to different tasks, such as:
- **Text generation**
- **Text classification**
- **Question answering**
- **Summarization**
- **Translation**

By simply specifying the type of task (e.g., `"text-generation"`), `pipeline` takes care of loading and configuring a compatible model and tokenizer, providing a ready-to-use interface for generating results.

You can find a full list of supported pipelines on the [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/pipelines).

In [66]:
# Create the pipeline with the model and tokenizer
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

In [None]:

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "What is 2 + 2?"},
]

# Format the messages using the chat template
formatted_messages = chat_template.format(
    system_message=messages[0]["content"],
    user_message=messages[1]["content"]
)


print(formatted_messages)

Now, remember that for models to follow instruction tuning, they need to have been tuned on this kind of data. In this case, we are not using the instruction-tuned version. 

So, we can expect the model to produce a garbage response (it has never seen that kind of inputs before!). But let's try it anyway!

In [None]:
# Generate the output text 
outputs = pipe(
    formatted_messages,
    max_new_tokens=256,
    do_sample=True,
)

print(outputs[0]["generated_text"])

### **Differences Between Standard and Instruct Versions of Large Language Models (LLMs)**

Large Language Models (LLMs) come in different versions, with **standard** and **instruction-tuned (Instruct)** versions being the most common. Here’s a brief comparison:

#### **1. Purpose and Training**:
   - **Standard LLM**: The standard model is generally pre-trained on large datasets without specific instruction-following capabilities. Typically generates more open-ended responses, which can be useful for creative writing or general information retrieval where the response style is flexible.
   - **Instruct LLM**: Instruction-tuned models, like the **Llama-3.2 Instruct**, are fine-tuned on datasets designed to help the model understand and follow instructions effectively. This tuning enhances the model's ability to respond directly to user prompts and handle structured requests. It is fine-tuned to produce concise, direct responses that are often more relevant in task-specific or conversational AI applications.

Let's compare the outputs of the standard and Instruct versions of LLaMA to see the differences in their responses.

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.2-1B-Instruct"

# TODO: Load the model with `torch.float16` precision and the tokenizer 
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# TODO: Set the pad token to the end of the sequence token
tokenizer.pad_token = tokenizer.eos_token

# TODO: Create the pipeline with the model and tokenizer
pipe = pipeline("chat", model=model, tokenizer=tokenizer)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "What is 2 + 2?"},
]

# TODO: Format the messages using the chat template and generate the output text
formatted_messages = [{"role": msg["role"], "content": msg["content"]} for msg in messages]
outputs = pipe(formatted_messages)

print(outputs[0]["generated_text"])

### **Evaluation of the Tokenizer Chat Template**

Actually, the chat template of `meta-llama/Llama-3.2-1B-Instruct` is much more complex than the example above. It includes various components that help the model understand the context of the conversation, manage dates, handle tools, and structure messages effectively.

The template is written in [jinja](https://jinja.palletsprojects.com/en/stable/templates/), a language that allows for the dynamic generation of content based on variables, conditions and loops.


Let's print it and analyze its key components:
 

#### **Key Components of the Template**:
1. **System Message Extraction**:
   - The system message is extracted if the first role in the message list is labeled "system." This allows the template to clearly differentiate between user queries and system instructions.
   - If a system message exists, it is added to the template between special tokens (`<|start_header_id|>` and `<|end_header_id|>`), ensuring that the model knows when the system message starts and ends.

2. **Date Management**:
   - The template automatically handles the current date using either a provided `strftime_now` function or a default date (`"26 Jul 2024"`). This can be useful when the model needs to be aware of the date in contexts such as time-sensitive responses.

3. **Handling Tools**:
   - The template checks if **tools** are defined. If tools are available, it includes a description of these tools in the system message or the user message, depending on where they need to appear.
   - If the tools are part of the user message, the template ensures that the first user message prompts the user to respond in a structured format, such as using JSON for function calls.

4. **Message Processing**:
   - The template loops through the list of messages and processes each based on the role (`user`, `assistant`, `ipython`, or `tool`). It formats each message using start and end tokens for the roles, helping the model understand the structure of the conversation.
   - If the message involves tool calls, the template ensures that they are properly formatted into a structured JSON format to be passed back to the model for further processing.

5. **Ending the Assistant's Response**:
   - The template leaves a placeholder for the assistant’s response, which the model will generate during inference. This ensures that the assistant's response begins in the correct format, ready to be populated with the generated content.

#### **Why Is This Template Needed?**

- **Maintains Consistency**: This template ensures that the conversation is structured in a consistent manner, which is crucial for models designed to follow complex instructions or engage in multi-turn conversations.
- **Handles Tools**: By incorporating the ability to dynamically introduce tools and functionality, the template allows the model to expand beyond simple text-based conversations and perform function-based tasks.
- **Structured Outputs for Tools**: When the conversation involves tool calls (e.g., through APIs or function calls), the template ensures that these interactions are formatted properly for execution.

In [None]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

print(tokenizer.chat_template)

Let's generate again the same example using the `chat_template` of `meta-llama/Llama-3.2-1B-Instruct` and analyze the output.

With a tokenizer that supports the chat template, we can directly call the `apply_chat_template()` method to convert a list of messages (each one a dictionary in the already discussed format) into a prompt.

Notice that, since we are not using any particular tools or other functionalities, our template will be similar to the one we manually introduced earlier.

In [None]:
# TODO: Create the pipeline with the model and tokenizer
pipe = pipeline("chat", model=model, tokenizer=tokenizer)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]
input_tokens = tokenizer.apply_chat_template(messages)
print(tokenizer.decode(input_tokens))

# TODO: Generate the output text (pass input_tokens
# as the input. You can use the `max_new_tokens` parameter
# to control the length of the output)
outputs = pipe(formatted_messages, max_new_tokens=50)

# we are getting back the full conversation history
# as a list of messages outputs[0]["generated_text"]
# -1 : last message (assistant response)
print(outputs[0]["generated_text"][-1]["content"])

Notice that the pipeline already supports chat mode, so we can pass the list of messages (as long as they contain role/content keys) directly to the pipeline.

Alternatively, we could have passed the prompt as a string. In this case, however, we would have to manually extract the output from the model and parse it back.

In [None]:
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

# TODO: Format the messages using the chat template and generate the output text
input_tokens = tokenizer.apply_chat_template(messages)
prompt_string = tokenizer.decode(input_tokens)

outputs = pipe(prompt_string, max_new_tokens=50)

print(outputs[0]["generated_text"])

# 2. Mistral

In this part, we will explore the use of `Mistral-7B-Instruct-v0.2`developed by Mistral AI to generate structured responses in JSON format. 

In this exercise, we will generate random math questions and instruct Mistral-7B to respond in a structured JSON format. We will then save the responses to a JSON file and verify the answers programmatically. 

Let's first repeat the same example we did with LLaMA, but now using Mistral.


In [None]:
# Define the model ID
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# TODO: Load the model and the tokenizer 
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# TODO: Initialize the pipeline for text generation
pipe = pipeline("chat", model=model, tokenizer=tokenizer)

# Define the message prompts for the conversation
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"}
]

# TODO: Generate the response
outputs = pipe(messages, max_new_tokens=100, pad_token_id=tokenizer.eos_token_id)

# Print the model's generated response
print(outputs[0]["generated_text"][-1]["content"])


Now let's generate a random math question and instruct Mistral-7B to respond in a structured JSON format. We will then save the responses to a JSON file and verify the answers programmatically!

1. **Generate random math questions:** Use Python to create questions with random numbers in a conversational style (e.g., “What is the sum of 245 and 173?”).

In [None]:
import random

def generate_random_math_questions(num_samples=5):
    """
    Generate random math questions with two numbers for a given number of samples.

    Args:
        num_samples (int): The number of math questions to generate.
    
    Returns:
        List of tuples: A list of tuples containing the math question and the two numbers (question, num1, num2).
    """

    # Define templates for math questions with two numbers
    templates = [
        "What is the sum of {} and {}?",
        "Can you add {} and {}?",
        "Calculate the sum of {} and {} for me.",
        "How much is {} plus {}?",
        "Please add {} and {}."
    ]
    
    questions = []
    for _ in range(num_samples):
        # TODO: Randomly select a template and generate two random numbers
        template = random.choice(templates)
        num1 = random.randint(1, 100)
        num2 = random.randint(1, 100)
        question = template.format(num1, num2)
        questions.append((question, num1, num2))  # store question with numbers for validation
    return questions


2. **Instruct the model to respond in JSON:** Use a system role instruction to ensure Mistral-7B answers in a JSON format containing the fields `num_1`, `num_2`, and `answer`. This makes the output compatible with automated processing or JSON parsers.

In [55]:
role_instruction = {
    "role": "system",
    "content": "Answer each question in JSON format with the fields 'num_1', 'num_2', and 'answer'. Provide only JSON to ensure compatibility with a JSON parser."
}


3. **Save and verify responses:** Generate and store model responses in a JSON file and check if the answers match expected values.

In [None]:
import json
from tqdm import tqdm

# Generate questions and answers, then save to JSON
questions = generate_random_math_questions(num_samples=5)
answers = []

# Generate structured answers for each question
for question, num1, num2 in tqdm(questions):
    # TODO: Define the message prompts
    formatted_messages = [
        role_instruction,
        {"role": "user", "content": question}
    ]
    
    # TODO: Generate the response
    outputs = pipe(formatted_messages, max_new_tokens=50)

    # Extract the model's JSON output
    structured_answer = outputs[0]["generated_text"]

    answers.append({
        "question": question,
        "num_1": num1,
        "num_2": num2,
        "model_answer": structured_answer
    })

# Save answers to a JSON file
with open("model_answers.json", "w") as f:
    json.dump(answers, f, indent=2)


In [None]:
import json

# Function to parse model's answer and verify correctness
def verify_answer(entry):
    try:
        # Extract expected values
        num1, num2 = entry["num_1"], entry["num_2"]
        expected_answer = num1 + num2
        
        # Extract the assistant's response from the list of messages
        assistant_message = next(
            (msg["content"] for msg in entry["model_answer"] if msg["role"] == "assistant"), None
        )
        
        if assistant_message is None:
            raise ValueError("Assistant's message not found in model_answer")
        
        # Parse model's structured answer from JSON
        model_response = json.loads(assistant_message.strip())  # Ensure model_answer is a string
        
        print(f"Expected answer: {num1} + {num2} = {expected_answer}")
        print(f"Model's answer: {model_response['num_1']} + {model_response['num_2']} = {model_response['answer']}")
        
        # Check if the values match
        if (model_response["num_1"] == num1 and 
            model_response["num_2"] == num2 and 
            model_response["answer"] == expected_answer):
            return True
        else:
            return False
    except (json.JSONDecodeError, KeyError, TypeError, ValueError) as e:
        # Handle cases where parsing fails or keys are missing
        print(f"Error verifying entry: {entry}. Error: {e}")
        return False

# Load answers from the JSON file and verify
try:
    with open("model_answers.json", "r") as f:
        saved_answers = json.load(f)
except (json.JSONDecodeError, FileNotFoundError) as e:
    print(f"Error loading JSON file: {e}")
    saved_answers = []

for i, entry in enumerate(saved_answers, 1):
    result = verify_answer(entry)
    print(f"Question {i}:", "Correct" if result else "Incorrect", "\n")