<a href="https://colab.research.google.com/github/mmaguero/diploma_fpuna_nlp_ia/blob/master/2025/prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Transformers installation
! pip install transformers datasets evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git



In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Prompt engineering

Prompt engineering or prompting, uses natural language to improve large language model (LLM) performance on a variety of tasks. A prompt can steer the model towards generating a desired output. In many cases, you don't even need a [fine-tuned](#finetuning) model for a task. You just need a good prompt.

Try prompting a LLM to classify some text. When you create a prompt, it's important to provide very specific instructions about the task and what the result should look like.

In [None]:
from transformers import pipeline
import torch

pipeline = pipeline(task="text-generation", model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """
Classify the text into neutral, negative or positive.
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
"""

outputs = pipeline(prompt, max_new_tokens=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Result: 
Classify the text into neutral, negative or positive.
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
Positive


In [29]:
from transformers import pipeline
import torch

pipeline = pipeline(task="text-generation", model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """
Classify the text into neutral, negative or positive.
Text: @EleditorPy @Yoyi_aponte @concursomarcapy @MIC_PY @Lizcramer_py @Senatur_Py @SPL_Paraguay ROHAYHU PARAGUAY DE MIS AMORES...
Sentiment:
"""

outputs = pipeline(prompt, max_new_tokens=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")
# positive

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: 
Classify the text into neutral, negative or positive.
Text: @EleditorPy @Yoyi_aponte @concursomarcapy @MIC_PY @Lizcramer_py @Senatur_Py @SPL_Paraguay ROHAYHU PARAGUAY DE MIS AMORES...
Sentiment:
Negative

Reason: The text appears to be a


The challenge lies in designing prompts that produces the results you're expecting because language is so incredibly nuanced and expressive.

This guide covers prompt engineering best practices, techniques, and examples for how to solve language and reasoning tasks.

## Best practices

1. Try to pick the latest models for the best performance. Keep in mind that LLMs can come in two variants, [base](https://hf.co/mistralai/Mistral-7B-v0.1) and [instruction-tuned](https://hf.co/mistralai/Mistral-7B-Instruct-v0.1) (or chat).

    Base models are excellent at completing text given an initial prompt, but they're not as good at following instructions. Instruction-tuned models are specifically trained versions of the base models on instructional or conversational data. This makes instruction-tuned models a better fit for prompting.

    > [!WARNING]
    > Modern LLMs are typically decoder-only models, but there are some encoder-decoder LLMs like [Flan-T5](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/flan-t5) or [BART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bart) that may be used for prompting. For encoder-decoder models, make sure you set the pipeline task identifier to `text2text-generation` instead of `text-generation`.

2. Start with a short and simple prompt, and iterate on it to get better results.

3. Put instructions at the beginning or end of a prompt. For longer prompts, models may apply optimizations to prevent attention from scaling quadratically, which places more emphasis at the beginning and end of a prompt.

4. Clearly separate instructions from the text of interest.

5. Be specific and descriptive about the task and the desired output, including for example, its format, length, style, and language. Avoid ambiguous descriptions and instructions.

6. Instructions should focus on "what to do" rather than "what not to do".

7. Lead the model to generate the correct output by writing the first word or even the first sentence.

8. Try other techniques like [few-shot](#few-shot) and [chain-of-thought](#chain-of-thought) to improve results.

9. Test your prompts with different models to assess their robustness.

10. Version and track your prompt performance.

## Techniques

Crafting a good prompt alone, also known as zero-shot prompting, may not be enough to get the results you want. You may need to try a few prompting techniques to get the best performance.

This section covers a few prompting techniques.

### Few-shot prompting

Few-shot prompting improves accuracy and performance by including specific examples of what a model should generate given an input. The explicit examples give the model a better understanding of the task and the output format you're looking for. Try experimenting with different numbers of examples (2, 4, 8, etc.) to see how it affects performance. The example below provides the model with 1 example (1-shot) of the output format (a date in MM/DD/YYYY format) it should return.

In [None]:
from transformers import pipeline
import torch

pipeline = pipeline(task="text-generation", model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 04/12/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.
Date:"""

outputs = pipeline(prompt, max_new_tokens=12, do_sample=True, top_k=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The downside of few-shot prompting is that you need to create lengthier prompts which increases computation and latency. There is also a limit to prompt lengths. Finally, a model can learn unintended patterns from your examples, and it may not work well on complex reasoning tasks.

To improve few-shot prompting for modern instruction-tuned LLMs, use a model's specific [chat template](https://huggingface.co/docs/transformers/main/en/tasks/../conversations). These models are trained on datasets with turn-based conversations between a "user" and "assistant". Structuring your prompt to align with this can improve performance.

Structure your prompt as a turn-based conversation and use the `apply_chat_template` method to tokenize and format it.

In [6]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")

messages = [
    {"role": "user", "content": "Text: The first human went into space and orbited the Earth on April 12, 1961."},
    {"role": "assistant", "content": "Date: 04/12/1961"},
    {"role": "user", "content": "Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon."}
]

prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

outputs = pipeline(prompt, max_new_tokens=12, do_sample=True, top_k=10)

for output in outputs:
    print(f"Result: {output['generated_text']}")

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 17 Nov 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: The first human went into space and orbited the Earth on April 12, 1961.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Date: 04/12/1961<|eot_id|><|start_header_id|>user<|end_header_id|>

Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Date: 09/28/1960


While the basic few-shot prompting approach embedded examples within a single text string, the chat template format offers the following benefits.

- The model may have a potentially improved understanding because it can better recognize the pattern and the expected roles of user input and assistant output.
- The model may more consistently output the desired output format because it is structured like its input during training.

Always consult a specific instruction-tuned model's documentation to learn more about the format of their chat template so that you can structure your few-shot prompts accordingly.

### Chain-of-thought

Chain-of-thought (CoT) is effective at generating more coherent and well-reasoned outputs by providing a series of prompts that help a model "think" more thoroughly about a topic.

The example below provides the model with several prompts to work through intermediate reasoning steps.

In [8]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """Let's go through this step-by-step:
1. You start with 15 muffins.
2. You eat 2 muffins, leaving you with 13 muffins.
3. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5. Your partner eats 2 muffins, leaving you with 12 muffins.
If you eat 6 muffins, how many are left?"""

outputs = pipeline(prompt, max_new_tokens=20, do_sample=True, top_k=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")
"""
Result: Let's go through this step-by-step:
1. You start with 15 muffins.
2. You eat 2 muffins, leaving you with 13 muffins.
3. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5. Your partner eats 2 muffins, leaving you with 12 muffins.
If you eat 6 muffins, how many are left?
Answer: 6
"""

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: Let's go through this step-by-step:
1. You start with 15 muffins.
2. You eat 2 muffins, leaving you with 13 muffins.
3. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5. Your partner eats 2 muffins, leaving you with 12 muffins.
If you eat 6 muffins, how many are left? 

## Step 1: Identify the current number of muffins.
You have 12 muffins


"\nResult: Let's go through this step-by-step:\n1. You start with 15 muffins.\n2. You eat 2 muffins, leaving you with 13 muffins.\n3. You give 5 muffins to your neighbor, leaving you with 8 muffins.\n4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.\n5. Your partner eats 2 muffins, leaving you with 12 muffins.\nIf you eat 6 muffins, how many are left?\nAnswer: 6\n"

Like [few-shot](#few-shot) prompting, the downside of CoT is that it requires more effort to design a series of prompts that help the model reason through a complex task and prompt length increases latency.

### Planning

In [11]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt= """
Q: In a dance class of 20 students,
20% enrolled in contemporary dance,
25% of the remaining enrolled in jazz dance,
and the rest enrolled in hip-hop dance.
What percentage of the entire students enrolled in hip-hop dance?
A: Let's first understand the
problem and devise a plan to solve the problem.
Then, let's carry out the plan and solve the problem
step by step.

"""

outputs = pipeline(prompt, max_new_tokens=512, do_sample=True, top_k=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")
# 60%

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: 
Q: In a dance class of 20 students, 
20% enrolled in contemporary dance, 
25% of the remaining enrolled in jazz dance, 
and the rest enrolled in hip-hop dance. 
What percentage of the entire students enrolled in hip-hop dance?
A: Let's first understand the
problem and devise a plan to solve the problem.
Then, let's carry out the plan and solve the problem 
step by step.

First, we are given that 20% of 20 students are enrolled in contemporary dance. We can find the number of students enrolled in contemporary dance by calculating 20% of 20, which is 4.
Therefore, the number of students enrolled in contemporary dance is 4.
Now, let's find the number of students remaining after 4 students are enrolled in contemporary dance. The number of remaining students is 20 - 4 = 16.
Next, we are given that 25% of the remaining 16 students are enrolled in jazz dance. We can find the number of students enrolled in jazz dance by calculating 25% of 16, which is 4.
Therefore, the number of stude

## Fine-tuning

While prompting is a powerful way to work with LLMs, there are scenarios where a fine-tuned model or even fine-tuning a model works better.

Here are some examples scenarios where a fine-tuned model makes sense.

- Your domain is extremely different from what a LLM was pretrained on, and extensive prompting didn't produce the results you want.
- Your model needs to work well in a low-resource language.
- Your model needs to be trained on sensitive data that have strict regulatory requirements.
- You're using a small model due to cost, privacy, infrastructure, or other constraints.

In all of these scenarios, ensure that you have a large enough domain-specific dataset to train your model with, have enough time and resources, and the cost of fine-tuning is worth it. Otherwise, you may be better off trying to optimize your prompt.

## Examples

The examples below demonstrate prompting a LLM for different tasks.

<hfoptions id="tasks">
<hfoption id="named entity recognition">

In [18]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """Return a list of named entities in the text.
Text: Hugging Face was founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and
Thomas Wolf in New York City, originally as a company that developed a chatbot app targeted at teenagers.
Named entities (ORG, PER, LOC):
- ORG: Hugging Face
- PER: Clément Delangue, Julien Chaumond, Thomas Wolf
- LOC: New York City
Text: Microsoft was founded on April 4, 1975, by Bill Gates and Paul Allen in Albuquerque, New Mexico.
Named entities (ORG, PER, LOC):
"""

outputs = pipeline(prompt, max_new_tokens=50, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")
# Result:  [Clément Delangue, Julien Chaumond, Thomas Wolf, company, New York City, chatbot app, teenagers]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: - ORG: Microsoft
- PER: Bill Gates, Paul Allen
- LOC: Albuquerque, New Mexico

## Step 1: Define the task
The task is to extract named entities from a given text. Named entities include organizations (ORG


</hfoption>
<hfoption id="translation">

In [21]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """Translate the English text to Guarani.
Text: Sometimes, I've believed as many as six impossible things before breakfast.
Translation:
"""

outputs = pipeline(prompt, max_new_tokens=20, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")
# Result: À l'occasion, j'ai croyu plus de six choses impossibles
# Result: A veces, he llegado a creer hasta seis cosas imposibles antes del desayuno.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: Oka, oka, some time, I believe up to six things that are not possible before


</hfoption>
<hfoption id="summarization">

In [24]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change.
Write a summary of the above text (in 3 bullet points).
Summary:
"""

outputs = pipeline(prompt, max_new_tokens=60, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")
# Result: Permaculture is the design process that involves mimicking natural ecosystems to provide sustainable solutions to basic needs. It is a holistic approach that comb

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: • Permaculture is a design process that mimics the diversity and functionality of natural ecosystems.
• It combines traditional ecological knowledge from indigenous cultures with modern scientific understanding and technological innovations.
• Permaculture aims to provide a framework for individuals and communities to develop innovative and effective strategies for meeting basic needs while


</hfoption>
<hfoption id="question answering">

In [27]:
from transformers import pipeline
import torch

pipeline = pipeline(model="meta-llama/Llama-3.2-3B-Instruct", dtype=torch.bfloat16, device_map="auto")
prompt = """Answer the question using the context below.
Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
Question: What modern tool is used to make gazpacho?
Answer:
"""

outputs = pipeline(prompt, max_new_tokens=10, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")
# Result: A blender or food processor is the modern tool

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Result: Traditional tool: Mortar
Modern tool: Blender


</hfoption>
</hfoptions>

# Refactoring
Refactor the provided Google Colab notebook to introduce a helper function `generate_text_with_model` that handles model loading, caching, and text generation. This function will replace direct `pipeline` initializations for a function call.

## Add helper function and model selection


Define a `CURRENT_MODEL` variable and a helper function `generate_text_with_model`. This function will manage model loading, caching, and text generation for both raw prompts and chat templates.



In [None]:
# Define the current model to be used
#CURRENT_MODEL = "mistralai/Mistral-7B-Instruct-v0.3"
# llama
CURRENT_MODEL = "meta-llama/Llama-3.2-3B-Instruct"
"""CURRENT_MODEL = "meta-llama/Llama-3.2-1B-Instruct"
# gemma
CURRENT_MODEL = "google/gemma-2-2b-it"
# phi
CURRENT_MODEL = "microsoft/Phi-4-mini-instruct"
# qwen
CURRENT_MODEL = "Qwen/Qwen3-4B-Instruct-2507"
## image-text-to-text
CURRENT_MODEL = "google/gemma-3-4b-it"
CURRENT_MODEL = "google/gemma-3n-E2B-it"
CURRENT_MODEL = "Qwen/Qwen3-VL-4B-Instruct" """

In [None]:
from transformers import pipeline
import torch

# Cache for loaded pipelines to avoid re-loading models
_cached_pipelines = {}

def generate_text_with_model(model_name: str, prompt, **kwargs):
    """
    Helper function to load a model (with caching) and generate text.

    Args:
        model_name (str): The name of the model to use (e.g., "mistralai/Mistral-7B-Instruct-v0.3").
        prompt: The input prompt, either a string or a list of messages for chat template.
        **kwargs: Additional arguments to pass to the pipeline for text generation.

    Returns:
        list: The generated text output from the pipeline.
    """
    global _cached_pipelines

    # Load or retrieve pipeline from cache
    if model_name not in _cached_pipelines:
        print(f"Loading model: {model_name}")
        _cached_pipelines[model_name] = pipeline(
            task="text-generation", # image...
            model=model_name,
            dtype=torch.bfloat16,
            device_map="auto"
        )
    else:
        print(f"Using cached model: {model_name}")

    current_pipeline = _cached_pipelines[model_name]

    # Prepare the prompt based on its type (string or list of messages)
    if isinstance(prompt, list):
        # For chat templates, use the tokenizer to format messages
        formatted_prompt = current_pipeline.tokenizer.apply_chat_template(
            prompt,
            tokenize=False,
            add_generation_prompt=True
        )
    else:
        formatted_prompt = prompt

    # Generate text
    outputs = current_pipeline(formatted_prompt, **kwargs)
    return outputs

## Zero-shot

Sentiment classification, removing the direct `pipeline` initialization.



In [None]:
import torch

# pipeline = pipeline(task="text-generation", model="mistralai/Mistral-7B-Instruct-v0.3", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization

prompt = """
Classify the text into neutral, negative or positive.
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
"""

# Use the helper function to generate text
outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Loading model: mistralai/Mistral-7B-Instruct-v0.3


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Result: 
Classify the text into neutral, negative or positive.
Text: This movie is definitely one of my favorite movies of its kind. The interaction between respectable and morally strong characters is an ode to chivalry and the honor code amongst thieves and policemen.
Sentiment:
Positive

The text expresses a positive


## Few-shot

A prompt is passed correctly along with generation arguments.



In [None]:
from transformers import pipeline
import torch

# pipeline = pipeline(task="text-generation", model="mistralai/Mistral-7B-Instruct-v0.3", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 04/12/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.
Date:"""

# Use the helper function to generate text
outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=12, do_sample=True, top_k=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: Text: The first human went into space and orbited the Earth on April 12, 1961.
Date: 04/12/1961
Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.
Date: 09/28/1960



### Chat template

Utilize the `generate_text_with_model` helper function, removing the direct `pipeline` initialization and ensuring the list of messages is passed correctly for **chat template** processing.



In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization

messages = [
    {"role": "user", "content": "Text: The first human went into space and orbited the Earth on April 12, 1961."},
    {"role": "assistant", "content": "Date: 04/12/1961"},
    {"role": "user", "content": "Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon."}
]

# Use the helper function to generate text
outputs = generate_text_with_model(CURRENT_MODEL, messages, max_new_tokens=12, do_sample=True, top_k=10)

for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: <s>[INST] Text: The first human went into space and orbited the Earth on April 12, 1961.[/INST] Date: 04/12/1961</s>[INST] Text: The first-ever televised presidential debate in the United States took place on September 28, 1960, between presidential candidates John F. Kennedy and Richard Nixon.[/INST] Date: 09/28/196


## Chain-of-thought

Utilize the `generate_text_with_model` helper function, removing the direct `pipeline` initialization and ensuring the prompt is passed correctly along with generation arguments. Let's think...



In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Let's go through this step-by-step:
1. You start with 15 muffins.
2. You eat 2 muffins, leaving you with 13 muffins.
3. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5. Your partner eats 2 muffins, leaving you with 12 muffins.
If you eat 6 muffins, how many are left?"""

# Use the helper function to generate text
outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=20, do_sample=True, top_k=10)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: Let's go through this step-by-step:
1. You start with 15 muffins.
2. You eat 2 muffins, leaving you with 13 muffins.
3. You give 5 muffins to your neighbor, leaving you with 8 muffins.
4. Your partner buys 6 more muffins, bringing the total number of muffins to 14.
5. Your partner eats 2 muffins, leaving you with 12 muffins.
If you eat 6 muffins, how many are left?

You started with 12 muffins, and after eating 6 muffins


## Prompt engineering for NLP tasks

### NER (Named Entity Recognition)



In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Return a list of named entities in the text.
Text: The company was founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf in New York City, originally as a company that developed a chatbot app targeted at teenagers.
Named entities:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=50, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: - 2016
- French
- Clément Delangue
- Julien Chaumond
- Thomas Wolf
- New York City
- chatbot
- teenagers


### Machine Translation



In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Translate the English text to French.
Text: Sometimes, I've believed as many as six impossible things before breakfast.
Translation:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=20, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: Sometimes, je crois jusqu'à six choses impossibles avant le petit déje


### Summarization



In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Permaculture is a design process mimicking the diversity, functionality and resilience of natural ecosystems. The principles and practices are drawn from traditional ecological knowledge of indigenous cultures combined with modern scientific understanding and technological innovations. Permaculture design provides a framework helping individuals and communities develop innovative, creative and effective strategies for meeting basic needs while preparing for and mitigating the projected impacts of climate change.
Write a summary of the above text.
Summary:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=30, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: Permaculture is a design process that mimics the diversity, functionality, and resilience of natural ecosystems. It combines traditional ecological


### QA (Question Answering)



In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Answer the question using the context below.
Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
Question: What modern tool is used to make gazpacho?
Answer:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=10, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Loading model: mistralai/Mistral-7B-Instruct-v0.3


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Result: A blender or food processor is the modern tool


In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Answer the question using the context below.
Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
Question: What modern tool is used to make gazpacho?
Answer:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=10, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: A modern tool used to make gazpacho


In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Answer the question using the context below.
Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
Question: What modern tool is used to make gazpacho?
Answer:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=10, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Using cached model: mistralai/Mistral-7B-Instruct-v0.3
Result: Blenders or food processors are modern tools used


Let's try extractive QA

In [None]:
import torch

# pipeline = pipeline(model="mistralai/Mistral-7B-Instruct-v0.1", dtype=torch.bfloat16, device_map="auto") # Removed direct pipeline initialization
prompt = """Answer the question returning a exact portion of text that match the question using the context below.
Context: Gazpacho is a cold soup and drink made of raw, blended vegetables. Most gazpacho includes stale bread, tomato, cucumbers, onion, bell peppers, garlic, olive oil, wine vinegar, water, and salt. Northern recipes often include cumin and/or pimentón (smoked sweet paprika). Traditionally, gazpacho was made by pounding the vegetables in a mortar with a pestle; this more laborious method is still sometimes used as it helps keep the gazpacho cool and avoids the foam and silky consistency of smoothie versions made in blenders or food processors.
Question: What modern tool is used to make gazpacho?
Answer:
"""

outputs = generate_text_with_model(CURRENT_MODEL, prompt, max_new_tokens=10, do_sample=True, top_k=10, return_full_text=False)
for output in outputs:
    print(f"Result: {output['generated_text']}")

Loading model: mistralai/Mistral-7B-Instruct-v0.3


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Result: Blenders or food processors are used to make
