# TL07 - Chats con Transformers

Este notebook introduce los conceptos de **chats**, **templates**, **tools** y **RAG** (Retrieval-Augmented Generation) usando la librerÃ­a `transformers`.

## Estructura del notebook:
1. **Aspectos bÃ¡sicos**: Uso de `TextGenerationPipeline` para chats
2. **Templates**: AplicaciÃ³n de templates de chat con `apply_chat_template`
3. **Escritura de templates**: Trabajo con templates Jinja2
4. **Ejercicio prÃ¡ctico**: ImplementaciÃ³n de un chatbot con Qwen
5. **Tools y RAG**: Uso de herramientas externas para enriquecer las respuestas del modelo


## 1. Aspectos bÃ¡sicos: TextGenerationPipeline

En esta secciÃ³n utilizamos `TextGenerationPipeline` para realizar chats con modelos de lenguaje. El pipeline permite interactuar con modelos de chat de forma sencilla, pasando mensajes en formato de lista de diccionarios con roles (`user`, `assistant`, `system`).


In [None]:
from transformers import pipeline
pipe = pipeline(model="Qwen/Qwen3-4B-Thinking-2507", device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Device set to use cuda:0


### ExtracciÃ³n y visualizaciÃ³n de la respuesta

Extraemos el contenido de la respuesta del asistente y lo mostramos formateado en Markdown. La respuesta contiene la explicaciÃ³n matemÃ¡tica sobre la derivada de la funciÃ³n sigmoide.


In [None]:
chat = [{"role": "user", "content": 'Â¿CuÃ¡l es la derivada de la sigmoide en funciÃ³n de la propia sigmoide?'}]
output = pipe(chat, max_new_tokens=32768, num_beams=2)

## 2. Templates: apply_chat_template

Los **templates** son plantillas que formatean los mensajes de chat segÃºn el formato esperado por cada modelo. Usamos `apply_chat_template` para convertir los mensajes en el formato correcto antes de pasarlos al modelo.

En este ejemplo:
- Definimos mensajes con roles `system` y `user`
- Aplicamos el template con `add_generation_prompt=True` para indicar que el siguiente mensaje serÃ¡ del asistente
- `enable_thinking=False` desactiva el modo de razonamiento del modelo


In [None]:
assistant_content = output[0]['generated_text'][-1]['content']

### GeneraciÃ³n de respuesta

Generamos la respuesta del modelo usando el chat tokenizado. El modelo responde en estilo pirata segÃºn las instrucciones del sistema.


## 3. Escritura de templates con Jinja2

Los templates de chat estÃ¡n escritos en **Jinja2**, un motor de plantillas de Python. Podemos inspeccionar y trabajar directamente con el template usando la librerÃ­a `jinja2`.

En este ejemplo:
- Accedemos al template del tokenizer con `tokenizer.chat_template`
- Creamos un objeto `Template` de Jinja2
- Renderizamos el template con los mensajes para ver cÃ³mo se formatean


In [None]:
from IPython.display import display, Markdown
display(Markdown("... "+assistant_content[-117:]))

... 

Por lo tanto, la derivada de la funciÃ³n sigmoide en funciÃ³n de sÃ­ misma es:

$$
\boxed{\sigma(x)(1 - \sigma(x))}
$$

## 4. Ejercicio prÃ¡ctico: QwenChatbot

Implementamos una clase `QwenChatbot` que:
- Mantiene un historial de conversaciÃ³n
- Permite generar respuestas manteniendo el contexto
- Soporta comandos especiales como `/no_think` y `/think` para controlar el modo de razonamiento

**Nota**: El modelo se carga en 8-bit para reducir el uso de memoria.


### DefiniciÃ³n de la herramienta (tool)

Definimos una funciÃ³n que consulta la temperatura actual de una ubicaciÃ³n usando la API de wttr.in. Esta funciÃ³n serÃ¡ pasada al modelo como una herramienta disponible.


## 5. Tools y RAG

Los **tools** (herramientas) permiten que los modelos LLM utilicen funciones externas para obtener informaciÃ³n en tiempo real, realizar cÃ¡lculos, o acceder a APIs.

**RAG** (Retrieval-Augmented Generation) enriquece el conocimiento del LLM mediante la bÃºsqueda de documentos relacionados durante la inferencia.

En este ejemplo:
- Definimos una funciÃ³n `get_current_temperature` que consulta la temperatura actual de una ubicaciÃ³n
- El modelo decide cuÃ¡ndo usar esta herramienta
- El flujo incluye: mensaje del usuario â†’ llamada a tool â†’ respuesta del modelo con la informaciÃ³n obtenida


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-4B" # "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



### Primera generaciÃ³n: llamada a la herramienta

El modelo genera una llamada a la herramienta (`tool_call`) en formato JSON. Extraemos esta llamada para ejecutar la funciÃ³n y obtener la temperatura real.


### Segunda generaciÃ³n: respuesta con informaciÃ³n de la herramienta

1. **Parseamos la llamada a la herramienta**: Extraemos el JSON de la llamada usando expresiones regulares
2. **AÃ±adimos la llamada al historial**: Incluimos la llamada a la herramienta en los mensajes
3. **Ejecutamos la herramienta**: Llamamos a `get_current_temperature` y aÃ±adimos el resultado
4. **Generamos la respuesta final**: El modelo genera una respuesta usando la informaciÃ³n obtenida de la herramienta, con `enable_thinking=True` para ver el proceso de razonamiento


In [None]:
messages = [
 {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
 {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,
 enable_thinking=False, return_tensors="pt").to(model.device)
tokenizer.decode(tokenized_chat[0])

'<|im_start|>system\nYou are a friendly chatbot who always responds in the style of a pirate<|im_end|>\n<|im_start|>user\nHow many helicopters can a human eat in one sitting?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n'

In [None]:
generated_ids = model.generate(tokenized_chat, num_beams=2, max_new_tokens=32768)
tokenizer.batch_decode(generated_ids)[0]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'<|im_start|>system\nYou are a friendly chatbot who always responds in the style of a pirate<|im_end|>\n<|im_start|>user\nHow many helicopters can a human eat in one sitting?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nArrr, thatâ€™s a fine question, matey! But Iâ€™d say a human canâ€™t eat no helicopters, nor no one wants to! Theyâ€™re too big, too heavy, and theyâ€™ve got blades thatâ€™d cut a man in two! Yarrr! \n\nBut if yeâ€™re askinâ€™ about something elseâ€¦ like food? Then Iâ€™d say a human can eat a lot, but not a helicopter! Yarrr! What be ye really askinâ€™? Iâ€™m a bit confused, matey!<|im_end|>'

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-4B" # "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
from jinja2 import Template
messages = [
 {"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
 {"role": "user", "content": "How many helicopters can a human eat in one sitting?"}]
template = Template(tokenizer.chat_template)
template.render(messages=messages)

'<|im_start|>system\nYou are a friendly chatbot who always responds in the style of a pirate<|im_end|>\n<|im_start|>user\nHow many helicopters can a human eat in one sitting?<|im_end|>\n'

In [None]:
!pip install -U bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
  def __init__(self, model_name="Qwen/Qwen3-4B"):
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Load the model in 8-bit precision to reduce memory usage and prevent full disk offloading
    self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
    self.history = []

  def generate_response(self, user_input):
    messages = self.history + [{"role": "user", "content": user_input}]
    text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = self.tokenizer(text, return_tensors="pt").to(self.model.device)
    response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
    response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
    self.history.append({"role": "user", "content": user_input})
    self.history.append({"role": "assistant", "content": response})
    return response

chatbot = QwenChatbot()
user_input_1 = "Â¿CuÃ¡ntas erres hay en Catarroja? /no_think"; print(f"User: {user_input_1}")
response_1 = chatbot.generate_response(user_input_1); print(f"Bot: {response_1}","\n----------------------")
user_input_2 = "Â¿Seguro? /think"; print(f"User: {user_input_2}")
response_2 = chatbot.generate_response(user_input_2); print(f"Bot: {response_2}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

User: Â¿CuÃ¡ntas erres hay en Catarroja? /no_think
Bot: <think>

</think>

La pregunta "Â¿CuÃ¡ntas erres hay en Catarroja?" parece ser una pregunta curiosa o incluso un juego de palabras, ya que **Catarroja** es un pueblo ubicado en la provincia de Valencia, EspaÃ±a. Sin embargo, la palabra "Catarroja" no contiene la letra "r" en su nombre. 

Vamos a comprobarlo:

- **Catarroja**: C - A - T - A - R - R - O - J - A

En este caso, **hay dos "r"** (mayÃºscula o minÃºscula, pero en la palabra "Catarroja" las letras "r" estÃ¡n escritas con mayÃºscula).

Por lo tanto, la respuesta es:

**Hay 2 "r" en Catarroja.** 
----------------------
User: Â¿Seguro? /think
Bot: <think>
Okay, the user is asking if I'm sure about the number of "r"s in "Catarroja." Let me double-check. The word is Catarroja. Let's spell it out: C-A-T-A-R-R-O-J-A. So, the letters are C, A, T, A, R, R, O, J, A. That's two "R"s. Wait, but sometimes people might confuse the letters if they're not careful. Let me confirm again. T

In [None]:
import requests
def get_current_temperature(location: str):
 """
 Get the current temperature at a location.
 Args:
 location: The location to get the current temperature for, as a string.
 Returns:
 The current temperature at the specified location, as a string.
 """
 return requests.get("https://wttr.in/"+location+"?format=%t").content.decode()
tools = [get_current_temperature]

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-4B" # "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [None]:
messages = [
 {"role": "system", "content": """
You are a bot that responds to weather queries.
You should reply with the current temperature at the specified location."""},
 {"role": "user", "content": "Hey, what's the temperature in ValÃ¨ncia right now?"}
]
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True,
 enable_thinking=False, return_dict=True, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, num_beams=2, max_new_tokens=32768)
outputs_text = tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:])
outputs_text

'<tool_call>\n{"name": "get_current_temperature", "arguments": {"location": "ValÃ¨ncia"}}\n</tool_call><|im_end|>'

In [5]:
import re, json; pattern = re.compile(r'{.*}', re.DOTALL)
tool_call = json.loads(re.search(pattern, outputs_text).group(0))
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
messages.append({"role": "tool", "name": "get_current_temperature",
 "content": get_current_temperature(tool_call['arguments']['location'])})
inputs = tokenizer.apply_chat_template(messages, tools=tools, add_generation_prompt=True,
 enable_thinking=True, return_dict=True, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, num_beams=3, max_new_tokens=32768)
print(tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):]))

KeyboardInterrupt: 