<a href="https://colab.research.google.com/github/Nebius-Academy/LLM-Engineering-Essentials/blob/main/topic4/4.1_open_source_models_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Engineering Essentials by Nebius Academy

Course github: [link](https://github.com/Nebius-Academy/LLM-Engineering-Essentials/tree/main)

The course is in development now, with more materials coming soon.

# Practice tasks - rewriting chat bots and agents with open source models

In Weeks 1-3 we've created some boilerplate code for creating chat bots and agents. In the tasks below, you'll update it to use open source LLMs instead of APIs.

## Task 1. Updating RAGAgent

In this task, you'll need to take the `RAGAgent` class you created in notebook 3.1 (or borrow it from the solution notebook) and replace Nebius AI Studio API calls by open source LLM calls.

You can use the code from the "Tool usage" section of this notebook. Just don't forget that you'll need a custom parser to extract tool calls from the model's answers. Also, make sure to correctly process the results in case if there seems to be no tool calls in the outputs.

In this task, you can assume that your agent gets only one prompt at a time, so you don't need batch inference.

If time and GPU resources allow, compare how the agent will work for Llama 3.2 models of different size. Generally, for larger sizes Llamas should be more adept at generating valid tool calls and not generating anything else.

In [None]:
# <YOUR CODE HER>

**Solution**

In [None]:
!pip install -q tavily-python

In [None]:
import os

with open("tavily_api_key", "r") as file:
    tavily_api_key = file.read().strip()
os.environ["TAVILY_API_KEY"] = tavily_api_key

with open("hf_access_token", "r") as file:
    hf_access_token = file.read().strip()
os.environ["HF_ACCESS_TOKEN"] = hf_access_token

In [None]:
from collections import defaultdict, deque
from typing import Dict, Any, List, Optional, Callable
import json
import traceback
import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class RAGAgent:
    def __init__(self,
                 model_name: str,
                 hf_access_token: str,
                 search_client,
                 history_size: int = 10,
                 get_system_message: Callable[[], Optional[Dict[str, str]]] = None,
                 search_depth: str = "advanced",
                 max_search_results: int = 5
                 ):
        """Initialize the chat agent with RAG tool using local LLM.

        Args:
            model_name: The Hugging Face model name (e.g., "meta-llama/Llama-3.2-3B-Instruct")
            hf_access_token: Hugging Face access token
            search_client: Search client instance (for example, Tavily)
            history_size: Number of messages to keep in history per user
            get_system_message: Function to retrieve the system message
            search_depth: Depth of web search ('basic' or 'advanced')
            max_search_results: Maximum number of search results to retrieve
        """
        self.model_name = model_name
        self.hf_access_token = hf_access_token
        self.search_client = search_client
        self.history_size = history_size

        # Initialize tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_access_token)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            token=hf_access_token
        )

        # Set pad token if not already set
        if self.tokenizer.pad_token_id is None:
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
        if self.model.config.pad_token_id is None:
            self.model.config.pad_token_id = self.model.config.eos_token_id

        # If no system message function is provided, use the default one
        self.get_system_message = get_system_message if get_system_message else self._default_system_message

        self.search_depth = search_depth
        self.max_search_results = max_search_results

        # Initialize chat history storage
        self.chat_histories = defaultdict(lambda: deque(maxlen=history_size))

        # Function definitions for the LLM
        self.function_definitions = """
        [
            {
                "name": "retrieve_information",
                "description": "You ALWAYS use this function if you don't have enough information to answer user's query. For example, the user asks about something which is after your knowlege cutoff. In this case, you will use this function to query and get additional context to provide a complete and accurate answer.",
                "parameters": {
                    "type": "dict",
                    "required": ["query"],
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "The search query to use for retrieving information. This should be a refined version of the user's question optimized for web search."
                        }
                    }
                }
            }
        ]
        """

        # Map available tool functions
        self.available_tools = {
            "retrieve_information": self.retrieve_information
        }

    def _default_system_message(self) -> str:
        """Default system message if none is provided."""
        return f"""You are a helpful assistant with access to various functions. When responding to a question:
1. If the question can be answered directly without using functions, provide a clear and helpful response.
2. If the question requires using one of the functions available for you to provide a complete answer:
   - Answer by ONLY outputting the necessary function call using the format: [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
   - DO NOT output anything except for the function call
Here is a list of functions in JSON format that you can use when appropriate:\n\n{self.function_definitions}\n"""

    def extract_query(self, completion):
        """
        Extract query from a retrieve_information tool call in a single completion.

        Args:
            completion (str): Text response from the LLM

        Returns:
            str or None: The extracted query string, or None if no query is found
        """
        # Look for query pattern in either format:
        # Format 1: retrieve_information(query="...")
        # Format 2: retrieve_information, query="..."
        pattern = r'retrieve_information(?:\(query=|,\s*query=)"([^"]*)"'

        # Find match
        match = re.search(pattern, completion)

        if match:
            # Extract the query value
            return match.group(1)
        else:
            # No query found
            return None

    def retrieve_information(self, query: str) -> Dict[str, Any]:
        """
        Perform a web search using the search client and format the results.

        Args:
            query: The search query

        Returns:
            Dictionary containing formatted search results and metadata
        """
        try:
            search_results = self.search_client.search(
                query=query,
                search_depth=self.search_depth,
                max_results=self.max_search_results
            )

            formatted_results = []
            for result in search_results.get('results', []):
                content = result.get('content', '').strip()
                url = result.get('url', '')
                if content:
                    formatted_results.append(f"Content: {content}\nSource: {url}\n")

            # Join all results with proper formatting
            context = "\n".join(formatted_results)

            return {
                "context": context,
                "query": query,
                "num_results": len(formatted_results),
                "success": True,
                "message": f"Retrieved {len(formatted_results)} results for query: '{query}'"
            }

        except Exception as e:
            print(f"Error in retrieve_information: {str(e)}")
            return {
                "context": "",
                "query": query,
                "num_results": 0,
                "success": False,
                "message": f"Failed to retrieve information: {str(e)}"
            }

    def chat(self, user_message: str, user_id: str, debug: bool = False) -> str:
        """Process a user message and return the agent's response.

        Args:
            user_message: The message from the user
            user_id: Unique identifier for the user
            debug: Whether to print debug information

        Returns:
            str: The agent's response
        """
        try:
            # Construct conversation history
            conversation = []

            # Add system message
            system_message = self.get_system_message()
            conversation.append({
                "role": "system",
                "content": system_message
            })

            # Add chat history
            for msg in self.chat_histories[user_id]:
                conversation.append(msg)

            # Add the new user message
            user_message_dict = {
                "role": "user",
                "content": user_message
            }
            conversation.append(user_message_dict)

            # Save user message to history
            self.chat_histories[user_id].append(user_message_dict)

            # Format the prompt
            prompt = self.tokenizer.apply_chat_template(
                conversation=conversation,
                tokenize=False,
                add_generation_prompt=True
            )

            if debug:
                print(f"#Formatted prompt:\n{prompt[:500]}...\n")

            # Tokenize
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                padding=True,
                padding_side="left",
            ).to(self.model.device)

            # Generate
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=True,
                temperature=0.6,
            )

            # Extract the model's output (excluding the prompt tokens)
            output_token_ids = outputs[:, inputs.input_ids.shape[1]:]
            completion = self.tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]

            if debug:
                print(f"#Initial model output:\n{completion}\n")

            # Check if the model wants to use a tool
            query = self.extract_query(completion)

            # If a tool call is found
            if query:
                if debug:
                    print(f"#Tool call detected with query: {query}\n")

                # Process the search request
                search_result = self.retrieve_information(query)

                # Format the search result as context for the model
                context = f"<context>\n{search_result['context']}\n</context>"

                if debug:
                    print(f"#Retrieved context (preview):\n{context[:200]}...\n")

                # Create a new conversation with the retrieved information
                second_conversation = conversation.copy()

                # Add the tool response as an assistant message
                second_conversation.append({
                    "role": "assistant",
                    "content": f"I need to search for more information about this. [retrieve_information(query=\"{query}\")]"
                })

                # Add the search results
                second_conversation.append({
                    "role": "system",
                    "content": f"Search results for: {query}\n{context}"
                })

                # Format the prompt for the final answer
                second_prompt = self.tokenizer.apply_chat_template(
                    conversation=second_conversation,
                    tokenize=False,
                    add_generation_prompt=True
                )

                # Tokenize
                second_inputs = self.tokenizer(
                    second_prompt,
                    return_tensors="pt",
                    padding=True,
                    padding_side="left",
                ).to(self.model.device)

                # Generate the final answer
                second_outputs = self.model.generate(
                    **second_inputs,
                    max_new_tokens=512,  # Allow for a longer response with the context
                    do_sample=True,
                    temperature=0.7,
                )

                # Extract the model's final output
                final_output_token_ids = second_outputs[:, second_inputs.input_ids.shape[1]:]
                final_response = self.tokenizer.batch_decode(final_output_token_ids, skip_special_tokens=True)[0]

                if debug:
                    print(f"#Final response with context:\n{final_response[:200]}...\n")

                # Save the assistant's final response to chat history
                self.chat_histories[user_id].append({
                    "role": "assistant",
                    "content": final_response
                })

                return final_response

            else:
                # The model provided a direct answer without needing tools
                if debug:
                    print(f"#Direct answer without tools:\n{completion}\n")

                # Save the assistant's response to chat history
                self.chat_histories[user_id].append({
                    "role": "assistant",
                    "content": completion
                })

                return completion

        except Exception as e:
            error_msg = f"Error in chat: {str(e)}"
            print(error_msg)
            print(traceback.format_exc())
            return error_msg

    def get_chat_history(self, user_id: str) -> list:
        """Retrieve the chat history for a specific user.

        Args:
            user_id: Unique identifier for the user

        Returns:
            list: List of message dictionaries
        """
        return list(self.chat_histories[user_id])

In [None]:
import os
from tavily import TavilyClient

# Initialize search client
tavily_client = TavilyClient(api_key=tavily_api_key)

# Initialize the RAG agent
rag_agent = RAGAgent(
    model_name="meta-llama/Llama-3.2-3B-Instruct",
    hf_access_token=hf_access_token,
    search_client=tavily_client,
    search_depth="advanced",
    max_search_results=5
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's now run the agent outputting lots of debugging information:

In [None]:
import uuid

# Example usage
user_id = str(uuid.uuid4())

response = rag_agent.chat("Nice day, isn't it?", user_id, debug=True)
print("Response:", response)

response = rag_agent.chat("Who won gold in break dance at the 2024 olympics?", user_id, debug=True)
print("Response:", response)

response = rag_agent.chat("Who played Thaddeus Ross in Captain America: Brave New World?", user_id, debug=True)
print("Response:", response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


#Formatted prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 15 May 2025

You are a helpful assistant with access to various functions. When responding to a question:
1. If the question can be answered directly without using functions, provide a clear and helpful response.
2. If the question requires using one of the functions available for you to provide a complete answer:
   - Answer by ONLY outputting the necessary function call using the form...



Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


#Initial model output:
It's hard to say whether it's a nice day or not, as I'm a large language model, I don't have have access to real-time information about the current weather or location. However, I can tell you that I'm here to help answer any questions you may have, so let's make the most of our conversation! What's on your mind?

#Direct answer without tools:
It's hard to say whether it's a nice day or not, as I'm a large language model, I don't have have access to real-time information about the current weather or location. However, I can tell you that I'm here to help answer any questions you may have, so let's make the most of our conversation! What's on your mind?

Response: It's hard to say whether it's a nice day or not, as I'm a large language model, I don't have have access to real-time information about the current weather or location. However, I can tell you that I'm here to help answer any questions you may have, so let's make the most of our conversation! What's on y

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


#Retrieved context (preview):
<context>
Content: Featured Video

Ranking Every Week 1 Game 🏈

Olympic Breakdancing 2024 Results: Women's Breaking Medal Winners and Highlights

Breakdancing made its debut at the 2024 Paris Olympics...



Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


#Final response with context:
The gold medal winners in breakdancing at the 2024 Olympics were:

* Women's: Ami Yuasa of Japan
* Men's: Phil Wizard (Philip Kim) of Canada...

Response: The gold medal winners in breakdancing at the 2024 Olympics were:

* Women's: Ami Yuasa of Japan
* Men's: Phil Wizard (Philip Kim) of Canada
#Formatted prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 15 May 2025

You are a helpful assistant with access to various functions. When responding to a question:
1. If the question can be answered directly without using functions, provide a clear and helpful response.
2. If the question requires using one of the functions available for you to provide a complete answer:
   - Answer by ONLY outputting the necessary function call using the form...

#Initial model output:
[retrieve_information, query="Thaddeus Ross Captain America: Brave New World"]

#Tool call detected with query: Thaddeus Ross C

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


#Retrieved context (preview):
<context>
Content: Thaddeus Ross is a fictional character originally portrayed by William Hurt and subsequently by Harrison Ford in the Marvel Cinematic Universe (MCU) film franchise, based on the Mar...

#Final response with context:
The actor who played Thaddeus Ross in Captain America: Brave New World is Harrison Ford....

Response: The actor who played Thaddeus Ross in Captain America: Brave New World is Harrison Ford.


In [None]:
rag_agent.get_chat_history(user_id)

[{'role': 'user', 'content': "Nice day, isn't it?"},
 {'role': 'assistant',
  'content': "It's hard to say whether it's a nice day or not, as I'm a large language model, I don't have have access to real-time information about the current weather or location. However, I can tell you that I'm here to help answer any questions you may have, so let's make the most of our conversation! What's on your mind?"},
 {'role': 'user',
  'content': 'Who won gold in break dance at the 2024 olympics?'},
 {'role': 'assistant',
  'content': "The gold medal winners in breakdancing at the 2024 Olympics were:\n\n* Women's: Ami Yuasa of Japan\n* Men's: Phil Wizard (Philip Kim) of Canada"},
 {'role': 'user',
  'content': 'Who played Thaddeus Ross in Captain America: Brave New World?'},
 {'role': 'assistant',
  'content': 'The actor who played Thaddeus Ross in Captain America: Brave New World is Harrison Ford.'}]

## Task 2. Updating the NPC Factory

In this task, you'll need to update the NPC Factory from notebook 1.7 so that it could use a self-depolyed LLM.

**Note**. Depending on an LLM you choose, and on GPU you use, you might encounter Out of memory error for a large context size. So, you might want to check the max context size tolerated by your GPU and to cut past dialog histories based on that value.

**If you're in for an advanced challege:** In a real-world situation, you'd want to consider batch processing, especially if your service becomes popular and it receives many queries every minute. But then, many questions might arise, like:

* What is the timeout after which we send even a non-complete batch to the LLM?
* How to balance back size with conversation lengths?
* Should we group conversations into batches depending on the conversations length?

You're not supposed to fully answer this question here, but we encourage you to experiment with batch sizes and conversation lengths to understand what you chosen GPU is capable of.

For a further treatment of the batch size vs GPU vs LLM, see the **Inference metrics** long read and notebook.

**A basic solution**

In [None]:
with open("hf_access_token", "r") as file:
    hf_access_token = file.read().strip()

'''
# Or use a colab secret:

!pip install --upgrade huggingface_hub

from google.colab import userdata
hf_access_token = userdata.get('HF_TOKEN')
'''

"\n# Or use a colab secret:\n\n!pip install --upgrade huggingface_hub\n\nfrom google.colab import userdata\nhf_access_token = userdata.get('HF_TOKEN')\n"

In [None]:
from collections import defaultdict, deque
from typing import Dict, Any, Optional
import datetime
import string
import random
from dataclasses import dataclass
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import re

@dataclass
class NPCConfig:
    world_description: str
    character_description: str
    history_size: int = 10
    has_scratchpad: bool = False

class NPCFactoryError(Exception):
    """Base exception class for NPC Factory errors."""
    pass

class NPCNotFoundError(NPCFactoryError):
    """Raised when trying to interact with a non-existent NPC."""
    def __init__(self, npc_id: str):
        self.npc_id = npc_id
        super().__init__(f"NPC with ID '{npc_id}' not found")

class SimpleChatNPC:
    def __init__(self, tokenizer, model, config: NPCConfig):
        self.tokenizer = tokenizer
        self.model = model
        self.config = config
        self.chat_histories = defaultdict(lambda: deque(maxlen=config.history_size))

    def get_system_message(self) -> Dict[str, str]:
        """Returns the system message that defines the NPC's behavior."""
        character_description = self.config.character_description

        if self.config.has_scratchpad:
            character_description += """
You can use scratchpad for thinking before you answer: whatever you output between #SCRATCHPAD and #ANSWER won't be shown to anyone.
You start your output with #SCRATCHPAD and after you've done thinking, you #ANSWER"""

        return {
            "role": "system",
            "content": f"""WORLD SETTING: {self.config.world_description}
###
{character_description}"""
        }

    def chat(self, user_message: str, user_id: str, debug: bool = False) -> str:
        """Process a user message and return the NPC's response."""
        try:
            # Construct conversation history
            conversation = []

            # Add system message
            conversation.append(self.get_system_message())

            # Add conversation history
            history = list(self.chat_histories[user_id])
            if history:
                conversation.extend(history)

            # Add new user message
            user_message_dict = {
                "role": "user",
                "content": user_message
            }
            conversation.append(user_message_dict)

            # Format the prompt using the chat template
            prompt = self.tokenizer.apply_chat_template(
                conversation=conversation,
                tokenize=False,
                add_generation_prompt=True
            )

            if debug:
                print(f"#Formatted prompt:\n{prompt[:500]}...\n")

            # Tokenize
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                padding=True,
                padding_side="left",
            ).to(self.model.device)

            # Generate response
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=512,
                do_sample=True,
                temperature=0.6,
            )

            # Extract the model's output (excluding the prompt tokens)
            output_token_ids = outputs[:, inputs.input_ids.shape[1]:]
            response = self.tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]

            if debug:
                print(f"#Model response:\n{response}\n")

            # Handle scratchpad if enabled
            response_clean = response
            if self.config.has_scratchpad:
                scratchpad_match = re.search(r"#SCRATCHPAD(:?)(.*?)#ANSWER(:?)", response, re.DOTALL)
                if scratchpad_match:
                    response_clean = response[scratchpad_match.end():].strip()

            # Store user message and response in history
            self.chat_histories[user_id].append(user_message_dict)
            self.chat_histories[user_id].append({
                "role": "assistant",
                "content": response  # Store the full response including scratchpad
            })

            # Return the message to the user without scratchpad
            return response_clean

        except Exception as e:
            return f"Error: {str(e)}"

class NPCFactory:
    def __init__(self, model_name: str, hf_access_token: str):
        """Initialize the NPC Factory with an open-source LLM.

        Args:
            model_name: The Hugging Face model name (e.g., "meta-llama/Llama-3.2-3B-Instruct")
            hf_access_token: Hugging Face access token
        """
        self.model_name = model_name
        self.hf_access_token = hf_access_token
        self.npcs: Dict[str, SimpleChatNPC] = {}
        self.user_ids: Dict[str, str] = {}  # username -> user_id mapping

        # Initialize tokenizer and model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_access_token)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            return_dict=True,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True,
            token=hf_access_token
        )

        # Set pad token if not already set
        if self.tokenizer.pad_token_id is None:
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
        if self.model.config.pad_token_id is None:
            self.model.config.pad_token_id = self.model.config.eos_token_id

    def generate_id(self) -> str:
        """Generate a random unique identifier."""
        return ''.join(random.choice(string.ascii_letters) for _ in range(8))

    def register_user(self, username: str) -> str:
        """Register a new user and return their unique ID.
        If username already exists, appends a numerical suffix."""
        base_username = username
        suffix = 1

        # Keep trying with incremented suffixes until we find an unused name
        while username in self.user_ids:
            username = f"{base_username}_{suffix}"
            suffix += 1

        user_id = self.generate_id()
        self.user_ids[username] = user_id
        return user_id

    def register_npc(self, world_description: str, character_description: str,
                     history_size: int = 10, has_scratchpad: bool = False) -> str:
        """Create and register a new NPC, returning its unique ID."""
        npc_id = self.generate_id()

        config = NPCConfig(
            world_description=world_description,
            character_description=character_description,
            history_size=history_size,
            has_scratchpad=has_scratchpad
        )

        # Pass the shared tokenizer and model to the NPC
        self.npcs[npc_id] = SimpleChatNPC(self.tokenizer, self.model, config)
        return npc_id

    def chat_with_npc(self, npc_id: str, user_id: str, message: str, debug: bool = False) -> str:
        """Send a message to a specific NPC from a specific user.

        Args:
            npc_id: The unique identifier of the NPC
            user_id: The unique identifier of the user
            message: The message to send
            debug: Whether to print debug information

        Returns:
            The NPC's response

        Raises:
            NPCNotFoundError: If the specified NPC doesn't exist
        """
        if npc_id not in self.npcs:
            raise NPCNotFoundError(npc_id)

        npc = self.npcs[npc_id]
        return npc.chat(message, user_id, debug=debug)

    def get_npc_chat_history(self, npc_id: str, user_id: str) -> list:
        """Retrieve chat history between a specific user and NPC.

        Args:
            npc_id: The unique identifier of the NPC
            user_id: The unique identifier of the user

        Returns:
            List of message dictionaries containing the chat history

        Raises:
            NPCNotFoundError: If the specified NPC doesn't exist
        """
        if npc_id not in self.npcs:
            raise NPCNotFoundError(npc_id)

        return list(self.npcs[npc_id].chat_histories[user_id])

In [None]:
# Creating a factory

model_name = "meta-llama/Llama-3.2-3B-Instruct"
# Use Qwen instead if you didn't get access to Llama 3.2
# model_name = "Qwen2.5-3B-Instruct"

npc_factory = NPCFactory(model_name=model_name, hf_access_token=hf_access_token)

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [None]:
# Register a user
user_id = npc_factory.register_user("Alice")

# Create an NPC
npc_id = npc_factory.register_npc(
    world_description="Medieval London, XIII century",
    character_description="A knight at Edward I's court",
    has_scratchpad=False
)

In [None]:
def prettify_string(text, max_line_length=80):
    """Prints a string with line breaks at spaces to prevent horizontal scrolling.

    Args:
        text: The string to print.
        max_line_length: The maximum length of each line.
    """

    output_lines = []
    lines = text.split("\n")
    for line in lines:
        current_line = ""
        words = line.split()
        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                output_lines.append(current_line.strip())
                current_line = word + " "
        output_lines.append(current_line.strip())  # Append the last line
    return "\n".join(output_lines)

In [None]:
response = npc_factory.chat_with_npc(npc_id, user_id,
                                     """Good day, sir knight!"""
                                     )
print(prettify_string(response))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Good morrow to thee, good fellow! I am Sir Edward de Montfort, a loyal knight
of the realm and a trusted advisor to His Majesty, King Edward I. 'Tis a grand
day to be alive, don't thou think? The sun shineth brightly upon our fair city
of London, and the sounds of hammering and sawing fill the air as our great
king's builders work tirelessly to construct grand new buildings and
fortifications.

I must say, I am in high spirits today, for I have just returned from a
successful tournament in the city, where I fought valiantly alongside my fellow
knights and won the favor of the king himself. 'Twas a grand spectacle, with
jousting, archery, and even a bit of sword fighting. I doth hope thou hast seen
the tournament, good fellow?

Now, tell me, what brings thee to our fair city? Art thou a merchant, a
traveler, or perhaps a noble seeking an audience with the king?
