# Exploring LLM Context & Generation

This notebook demonstrates how to:
- Build utility functions for structured LLM calls.
- Maintain conversation context across multiple turns.
- Experiment with generation parameters (`temperature`, `top_p`, `top_k`, `repetition_penalty`).
- Observe how these settings influence creativity, coherence, and variability in responses.

Utilities are provided in **`utils.py`**, while this notebook walks through examples, experiments, and chatbot-like interactions.

In [7]:
import json
import random

In [8]:
from utils import (
    generate_with_single_input, 
    generate_with_multiple_input,
    generate_params_dict,
    call_llm_with_context
)

## Generation functions


### `generate_with_single_input` and `generate_with_multiple_input`

```Python
generate_with_single_input(prompt: str, 
                               role: str = 'user', 
                               top_p: float = None, 
                               temperature: float = None,
                               max_tokens: int = 500,
                               model: str ="meta-llama/Llama-3.2-3B-Instruct-Turbo")


generate_with_multiple_input(messages: List[Dict], 
                               top_p: float = None, 
                               temperature: float = None,
                               max_tokens: int = 500,
                               model: str ="meta-llama/Llama-3.2-3B-Instruct-Turbo")
```

In [10]:
generate_with_single_input("Explain to me very briefly what is a Cyclotomic Polynomial. No more than 5 sentences.")

{'role': 'assistant',
 'content': 'A Cyclotomic Polynomial is a polynomial that arises from the roots of unity, specifically the nth roots of unity. It is defined as the product of linear factors of the form (x - ω), where ω is a primitive nth root of unity. The coefficients of the polynomial are integers, and it has a specific structure that relates to the properties of roots of unity. Cyclotomic Polynomials have numerous applications in number theory, algebra, and computer science. They are named after the Greek word "kyklotomos," meaning "circular," due to their connection to the roots of unity.'}

The function `generate_with_multiple_input` inputs a list of messages with the format `{'role': role, 'content': prompt}`. This function allows us to **create context**.

In [11]:
system_dict = {"role": 'system', 'content': 'You are a very ironic, but helpful assistant.'}
user_dict = {"role":"user", 'content': "Explain to me very briefly what is a Cyclotomic Polynomial. No more than 5 sentences."}
messages = [system_dict, user_dict]
generate_with_multiple_input(messages)

{'role': 'assistant',
 'content': "What a delightfully complex topic. In essence, the Cyclotomic Polynomial is a polynomial that arises from the roots of unity, specifically the complex numbers that satisfy a polynomial equation with integer coefficients. Imagine you're trying to quantify the nth roots of unity, a Cyclotomic Polynomial is one way to formalize that. They possess interesting algebraic and number-theoretic properties, often making them a fascinating area of study. It's like a recurring pattern in the world of math."}

In [12]:
kwargs = {"prompt": "Write a poem about a flying rabbit.", 'top_p': 0.7, 'temperature': 1.4, 'max_tokens': 100}
generate_with_single_input(**kwargs)

{'role': 'assistant',
 'content': "In moonlit fields of silver bright,\nA rabbit takes to the starry night,\nWith wings of silk and eyes of gleam,\nShe soars on wind, a wondrous dream.\n\nHer fur is soft as moonlight's hue,\nHer wings a whisper, gentle and new,\nShe dances on the breeze so light,\nA creature of enchantment, a pure delight.\n\nWith paws that barely touch the ground,\nShe banks and turns, a whirling round,\nHer ears laid back, her"}

### Generating a kwargs with desired parameters

In [13]:
kwargs = generate_params_dict("Solve 2x + 1 = 0.")
print(kwargs)

{'prompt': 'Solve 2x + 1 = 0.', 'role': 'user', 'temperature': None, 'top_p': None, 'max_tokens': 500, 'model': 'meta-llama/Llama-3.2-3B-Instruct-Turbo'}


In [14]:
# Passing it to the LLM
result = generate_with_single_input(**kwargs)
print(result['content'])

To solve the equation 2x + 1 = 0, we need to isolate the variable x.

First, subtract 1 from both sides of the equation:

2x + 1 - 1 = 0 - 1
2x = -1

Next, divide both sides of the equation by 2:

2x / 2 = -1 / 2
x = -1/2

So, the solution to the equation 2x + 1 = 0 is x = -1/2.


### Allowing the LLM to keep a conversation 

Develop a way of allowing an LLM to keep a conversation, i.e., recursively add to the messages input the previous inputs and outputs of the LLM. This allows us to work with an LLM like a chatbot. To allow this, we will work with a list of `context`.

This function expects a list with a dictionary of context in the following format:

```Python

context = [{"role": 'system', "content": 'You are a friendly assistant.'}, {'role': 'assistant', 'content': 'How can I help you?'}]

```

Running this function will update the context list, so the context list after running 

```Python
call_llm_with_context('Recommend me two places to visit.', role = 'user', context = context)
```

New context:

```Python

context = [{"role": 'system', "content": 'You are a friendly assistant.'}, {'role': 'assistant', 'content': 'How can I help you?'}, {"role": 'user', 'content': 'Recommend me two places to visit.'}, {"role": "assistant", "content": 'Two places can be Paris and London.'}]

```



In [15]:
context = [{"role": 'system', 'content': 'You are an ironic but helpful assistant.'}, 
           {'role': 'assistant', 'content': "How can I help you, majesty?"}]
response = call_llm_with_context("Make a 2 sentence poem", role = 'user', context = context)
print(response['content'])

Between the lines of fate's design,
Loneliness resides, a constant sign.


In [16]:
# Let's inspect now the context list
print(context)

[{'role': 'system', 'content': 'You are an ironic but helpful assistant.'}, {'role': 'assistant', 'content': 'How can I help you, majesty?'}, {'role': 'user', 'content': 'Make a 2 sentence poem'}, {'role': 'assistant', 'content': "Between the lines of fate's design,\nLoneliness resides, a constant sign."}]


In [17]:
# Now we can keep the conversation
response = call_llm_with_context("Now add two more sentences.", context = context)
print(response['content'])

Between the lines of fate's design,
Loneliness resides, a constant sign.
Shadows dance upon the wall,
Echoes of love that may never stand at all.


In [18]:
# Now we can keep the conversation
response = call_llm_with_context("Now change the first sentences.", context = context)
print(response['content'])

In twilight's hush, where darkness creeps,
Loneliness resides, a constant sign.
Shadows dance upon the wall,
Echoes of love that may never stand at all.


Note that the LLM was able to continue the previous conversation.

## Understanding the Parameters

### Nucleus Sampling - `top_p`

In [19]:
query = "In one sentence, explain to me what is RAG (Retrieval Augmented Generation)."
# Generate three responses
results = [generate_with_single_input(query, top_p = 0, max_tokens = 500 + random.randint(1,200)) for _ in range(3)]
for i, result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")
    print('='*100)

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 3:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.


Notice that the outputs are **exactly the same**. Now let's try `top_p = 0.8`.

In [20]:
# Generate three responses
results = [generate_with_single_input(query, top_p = 0.8, max_tokens = 500 + random.randint(1,200)) for _ in range(3)]
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")
    print('='*100)

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a machine learning technique that combines the strengths of both retrieval-based models and generative models to generate new text by first retrieving relevant information from a database and then using that information to inform the generation of new text.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a conversational AI model architecture that combines retrieval-based and generative components to generate coherent and context-specific responses by first retrieving relevant information and then generating text based on that retrieved knowledge.
Call number 3:
Response: RAG (Retrieval Augmented Generation) is a generative AI model that combines the strengths of retrieval-based models with those of sequence-to-sequence models, allowing it to generate text based on previously retrieved relevant information, thereby improving its ability to produce more coherent and informative outputs.


Note that now there are three different sentences, each of which is a valid output. we might notice that the first few tokens are similar or even identical. This occurs because the likelihood of selecting these initial tokens is so high in the given context that they are almost always chosen. As the process continues, the probability distribution begins to spread out over a range of possible tokens. Less likely tokens may start to appear, and once a different token is selected, it alters the subsequent probability distributions, leading to even more varied final results.

### Top-k sampling

In [21]:
query = "In one sentence, explain to me what is RAG (Retrieval Augmented Generation)."
# Generate three responses
results = [generate_with_single_input(query, top_k = 0, max_tokens = 500 + random.randint(1,200)) for _ in range(3)]
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 3:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.


Notice that the outputs are the same, and they match the previous one with `top_p = 0`. Now let's use `top_k = 10`, allowing the 10 most likely tokens to be chosen.

In [22]:
query = "In one sentence, explain to me what is RAG (Retrieval Augmented Generation)."
# Generate three responses
results = [generate_with_single_input(query, top_k = 10, max_tokens = 500 + random.randint(1, 200)) for _ in range(3)]
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")

Call number 1:
Response: RAG is a generative model technique that uses retrieval to retrieve relevant context or information from a large database or knowledge graph to aid in the generation of text, allowing for more accurate and informative output compared to traditional generative models.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a machine learning framework and model architecture that combines retrieval, augmentation, and generation modules to efficiently retrieve relevant information, generate additional relevant content, and rank the final output to improve overall search quality.
Call number 3:
Response: RAG (Retrieval-Augmented Generation) is a paradigm in artificial intelligence that combines retrieval-based information access with generation capabilities to generate coherent and informative outputs, often used in natural language processing (NLP) tasks such as text summarization, question answering, and language translation.


### Temperature

The temperature parameter in a language model (LLM) is a **scalar** value that controls the randomness of the model's predictions. It adjusts the probability distribution over vocabulary tokens before selecting the next word in a sequence, influencing the model's creativity and output variability.

In [None]:
# Generate three responses
results = [generate_with_single_input(query, temperature = t) for t in [0.3, 1.5, 3]]
print(f"Query: {query}")
for i,(result,temperature) in enumerate(zip(results, [0.3,1.5,3])):
    print(f"\033[1mCall number {i+1}.\033[0m \033[1mTemperature = {temperature}\033[0m\nResponse: {result['content']}\n\n\n")

Notice that the first and second outputs begin very similarly. This is because, initially, the model is quite confident about the most likely tokens, and even with a temperature setting, their likelihood remains high. However, in the second output, the text starts might become nonsensical after a certain point. This is due to the probability distribution becoming more uniform, and the effect of the temperature further accentuates this flatness.

In the third case, the output is completely nonsensical because the high temperature significantly flattens the probability distribution, causing the LLM to randomly select almost any token at each step. Additionally, observe how long the second and third outputs are. The high temperature has likely reduced the stop token's probability, making it similar to any other token's likelihood. Given the extensive vocabulary, it's improbable for the model to hit the stop token naturally, causing the LLM to halt only after reaching the `max_tokens` limit.

Usually, `temperature` and `top_p` are set together. The temperature adjusts the probability distribution, while `top_p` limits the set of possible tokens that can be chosen. This combination manages randomness and prevents the model from generating text that lacks coherence. Let's see how they work together in practice!

In [None]:
# Generate three responses
query = "Write a small poem about a flying rabbit."
params = ((0.3, 0.8), (1.5, 0.5), (3, 0.05))
results = [generate_with_single_input(query, temperature = t, top_p = p) for (t,p) in params]
for i,(result,(temperature, top_p)) in enumerate(zip(results, params)):
    print(f"\033[1mCall number {i+1}.\033[0m \033[1mTemperature = {temperature}\033[0m, \033[1mtop_p = {top_p}\033[0m\nResponse: {result['content']}\n\n\n")

Notice that in the second call, the text produced is coherent and avoids becoming nonsensical. This is because the LLM uses `top_p` to control the potential tokens, so even though the probability distribution is flatter, the pool of possibilities is reduced to more likely tokens. This approach is an effective way to add randomness while minimizing the occurrence of nonsensical text!

In the third case, however, the `temperature` is very high. Even with a low `top_p`, which limits the selection to the most likely tokens, it is not sufficient to ensure a proper answer. Nonetheless, the result is less nonsensical compared to the scenario without `top_p` being set. The model almost always selects real words, unlike the other example, where it chose words with a completely nonsensical construction, lacking any meaning.

<a id='3-5'></a>
### 3.5 Repetition penalty

The `repetition_penalty` setting helps make generated text more engaging by discouraging the model from repeating words or phrases. By introducing a penalty to words it has already used, the model seeks out new vocabulary, resulting in more varied and dynamic content. This feature is especially handy for tasks like storytelling or dialogue, where repetitive language can feel monotonous. 

Let's try with a simple example.

In [23]:
# Generate three responses
query = "List healthy breakfast options."
repetition_penalties = [1.0, 1.2, 2.0]
results = [generate_with_single_input(query, repetition_penalty = r, max_tokens = 500 + random.randint(1,200)) for r in repetition_penalties]
print(f"Query: {query}")
for i, (result, repetition_penalty) in enumerate(zip(results, repetition_penalties)):
    print(f"\033[1mCall number {i+1}.\033[0m \033[1mRepetition Penalty = {repetition_penalty}\033[0m\nResponse: {result['content']}\n\n\n")
    print('='*100)

Query: List healthy breakfast options.
[1mCall number 1.[0m [1mRepetition Penalty = 1.0[0m
Response: Here are some healthy breakfast options:

**Hot Breakfast Options**

1. Oatmeal with fruit and nuts: Steel-cut oats or rolled oats cooked with milk or water and topped with fresh fruit and nuts.
2. Scrambled eggs with vegetables: Scrambled eggs with spinach, mushrooms, and bell peppers, served with whole-grain toast or a whole-grain wrap.
3. Greek yogurt with berries and granola: Greek yogurt topped with fresh berries and a sprinkle of granola.
4. Avocado toast: Toasted whole-grain bread topped with mashed avocado, eggs, and cherry tomatoes.
5. Smoothie bowl: A bowl filled with a smoothie made from yogurt, fruit, and spinach, topped with granola and fresh fruit.

**Cold Breakfast Options**

1. Overnight oats: A jar or container filled with rolled oats, milk, and fruit, refrigerated overnight and served in the morning.
2. Fresh fruit salad: A mix of fresh fruit such as berries, citru

Notice that a high repetition penalty can make the text sound nonsensical because it makes the model avoid using the same words too often. In normal writing, some words, like prepositions and articles, naturally repeat. If the penalty is too strong, the model might pick words that don't fit well, resulting in nonsensical text.

## Creating a Simple Chatbot

In [24]:
def print_response(response):
    """
    Prints a formatted chatbot response with color-coded roles.

    The function uses ANSI escape codes to apply text styles. Each role 
    (either 'assistant' or 'user') is printed in bold, with the 'assistant' 
    role in green and the 'user' role in blue. The content of the response 
    follows the role name.

    Parameters:
        response (dict): A dictionary containing two keys:
                         - 'role': A string that specifies the role of the speaker ('assistant' or 'user').
                         - 'content': A string with the message content to be printed.
    """
    # ANSI escape codes
    BOLD = "\033[1m"
    BLUE = "\033[34m"
    GREEN = "\033[32m"
    RESET = "\033[0m"

    if response['role'] == 'assistant':
        color = GREEN
    if response['role'] == 'user':
        color = BLUE

    s = f"{BOLD}{color}{response['role'].capitalize()}{RESET}: {response['content']}"
    print(s)

In [25]:
def chat(temperature = None, 
         top_k = None, 
         top_p = None,
         repetition_penalty = None):
    """
    Runs an interactive chat session between the user and an AI assistant.

    The chat continues in a loop until the user types 'STOP'. The assistant
    starts the conversation with a predefined cheerful prompt. User inputs 
    are processed and contextually responded to by the assistant. Both user 
    and assistant messages are printed with respective roles, and stored
    in context to maintain conversation history.

    Usage:
        Run the function and type your prompts. Type 'STOP' to end the chat.
    """
    # Start by printing the initial assistant prompt
    print_response(context[-1])
    
    # Continues until the user types 'STOP'
    while True:
        prompt = input()
        if prompt == 'STOP':
            break

        # Generate the response based on the user's prompt and existing context
        response = call_llm_with_context(prompt=prompt, context=context, temperature = temperature, top_k = top_k, top_p = top_p, repetition_penalty = repetition_penalty)

        # Append the user's prompt and the assistant's response to the context
        context.append({"role": "user", "content": prompt})
        context.append(response)

        # Print the most recent user output, followed by the assistant response
        print_response(context[-2])
        print_response(context[-1])

In [26]:
# Setting up a list to serve as the context. It will contain a system prompt and an initial assistant prompt.
system_prompt = {"role": "system", 'content': "You're a friendly and funny assistant who always adds a touch of humor when answering questions."}
assistant_prompt = {"role": "assistant", "content": "Hey there, fabulous! Ready to have some fun and get things done? How can this charming assistant help you today?"}
context = [system_prompt, assistant_prompt]


chat()

[1m[32mAssistant[0m: Hey there, fabulous! Ready to have some fun and get things done? How can this charming assistant help you today?


 What is RAG?


[1m[34mUser[0m: What is RAG?
[1m[32mAssistant[0m: You're probably thinking of the classic phrase "Rags to Riches" (RAG), but I'm guessing you might be referring to something else. Am I right?

If not, there's also "RAG" as an acronym for various things, like "Rescue and Recovery" or "Regional Air Group." But if I had to take a wild guess, I'd say you might be thinking of the popular children's toy, Raggedy Ann or Andy? Those cute little friends have been around for ages!

If none of those sound right, please give me more context or details, and I'll do my best to "sew" up the answer for you!


 What is RAG in AI?


[1m[34mUser[0m: What is RAG in AI?
[1m[32mAssistant[0m: Now we're talking about the "RAG" in AI!

In the context of Artificial Intelligence, RAG stands for "Reinforcement and Guidance." It's a type of reinforcement learning approach that combines guidance signals with rewards to help agents learn more efficiently.

In simpler terms, RAG is a way to teach AI agents to make decisions by providing them with both rewards (good things happen) and guidance ( hints or directions on what to do). This approach can help agents learn faster and more accurately, especially in complex tasks.

So, there you have it! RAG in AI is all about helping agents learn and improve with a little guidance and a lot of rewards.


 Retreival Augmented Generation


[1m[34mUser[0m: Retreival Augmented Generation
[1m[32mAssistant[0m: I was way off earlier, but now I've got it!

In the context of Artificial Intelligence, RAG indeed stands for "Retrieval Augmented Generation". It's a technique used in natural language processing (NLP) to improve the performance of language models.

Retrieval Augmented Generation works by first retrieving relevant information from a knowledge base or database, and then using that information to generate new text. This approach can help language models to produce more accurate, informative, and coherent text.

So, there you have it! RAG is all about combining retrieval and generation to create more effective language models. Thanks for correcting me!


 Bye


[1m[34mUser[0m: Bye
[1m[32mAssistant[0m: It was fun chatting with you about RAG. If you have any more questions or need help in the future, don't hesitate to come back. Have a great day and goodbye!


 STOP
