<img align="right" width="400" src="https://drive.google.com/thumbnail?id=1rPeHEqFWHJcauZlU82a4hXM10TUjmHxM&sz=s4000" alt="FHNW Logo">


# How to Generate with LLMs

by Fabian Märki

## Summary
The aim of this notebook is to illustrate available adjustment "screws" for generating content/text.

## Links
- [How to Generate](https://huggingface.co/blog/how-to-generate)
- [Generation Strategies](https://huggingface.co/docs/transformers/generation_strategies)
- [Text Generation](https://github.com/nlp-with-transformers/notebooks/blob/main/05_text-generation.ipynb)

This notebook contains assigments: <font color='red'>Questions are written in red.</font>

<a href="https://colab.research.google.com/github/markif/2025_HS_Advanced_NLP_LAB/blob/master/XX_a_How_to_Generate_with_LLMs.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

The inputs to the generate method depend on the model. They are returned by the model’s preprocessor class, such as AutoTokenizer or AutoProcessor. You can learn more about the individual model’s preprocessor in the corresponding model’s documentation.

The process of selecting output tokens to generate text is known as decoding, and you can customize the decoding strategy that the `generate()` method will use. Modifying a decoding strategy can have a noticeable impact on the quality of the generated output. It can help reduce repetition in the text and make it more coherent.

[Text generation](https://huggingface.co/docs/transformers/v4.45.1/en/main_classes/text_generation#transformers.GenerationMixin.generate) is essential to many NLP tasks:
- [Text summarization](https://huggingface.co/docs/transformers/tasks/summarization#inference)
- [Image captioning](https://huggingface.co/docs/transformers/model_doc/git#transformers.GitForCausalLM.forward.example)
- [Audio transcription](https://huggingface.co/docs/transformers/model_doc/whisper)

In [1]:
%%capture

!pip install transformers

!pip install accelerate
!pip install optimum
!pip install auto-gptq
!pip install bitsandbytes

In [2]:
%%capture

!pip install 'fhnw-nlp-utils>=0.11.0,<0.12.0'

from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.storage import load_dataframe

import pandas as pd
import numpy as np

**Make sure that a GPU is available (see [here](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm))!!!**

In [3]:
from fhnw.nlp.utils.system import set_log_level
from fhnw.nlp.utils.system import system_info

set_log_level()
print(system_info())

2024-11-09 10:15:33.426168: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-09 10:15:33.441139: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-09 10:15:33.445710: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


OS name: posix
Platform name: Linux
Platform release: 6.8.0-47-generic
Python version: 3.11.0rc1
CPU cores: 6
RAM: 31.11GB total and 23.38GB available
Tensorflow version: 2.17.0
GPU is available
GPU is a NVIDIA GeForce RTX 2070 with Max-Q Design with 8192MiB


(e.g. `from pynvml_utils import nvidia_smi`)


In [4]:
import transformers
import torch

model_id = "NousResearch/Meta-Llama-3.1-8B-Instruct"

llm = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "quantization_config": {"load_in_4bit": True}
    },
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

A decoding strategy for a model is defined in its generation configuration. When using pre-trained models for inference within a `pipeline()`, the models call the `PreTrainedModel.generate()` method that applies a default generation configuration under the hood. The default configuration is also used when no custom configuration has been saved with the model.

When you load a model, you can inspect the generation configuration that comes with it through `model.generation_config`.

In [5]:
llm.model.generation_config

GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

You can override any generation_config by passing the parameters and their values directly to the generate method:

```python
llm(prompt, num_beams=4, do_sample=True)
```

or you can save your custom decoding strategy:

```python
from transformers import GenerationConfig

generation_config = GenerationConfig(
    max_new_tokens=50, do_sample=True, top_k=50
)
generation_config.save_pretrained("model_folder")

...

generation_config = GenerationConfig.from_pretrained("model_folder")
generate_kwargs = {
    "generation_config": generation_config,
}

llm = transformers.pipeline(
    "text-generation",
    model=model_id,
    generate_kwargs=generate_kwargs,
)
```

Even if the default decoding strategy works for your task, you can still tweak a few things. Some of the commonly adjusted parameters include:

- `max_new_tokens`: the maximum number of tokens to generate. In other words, the size of the output sequence, not including the tokens in the prompt. As an alternative to using the output’s length as a stopping criteria, you can choose to stop generation whenever the full generation exceeds some amount of time. To learn more, check StoppingCriteria.
- `num_beams`: by specifying a number of beams higher than 1, you are effectively switching from greedy search to beam search. This strategy evaluates several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with a lower probability initial tokens and would’ve been ignored by the greedy search. Visualize how it works here.
- `do_sample`: if set to True, this parameter enables decoding strategies such as multinomial sampling, beam-search multinomial sampling, Top-K sampling and Top-p sampling. All these strategies select the next token from the probability distribution over the entire vocabulary with various strategy-specific adjustments.
- `num_return_sequences`: the number of sequence candidates to return for each input. This option is only available for the decoding strategies that support multiple sequence candidates, e.g. variations of beam search and sampling. Decoding strategies like greedy search and contrastive search return a single output sequence.

In [6]:
from transformers import set_seed

# for reproducibility
set_seed(0) 

Let's get started and build or first chat (template).

<font color='red'>**TASK: Get an understanding of the differences between `system` and `user` prompts (e.g. by consulting [this](https://www.regie.ai/blog/user-prompts-vs-system-prompts) resource).**</font>

<font color='green'>**System prompts** ...</font>

<font color='green'>**User prompts** ...</font>

In [7]:
messages = [
    {
        "role": "system",
        "content": "You are a skilled Python developer specializing in database management and optimization.",
    },
    {
        "role": "user",
        "content": "I am experiencing a sorting issue in my database. Could you please provide Python code to help resolve this problem?",
    },
]

<font color='red'>**TASK: Go to [Chat Template](https://huggingface.co/docs/transformers/chat_templating) and get an understanding on how to process/handle chat templates within the Huggingface ecosystem.**</font>


Let's check if it works...

In [8]:
def get_responses(outputs):
    return [output["generated_text"].split("<|start_header_id|>assistant<|end_header_id|>")[1] for output in outputs]

def get_response(outputs):
    return get_responses(outputs)[0]

In [9]:
from IPython.display import Markdown, display

prompt = llm.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

outputs = llm(prompt, max_new_tokens=512, do_sample=True)

display(Markdown(get_response(outputs)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




I'd be happy to help you resolve your database sorting issue. However, I'll need more information about the problem you're experiencing.

To better assist you, could you please provide more details about your database setup, the data you're trying to sort, and the specific issue you're facing? This will help me provide a more accurate and effective solution.

Here are some questions to consider:

1. What type of database are you using (e.g., MySQL, PostgreSQL, MongoDB, SQLite)?
2. What is the structure of your database tables?
3. What data are you trying to sort (e.g., dates, integers, strings)?
4. What is the specific issue you're experiencing (e.g., slow performance, incorrect sorting order, inconsistent results)?

Once I have this information, I can provide you with a Python code snippet to help resolve your sorting issue.

If you're working with a specific database library, such as `pandas` or `sqlite3`, I can provide code examples using those libraries.

Here's an example of a basic sorting function using the `pandas` library:

```python
import pandas as pd

# create a sample DataFrame
data = {
    'Name': ['John', 'Alice', 'Bob', 'Eve'],
    'Age': [25, 30, 35, 20]
}
df = pd.DataFrame(data)

# sort by 'Age' column in ascending order
df_sorted = df.sort_values(by='Age')

print(df_sorted)
```

This code creates a sample DataFrame, sorts it by the 'Age' column in ascending order, and prints the result.

If you're using a different database library or have more complex sorting requirements, please provide more details, and I'll do my best to assist you with a Python code solution.

Alternatively we could also directly call...

In [10]:
outputs = llm(messages, max_new_tokens=512, do_sample=True)[0]['generated_text'][-1]

display(Markdown(outputs["content"]))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


I'd be happy to help you with your database sorting issue.

Before I provide you with code, could you please provide more information about your problem? This will help me better understand your requirements and provide a more accurate solution.

Here are some details that would be helpful to know:

1. What type of database are you using (e.g., MySQL, PostgreSQL, SQLite)?
2. What is the data type of the column you're trying to sort by?
3. What is the current state of your data (e.g., is it sorted, unsorted, or partially sorted)?
4. What is the desired order of your data (e.g., ascending, descending, alphabetical)?
5. Are there any specific constraints or requirements that I should be aware of?

Assuming you're using a Python database library such as `sqlite3` or `pandas`, here's an example code snippet that demonstrates how to sort data in a database:

**Sorting Data in a SQLite Database**
```python
import sqlite3

# Connect to the database
conn = sqlite3.connect('my_database.db')

# Create a cursor object
cur = conn.cursor()

# Define the table name and column to sort by
table_name ='my_table'
sort_column = 'id'

# Sort the data in ascending order
cur.execute(f"SELECT * FROM {table_name} ORDER BY {sort_column} ASC")

# Fetch the sorted data
sorted_data = cur.fetchall()

# Print the sorted data
for row in sorted_data:
    print(row)

# Close the connection
conn.close()
```
This code snippet demonstrates how to sort data in a SQLite database by a specified column. You can modify the code to suit your specific requirements.

If you're using a Pandas DataFrame, you can use the following code to sort data:
```python
import pandas as pd

# Load the data into a DataFrame
df = pd.read_csv('my_data.csv')

# Sort the data in ascending order
df_sorted = df.sort_values(by='id')

# Print the sorted data
print(df_sorted)
```
Please provide more information about your problem, and I'll be happy to assist you further!

<font color='red'>**TASK: Decide upon a prompt (you might want to define your own).**</font>

In case you are interested: Some details about the [Chat Markup Language (ChatML)](https://cobusgreyling.medium.com/openai-introduced-chat-markup-language-chatml-based-input-to-non-chat-modes-6ca4b267012f) displayed below.

In [11]:
messages = [
    {
        "role": "system",
        "content": "You are skilled story teller experienced in writing funny jokes.",
    },
    {
        "role": "user",
        "content": "Tell me a joke about an NLP class that does not pay attention to their lecturer.",
    },
]

prompt = llm.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are skilled story teller experienced in writing funny jokes.<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell me a joke about an NLP class that does not pay attention to their lecturer.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [12]:
messages = [
    {
        "role": "system",
        "content": "You are a skilled writer experienced in writing emails.",
    },
    {
        "role": "user",
        "content": (
            "Write a email to schedule a meeting with your colleague John Doe to disucss the requirements for the RAG application. "
            "Ask him when he is free. " 
            # we might want to add some instructions to keep the email short but to the sake of the task we skip this
            #"Keep the email short. "
        ),
    },
]

prompt = llm.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

print(prompt)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a skilled writer experienced in writing emails.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a email to schedule a meeting with your colleague John Doe to disucss the requirements for the RAG application. Ask him when he is free.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




<font color='red'>**TASK (ongoing): Play with the parameters.**</font>

## Decoding Strategies

Certain combinations of the `generate()` parameters, and ultimately generation_config, can be used to enable specific decoding strategies. 

### Greedy Search

Greedy search decoding is the default generation strategy. This means the parameters are set to `num_beams=1` and `do_sample=False`.

In [13]:
max_new_tokens = 1024

In [14]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=False, num_beams=1)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a neutral and professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next few days so we can schedule a time that suits you?

I look forward to hearing back from you and discussing the details of the project.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 6.19 s, sys: 59.3 ms, total: 6.25 s
Wall time: 6.25 s


### Beam-Search Decoding

Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with lower probability initial tokens and would’ve been ignored by the greedy search. Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

Use this [demo](https://huggingface.co/spaces/m-ric/beam_search_visualizer) to visualize how beam-search decoding works.

To enable beam-search decoding set `num_beams` (aka number of hypotheses to keep track of) to a number that is greater than 1.

In [15]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=False, num_beams=3, early_stopping=True)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next few days so we can schedule a time that suits you?

I am available to meet at your convenience, either in person or virtually, and can accommodate your schedule.

Looking forward to hearing back from you.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 45.4 s, sys: 7.78 s, total: 53.2 s
Wall time: 53.3 s


Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.

To enable this feature set the parameter `num_return_sequences` to the number of highest scoring beams that should be returned.



In [16]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=False, num_beams=3, early_stopping=True, num_return_sequences=3)

print(100 * '-')
for i, text in enumerate(get_responses(outputs)):
  print("{}: {}".format(i, text))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------
0: 

Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next few days so we can schedule a time that suits you?

I am available to meet at your convenience, either in person or virtually, and can accommodate your schedule.

Looking forward to hearing back from you.

Best regards,
[Your Name]
1: 

Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next

### Diverse Beam-Search Decoding

Diverse beam search decoding strategy is an extension of the beam search strategy that allows for generating a more diverse set of beam sequences to choose from. 

This approach has three main parameters: `num_beams`, `num_beam_groups`, and `diversity_penalty`. The diversity penalty ensures the outputs are distinct across groups, and beam search is used within each group.

In [17]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=False, num_beams=3, early_stopping=True, num_return_sequences=3, num_beam_groups=3, diversity_penalty=5.0)

print(100 * '-')
for i, text in enumerate(get_responses(outputs)):
  print("{}: {}".format(i, text))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------
0: 

Here is a neutral and professional email:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next few days so we can schedule a time that suits you?

I look forward to hearing back from you and discussing the details.

Best regards,
[Your Name]
1: 

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I am reaching out to schedule a meeting with you to discuss the requirements for the RAG application. I would appreciate it if you could let me know your availability over the next few days, and we can schedule a meeting at a time that suits you.

Please reply with your preferred dates and times, and I will make sure to accommodate

As can be seen, the beam hypotheses are only marginally different to each other - which should not be too surprising since we do not use sampling (only marginal diversity in the generated text).

In open-ended generation, a couple of reasons have recently been brought forward why beam search might not be the best possible option:

- Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization. But this is not the case for open-ended generation where the desired output length can vary greatly, e.g. dialog and story generation.

- Beam search can suffer from repetitive generation and this can be hard to control and can require a lot of finetuning (this was an issue with simple/early models like gpt-2 but not with LLMs anymore).

- High quality human language does not follow a distribution of high probable next words. As humans, we want generated text to surprise us and not to be boring/predictable. The following figure shows this by plotting the probability a model would give to human text vs. what beam search does.

<img src="https://drive.switch.ch/index.php/s/QAe46FquKEqDlYL/download" alt="System vs. User prompt">

So let's introduce some randomness.

## Sampling

In its most basic form, sampling means randomly picking the next word according to its conditional probability distribution:

$$w_t \sim P(w|w_{1:t-1})$$

It becomes obvious that language generation using sampling is not deterministic anymore. 

In [18]:
from transformers import set_seed

# for reproducibility
set_seed(0)

### Multinomial Sampling
Multinomial sampling (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire vocabulary of the model (opposed to greedy search that always chooses a token with the highest probability as the next token). Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition.

To enable multinomial sampling set `num_beams=1` and `do_sample=True` (and deactivate *top-k* sampling (more on this later) via `top_k=0`).

In [19]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, num_beams=1, top_k=0)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I am reaching out to schedule a meeting with you to discuss the requirements for the RAG application. I would appreciate the opportunity to touch base with you to ensure we are on the same page regarding the project's specifications and timelines.

Would you be available to meet at your earliest convenience? Could you please let me know your availability, and I will schedule a meeting at a time that suits you?

Thank you for your time, and I look forward to hearing back from you.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 8.36 s, sys: 83.4 ms, total: 8.45 s
Wall time: 8.44 s


### Beam-Search Multinomial Sampling

This decoding strategy combines beam search with multinomial sampling. 

To enable this strategy you need to specify the `num_beams` greater than 1 and `do_sample=True`.

In [20]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, num_beams=3, early_stopping=True, num_return_sequences=3, top_k=0)

print(100 * '-')
for i, text in enumerate(get_responses(outputs)):
  print("{}: {}".format(i, text))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------
0: 

Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next few days, and we can schedule a time that suits you?

I am available to meet at your convenience, either in person or virtually. Please reply to this email with a few dates and times that work for you, and I will do my best to accommodate your schedule.

Looking forward to hearing back from you.

Best regards,
[Your Name]
1: 

Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requir

### Temperature

The `temperature` parameter in LLMs directly affects the variability and randomness of generated responses. A lower LLM `temperature` value (close to 0) produces more deterministic and focused outputs, ideal for tasks requiring factual accuracy, such as summarization or translation. Conversely, a higher `temperature` value (e.g., 1.0 or above) introduces more diversity and creativity, as the model samples from a broader range of possible words. This makes higher temperatures suitable for creative writing.

That said, a lower `temperature` makes the distribution $P(w|w_{1:t-1})$ sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the `temperature` of the [softmax](https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max). 

For applications needing highly reliable and reproducible outputs, setting a lower LLM `temperature` (e.g., 0.2-0.3) helps maintain quality and reduces unexpected responses. In contrast, interactive applications like chatbots (or any content creation task) can benefit from a mid-range `temperature` (e.g., 0.7-0.9), allowing the model to be engaging without straying too far from meaningful responses.

By setting `temperature` $ \to 0$, temperature scaled sampling becomes equal to greedy decoding.

<img src="https://drive.switch.ch/index.php/s/B6a30e8WbLbmwx8/download" alt="Temperature">

In [21]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=0.1, top_k=0)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a neutral and concise email:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability and suggest a few dates and times that work for you?

I look forward to hearing back from you.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 4.84 s, sys: 73.8 ms, total: 4.91 s
Wall time: 4.91 s


If we set temperature even higher the generated text becomes gibberish...

Additionally, it can happen that the LLM starts to make things up (I had an example where it wrote "discuss the requirements for the **Results Affiliate Group (RAG)** application").

In [22]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=1.5, top_k=0)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is an email:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I am writing to schedule a meeting to discuss the requirements for our bid on the Rao Application Group (RAG) application. Your insight and expertise will be invaluable to this process, and I look forward to discussing this further with you.

Could you please let me know when your availability is the next few days so we can arrange a time that suits you?

Looking forward to hearing back from you.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 6.96 s, sys: 73.4 ms, total: 7.03 s
Wall time: 7.03 s


### Top-K Sampling

In *top-k* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words. 

To enable top-k sampling set `top_k` to the number of words you want to consider from the probability distribution for the next word in the sequence.

<img src="https://drive.switch.ch/index.php/s/3FxJ7e3r74zVy17/download" alt="top-k vs. top-p">

In [23]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, top_k=20)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a simple email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability this week or next so we can schedule a time that suits you?

I look forward to hearing back from you and discussing the requirements in more detail.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 6.13 s, sys: 80 ms, total: 6.21 s
Wall time: 6.21 s


One concern with *top-k* sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution $P(w|w_{1:t-1})$.
This can be problematic as some words might be sampled from a very sharp distribution, whereas others from a much more flat distribution.

Thus, limiting the sample pool to a fixed size *K* could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution.

### Top-P (Nucleus) Sampling

Instead of sampling only from the most likely *K* words, in *top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. The probability mass is then redistributed among this set of words. This way, the size of the set of words (*a.k.a* the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution.

In [24]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, top_k=0, top_p=0.80)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a professional email to schedule a meeting with John Doe:

Subject: Meeting Invitation: RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next few days so we can schedule a meeting at a time that suits you?

I am available to meet at your convenience, whether that's in-person or remotely. Please reply to this email with your preferred date and time, and I will make sure to adjust my schedule accordingly.

Thank you for your time, and I look forward to hearing back from you soon.

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 8.8 s, sys: 83.8 ms, total: 8.88 s
Wall time: 8.88 s


### Combinations

While in theory, *top-p* seems more elegant than *top-k*, both methods work  well in practice. *top-p* can also be used in combination with *top-k*, which can avoid very low ranked words while allowing for some dynamic selection.

To get multiple independently sampled outputs, we can *again* set the parameter `num_return_sequences > 1`: 

In [25]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, top_k=20, top_p=0.80)

print(100 * '-')
print(get_response(outputs))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------


Here is a neutral email:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I would like to schedule a meeting to discuss the requirements for the RAG application. Could you please let me know when you are available to meet?

Best regards,
[Your Name]
----------------------------------------------------------------------------------------------------
CPU times: user 3.67 s, sys: 80.6 ms, total: 3.75 s
Wall time: 3.75 s


## Contrastive Search

See [here](https://huggingface.co/blog/introducing-csearch#5-contrastive-search) or [here](https://github.com/yxuansu/Contrastive_Search_Is_What_You_Need) for a detailed explanation.

Contrastive Search is a [method proposed in 2022](https://arxiv.org/pdf/2202.06417) that suppresses token repetition by making it more difficult to select token sequences that have been outputed previously. It suppresses repetitions more effectively than Beam Search with an equivalent amount of computation.

To calculate the token $x_t$ at time $t$, the selection of $v$ from the token candidates $V^k$ is determined according to the following evaluation function. The terme `model confidence` is the probability value calculated by the model for the next tokens (i.e. the probability value of the token candidate $v$ given past tokens $x_{<t}$). In Greedy Search, the next token is determined only from this term.

<img src="https://drive.switch.ch/index.php/s/RFuQ8gFdAU46xll/download" alt="Contrastive Search Formula https://miro.medium.com/v2/resize:fit:720/format:webp/1*OEIwRq659xURfgrfGdWd1Q.png">

In `Contrastive Search`, a `degeneration penalty` term is added as a penalty. This penalty is calculated based on the past token sequence $x_j$ and the current candidate token $v$, and is subtracted from the probability value output by the model.

In this context, $s$ represents the `cosine similarity` of the token’s hidden states. For each candidate token $v$, the model’s inference is performed again to obtain its hidden state. An 'cosine similarity' with past hidden states is calculated, and if the candidate token is semantically close to previous tokens, a higher penalty is applied.

If the model’s inference is performed to calculate the hidden states for all candidates, the computational load becomes significant. Therefore, the penalty is calculated only for the `top_k` candidates with the highest probabilities.

$α$ is a parameter used to control the penalty. If it is set to 0, the process is equivalent to Greedy Search.

In [26]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, top_k=200, top_p=0.95, num_return_sequences=3, penalty_alpha=0.4)

print(100 * '-')
for i, text in enumerate(get_responses(outputs)):
  print("{}: {}".format(i, text))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------
0: 

Here is a neutral and professional email to schedule a meeting with John Doe:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I hope this email finds you well. I would like to schedule a meeting with you to discuss the requirements for the RAG application. Could you please let me know your availability over the next week so that we can schedule a time that suits you?

I look forward to hearing back from you and discussing the details.

Best regards,
[Your Name]
1: 

Here is a neutral email to schedule a meeting:

Subject: Meeting to Discuss RAG Application Requirements

Dear John,

I would like to schedule a meeting to discuss the requirements for the RAG application. Would you be available to meet at your convenience?

Please let me know a time that suits you and I will ensure I am available.

Best regards,
[Your Name]
2: 

Here is a draft email:

Subject: M

When we increase temperature it can happen that the LLM starts to make things up (I had an example where it wrote "I am available Monday to Friday, and I am flexible to meet at your preferred time.")

In [27]:
%%time

outputs = llm(prompt, max_new_tokens=max_new_tokens, do_sample=True, temperature=1.2, top_k=200, top_p=0.95, num_return_sequences=3, penalty_alpha=0.4)

print(100 * '-')
for i, text in enumerate(get_responses(outputs)):
  print("{}: {}".format(i, text))
print(100 * '-')

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


----------------------------------------------------------------------------------------------------
0: 

Here is a neutral, yet effective email:

Subject: Meeting Invitation: RAG Application Requirements Discussion

Dear John,

I'd like to schedule a meeting to discuss the requirements for the RAG application. Would you be available to meet at your earliest convenience? Please let me know a date and time that suits you.

Best regards,
[Your Name]
1: 

Here is an email to schedule a meeting with John Doe:

Subject: Meeting Invitation - RAG Application Discussion

Dear John,

I hope you're doing well. I wanted to touch base with you regarding the requirements for the RAG application. It would be helpful to discuss this further to ensure we're on the same page.

Would you be available to meet this week or early next week to discuss the requirements? Could you please let me know a time that suits you?

I look forward to hearing back from you.

Best regards,
[Your Name]
2: 

Here's a neutr