# 1. Multi-turn Conversational Data Preprocessing
---

This notebook's content focuses on teaching you how to preprocess multiple chat data (multi-turn conversation) for the summarization task. Furthermore, it exposes learners to the fundamentals of LLMs in instruction tuning dataset and the NeMo-Run data format for fine-tuning. After completing a walkthrough of this notebook, learners will be able to preprocess data in a NeMo-Run acceptable format.

### Fundamentals of Instruction Tuning Dataset

After pre-training, instruction tuning (also known as supervised fine-tuning) is a crucial method for enhancing or unlocking specific abilities of LLMs. An example of such abilities is the ability to follow instructions. Based on the formatted instruction construction method, instruction tuning data can be categorized into three main types, namely, `NLP task datasets`, `daily chat datasets`, and `synthetic datasets`.

- **NLP Task Datasets**. This kind of dataset is formatted based on collected NLP task datasets (e.g., text classification and summarization) with corresponding natural language task descriptions. Example include [P3](https://huggingface.co/datasets/bigscience/P3) and [FLAN](https://arxiv.org/pdf/2301.13688).
- **Daily Chat Datasets**: Such a dataset is constructed based on real user conversations where queries are posed by humans, and responses are mainly generated by human labelers or LLMs. The conversation types include open-ended generation, question answering, brainstorming, and chatting. Examples include [ShareGPT](https://sharegpt.com/), [OpenAssistant](https://paperswithcode.com/dataset/oasst1), and [Dolly](https://github.com/databrickslabs/dolly).
- **Synthetic Datasets**: These kinds of datasets are typically constructed by instructing LLMs, based on pre-defined guidance rules or methods. Examples include [Self-Instruct-52K](https://github.com/yizhongw/self-instruct), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca), and [Baize](https://arxiv.org/abs/2304.01196).


The [NeMo Curator](https://docs.nvidia.com/nemo/curator/latest/about/index.html#about-overview) is a typical source of constructing a synthetic dataset. It is an open-source, enterprise-grade platform for scalable, privacy-aware data curation across `text`, `image`, and `video` modalities. A comprehensive example on how to get started with text curation using the `NeMo Curator` can be found [here](https://docs.nvidia.com/nemo/curator/latest/get-started/text.html)


#### Formatted Instance Construction

Instruction tuning is the approach to fine-tuning pre-trained LLMs on a collection of formatted instances in the form of natural language, which is highly related to supervised fine-tuning and multi-task prompted training. To perform instruction tuning, we first need to collect or construct instruction-formatted instances. Then, we utilize these formatted instances to fine-tune LLMs in a supervised learning manner (e.g., training with sequence-to-sequence loss). An instruction-formatted instance consists of:
- a task description (called an instruction),
- an input,
- the corresponding output, and
- a small number of demonstrations (optional).

The screenshot below illustrates three different methods for constructing instruction-formatted instances. You can read section 5 of [arXiv:2303.18223](https://arxiv.org/abs/2303.18223) paper for further details.


<center><img src="images/formatted-instruction.png" width="700px" height="700px"></center>
<center>  An illustration of instance formatting and three different methods for constructing the instruction-formatted
instances. <br/> <i>source: <a href="https://arxiv.org/abs/2303.18223">arXiv:2303.18223 [cs.CL]</a></i></center>

### NeMo 2.0 Compatible Format For Custom Dataset

To perform fine-tuning in [NeMo 2.0](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html), we must first transform the training dataset into a predefined compatible format. The NeMo 2.0 has a `FineTuningDataModule` base class for fine-tuning Large Language Models (LLMs) for supervised tasks. The task includes question answering, instruction tuning, function calling, summarization, etc. The `FineTuningDataModule` handles the data loading, preprocessing, and batch creation for training, validation, and testing phases. This class integrates with PyTorch Lightning’s LightningDataModule and NeMo’s SFT dataset classes (`GPTSFTDataset`, `GPTSFTChatDataset`, and `GPTSFTPackedDataset`).

NeMo’s fine-tuning datasets are formatted as JSONL files. Each file contains lines of json-formatted text, and each line should contain a minimum of two keys, “input” and “output”. Additional keys can be added, and are returned by the data loader. An example is given below:

```python
{"input": "This is the input/prompt/context/question for sample 1. Escape any double quotes like \"this\".", "output": "This is the output/answer/completion part of sample 1"}
{"input": "This is the input/prompt/context/question for sample 2. Escape any double quotes like \"this\".", "output": "This is the output/answer/completion part of sample 2"}
...

```

During training, by default, “input” and “output” are naively concatenated to be passed into the transformer model. Moreover, loss is only computed on the “output” tokens by default. These two behaviors can be customized with the dataset_kwargs field in the data module.

```python

FineTuningDataModule(
    ...,
    dataset_kwargs={
        "prompt_template": "Question: {input} Answer: {output}",  # default is "{input} {output}" (naive concatenation)
        "answer_only_loss": False,  # default is True (only calculate loss on answer/output)
    }
)

```

#### Using FineTuningDataModule with Preprocessed Data

If you prefer to preprocess your dataset offline, you can use `FineTuningDataModule` directly by specifying the location of the preprocessed data following these two steps:
- Create training, validation, and test files by preprocessing your raw data into a `.jsonl` format:

```python
your_dataset_root/
    ├── training.jsonl
    ├── validation.jsonl
    └── test.jsonl

```
Each record in the `.jsonl` files should follow the format below:

```python
{"input": "This is for sample 1. Escape any double quotes like \"this\".", "output": "This is the output for sample 1"}
{"input": "This is for sample 2. Escape any double quotes like \"this\".", "output": "This is the output for sample 2"}
...

```
- Set up `FineTuningDataModule` to point to `dataset_root`, as well as any additional kwargs, if needed.

```python
FineTuningDataModule(
    dataset_root="your_dataset_path",
    seq_length=512,
    micro_batch_size=1,
    global_batch_size=128,
    dataset_kwargs={},
)

```



#### Using ChatDataModule with Preprocessed Data

You can also use the NeMo `ChatDataModule` dataset format for offline preprocessed data. The dataset format consists of three fields: `mask,` `system,` and `conversations.`

- **Mask**: The role that needs to be masked out to prevent the role from participating in loss calculation.

- **System**: System prompt.

- **Conversations**: For each role, the conversation consists of two fields: `from` and `value`.

```python
{ 
    "mask": "User", 
    "system": "",
    "conversations": [
        {
            "from": "User", 
            "value": "You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose."
        }, 
        {
            "from": "Assistant", 
            "value": "Of course. How long would you like your password to be? And would you like it to include symbols?"
        }, 
        {
            "from": "User", 
            "value": "I would like it to be 12 characters long and yes, please include symbols."
        }, 
        {
            "from": "Assistant", 
            "value": "<TOOLCALL>[generate_password(length=12, include_symbols=True)]</TOOLCALL>"
        }
    ]
}
```
Also, the data must be formatted as `.jsonl` record file as shown below.

```Python
{"system": "", "mask": "User", "conversations": [{"from": "User", "value": "You are an intelligent agent capable of invoking functions based on user queries. Given a question and a list of functions, your task is to identify and execute the appropriate functions. If no function is suitable, specify this. If required parameters are missing, indicate this as well. Return only the function call in the specified format. Here is the list of available functions in JSON format\n<AVAILABLE_TOOLS>\n[{\"name\": \"playlist\", \"description\": \"Fetch details and videos of a YouTube playlist using the provided playlist ID and optional parameters.\", \"parameters\": {\"is_id\": {\"description\":.."}, {"from": "Assistant", "value": "<TOOLCALL>[domain(domain_id=\"sample.info\"), playlist(is_id=\"PLghi\")]</TOOLCALL>"}]}

```


### DialogStudio Conversational Datasets

[DialogStudio](https://github.com/salesforce/DialogStudio) is the most comprehensive and diverse collection of publicly available dialogue datasets, unified under a consistent format. By aggregating dialogues from various sources, DialogStudio facilitates holistic analysis and the development of models that are adaptable to a range of conversational scenarios.
 The collection spans an extensive range of domains, aspects, and tasks, encompassing several categories: `Open-Domain Dialogues`, `Task-Oriented Dialogues`, `Natural Language Understanding`, `Conversational Recommendation`, `Dialogue Summarization`, and `Knowledge-Grounded Dialogues`. Thus, it can provide support for research in both individual dialogue tasks and large-scale language pre-training.

<center><img src="images/dialogstudio-dist.png" width="700px" height="700px"></center>
<center> (a) is the distribution of all datasets in DialogStudio. The outer and inner circle list names of datasets and
the associated categories, respectively. (b) illustrates covered domains of Task-Oriented Dialogues in DialogStudio.. <br/> <i>source: <a href="https://arxiv.org/pdf/2307.10172">arXiv:2307.10172 [cs.CL]</a></i></center> <br/>

For learning purposes, we will focus on the `Dialogue Summarization` category for data preprocessing. The goal will be to preprocess some of the datasets (SAMSum and DialogSum) for a summarization task. 

#### Dialogue Summarization Dataset

In this section, we will examine two datasets, including `SAMSum` and `DialogSum`. The preprocessing process of the SAMSum dataset will be exemplified in the remainder of this notebook, while the DialogSum dataset will be used for learners' activities to try out.

#### SAMSum

[SAMSum Corpus](https://aclanthology.org/D19-5409/) is a dataset with abstractive dialogue summaries. It contains high-quality chat dialogues that have been manually annotated with abstractive summarizations. The dataset contains 14,732 training samples, 818 validation samples, and 819 test samples. For our preprocessing process, we will consider 1000 training samples. The screenshot below displays the dictionary keys of the [JSON file](../data/SAMSum/train.json).

<img src="images/SAMSum.png" height="500px" width="500px">

Each training sample consists of a key `SAMSum--train--xxx`, where `xxx` represents the indexes ranging from 1 to 1000. Let's explore the JSON file and display a sample by running the cells below. 

In [None]:
import json
import pandas as pd
from pathlib import Path
import re

def read_json_file(filename):
    try:
        with open(filename, 'r', encoding='utf-8') as file:
            data = json.load(file)
        return data
    except FileNotFoundError:
        print(f"Error: File '{filename}' not found.")
        return None
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON format - {e}")
        return None
    except Exception as e:
        print(f"Error reading file: {e}")
        return None

In [None]:
data = read_json_file("../data/SAMSum/train.json")  # read SAMSum training JSON file
data["SAMSum--train--10"]  # display the content of the 10th sample

**Expected Output:**
```text
{'original dialog id': '13809912',
 'dialog index': 10,
 'original dialog info': {'summary': "Matt invites Agnes for a date to get to know each other better. They'll go to the Georgian restaurant in Kazimierz on Saturday at 6 pm, and he'll pick her up on the way to the place."},
 'log': [{'turn id': 1,
   'user utterance': 'Do you want to go for date?',
   'system response': 'Wow! You caught me out with this question Matt.',
   'dialog history': '',
   'original user side information': {'speaker': 'Matt'},
   'original system side information': {'speaker': 'Agnes'}},
  {'turn id': 2,
   'user utterance': 'Why?',
   'system response': "I simply didn't expect this from you.",
   'dialog history': '<USER> Do you want to go for date? <SYSTEM> Wow! You caught me out with this question Matt.',
   'original user side information': {'speaker': 'Matt'},
   'original system side information': {'speaker': 'Agnes'}},
  {'turn id': 3,
...
```


Each training sample consists of the following keys: `original dialog id`, `dialog index`, `original dialog info`, `log`, `original user side information`, and `original system side information`.  The `original dialog info` contains the dialog summary, while the `log` holds the dictionary list of dialog between a user and the system. To proceed, we write two functions:
- First, to remove unwanted text, punctuation, and symbols.
- Second, to extract the dialogue summary, dialogue log (conversation), and transform it into a single text. This will form a new sample ({conversation and summary}) for our processed dataset.


In [None]:
def remove_umwanted_text_symbols(text):
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"\^[^ ]+", "", text)
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"@[^\s]+", "", text)
    return text

def transform_conversation(data):
    transformed_text = ""
    summary = data["original dialog info"]["summary"]
    for row in data["log"]:
        user = remove_umwanted_text_symbols(row["user utterance"])
        transformed_text += f"user: {user.strip()}\n"
        agent = remove_umwanted_text_symbols(row["system response"])
        transformed_text += f"agent: {agent.strip()}\n"
    return transformed_text, summary 


Run the cell below to display the processed version for the sample `SAMSum--train--10`

In [None]:
conversation, summary = transform_conversation(data["SAMSum--train--10"])
print("Conversation: \n", conversation)
print("Summary: ", summary)

**Expected Output:**
```text
Conversation: 
 user: Do you want to go for date?
agent: Wow! You caught me out with this question Matt.
user: Why?
agent: I simply didn't expect this from you.
user: Well, expect the unexpected.
agent: Can I think about it?
user: What is there to think about?
agent: Well, I don't really know you.
user: This is the perfect time to get to know eachother
agent: Well that's true.
...
Summary:  Matt invites Agnes for a date to get to know each other better. They'll go to the Georgian restaurant in Kazimierz on Saturday at 6 pm, and he'll pick her up on the way to the place.
```

One noticeable occurrence in the output is that the summary includes the names of the personalities involved in the dialog, which were replaced with `user` and `agent` in the conversation. This may result in a model fine-tuned on the dataset generating summaries that replace the names with 'user' and 'agent'. We can remedy this by extracting the speakers' names from the `original user side information` and the `original system side information`. Please run the cell below and confirm that the changes are effected in the output.

In [None]:
def transform_conversation_speakers(data):
    transformed_text = ""
    summary = data["original dialog info"]["summary"]
    for row in data["log"]:
        user = remove_umwanted_text_symbols(row["original user side information"]["speaker"])
        system = remove_umwanted_text_symbols(row["original system side information"]["speaker"])
        user_utterance = remove_umwanted_text_symbols(row["user utterance"])
        transformed_text += f"{user.strip()}: {user_utterance.strip()}\n"
        system_response = remove_umwanted_text_symbols(row["system response"])
        transformed_text += f"{system.strip()}: {system_response.strip()}\n"
    return transformed_text, summary

In [None]:
conversation, summary = transform_conversation_speakers(data["SAMSum--train--10"])
print("Conversation: \n", conversation)
print("Summary: ", summary)

**Expected Output:**
```text
Conversation: 
Matt: Do you want to go for date?
Agnes: Wow! You caught me out with this question Matt.
Matt: Why?
Agnes: I simply didn't expect this from you.
Matt: Well, expect the unexpected.
Agnes: Can I think about it?
Matt: What is there to think about?
Agnes: Well, I don't really know you.
...
Summary:  Matt invites Agnes for a date to get to know each other better. They'll go to the Georgian restaurant in Kazimierz on Saturday at 6 pm, and he'll pick her up on the way to the place.

```


#### Task Description

Task descriptions (also known as instructions) are an important component in constructing an instruction-following dataset. This helps a fine-tuned model know the task to be executed and improves its performance. For example, one can add `Write a summary of the conversation below` or `Below is a conversation between a user and an agent. Write a summary of their conversation.` There are several ways you can achieve this purpose. A common trick used to include the task description in the training data is to prefix it along with the input data. Below is a simple function to construct and prefix our summarization task description. 

In [None]:
def format_task_description(conversation):
    return f"""### Instruction: Write a summary of the conversation below. ### Input: {conversation.strip()}""".strip()

### Constructing Dataset in FineTuningDataModule Format

As discussed previously, the FineTuningDataModule dataset construct is a dictionary of records containing both `input` and `output` fields. In our case, the input will be represented as a conversation prefix with a task description, and the output will denote the summary.

```python

new_row = {
        "input": str(format_task_description(conversation)).strip(),
        "output": str(summary).strip()       
        }
```

Next, run the cells below to curate the new preprocessed dataset

In [None]:
def generate_preprocessed_dataset(dataset, type_data):
    #length = len(dataset)
    data_list = []
    row_key = ""
    if type_data == "train":
        row_key = "SAMSum--train--"
    elif type_data == "val":
        row_key = "SAMSum--val--"
    elif type_data == "test":
        row_key = "SAMSum--test--"
        
    for index in range(1, len(dataset)+1):
        key = row_key+str(index)
        #print(key)
        conversation, summary = transform_conversation_speakers(dataset[str(key)])
        new_row = {
        "input": str(format_task_description(conversation)).strip(),
        "output": str(summary).strip()       
        }
        data_list.append(new_row)
    return data_list

In [None]:
training = read_json_file("../data/SAMSum/train.json")  # read SAMSum training JSON file
validation = read_json_file("../data/SAMSum/val.json")  # read SAMSum validation JSON %%file
test = read_json_file("../data/SAMSum/test.json")  # read SAMSum test JSON file

Now, let's process the training set and check the 10th sample.

In [None]:
training = generate_preprocessed_dataset(training, "train")
training[10] 

**Expected Output:**

```python
{'input': '### Instruction: Write a summary of the conversation below. ### Input: Lucas: Hey! How was your day?\nDemi: Hey there!\nDemi: It was pretty fine, actually, thank you!\nDemi: I just got promoted! :D\nLucas: Whoa! Great news!\nLucas: Congratulations!\nLucas: Such a success has to be celebrated.\nDemi: I agree! :D\nDemi: Tonight at Death & Co.?\nLucas: Sure!\nLucas: See you there at 10pm?\nDemi: Yeah! See you there! :D',
 'output': 'Demi got promoted. She will celebrate that with Lucas at Death & Co at 10 pm.'}

```

The next step is to process the validation set. 

In [None]:
validation = generate_preprocessed_dataset(validation, "val")

Finally, you run the cell below to process the test set.

In [None]:
test = generate_preprocessed_dataset(test, "test")

#### Save Data in JONSL Format

NeMo 2.0 accepts JSONL data format. Therefore, we use `save_jsonl`, written below, to save our dataset in `.jsonl` format. Please note that each sample has to be written to a JSONL file as a dictionary record. Additionally, files must be named `training`, `validation`, and `test`. NeMo 2.0 uses this naming convention to identify the JSONL files during training.

In [None]:
from tqdm import tqdm

def save_jsonl(filepath, data):
    with open(filepath, "w") as write_file:
         for row in tqdm(data):
             write_file.write(json.dumps(row) + '\n')
         print(f"{filepath} dataset saved ")

In [None]:
training_path = "../data/SAMSum/finetune_module/training.jsonl"
validation_path = "../data/SAMSum/finetune_module/validation.jsonl"
test_path = "../data/SAMSum/finetune_module/test.jsonl"

save_jsonl(training_path, training)
save_jsonl(validation_path, validation)
save_jsonl(test_path, test)

**Expected Output:**
```python
100%|███████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 207751.94it/s]
../data/SAMSum/finetune_module/training.jsonl dataset saved 
100%|█████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 206384.79it/s]
../data/SAMSum/finetune_module/validation.jsonl dataset saved 
100%|█████████████████████████████████████████████████████████████████████████████| 819/819 [00:00<00:00, 208253.11it/s]
../data/SAMSum/finetune_module/test.jsonl dataset saved 

```


Display the top 3 records from the [training set](../data/SAMSum/finetune_module/training.jsonl)

In [None]:
!head -3 ../data/SAMSum/finetune_module/training.jsonl

**Expected Output:**
```text
{"input": "### Instruction: Write a summary of the conversation below. ### Input: Amanda: I baked cookies. Do you want some?\nJerry: Sure!", "output": "Amanda baked cookies and will bring Jerry some tomorrow."}
{"input": "### Instruction: Write a summary of the conversation below. ### Input: Olivia: Who are you voting for in this election?\nOliver: Liberals as always.\nOlivia: Me too!!\nOliver: Great", "output": "Olivia and Olivier are voting for liberals in this election."}
{"input": "### Instruction: Write a summary of the conversation below. ### Input: Tim: Hi, what's up?\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\nTim: What did you plan on doing?\nKim: Oh you know, uni stuff and unfucking my room\nKim: Maybe tomorrow I'll move my ass and do everything\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\nTim: It really helps\nKim: thanks, maybe I'll do that\nTim: I also like using post-its in kaban style", "output": "Kim may try the pomodoro technique recommended by Tim to get more stuff done."}

````


---

### Lab Activity 1 (Optional)

Following the preprocessing steps discussed to construct a dataset in `FineTuningDataModule` acceptable Format, apply this approach to preprocess `DialogSum` summarization dataset. The `DialogSum` is found in the location: `../data/DialogSum`.  

*Tips*
- *examine the [train](../data/DialogSum/train.json) and [validation](../data/DialogSum/val.json) sets for required information to extract*
- *the fetch the files using `read_json_file` function and display a sample*
- *Copy the `remove_umwanted_text_symbols` and `transform_conversation` functions. If applicable, modify the fields to suit your data extraction*
- *use or modify the `format_task_description` function to define your task description*
- *construct the `FineTuningDataModule` format using/modifying the `generate_preprocessed_dataset` function*
- *save the preprocessed dataset (training.jsonl and validation.jsonl) to `../data/DialogSum/finetune_module/` using or modifying the `save_jsonl` function*
  
**Besides the given tips, you are free to explore an approach deemed best for you.**

In [None]:
### Write activity solution here








In [None]:
### Write activity solution here
### You can also add more cells









---
### Constructing dataset in ChatDataModule Format

As previously discussed, the `ChatDataModule` dataset construct consists of three fields: `mask`, `system`, and `conversations`. To achieve this, we define a function to create  `ChatDataModule` fields and another to generate the conversation.

In [None]:
def transform_conversation_speakers(data): ## This function is repeated to avoid scrolling up
    transformed_text = ""
    summary = data["original dialog info"]["summary"]
    for row in data["log"]:
        user = remove_umwanted_text_symbols(row["original user side information"]["speaker"])
        system = remove_umwanted_text_symbols(row["original system side information"]["speaker"])
        user_utterance = remove_umwanted_text_symbols(row["user utterance"])
        transformed_text += f"{user.strip()}: {user_utterance.strip()}\n"
        system_response = remove_umwanted_text_symbols(row["system response"])
        transformed_text += f"{system.strip()}: {system_response.strip()}\n"
    return transformed_text, summary

def create_chatDataModule(conversation, summary):
    record = {}
    record["system"] = ""
    record["mask"] = "User"
    record["conversations"] = [{"from":"User", "value": conversation}, {"from":"Response", "value": summary}]
    return record


def generate_conversation_chatDataModule(dataset, type_data):
    data_list = []
    row_key = ""
    if type_data == "train":
        row_key = "SAMSum--train--"
    elif type_data == "val":
        row_key = "SAMSum--val--"
    elif type_data == "test":
        row_key = "SAMSum--test--"
    for index in range(1, len(dataset)+1):
        key = row_key+str(index)
        conversation, summary = transform_conversation_speakers(dataset[str(key)])
        conversation = str(format_task_description(conversation)).strip()
        record = create_chatDataModule(conversation, str(summary).strip())
        data_list.append(record)
    return data_list


Loading the JSON dataset

In [None]:
chat_training = read_json_file("../data/SAMSum/train.json")  # read SAMSum training JSON file
chat_validation = read_json_file("../data/SAMSum/val.json")  # read SAMSum validation JSON %%file
chat_test = read_json_file("../data/SAMSum/test.json")  # read SAMSum test JSON file

Preprocessing the training, validation, and test sets.

In [None]:
training = generate_conversation_chatDataModule(chat_training, "train")
validation = generate_conversation_chatDataModule(chat_validation, "val")
test = generate_conversation_chatDataModule(chat_test, "test")

Let's display the 10th sample in the preprocessed training set.

In [None]:
#training = generate_conversation_chatDataModule(chat_training, "train")
training[10] 

**Expected Output:**
```python
{'system': '',
 'mask': 'User',
 'conversations': [{'from': 'User',
   'value': '### Instruction: Write a summary of the conversation below. ### Input: Lucas: Hey! How was your day?\nDemi: Hey there!\nDemi: It was pretty fine, actually, thank you!\nDemi: I just got promoted! :D\nLucas: Whoa! Great news!\nLucas: Congratulations!\nLucas: Such a success has to be celebrated.\nDemi: I agree! :D\nDemi: Tonight at Death & Co.?\nLucas: Sure!\nLucas: See you there at 10pm?\nDemi: Yeah! See you there! :D'},
  {'from': 'Response',
   'value': 'Demi got promoted. She will celebrate that with Lucas at Death & Co at 10 pm.'}]}
```

#### Saving Preprocessed Data in JONSL Format

Just as we did previously, the preprocessed dataset will be saved in NeMo 2.0's acceptable JSONL data format, and named `training.jsonl`, `validation.jsonl`, and `test.jsonl`. 


In [None]:
training_path = "../data/SAMSum/chat_module/training.jsonl"
validation_path = "../data/SAMSum/chat_module/validation.jsonl"
test_path = "../data/SAMSum/chat_module/test.jsonl"

save_jsonl(training_path, training)
save_jsonl(validation_path, validation)
save_jsonl(test_path, test)

**Expected Output:**

```python
100%|███████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 156650.01it/s]
../data/SAMSum/chat_module/training.jsonl dataset saved 
100%|█████████████████████████████████████████████████████████████████████████████| 818/818 [00:00<00:00, 163635.27it/s]
../data/SAMSum/chat_module/validation.jsonl dataset saved 
100%|█████████████████████████████████████████████████████████████████████████████| 819/819 [00:00<00:00, 164030.89it/s]
../data/SAMSum/chat_module/test.jsonl dataset saved 

````

Now, let's display the top 3 records from the [training set](../data/SAMSum/chat_module/training.jsonl).

In [None]:
!head -3 ../data/SAMSum/chat_module/training.jsonl

**Expected Output:**
```python

{"system": "", "mask": "User", "conversations": [{"from": "User", "value": "### Instruction: Write a summary of the conversation below. ### Input: Amanda: I baked cookies. Do you want some?\nJerry: Sure!"}, {"from": "Response", "value": "Amanda baked cookies and will bring Jerry some tomorrow."}]}
{"system": "", "mask": "User", "conversations": [{"from": "User", "value": "### Instruction: Write a summary of the conversation below. ### Input: Olivia: Who are you voting for in this election?\nOliver: Liberals as always.\nOlivia: Me too!!\nOliver: Great"}, {"from": "Response", "value": "Olivia and Olivier are voting for liberals in this election."}]}
{"system": "", "mask": "User", "conversations": [{"from": "User", "value": "### Instruction: Write a summary of the conversation below. ### Input: Tim: Hi, what's up?\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\nTim: What did you plan on doing?\nKim: Oh you know, uni stuff and unfucking my room\nKim: Maybe tomorrow I'll move my ass and do everything\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\nTim: It really helps\nKim: thanks, maybe I'll do that\nTim: I also like using post-its in kaban style"}, {"from": "Response", "value": "Kim may try the pomodoro technique recommended by Tim to get more stuff done."}]}

```

### Lab Activity 2 (Optional)

Following the preprocessing steps exemplified to construct a dataset in `ChatDataModule` acceptable Format, apply this approach to preprocess `DialogSum` summarization dataset. The `DialogSum` is found in the location: `../data/DialogSum`.  

*Tips*
- *examine the [train](../data/DialogSum/train) and [validation](../data/DialogSum/val) sets for required information to extract*
- *the fetch the files using `read_json_file` function and display a sample*
- *Copy the `remove_umwanted_text_symbols` and `transform_conversation` functions. If applicable, modify the fields to suit your data extraction*
- *use or modify the `format_task_description` function to define your task description*
- *construct the `ChatDataModule` format using/modifying the `generate_conversation_chatDataModule` function*
- *save the preprocessed dataset (training.jsonl and validation.jsonl) to `../data/DialogSum/chat_module/` using or modifying the `save_jsonl` function*
  
**Besides the given tips, you are free to explore an approach deemed best for you.**

In [None]:
### Write activity solution here








In [None]:
### Write activity solution here
### You can also add more cells










Now that we have learned how to preprocess a dialogue dataset, let's proceed to the next notebook and learn to fine-tune a `Llama-3.1-8B` model with the preprocessed dataset. Please click the link below.

## <center><div style="text-align:center; color:#FF0000; border:3px solid red;height:80px;"> <b><br/> [Next Notebook](finetuning.ipynb) </b> </div></center>

---
### References
- [Overview of NeMo Curator](https://docs.nvidia.com/nemo/curator/latest/about/index.html#about-overview)
- [A Survey of Large Language Models](https://arxiv.org/abs/2303.18223)
- [https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)
- [https://github.com/salesforce/DialogStudio](https://github.com/salesforce/DialogStudio)

### Licensing
Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.
