## Building A Text Summarization Model With NeMo-Run
---

The notebook content focuses on teaching learners how to fine-tune an SOTA model for a summarization task using NeMo-Run. The rest of the notebook will expose learners to the NeMo Framework, an overview of NeMo-Run, NeMo fine-tuning models, and LoRA. Upon completing this content, learners will be able to fine-tune an SOTA model for the summarization task and perform inference.

### Overview of NeMo Framework

[NVIDIA NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g., Automatic Speech Recognition and Text-to-Speech). It provides end-to-end support for developing Large Language Models (LLMs) and provides the flexibility to be used on-premises, in a data center, or with your preferred cloud provider. It also supports execution on `SLURM` or `Kubernetes-enabled` environments. NeMo Framework provides tools for efficient training and customization of LLM models. It includes default configurations for setting up a compute cluster, downloading data, and adjusting model hyperparameters, which can be customized to train on new datasets and models. In addition to pre-training, NeMo supports both [Supervised Fine-Tuning (SFT)](https://huggingface.co/learn/llm-course/en/chapter11/3) and [Parameter-Efficient Fine-Tuning (PEFT)](https://arxiv.org/pdf/2312.12148) techniques, such as [LoRA](https://arxiv.org/pdf/2106.09685), [Ptuning](https://arxiv.org/pdf/2110.07602), and others.

<center><img src="images/NeMo-arch.png" width="900" height="900" /></center>

There are two options available for launching the training process in NeMo: using the `NeMo 2.0 API interface` or with [NeMo Run](https://github.com/NVIDIA-NeMo/Run). For this notebook, our focus will be on using `NeMo Run`.

#### NeMo Supported Models for Finetuning LLM

NeMo comes equipped with a CLI that allows you to launch experiments locally or on a remote cluster. Through the CLI, you can check the list of fine-tune models. Run the cell below to view the list of nemo llm finetune models.

In [None]:
!nemo llm finetune --help llama31_8b 

```python
...
╭─ Pre-loaded entrypoint factories, run with --factory ────────────────────────╮
│ baichuan2_7b               ]8;id=453595;file:///opt/NeMo-Run/nemo_run/cli/api.py#L236\nemo.collections.llm.r…]8;;\                   │
│ chatglm3_6b                ]8;id=447150;file:///opt/NeMo-Run/nemo_run/cli/api.py#L236\nemo.collections.llm.r…]8;;\                   │
│ deepseek_v2                ]8;id=826757;file:///opt/NeMo-Run/nemo_run/cli/api.py#L108\nemo.collections.llm.r…]8;;\                   │
│ deepseek_v2_lite           ]8;id=169312;file:///opt/NeMo-Run/nemo_run/cli/api.py#L107\nemo.collections.llm.r…]8;;\                   │
│ deepseek_v3                ]8;id=602877;file:///opt/NeMo-Run/nemo_run/cli/api.py#L88\nemo.collections.llm.r…]8;;\                    │
│ e5_340m                    ]8;id=586107;file:///opt/NeMo-Run/nemo_run/cli/api.py#L46\nemo.collections.llm.r…]8;;\                    │
│ gemma2_2b                  ]8;id=885420;file:///opt/NeMo-Run/nemo_run/cli/api.py#L173\nemo.collections.llm.r…]8;;\                   │
...
│ llama3_8b                  ]8;id=606922;file:///opt/NeMo-Run/nemo_run/cli/api.py#L247\nemo.collections.llm.r…]8;;\                   │
│ llama3_70b                 ]8;id=740427;file:///opt/NeMo-Run/nemo_run/cli/api.py#L251\nemo.collections.llm.r…]8;;\                   │
│ llama31_8b                 ]8;id=231473;file:///opt/NeMo-Run/nemo_run/cli/api.py#L246\nemo.collections.llm.r…]8;;\          
...

```

### Getting Started With NeMo Run 

NeMo Run is a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across various computing environments. NeMo Run has three core responsibilities: [Configuration](https://github.com/NVIDIA-NeMo/Run/blob/main/docs/source/guides/configuration.md), [Execution](https://github.com/NVIDIA-NeMo/Run/blob/main/docs/source/guides/execution.md), and [Management](https://github.com/NVIDIA-NeMo/Run/blob/main/docs/source/guides/management.md).



 #### Finetuning Custom Summarization Dataset with NeMo Run 

One of the main benefits of NeMo-Run is that it decouples configuration and execution, allowing the reuse of predefined executors and simply changing the recipe. [Important reasons](https://github.com/NVIDIA-NeMo/Run/blob/main/docs/source/guides/why-use-nemo-run.md) why we used NeMo Run are that it provides `Flexibility`, `Modularity`, `Reproducibility`, and `Organization`. To get started with Finetuning: 
- We need to set up your [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) to enable the automatic conversion of the model from Hugging Face.
- Configure the Recipe by taking 2 steps: 1) Convert the checkpoint from Hugging Face to NeMo. 2) Run fine-tuning using the converted checkpoint from step 1. We will accomplish this using a NeMo-Run experiment, which allows us to define these two tasks and execute them sequentially with ease.

Log in with your token via huggingface-cli.

In [None]:
!huggingface-cli login --token "Add-your-huggingface-token-here"

To configure the Recipe, we will write a Python file (llama3_1_8b.py) to pull the Llama 3-1-8B checkpoint from Hugging Face and convert it to NeMo format via the NeMo Run experiment. First, we need to set up the NeMo cache path to store the checkpoint. By default, NeMo stores the checkpoint here: `NEMO_MODELS_CACHE=/root/.cache/nemo/models`

In [None]:
import os

os.environ["NEMO_MODELS_CACHE"] = "/workspace/model/"
os.environ["NEMO_MODELS_CACHE"]

In [None]:
%%writefile llama3_1_8b.py
from nemo.collections import llm

if __name__ == '__main__':
    llm.import_ckpt(
       model=llm.LlamaModel(config=llm.Llama31Config8B()),
        source="hf://meta-llama/Meta-Llama-3.1-8B",
        overwrite=True,
    )

Run the script to pull the Llama 3-1-8B checkpoint from Hugging Face and convert it to NeMo format.

In [None]:
!torchrun llama3_1_8b.py

**Expected Output:**
```python
...

[NeMo I 2025-07-27 19:36:37 nemo_logging:393] Successfully saved checkpoint from iteration       0 to /workspace/model/meta-llama/Meta-Llama-3.1-8B
[NeMo I 2025-07-27 19:36:38 nemo_logging:393] Async finalization time took 10.174 s
Converted Llama model to Nemo, model saved to /workspace/model/meta-llama/Meta-Llama-3.1-8B in torch.bfloat16.
 $NEMO_MODELS_CACHE=/workspace/model 
Imported Checkpoint
├── context/
│   ├── artifacts/
│   │   └── generation_config.json
│   ├── nemo_tokenizer/
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   └── tokenizer_config.json
│   ├── io.json
│   └── model.yaml
└── weights/
    ├── .metadata
    ├── __0_0.distcp
    ├── __0_1.distcp
    ├── common.pt
    └── metadata.json

```

Run the functions to configure the recipe and local executor. Note that we set the PEFT scheme (peft_scheme) to LoRA. If you intend to perform a full fine-tuning, you can set it to None `(peft_scheme=None)`. [PEFT](https://arxiv.org/abs/2305.16742) allows fine-tuning a small number of (extra) model parameters instead of all the model's parameters, and this significantly decreases the computational and storage costs. One way to implement PEFT is to adopt the Low-Rank Adaptation (LoRA) technique. Lora makes fine-tuning more efficient by greatly reducing the number of trainable parameters for downstream tasks. It does this by freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture. According to the [authors of LoRA](https://arxiv.org/abs/2106.09685), aside from reducing the number of trainable parameters by 10k times, it also reduces the GPU consumption by 3x, thus delivering high throughput with no inference latency.

<center><img src="images/lora-arch.png" height="400" width="600"  /></center>
<center> LoRA Reparametrization and Weight Merging. <a href="https://huggingface.co/docs/peft/main/en/conceptual_guides/lora"> View source</a> </center>

In [None]:
import nemo_run as run
from nemo.collections import llm

def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.llama31_8b.finetune_recipe(
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        peft_scheme='lora',
    )
    return recipe

def local_executor_torchrun(devices: int = 1) -> run.LocalExecutor:
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun")
    return executor

Instantiate the recipe and make sure you set the gpus_per_node as expected. In our case, we set the value to a GPU, 1.

In [None]:
recipe = configure_recipe(gpus_per_node=1) 

##### Define Custom Data Source 

From the previous notebook, we preprocessed the SAMSum summarization dataset for the `FineTuningDataModule` and `ChatDataModule` objects. To use the `FineTuningDataModule` object, replace the `recipe.data` value in the cell below with the code snippet below.

```python
 recipe.data = run.Config( llm.FineTuningDataModule,
   dataset_root="../data/SAMSum/finetune_module/",
   seq_length=2048, #512,
   micro_batch_size=1,
   global_batch_size=32, #128
                           )
```
For the fine-tuning process, we will use the ChatDataModule format for our custom preprocessed dataset.

In [None]:
recipe.data = run.Config(
    llm.ChatDataModule,
   dataset_root="../data/SAMSum/chat_module/",
    seq_length=2048,
    micro_batch_size=1,
    global_batch_size=32,
)

Setting hyperparameters 

In [None]:
recipe.trainer.num_sanity_val_steps = 0

# Need to set this to 1 since the default is 2
recipe.trainer.strategy.context_parallel_size = 1
recipe.trainer.val_check_interval = 100 #0

recipe.trainer.limit_val_batches = 0
recipe.trainer.max_steps = 100 #40
recipe.log.use_datetime_version = False
recipe.log.explicit_log_dir = '/workspace/log'
recipe.resume.restore_config.path = '/workspace/model/meta-llama/Meta-Llama-3.1-8B/'
# adjust other hyperparameters as needed
# for example:
# recipe.optim.config.lr = 1e-6
# recipe.trainer.strategy.tensor_model_parallel_size = 2
# recipe.log.ckpt.save_top_k = 3

executor = local_executor_torchrun(devices=recipe.trainer.devices)
run.run(recipe, executor=executor)

**Likely Output:**
```python
...

Task 0: nemo.collections.llm.api.finetune
- Status: RUNNING
- Executor: LocalExecutor
- Job id: nemo.collections.llm.api.finetune-ztkbrsjbg7b76
- Local Directory: /root/.nemo_run/experiments/nemo.collections.llm.api.finetune/nemo.collections.llm.api.finetune_1753648690/nemo.collecti
...

i.finetune/0 [default0]:[NeMo I 2025-07-27 20:49:05 nemo_logging:393] Successfully saved checkpoint from iteration      99 to /workspace/log/checkpoints/model_name=0--val_loss=0.00-step=99-consumed_samples=3200.0-last.ckpt
i.finetune/0 [default0]:[NeMo I 2025-07-27 20:49:05 nemo_logging:393] Async checkpoint save for step 100 (/workspace/log/checkpoints/model_name=0--val_loss=0.00-step=99-consumed_samples=3200.0-last.ckpt) finalized successfully.
i.finetune/0 [default0]:[NeMo I 2025-07-27 20:49:05 nemo_logging:393] Async finalization time took 0.256 s
i.finetune/0 I0727 20:49:17.888000 7860 torch/distributed/elastic/agent/server/api.py:879] [default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
i.finetune/0 I0727 20:49:17.889000 7860 torch/distributed/elastic/agent/server/api.py:932] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
i.finetune/0 I0727 20:49:17.890000 7860 torch/distributed/elastic/agent/server/api.py:946] Done waiting for other agents. Elapsed: 0.00012350082397460938 seconds
Job nemo.collections.llm.api.finetune-ztkbrsjbg7b76 finished: SUCCEEDED
```
<center><img src="images/train_output.png" width="700" height="700" /></center>


### Running Inference

After successfully training our Llama-3-1-8b checkpoint, we should evaluate the effectiveness of the fine-tuned model. First, as a sanity check, we can quickly evaluate the trained model's performance using NeMo's in-framework inference. To sart with, we need to know the path where adapter checkpoint is saved from the training log:

```python
...
i.finetune/0 [default0]:[NeMo I 2025-07-27 20:49:05 nemo_logging:393] Async checkpoint save for step 100 (/workspace/log/checkpoints/model_name=0--val_loss=0.00-step=99-consumed_samples=3200.0-last.ckpt) finalized successfully.

...
```


In [None]:
%%writefile run_inference.py

from megatron.core.inference.common_inference_params import CommonInferenceParams
import nemo.lightning as nl
from nemo.collections.llm import api
import torch

strategy = nl.MegatronStrategy(
    tensor_model_parallel_size=1,
    pipeline_model_parallel_size=1,
    context_parallel_size=1,
    sequence_parallel=False,
    setup_optimizers=False,
    
)

trainer = nl.Trainer(
    accelerator="gpu",
    devices=1,
    num_nodes=1,
    strategy=strategy,
    plugins=nl.MegatronMixedPrecision(
        precision="bf16-mixed",
        params_dtype=torch.bfloat16,
        pipeline_dtype=torch.bfloat16,
    ),
)

prompts = [ "### Instruction: Write a summary of the conversation below. ### Input: Will: hey babe, what do you want for dinner tonight?\nEmma: gah, don't even worry about it tonight\nWill: what do you mean? everything ok?\nEmma: not really, but it's ok, don't worry about cooking though, I'm not hungry\nWill: Well what time will you be home?\nEmma: soon, hopefully\nWill: you sure? Maybe you want me to pick you up?\nEmma: no no it's alright. I'll be home soon, i'll tell you when I get home.\nWill: Alright, love you.\nEmma: love you too.",
"### Instruction: Write a summary of the conversation below. ### Input: Ollie: Hi , are you in Warsaw\nJane: yes, just back! Btw are you free for diner the 19th?\nOllie: nope!\nJane: and the 18th?\nOllie: nope, we have this party and you must be there, remember?\nJane: oh right! i lost my calendar.. thanks for reminding me\nOllie: we have lunch this week?\nJane: with pleasure!\nOllie: friday?\nJane: ok\nJane: what do you mean \" we don't have any more whisky!\" lol..\nOllie: what!!!\nJane: you just call me and the all thing i heard was that sentence about whisky... what's wrong with you?\nOllie: oh oh... very strange! i have to be carefull may be there is some spy in my mobile! lol\nJane: dont' worry, we'll check on friday.\nOllie: don't forget to bring some sun with you\nJane: I can't wait to be in Morocco..\nOllie: enjoy and see you friday\nJane: sorry Ollie, i'm very busy, i won't have time for lunch tomorrow, but may be at 6pm after my courses?this trip to Morocco was so nice, but time consuming!\nOllie: ok for tea!\nJane: I'm on my way..\nOllie: tea is ready, did you bring the pastries?\nJane: I already ate them all... see you in a minute\nOllie: ok"
 ]


groundtruth = [
     {"from": "Response", "value": "Emma will be home soon and she will let Will know."},
 {"from": "Response", "value": "Jane is in Warsaw. Ollie and Jane has a party. Jane lost her calendar. They will get a lunch this week on Friday. Ollie accidentally called Jane and talked about whisky. Jane cancels lunch. They'll meet for a tea at 6 pm."}
]
   

if __name__ == "__main__":
    adapter_checkpoint = "/workspace/log/checkpoints/model_name=0--val_loss=0.00-step=99-consumed_samples=3200.0-last"  #
    results = api.generate(
    path=adapter_checkpoint,
    prompts=prompts,
    trainer=trainer,
    inference_params=CommonInferenceParams(temperature=1, top_k=1, num_tokens_to_generate=100),
    text_only=True,
    )
    nos_of_result= len(results)
    for chat, summary in zip(prompts,results):
        top_summary = summary.split("\n")[1]
        print ("Chat History: ", chat, "\n")
        print("=" * 50)
        print("Summary of the Chat ")
        print("=" * 50, '\n')
        print(top_summary)
        print("=" * 50, '\n')
    #print("Detailed Result: ", results)

In [None]:
!torchrun run_inference.py

**Likely Output:**

```python
...
[NeMo I 2025-07-27 21:48:34 nemo_logging:393] Adding lora to: module.module.decoder.layers.31.mlp.linear_fc2
[NeMo I 2025-07-27 21:48:35 nemo_logging:393] Using <megatron.core.dist_checkpointing.strategies.fully_parallel.FullyParallelLoadStrategyWrapper object at 0x7ff884163cb0> dist-ckpt load strategy.
[NeMo I 2025-07-27 21:48:35 nemo_logging:393] Global Checkpoint Load : Rank : 0 : Start time : 1753652915.032s : Time spent in load_checkpoint: 0.201s
static requests: 100%|████████████████████████████| 1/1 [00:09<00:00,  9.85s/it]
Chat History:  ### Instruction: Write a summary of the conversation below. ### Input: Will: hey babe, what do you want for dinner tonight?
Emma: gah, don't even worry about it tonight
Will: what do you mean? everything ok?
Emma: not really, but it's ok, don't worry about cooking though, I'm not hungry
Will: Well what time will you be home?
Emma: soon, hopefully
Will: you sure? Maybe you want me to pick you up?
Emma: no no it's alright. I'll be home soon, i'll tell you when I get home.
Will: Alright, love you.
Emma: love you too. 

==================================================
Summary of the Chat 
================================================== 

Emma doesn't want Will to cook dinner tonight. She will be home soon.
================================================== 
...
```

After training for 100 steps, we can see that our summarization output is close to the ground truth list in the cell above, but not exact. You can decide to increase the number of steps and modify the values of `temperature`, `top_k`, and `num_tokens_to_generate`. Let's proceed to the next notebook and learn how to apply the prompt engineering approach to prompt our model. Please click the link below.

## <center><div style="text-align:center; color:#FF0000; border:3px solid red;height:80px;"> <b><br/> [Next Notebook](prompt-engineering.ipynb) </b> </div></center>

---

### References
- [Quickstart with NeMo-Run](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/quickstart.html)
- [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)
- [Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models](https://arxiv.org/pdf/2312.12148)
- [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685)
- [NeMo Run](https://github.com/NVIDIA-NeMo/Run)
- [HuggingFace Token](https://huggingface.co/docs/hub/en/security-tokens)
- [NeMo 2.0](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/index.html)

### Licensing
Copyright © 2025 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.