# Creating a Llama-3.1 LoRA adapter with the NeMo Framework and Deploy via NVIDIA NIM
It's Llama 3.1 Day and we're excited to share our newest notebook in collaboration with the NVIDIA for finetuning using the NeMo framework and deploying it using an NVIDIA NIM. In this notebook, we'll be finetuning our own LoRA with a cleaned up version of the [Law StackExchange](https://huggingface.co/datasets/ymoslem/Law-StackExchange) dataset using NeMo Framework. Law StackExchange is a dataset of legal question/answers. Each record consists of a question, its title, as well as human-provided answers. Given a Law StackExchange forum question our goal is to auto-generate an appropriate title for it.

####  NVIDIA NeMo Framework and NVIDIA NIM
NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech). It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints. After we finetune a LoRa using NeMo, we then deploy it using an NVIDIA NIM. An NVIDIA NIM is an accelerated inference solution for Generative AI models.

#### Prerequistes

Before you start this notebook, ensure that you have an NGC key available that is able to access the Llama3.1 NIM on NGC. To generate one, please visit build.nvidia.com and click Get API Key!

First we install the NGC CLI and docker and pull the `.nemo` checkpoint that we will use for finetuning. This can take about 5-7 minutes

In [1]:
%%bash

wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-ngc.sh -O setup-ngc
chmod +x setup-ngc
./setup-ngc

--2024-08-22 20:41:00--  https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-ngc.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1385 (1.4K) [text/plain]
Saving to: ‘setup-ngc’

     0K .                                                     100% 17.1M=0s

2024-08-22 20:41:01 (17.1 MB/s) - ‘setup-ngc’ saved [1385/1385]



Installing Docker CLI...
Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1581 B]
Get:3 https://deb.nodesource.com/node_18.x nodistro InRelease [12.1 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:5 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [921 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:8 https://deb.nodesource.com/node_18.x nodistro/main amd64 Packages [9773 B]
Get:9 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2498 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1425 kB]
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/restricted amd64 Packages [3041 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-updates/multiverse amd64 Packages [51.8

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 5.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Fetched 886 kB in 0s (2210 kB/s)
Selecting previously unselected package distro-info-data.
(Reading database ... 38566 files and directories currently installed.)
Preparing to unpack .../distro-info-data_0.52ubuntu0.7_all.deb ...
Unpacking distro-info-data (0.52ubuntu0.7) ...
Selecting previously unselected package lsb-release.
Preparing to unpack .../lsb-release_11.1.0ubuntu4_all.deb ...
Unpacking lsb-release (11.1.0ubuntu4) ...
Preparing to unpack .../libcurl4-openssl-dev_7.81.0-1ubuntu1.17_amd64.deb ...
Unpacking libcurl4-openssl-dev:amd64 (7.81.0-1ubuntu1.17) over (7.81.0-1ubuntu1.16) ...
Preparing to unpack .../curl_7.81.0-1ubuntu1.17_amd64.deb ...
Unpacking curl (7.81.0-1ubuntu1.17) over (7.81.0-1ubuntu1.16) ...
Preparing to unpack .../libcurl4_7.81.0-1ubuntu1.17_amd64.deb ...
Unpacking libcurl4:amd64 (7.81.0-1ubuntu1.17) over (7.81.0-1ubuntu1.16) ...
Setting up distro-info-data (0.52ubuntu0.7) ...
Setting up libcurl4:amd64 (7.81.0-1ubuntu1.17) ...
Setting up curl (7.81.0-1ubuntu

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 3.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


Fetched 57.0 MB in 1s (73.0 MB/s)
Selecting previously unselected package docker-buildx-plugin.
(Reading database ... 38586 files and directories currently installed.)
Preparing to unpack .../docker-buildx-plugin_0.16.2-1~ubuntu.22.04~jammy_amd64.deb ...
Unpacking docker-buildx-plugin (0.16.2-1~ubuntu.22.04~jammy) ...
Selecting previously unselected package docker-ce-cli.
Preparing to unpack .../docker-ce-cli_5%3a27.1.2-1~ubuntu.22.04~jammy_amd64.deb ...
Unpacking docker-ce-cli (5:27.1.2-1~ubuntu.22.04~jammy) ...
Selecting previously unselected package docker-compose-plugin.
Preparing to unpack .../docker-compose-plugin_2.29.1-1~ubuntu.22.04~jammy_amd64.deb ...
Unpacking docker-compose-plugin (2.29.1-1~ubuntu.22.04~jammy) ...
Setting up docker-buildx-plugin (0.16.2-1~ubuntu.22.04~jammy) ...
Setting up docker-compose-plugin (2.29.1-1~ubuntu.22.04~jammy) ...
Setting up docker-ce-cli (5:27.1.2-1~ubuntu.22.04~jammy) ...
Installing NGC CLI...
Downloading .nemo model. This might take a few m

In [2]:
# this should the .nemo checkpoint that is saved
!ls ./llama-3_1-8b-instruct-nemo_v1.0

llama3_1_8b_instruct.nemo


In [3]:
import os
import json
import numpy as np
from rouge_score import rouge_scorer, scoring

# Phase 1: Finetuning the LoRa adapter

##  Step-by-step PEFT finetuning instructions

1. Prepare the dataset
2. Run the PEFT finetuning script
3. Inference with NeMo Framework
4. Check the model accuracy

### Step 1: Prepare the dataset

The dataset we used is a subset of the [Law-StackExchange dataset](https://huggingface.co/datasets/ymoslem/Law-StackExchange). We've already filtered and processed this dataset and it can be used to train the model for various different tasks - question title generation (summarization), law domain question answering, and question tag generation (multi-label classification). To run your own data cleaning and prepreocessing, please refer to the [data generation notebook](https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials/peft-curation-with-sdg). That tutorial also allows you to generate synthetic data and increase the size of the dataset.

This dataset is licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license. You can use it for any purpose, including commercial use, without attribution. However, if you use the dataset in a publication, please cite the original authors and the [Law-StackExchange dataset](https://huggingface.co/datasets/ymoslem/Law-StackExchange) repository.

In [4]:
!wget https://huggingface.co/datasets/bigmlguy2234/hf-law-qa-dataset/resolve/main/law-qa-curated.zip 

--2024-08-22 21:08:41--  https://huggingface.co/datasets/bigmlguy2234/hf-law-qa-dataset/resolve/main/law-qa-curated.zip
Resolving huggingface.co (huggingface.co)... 18.172.134.24, 18.172.134.88, 18.172.134.124, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.24|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/a6/d5/a6d5955c217c4e78e708cfea9bf52e46fb3c5cc93151c5447c804929b8db561a/b26fcd36ab38c6011cecb8f8d6f0e9990441dfa9d1fa9f9a8d740612493c4a90?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27law-qa-curated.zip%3B+filename%3D%22law-qa-curated.zip%22%3B&response-content-type=application%2Fzip&Expires=1724620122&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyNDYyMDEyMn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2E2L2Q1L2E2ZDU5NTVjMjE3YzRlNzhlNzA4Y2ZlYTliZjUyZTQ2ZmIzYzVjYzkzMTUxYzU0NDdjODA0OTI5YjhkYjU2MWEvYjI2ZmNkMzZhYjM

In [5]:
!unzip -j law-qa-curated.zip -d curated-data

Archive:  law-qa-curated.zip
  inflating: curated-data/law-qa-test.jsonl  
  inflating: curated-data/law-qa-val.jsonl  
  inflating: curated-data/law-qa-train.jsonl  


You should see the `law-qa-{train/val/test}.jsonl` splits in the curated folder

In [6]:
DATA_DIR = os.path.join("./curated-data")

TRAIN_DS = os.path.join(DATA_DIR, "law-qa-train.jsonl")
VAL_DS = os.path.join(DATA_DIR, "law-qa-val.jsonl")
TEST_DS = os.path.join(DATA_DIR, "law-qa-test.jsonl")

You will see several fields in the `.jsonl`, including `title`, `question`, `answer`, and other associated metadata.

For this tutorial, our input will be the `answer` field, and output will be it's `title`. 

The following cell does two things -
* Adds a template - a prompt instruction (which is optional), and format `{PROMPT} \nQUESTION: {data["question"]} \nTITLE: `.
* Saves the data splits into the same location, also appending a `_preprocessed` marker to them.

In [7]:
 # Add a prompt instruction.
PROMPT='''Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more.'''

# Creates a preprocessed version of the data files
for input_file in [TRAIN_DS, VAL_DS, TEST_DS]:
    output_file = input_file.rsplit('.', 1)[0] + '_preprocessed.jsonl'
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        for line in infile:
            # Parse each line as JSON
            data = json.loads(line)

            # Create a new dictionary with only the desired fields, renamed and formatted
            new_data = {
                "input": f'''{PROMPT} \nQUESTION: {data["question"]} \nTITLE: ''',
                "output": data['title']
            }

            # Write the new data as a JSON line to the output file
            json.dump(new_data, outfile)
            outfile.write('\n')  # Add a newline after each JSON object

    print(f"Processed {input_file} and created {output_file}")

Processed ./curated-data/law-qa-train.jsonl and created ./curated-data/law-qa-train_preprocessed.jsonl
Processed ./curated-data/law-qa-val.jsonl and created ./curated-data/law-qa-val_preprocessed.jsonl
Processed ./curated-data/law-qa-test.jsonl and created ./curated-data/law-qa-test_preprocessed.jsonl


After running the above scripts, you will see  `law-qa-{train/test/val}_preprocessed.jsonl` files appear in the data directory.

This is what an example will be formatted like -

```json
{"input": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE: ", 
 "output": "What constitutes \"doing business in a jurisdiction?\""}
```

### Step 2: Run PEFT finetuning script for LoRA

NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

For this demonstration, this training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.

> `NOTE:` In the block of code below, pass the paths to your train, test and validation data files as well as path to the .nemo model.

In [9]:
%%bash

# Set paths to the model, train, validation and test sets.
MODEL="./llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TRAIN_DS="[./curated-data/law-qa-train_preprocessed.jsonl]"
VALID_DS="[./curated-data/law-qa-val_preprocessed.jsonl]"
TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]"
TEST_NAMES="[law]"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

rm -rf results
OUTPUT_DIR="./results/Meta-llama3.1-8B-Instruct-titlegen"

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.2 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=32 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-08-22 21:43:40 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-08-22 21:43:40 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 50
      log_every_n_steps: 10
      val_check_interval: 0.2
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen
      exp_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${mode

[NeMo W 2024-08-22 21:43:40 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True


[NeMo I 2024-08-22 21:43:40 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.


TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo E 2024-08-22 21:43:41 exp_manager:703] exp_manager received explicit_log_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen and at least one of exp_dir: ./results/Meta-llama3.1-8B-Instruct-titlegen, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-08-22 21:43:41 exp_manager:630] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints. Training from scratch.


[NeMo I 2024-08-22 21:43:41 exp_manager:396] Experiments will be logged at results/Meta-llama3.1-8B-Instruct-titlegen
[NeMo I 2024-08-22 21:43:41 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2024-08-22 21:43:41 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 50. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does

[NeMo I 2024-08-22 21:43:57 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-08-22 21:43:57 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-08-22 21:43:57 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-08-22 21:43:57 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-08-22 21:43:57 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-08-22 21:43:57 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-08-22 21:43:57 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-08-22 21:43:57 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-08-22 21:43:57 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-08-22 21:43:57 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-08-22 21:43:57 megatron_init:310] All tensor model parallel group ranks: 

[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_ag in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_split_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_atomic_rs in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:57 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: defer_embedding_wgrad_compute in 

[NeMo I 2024-08-22 21:43:57 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-08-22 21:43:58 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-08-22 21:43:58 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:58 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:58 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:58 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 21:43:58 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2024-08-22 21:44:17 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2024-08-22 21:45:03 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo.
[NeMo I 2024-08-22 21:45:03 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-08-22 21:45:03 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ------------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-08-22 21:45:06 nlp_adapter_mixins:208] After adding PEFT params:
     

[NeMo W 2024-08-22 21:45:06 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-08-22 21:45:06 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-08-22 21:45:06 megatron_gpt_sft_model:811] Building GPT SFT validation datasets.
[NeMo I 2024-08-22 21:45:06 text_memmap_dataset:116] Building data files
[NeMo I 2024-08-22 21:45:06 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.071123
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.051730
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:158] Loading data files
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:249] Loading ./curated-data/law-qa-val_preprocessed.jsonl
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001202
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-08-22 21:45:07 megatron_gpt_sft_model:815] Length of val dataset: 2434
[NeMo I 2024-08-22 21:45:07 megatron_gpt_sft_model:822] Building GPT SFT traing datasets.
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:116] Building data files
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.062173
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.053765
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:158] Loading data files
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:249] Loading ./curated-data/law-qa-train_preprocessed.jsonl
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001269
[NeMo I 2024-08-22 21:45:07 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-08-22 21:45:07 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.04 (sec)
[NeMo I 2024-08-22 21:45:07 megatron_gpt_sft_model:824] Length of train dataset: 1608
[NeMo I 2024-08-22 21:45:07 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0
[NeMo I 2024-08-22 21:45:07 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-08-22 21:45:07 megatron_base_model:1199] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 50.


[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-08-22 21:45:07 adapter_mixins:435] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 8.0 B  | train
------------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
[NeMo W 2024-08-22 21:45:07 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    
[NeMo W 2024-08-22 21:45:07 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
[NeMo W 2024-08-22 21:45:14 nemo_logging:

Epoch 0: :  20%|██        | 10/50 [00:57<03:48, reduced_train_loss=3.340, global_step=9.000, consumed_samples=320.0, train_step_timing in s=5.650]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏         | 1/77 [00:03<03:58,  0.32it/s][A
Validation DataLoader 0:   3%|▎         | 2/77 [00:06<03:55,  0.32it/s][A
Validation DataLoader 0:   4%|▍         | 3/77 [00:09<03:52,  0.32it/s][A
Validation DataLoader 0:   5%|▌         | 4/77 [00:13<04:06,  0.30it/s][A
Validation DataLoader 0:   6%|▋         | 5/77 [00:16<03:59,  0.30it/s][A
Validation DataLoader 0:   8%|▊         | 6/77 [00:21<04:17,  0.28it/s][A
Validation DataLoader 0:   9%|▉         | 7/77 [00:24<04:09,  0.28it/s][A
Validation DataLoader 0:  10%|█         | 8/77 [00:28<04:02,  0.28it/s][A
Validation DataLoader 0:  12%|█▏        | 9/77 [00:31<03:56,  0.29it/s][A
Validati

Metric val_loss improved. New best score: 3.313
Epoch 0, global step 10: 'validation_loss' reached 3.31278 (best 3.31278), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.313-step=10-consumed_samples=320.0.ckpt' as top 1
[NeMo W 2024-08-22 21:50:27 nlp_overrides:480] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :  40%|████      | 20/50 [06:09<09:14, reduced_train_loss=2.870, global_step=19.00, consumed_samples=640.0, train_step_timing in s=5.700, val_loss=3.310]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏         | 1/77 [00:03<04:17,  0.30it/s][A
Validation DataLoader 0:   3%|▎         | 2/77 [00:06<04:03,  0.31it/s][A
Validation DataLoader 0:   4%|▍         | 3/77 [00:09<03:58,  0.31it/s][A
Validation DataLoader 0:   5%|▌         | 4/77 [00:14<04:16,  0.29it/s][A
Validation DataLoader 0:   6%|▋         | 5/77 [00:17<04:07,  0.29it/s][A
Validation DataLoader 0:   8%|▊         | 6/77 [00:22<04:25,  0.27it/s][A
Validation DataLoader 0:   9%|▉         | 7/77 [00:25<04:15,  0.27it/s][A
Validation DataLoader 0:  10%|█         | 8/77 [00:28<04:08,  0.28it/s][A
Validation DataLoader 0:  12%|█▏        | 9/77 [00:32<04:02,  0.28i

Metric val_loss improved by 0.748 >= min_delta = 0.001. New best score: 2.565
Epoch 0, global step 20: 'validation_loss' reached 2.56499 (best 2.56499), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.565-step=20-consumed_samples=640.0.ckpt' as top 1


Epoch 0: :  40%|████      | 20/50 [10:25<15:37, reduced_train_loss=2.870, global_step=19.00, consumed_samples=640.0, train_step_timing in s=5.700, val_loss=2.560][NeMo I 2024-08-22 21:55:40 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.313-step=10-consumed_samples=320.0.ckpt
[NeMo I 2024-08-22 21:55:40 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=3.313-step=10-consumed_samples=320.0-last.ckpt
Epoch 0: :  60%|██████    | 30/50 [11:24<07:36, reduced_train_loss=2.080, global_step=29.00, consumed_samples=960.0, train_step_timing in s=5.670, val_loss=2.560]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏   

Metric val_loss improved by 0.585 >= min_delta = 0.001. New best score: 1.980
Epoch 0, global step 30: 'validation_loss' reached 1.98005 (best 1.98005), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.980-step=30-consumed_samples=960.0.ckpt' as top 1


Epoch 0: :  60%|██████    | 30/50 [15:38<10:25, reduced_train_loss=2.080, global_step=29.00, consumed_samples=960.0, train_step_timing in s=5.670, val_loss=1.980][NeMo I 2024-08-22 22:00:54 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.565-step=20-consumed_samples=640.0.ckpt
[NeMo I 2024-08-22 22:00:54 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=2.565-step=20-consumed_samples=640.0-last.ckpt
Epoch 0: :  80%|████████  | 40/50 [16:37<04:09, reduced_train_loss=1.790, global_step=39.00, consumed_samples=1280.0, train_step_timing in s=5.650, val_loss=1.980]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏  

Metric val_loss improved by 0.220 >= min_delta = 0.001. New best score: 1.760
Epoch 0, global step 40: 'validation_loss' reached 1.75968 (best 1.75968), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.760-step=40-consumed_samples=1280.0.ckpt' as top 1


Epoch 0: :  80%|████████  | 40/50 [20:52<05:13, reduced_train_loss=1.790, global_step=39.00, consumed_samples=1280.0, train_step_timing in s=5.650, val_loss=1.760][NeMo I 2024-08-22 22:06:07 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.980-step=30-consumed_samples=960.0.ckpt
[NeMo I 2024-08-22 22:06:07 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.980-step=30-consumed_samples=960.0-last.ckpt
Epoch 0: : 100%|██████████| 50/50 [21:50<00:00, reduced_train_loss=1.710, global_step=49.00, consumed_samples=1600.0, train_step_timing in s=5.680, val_loss=1.760]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/77 [00:00<?, ?it/s][A
Validation DataLoader 0:   1%|▏ 

Metric val_loss improved by 0.043 >= min_delta = 0.001. New best score: 1.717
Epoch 0, global step 50: 'validation_loss' reached 1.71680 (best 1.71680), saving model to '/root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt' as top 1


Epoch 0: : 100%|██████████| 50/50 [26:04<00:00, reduced_train_loss=1.710, global_step=49.00, consumed_samples=1600.0, train_step_timing in s=5.680, val_loss=1.720][NeMo I 2024-08-22 22:11:19 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.760-step=40-consumed_samples=1280.0.ckpt
[NeMo I 2024-08-22 22:11:20 nlp_overrides:464] Removing checkpoint: /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.760-step=40-consumed_samples=1280.0-last.ckpt


`Trainer.fit` stopped: `max_steps=50` reached.


Epoch 0: : 100%|██████████| 50/50 [26:05<00:00, reduced_train_loss=1.710, global_step=49.00, consumed_samples=1600.0, train_step_timing in s=5.680, val_loss=1.720]


Restoring states from the checkpoint path at /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt
Restored all states from the checkpoint at /root/verb-workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt


This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/Meta-Llama-3-8B-Instruct/checkpoints/`. We'll use this later.

To further configure the run above -

* **A different PEFT technique**: The `peft.peft_scheme` parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the [PEFT support matrix](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/peft/landing_page.html). For example, for P-tuning, simply set 

```bash
model.peft.peft_scheme="ptuning" # instead of "lora"
```

* **Tuning Llama-3.1 70B**: You will need 8xA100 or 8xH100 GPUs. Provide the path to it's .nemo checkpoint (similar to the download and conversion steps earlier), and change the model parallelization settings for Llama-3.1 70B PEFT to distribute across the GPUs. It is also recommended to run the fine-tuning script from a terminal directly instead of Jupyter when using more than 1 GPU.

```bash
# Change the following settings, and run from a terminal directly
trainer.devices=8
model.tensor_model_parallel_size=8
model.pipeline_model_parallel_size=1
```

You can override many such configurations while running the script. A full set of possible configurations is located in [NeMo Framework Github](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/conf/megatron_gpt_finetuning_config.yaml).

### Step 3: Inference with NeMo Framework

Running text generation within the framework is also possible with running a Python script. Note that is more for testing and validation, not a full-fledged  deployment solution like NVIDIA NIM.

In [10]:
 # Check that the LORA model file exists
!ls -l ./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints

total 307496
-rw-r--r-- 1 root root 146928238 Aug 22 22:11 'megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0-last.ckpt'
-rw-r--r-- 1 root root 146928238 Aug 22 22:11 'megatron_gpt_peft_lora_tuning--validation_loss=1.717-step=50-consumed_samples=1600.0.ckpt'
-rw-r--r-- 1 root root  21012480 Aug 22 22:11  megatron_gpt_peft_lora_tuning.nemo


In the code snippet below, the following configurations are worth noting - 

1. `model.restore_from_path` to the path for the Meta-Llama-3.1-8B-Instruct.nemo file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the preprocessed test file.

If you have made any changes in model or experiment paths, please ensure they are configured correctly below.

In [19]:
# Create a smaller test subset for a quick eval demonstration.
!head -n 128 ./curated-data/law-qa-test_preprocessed.jsonl > ./curated-data/law-qa-test_preprocessed-n128.jsonl

In [20]:
%%bash
MODEL="./llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TEST_DS="[./curated-data/law-qa-test_preprocessed-n128.jsonl]" # Smaller test split
# TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]" # Full test set
TEST_NAMES="[law]"

TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="law_titlegen_lora"

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=32 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=50 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True  \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.label_key="output" \
    model.data.test_ds.prompt_template="\{input\}\ \{output\}"

    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-08-22 22:44:10 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-08-22 22:44:10 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-08-22 22:44:10 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-08-22 22:44:27 megatron_init:263] Rank 0 has data parallel group : [0]
[NeMo I 2024-08-22 22:44:27 megatron_init:269] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-08-22 22:44:27 megatron_init:274] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-08-22 22:44:27 megatron_init:277] Ranks 0 has data parallel rank: 0
[NeMo I 2024-08-22 22:44:27 megatron_init:285] Rank 0 has context parallel group: [0]
[NeMo I 2024-08-22 22:44:27 megatron_init:288] All context parallel group ranks: [[0]]
[NeMo I 2024-08-22 22:44:27 megatron_init:289] Ranks 0 has context parallel rank: 0
[NeMo I 2024-08-22 22:44:27 megatron_init:296] Rank 0 has model parallel group: [0]
[NeMo I 2024-08-22 22:44:27 megatron_init:297] All model parallel group ranks: [[0]]
[NeMo I 2024-08-22 22:44:27 megatron_init:306] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-08-22 22:44:27 megatron_init:310] All tensor model parallel group ranks: 

24-08-22 22:44:27 - PID:31555 - rank:(0, 0, 0, 0) - microbatches.py:39 - INFO - setting number of micro-batches to constant 1
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.

[NeMo I 2024-08-22 22:44:27 megatron_base_model:584] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-08-22 22:44:27 megatron_base_model:1158] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_t

[NeMo I 2024-08-22 22:44:47 dist_ckpt_io:95] Using ('zarr', 1) dist-ckpt save strategy.
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Loading distributed checkpoint directly on the GPU
[NeMo I 2024-08-22 22:45:42 nlp_overrides:1180] Model MegatronGPTSFTModel was successfully restored from /root/verb-workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo.
[NeMo I 2024-08-22 22:45:42 nlp_adapter_mixins:203] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-08-22 22:45:45 nlp_adapter_mixins:208] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | 

[NeMo W 2024-08-22 22:45:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-08-22 22:45:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-08-22 22:45:45 megatron_gpt_sft_model:803] Building GPT SFT test datasets.
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:116] Building data files
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:525] Processing 1 data files using 6 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.171776
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:525] Processing 1 data files using 6 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.141256
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:158] Loading data files
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:249] Loading ./curated-data/law-qa-test_preprocessed-n128.jsonl
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001299
[NeMo I 2024-08-22 22:45:45 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-08-22 22:45:45 megatron_gpt_sft_model:806] Length of test dataset: 128
[NeMo I 2024-08-22 22:45:45 megatron_gpt_sft_model:829] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2024-08-22 22:45:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
    
[NeMo W 2024-08-22 22:45:45 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
    
      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.as_tensor(
    


Testing DataLoader 0: 100%|██████████| 4/4 [05:46<00:00,  0.01it/s][NeMo I 2024-08-22 22:51:32 megatron_gpt_sft_model:561] Total deduplicated inference data size: 128 to 128
[NeMo I 2024-08-22 22:51:32 megatron_gpt_sft_model:712] Predictions saved to law_titlegen_lora_test_law_inputs_preds_labels.jsonl


[NeMo W 2024-08-22 22:51:32 megatron_gpt_sft_model:652] No training data found, reconfiguring microbatches based on validation batch sizes.
[NeMo W 2024-08-22 22:51:32 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-08-22 22:51:32 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss_law', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-08-22 22:51:32 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss', ..., sync_

Testing DataLoader 0: 100%|██████████| 4/4 [05:46<00:00,  0.01it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric       [0m[1m [0m┃[1m [0m[1m      DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss        [0m[36m [0m│[35m [0m[35m   1.6102365255355835    [0m[35m [0m│
│[36m [0m[36m      test_loss_law      [0m[36m [0m│[35m [0m[35m   1.6102365255355835    [0m[35m [0m│
│[36m [0m[36m        val_loss         [0m[36m [0m│[35m [0m[35m   1.6102365255355835    [0m[35m [0m│
└───────────────────────────┴───────────────────────────┘


### Step 4: Check the model accuracy

Now that the results are in, let's read the results and calculate the accuracy on the question title generation task.
Let's take a look at one of the predictions in the generated output file. The pred key indicates what was generated.

In [21]:
# Take a look at predictions
!head -n1  law_titlegen_lora_test_law_inputs_preds_labels.jsonl

{"input": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE:", "pred": " What constitutes a minimal business presence in a jurisdiction?", "label": " What constitutes \"doing business in a jurisdiction?\""}


For evaluating this task, we will use ROUGE.  It measures overlap of ngrams, and a higher score is better. While it's not perfect and it misses capturing the semantics of the prediction, it is a popular metric in academia and industry for evaluating such systems. The following method uses the rouge_score library to implement scoring. It will report `ROUGE_{1/2/L/Lsum}` metrics.

In [22]:
def compute_rouge(input_file: str) -> dict:
    ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
    scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)
    aggregator = scoring.BootstrapAggregator()
    lines = [json.loads(line) for line in open(input_file)]
    num_response_words = []
    num_ref_words = []
    for idx, line in enumerate(lines):
        prompt = line['input']
        response = line['pred']
        answer = line['label']
        scores = scorer.score(response, answer)
        aggregator.add_scores(scores)
        num_response_words.append(len(response.split()))
        num_ref_words.append(len(answer.split()))

    result = aggregator.aggregate()
    rouge_scores = {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
    print(rouge_scores)
    print(f"Average and stddev of response length: {np.mean(num_response_words):.2f}, {np.std(num_response_words):.2f}")
    print(f"Average and stddev of ref length: {np.mean(num_ref_words):.2f}, {np.std(num_ref_words):.2f}")

    return rouge_scores

In [23]:
compute_rouge("./law_titlegen_lora_test_law_inputs_preds_labels.jsonl")

{'rouge1': 40.6191, 'rouge2': 21.0074, 'rougeL': 36.6348, 'rougeLsum': 36.6726}
Average and stddev of response length: 11.61, 4.52
Average and stddev of ref length: 11.26, 4.97


{'rouge1': 40.6191, 'rouge2': 21.0074, 'rougeL': 36.6348, 'rougeLsum': 36.6726}

For the Llama-3.1-8B-Instruct model, you should see accuracy comparable to the below:

`{'rouge1': 39.2082, 'rouge2': 18.8573, 'rougeL': 35.4098, 'rougeLsum': 35.3906}`

# LoRA inference with NVIDIA NIM

Now that we've trained our LoRA, lets go ahead and deploy them with NVIDIA NIM. NIM's let you deploy multiple LoRA adapters and supports the .nemo and Hugging Face model formats. We will deploy the Law LoRA adapter.

## Before you begin

Lets download the NIM from NGC and get it up and running with the LoRa's that we've trained.

Note this cell might take a few minutes as it pulls the NIM

In [25]:
%%bash

wget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-nim.sh -O setup-nim
chmod +x setup-nim
export NGC_API_KEY=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
./setup-nim

--2024-08-22 23:31:56--  https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-nim.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1713 (1.7K) [text/plain]
Saving to: ‘setup-nim’

     0K .                                                     100% 35.8M=0s

2024-08-22 23:31:56 (35.8 MB/s) - ‘setup-nim’ saved [1713/1713]

https://docs.docker.com/engine/reference/commandline/login/#credential-stores



Login Succeeded
~/verb-workspace/loras ~/verb-workspace
~/verb-workspace


docker: Error response from daemon: Conflict. The container name "/meta-llama3_1-8b-instruct" is already in use by container "91a155b3514a3579bd2847691979ae75fe9a05f9dd18038273750c539774d74b". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


CalledProcessError: Command 'b'\nwget https://raw.githubusercontent.com/brevdev/notebooks/main/assets/setup-nim.sh -O setup-nim\nchmod +x setup-nim\nexport NGC_API_KEY=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\n./setup-nim\n'' returned non-zero exit status 125.

This notebook includes instructions to send an inference call to NVIDIA NIM using the Python `requests` library.

In [26]:
import requests
import json

## Check available LoRA models

Once the NIM server is up and running, check the available models as follows:

In [27]:
url = 'http://0.0.0.0:8000/v1/models'

response = requests.get(url)
data = response.json()

print(json.dumps(data, indent=4))

{
    "object": "list",
    "data": [
        {
            "id": "meta/llama-3_1-8b-instruct",
            "object": "model",
            "created": 1724369530,
            "owned_by": "system",
            "root": "meta/llama-3_1-8b-instruct",
            "parent": null,
            "max_model_len": 131072,
            "permission": [
                {
                    "id": "modelperm-51a67fefba3f417f9a0475f705f5acd0",
                    "object": "model_permission",
                    "created": 1724369530,
                    "allow_create_engine": false,
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_search_indices": false,
                    "allow_view": true,
                    "allow_fine_tuning": false,
                    "organization": "*",
                    "group": null,
                    "is_blocking": false
                }
            ]
        },
        {
            "id": "llama3.1-8b-

This will return all the models available for inference by NIM. In this case, it will return the base model, as well as the LoRA adapters that were provided during NIM deployment - `llama3.1-8b-law-titlegen`.

---
## Inference

Inference can be performed by sending POST requests to the `/completions` endpoint.

A few things to note:
* The `model` parameter in the payload specifies the model that the request will be directed to. This can be the base model `meta/llama3.1-8b-instruct`, or any of the LoRA models, such as `llama3.1-8b-law-titlegen`.
* `max_tokens` parameter specifies the maximum number of tokens to generate. At any point, the cumulative number of input prompt tokens and specified number of output tokens to generate should not exceed the model's maximum context limit. For llama3-8b-instruct, the context length supported is 8192 tokens.

Following code snippets show how it's possible to send requests belonging to different LoRAs (or tasks). NIM dynamically loads the LoRA adapters and serves the requests. It also internally handles the batching of requests belonging to different LoRAs to allow better performance and more efficient of compute.

### Title Generation

Try sending an example from the test set.

In [31]:
url = 'http://0.0.0.0:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# Example from the test set
prompt="Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE: "
data = {
    "model": "llama3.1-8b-law-titlegen",
    "prompt": prompt,
    "max_tokens": 50
}

response = requests.post(url, headers=headers, json=data)
response_data = response.json()

print(json.dumps(response_data, indent=4))

{
    "id": "cmpl-70567d0d07fe4cdfb33762a801172d85",
    "object": "text_completion",
    "created": 1724369778,
    "model": "llama3.1-8b-law-titlegen",
    "choices": [
        {
            "index": 0,
            "text": "6-6-2015",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": 128001
        }
    ],
    "usage": {
        "prompt_tokens": 141,
        "total_tokens": 148,
        "completion_tokens": 7
    }
}
