# Finetuning a large language model to generate next best actions in a Text Adventure Game.
###  This notebook finetunes a Llama 3.1-8b-Instruct model using a Lora Adaptor and SciWorld data.

This notebook showcases performing LoRA finetuning on **Llama 3.1-8B-Instruct** using the Sciworld dataset. It is derived from the tutorial posted at https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/llama3-sdg-lora-nemofw.ipynb which finetunes the Llama 3.1 8B Instruct model and has been been adapted to work with the DragonFire use case to explore current game state and generate the next recommended game action.  A tutorial for the original source notebook use law data is available at https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama-3/sdg-law-title-generation/README.rst.  

We use the NVIDIA NeMo Framework which simplifies the pipeline to finetune the LLM.  We run the finetuning portion of the experiment inside the latest NeMo contained that we pulled from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo.  This container is publicly available and should be run on either a linux machine or windows with wsl. This can also be done in a cloud GPU instance. Sufficient memory, GPU, and resources should be available. We used a workstation with a NVIDIA A6000 GPU with 48gb of memory.  

To replicate, the commands used: 
This command creates the directory to store code files  
mkdir ~/nvdata  

This command pulls and runs the NeMo container  
docker run --gpus all --runtime=nvidia -it --rm -v --shm-size=16g -p 8888:8888 -p 6006:6006 \    
 -v ~/nvdata:/workspace/nvdata \  
--ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.07    

Once the container downloads are you are at the workspace prompt, within the container run:
pip install ipywidgets  
pip install scienceworld  
jupyter-lab --allow-root --ip='0.0.0.0'  

Now you can go to a browser and run this notebook.  

For this demonstration, we will tune the model on the task of title/subject generation, that is, given a Law StackExchange forum question, auto-generate an appropriate title for it.


In [11]:
import os
import json
import numpy as np
from rouge_score import rouge_scorer, scoring

### Retrieve the SciWorld dataset
We are finetuning the Llamamodel using the SciWorld Dataset.  Sciworld is an environment that contains multistep science tasks that is used to test out how well text based game and science agents can solve these tasts.  The environment is at https://github.com/allenai/ScienceWorld.   

For this experiment, we are using a version of this dataset used by Calmworld-GPT2 located at https://github.com/cognitiveailab/calm-scienceworld/tree/master/data.  This is a CALM project adapted for SciWorld. The CALM Project is the implemented code for the paper and study, Keep CALM and Explore: Language Models for Action Generation in Text-based Games. The paper is available at https://arxiv.org/abs/2010.02903.  While the study focused on the GPT2 model, we are expanding to use Llama 3.1 7B Instruct and GPT4.  

In [None]:
#if needed, create a data directory and pull in the formatted sciworld data
#unzip the sciworld data and save to data directory
!mkdir -p data
!cd data && wget https://github.com/cognitiveailab/calm-scienceworld/blob/master/data/sciworld_formatted_data.zip
!unzip data/sciworld_formatted_data.zip

#### Preparing dataset
Helper functions to clean up the data format it for fine-tuning  

The convert_for_finetuning functions build the prompts and instruct the model on how to output the results of a query.  
There are two different versions of the functions, one for Gpt4o and chat style LLMs and one for Llama and instruct style LLMs.  

In [40]:
import json
import pandas as pd

def write_jsonl(fname, json_objs):
    with open(fname, 'wt') as f:
        for o in json_objs:
            f.write(json.dumps(o)+"\n")

#parse json file into list for exploration
def load_jsonl(file_path):
    data = []
    with open(file_path, 'r') as file:
        for line in file:
            data.append(json.loads(line))
    return data

#used for llama instruct style models
def convert_for_finetuning(json_filename,output_path):
    sci_data = load_jsonl(json_filename)
    json_objs = []
    prompt='''Generate the next action towards solving a task for the following observations in an interactive fiction game. The action should be the optimal action.'''
    for item in sci_data:
        #data={}
        #item_input =item['input']
        #item_output = item['target']
        #prompt = item_input
        #completion = item_output
        data = { "input": f'''{prompt} {item["input"]} \nAction: ''',
                "output": item['target'] }
       
        #data['input'] = {item_input}
        #data['output'] = {item_output}
        json_objs.append(data)
    write_jsonl(output_path,json_objs)
    return json_objs

#used for GPT chat style models
def convert_for_gpt_finetuning(json_filename, output_path):
    sci_data = load_jsonl(json_filename)
    system_message = "You are trying to solve a task in an interactive fiction game. You will be provided with the task " \
    "and observations about your location such as the room and items. You must decide what is recommended next action" \
    "to help complete the task given the observations. You will respond with just the action." 
    json_objs = []
    for item in sci_data:
        data={}
        item_input = item['input']
        item_output = item['target']
        prompt = [{"role": "system","content": system_message},{"role": "user","content": item_input},{"role":"assistant","content":item_output}]
        data['messages'] = prompt
        json_objs.append(data)
    #sdf = pd.DataFrame(json_objs)
    write_jsonl(output_path,json_objs)
    return json_objs

In [42]:
#lets test against the test file and view a row 
test_json_objs = convert_for_finetuning("./data/sciworld_formatted_test.jsonl","data/sciworld_test1_set.jsonl")
type(test_json_objs)

list

In [39]:
print(test_json_objs[1])

{'input': 'Generate the next action towards solving a task for the following observations in an interactive fiction game. The action should be the optimal action. [CLS] Your task is to boil lead. For compounds without a boiling point, combusting the substance is also acceptable. First, focus on the substance. Then, take actions that will cause it to change its state of matter. [SEP] The door is already open. [SEP] In your inventory, you see:  an orange  [SEP] This room is called the bathroom. In it, you see:   a substance called air  a sink, which is turned off. In the sink is: nothing.  a toilet. In the toilet is: A drain, which is open, a substance called water.  a picture  a glass cup (containing nothing)  the agent  a bath tub, which is turned off. In the bath tub is: nothing. You also see:  A door to the kitchen (that is open)  [SEP] This room is called the bathroom. In it, you see:   a substance called air  a sink, which is turned off. In the sink is: nothing.  a toilet. In the t

In [43]:
#now lets process all the data files to generate datasets for finetuning
test_json_objs = convert_for_finetuning("./data/sciworld_formatted_test.jsonl", "data/sciworld_test1_set.jsonl")
train_json_objs = convert_for_finetuning("./data/sciworld_formatted_train.jsonl", "data/sciworld_train1_set.jsonl")
dev_json_objs = convert_for_finetuning("./data/sciworld_formatted_val.jsonl", "data/sciworld_val1_set.jsonl")

In [12]:
DATA_DIR = os.path.join("data")

!ls {DATA_DIR}

sciworld_formatted_test.jsonl	sciworld_train_set.jsonl.idx.info
sciworld_formatted_train.jsonl	sciworld_train_set.jsonl.idx.npy
sciworld_formatted_val.jsonl	sciworld_val_set.jsonl
sciworld_test_set.jsonl		sciworld_val_set.jsonl.idx.info
sciworld_train_set.jsonl	sciworld_val_set.jsonl.idx.npy


In [44]:
TRAIN_DS = os.path.join(DATA_DIR, "sciworld_train1_set.jsonl")
VAL_DS = os.path.join(DATA_DIR, "sciworld_val1_set.jsonl")
TEST_DS = os.path.join(DATA_DIR, "sciworld_test1_set.jsonl")

In [46]:
# TRAIN, VAL and TEST splits all follow the same structure
!head -n1 {TRAIN_DS}

{"input": "Generate the next action towards solving a task for the following observations in an interactive fiction game. The action should be the optimal action. [CLS] Your task is to boil water. For compounds without a boiling point, combusting the substance is also acceptable. First, focus on the substance. Then, take actions that will cause it to change its state of matter. [SEP] This room is called the hallway. In it, you see:   a picture  a substance called air  the agent You also see:  A door to the green house (that is open)  A door to the living room (that is open)  A door to the art studio (that is open)  A door to the kitchen (that is open)  A door to the bedroom (that is open)  A door to the workshop (that is open) [SEP] In your inventory, you see:  an orange  [SEP] This room is called the hallway. In it, you see:   a picture  a substance called air  the agent You also see:  A door to the green house (that is open)  A door to the living room (that is open)  A door to the ar

### Download Llama 3.8 8B Instruction Model (NeMo format): 
We use the Nemo version of the Llama 3.1 8B Instruction model which can be downloaded from https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3_1-8b-instruct-nemo.   

This is a large 15 GB file. You will need to create a developer account and either directly download or use wget or the NGC CLI utility to pull down the file.  

It should be placed in a directory that is mounted and accessible to the NeMo docker container.

In [45]:
#verify that you can see your model
!ls /workspace/llama-3_1-8b-instruct-nemo_v1.0

llama3_1_8b_instruct.nemo


In [9]:
# optional clear up any cached mem-map file
!rm data/*idx*


### Perform PEFT finetuning script for LoRA

NeMo framework includes a high level python script for fine-tuning  [megatron_gpt_finetuning.py](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py) that can abstract away some of the lower level API calls. LoRA fine-tuning with NeMo involves is essentially just running this script and passing through any essential hyperparameters.

This training run is capped by `max_steps`, and validation is carried out every `val_check_interval` steps. If the validation loss does not improve after a few checks, training is halted to avoid overfitting.

### Note - while the code to run the script is shown here with a sample training run, it is better to run this at the command prompt than inside a Jupyter-notebook, especially if you plan to run more than one epoch. 

In [None]:
%%bash

# Set paths to the model, train, validation and test sets.
MODEL="/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TRAIN_DS="[./data/sciworld_train1_set.jsonl]"
VALID_DS="[./data/sciworld_val1_set.jsonl]"
TEST_DS="[./data/sciworld_test1_set.jsonl]"
TEST_NAMES="[data/sciworld_test1_set.jsonl]"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

OUTPUT_DIR="./results/Meta-llama3.1-8B-Instruct-sci"
rm -r $OUTPUT_DIR

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.2 \
    trainer.max_steps=1000 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=32 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-11-11 21:32:30 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-11-11 21:32:30 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: bf16-mixed
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 1000
      log_every_n_steps: 10
      val_check_interval: 0.2
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: ./results/Meta-llama3.1-8B-Instruct-sci
      exp_dir: ./results/Meta-llama3.1-8B-Instruct-sci
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.v

[NeMo W 2024-11-11 21:32:30 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


[NeMo I 2024-11-11 21:32:30 exp_manager:396] ExpManager schema
[NeMo I 2024-11-11 21:32:30 exp_manager:397] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo E 2024-11-11 21:32:30 exp_manager:830] exp_manager received explicit_log_dir: ./results/Meta-llama3.1-8B-Instruct-sci and at least one of exp_dir: ./results/Meta-llama3.1-8B-Instruct-sci, or version: None. Please note that exp_dir, name, and version will be ignored.
[NeMo W 2024-11-11 21:32:30 exp_manager:757] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :results/Meta-llama3.1-8B-Instruct-sci/checkpoints. Training from scratch.


[NeMo I 2024-11-11 21:32:30 exp_manager:455] Experiments will be logged at results/Meta-llama3.1-8B-Instruct-sci
[NeMo I 2024-11-11 21:32:30 exp_manager:983] TensorboardLogger has been set up


[NeMo W 2024-11-11 21:32:30 exp_manager:1111] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 1000. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() d

[NeMo I 2024-11-11 21:32:36 megatron_init:269] Rank 0 has data parallel group : [0]
[NeMo I 2024-11-11 21:32:36 megatron_init:275] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-11-11 21:32:36 megatron_init:280] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-11-11 21:32:36 megatron_init:283] Ranks 0 has data parallel rank: 0
[NeMo I 2024-11-11 21:32:36 megatron_init:291] Rank 0 has context parallel group: [0]
[NeMo I 2024-11-11 21:32:36 megatron_init:294] All context parallel group ranks: [[0]]
[NeMo I 2024-11-11 21:32:36 megatron_init:295] Ranks 0 has context parallel rank: 0
[NeMo I 2024-11-11 21:32:36 megatron_init:302] Rank 0 has model parallel group: [0]
[NeMo I 2024-11-11 21:32:36 megatron_init:303] All model parallel group ranks: [[0]]
[NeMo I 2024-11-11 21:32:36 megatron_init:312] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-11-11 21:32:36 megatron_init:316] All tensor model parallel group ranks: 

[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

[NeMo I 2024-11-11 21:32:36 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-11-11 21:32:36 megatron_base_model:595] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-11 21:32:36 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-11-11 21:33:02 nlp_overrides:1346] Model MegatronGPTSFTModel was successfully restored from /workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo.
[NeMo I 2024-11-11 21:33:02 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-11-11 21:33:02 nlp_adapter_mixins:240] Before adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ------------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-11-11 21:33:03 nlp_adapter_mixins:245] After adding PEFT params:
      | Name  | Type          | Params | Mode 
    ------------------------------------------------
    0 | model | Float16Module | 8.0 B  | train
    ---

[NeMo W 2024-11-11 21:33:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-11-11 21:33:03 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-11-11 21:33:03 megatron_gpt_sft_model:801] Building GPT SFT validation datasets.
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:116] Building data files
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:495] Building indexing for fn = ./data/sciworld_val1_set.jsonl
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:507] Saving idx file = ./data/sciworld_val1_set.jsonl.idx.npy
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:509] Saving metadata file = ./data/sciworld_val1_set.jsonl.idx.info
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.075660
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.028494
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:158] Loading data files
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:249] Loading ./data/sciworld_val1_set.jsonl
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000536
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-11-11 21:33:03 megatron_gpt_sft_model:805] Length of val dataset: 92610
[NeMo I 2024-11-11 21:33:03 megatron_gpt_sft_model:812] Building GPT SFT traing datasets.
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:116] Building data files
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:495] Building indexing for fn = ./data/sciworld_train1_set.jsonl
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:507] Saving idx file = ./data/sciworld_train1_set.jsonl.idx.npy
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:509] Saving metadata file = ./data/sciworld_train1_set.jsonl.idx.info
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:535] Time building 1 / 1 mem-mapped files: 0:00:00.122710
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:525] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.032166
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:158] Loading data files
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:249] Loading ./data/sciworld_train1_set.jsonl
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000399
[NeMo I 2024-11-11 21:33:03 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-11-11 21:33:04 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.03 (sec)
[NeMo I 2024-11-11 21:33:04 megatron_gpt_sft_model:814] Length of train dataset: 32160
[NeMo I 2024-11-11 21:33:04 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0
[NeMo I 2024-11-11 21:33:04 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2024-11-11 21:33:04 megatron_base_model:1223] Ignoring `trainer.max_epochs` when computing `max_steps` because `trainer.max_steps` is already set to 1000.


[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-11-11 21:33:04 adapter_mixins:495] Unfrozen adapter : lora_kqv_


  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 8.0 B  | train
------------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
[NeMo W 2024-11-11 21:33:04 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.
    
[NeMo W 2024-11-11 21:33:04 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    


Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-11-11 21:33:04 num_microbatches_calculator:119] setting number of micro-batches to constant 32
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:08<00:00,  0.23it/s][NeMo I 2024-11-11 21:33:13 num_microbatches_calculator:119] setting number of micro-batches to constant 32


[NeMo W 2024-11-11 21:33:13 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-11-11 21:33:13 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-11-11 21:33:13 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 202

Epoch 0: :  20%|██        | 201/1000 [28:19<1:52:37, reduced_train_loss=0.126, global_step=200.0, consumed_samples=6432.0, train_step_timing in s=9.840] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-11-11 22:01:33 num_microbatches_calculator:119] setting number of micro-batches to constant 32

Validation:   0%|          | 0/2895 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/2895 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/2895 [00:04<3:15:00,  0.25it/s][A
Validation DataLoader 0:   0%|          | 2/2895 [00:08<3:15:00,  0.25it/s][A
Validation DataLoader 0:   0%|          | 3/2895 [00:12<3:15:15,  0.25it/s][A
Validation DataLoader 0:   0%|          | 4/2895 [00:15<3:01:14,  0.27it/s][A
Validation DataLoader 0:   0%|          | 5/2895 [00:17<2:52:44,  0.28it/s][A
Validation DataLoader 0:   0%|          | 6/2895 [00:20<2:41:51,  0.30it/s][A
Validation DataLoader 0:   0%|          | 7/2895 [00:24<2:46:36,  0.29it/s][A
Validation 

Metric val_loss improved. New best score: 0.079
Epoch 0, global step 201: 'validation_loss' reached 0.07865 (best 0.07865), saving model to '/workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.079-step=201-consumed_samples=6432.0.ckpt' as top 1
[NeMo W 2024-11-12 00:39:13 nlp_overrides:609] DistributedCheckpointIO configured but should not be used. Reverting back to TorchCheckpointIO


Epoch 0: :  40%|████      | 402/1000 [3:34:46<5:19:29, reduced_train_loss=0.0903, global_step=401.0, consumed_samples=12864.0, train_step_timing in s=9.680, val_loss=0.0786] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-11-12 01:08:00 num_microbatches_calculator:119] setting number of micro-batches to constant 32

Validation:   0%|          | 0/2895 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/2895 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 1/2895 [00:04<3:14:49,  0.25it/s][A
Validation DataLoader 0:   0%|          | 2/2895 [00:08<3:14:46,  0.25it/s][A
Validation DataLoader 0:   0%|          | 3/2895 [00:12<3:15:08,  0.25it/s][A
Validation DataLoader 0:   0%|          | 4/2895 [00:15<3:01:07,  0.27it/s][A
Validation DataLoader 0:   0%|          | 5/2895 [00:17<2:52:37,  0.28it/s][A
Validation DataLoader 0:   0%|          | 6/2895 [00:20<2:41:46,  0.30it/s][A
Validation DataLoader 0:   0%|          | 7/2895 [00:24<2:46:30,  0.2

Metric val_loss improved by 0.021 >= min_delta = 0.001. New best score: 0.058
Epoch 0, global step 402: 'validation_loss' reached 0.05788 (best 0.05788), saving model to '/workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.058-step=402-consumed_samples=12864.0.ckpt' as top 1


Epoch 0: :  40%|████      | 402/1000 [6:12:34<9:14:13, reduced_train_loss=0.0903, global_step=401.0, consumed_samples=12864.0, train_step_timing in s=9.680, val_loss=0.0579][NeMo I 2024-11-12 03:45:48 nlp_overrides:593] Removing checkpoint: /workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.079-step=201-consumed_samples=6432.0.ckpt
[NeMo I 2024-11-12 03:45:48 nlp_overrides:593] Removing checkpoint: /workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.079-step=201-consumed_samples=6432.0-last.ckpt
Epoch 0: :  60%|██████    | 603/1000 [6:40:39<4:23:46, reduced_train_loss=0.0276, global_step=602.0, consumed_samples=19296.0, train_step_timing in s=7.220, val_loss=0.0579] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-11-12 04:13:53 num_microbatches_calculator:119] setting number of micro-batches to constant 32

Validation:   

Metric val_loss improved by 0.013 >= min_delta = 0.001. New best score: 0.045
Epoch 0, global step 603: 'validation_loss' reached 0.04453 (best 0.04453), saving model to '/workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.045-step=603-consumed_samples=19296.0.ckpt' as top 1


Epoch 0: :  60%|██████    | 603/1000 [9:18:29<6:07:41, reduced_train_loss=0.0276, global_step=602.0, consumed_samples=19296.0, train_step_timing in s=7.220, val_loss=0.0445][NeMo I 2024-11-12 06:51:43 nlp_overrides:593] Removing checkpoint: /workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.058-step=402-consumed_samples=12864.0.ckpt
[NeMo I 2024-11-12 06:51:43 nlp_overrides:593] Removing checkpoint: /workspace/nvdata/dragonfire/results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=0.058-step=402-consumed_samples=12864.0-last.ckpt
Epoch 0: :  80%|████████  | 804/1000 [9:46:58<2:23:05, reduced_train_loss=0.0269, global_step=803.0, consumed_samples=25728.0, train_step_timing in s=9.500, val_loss=0.0445] 
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-11-12 07:20:12 num_microbatches_calculator:119] setting number of micro-batches to constant 32

Validation: 

This will create a LoRA adapter - a file named `megatron_gpt_peft_lora_tuning.nemo` in `./results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/`. We'll use to evaluate the checkpoint.



### Evaluation Inference with NeMo Framework

We wil check how well the model predicts the next action on a subset of the test data.

In [None]:
# Set paths to the model, train, validation and test sets.
MODEL="/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TRAIN_DS="[./data/sciworld_train1_set.jsonl]"
VALID_DS="[./data/sciworld_val1_set.jsonl]"
TEST_DS="[./data/sciworld_test1_set.jsonl]"
TEST_NAMES="[data/sciworld_test1_set.jsonl]"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

OUTPUT_DIR="./results/Meta-llama3.1-8B-Instruct-sci"
rm -r $OUTPUT_DIR

In [50]:
# Check that the LORA model file exists
!ls -l ./results/Meta-llama3.1-8B-Instruct-sci/checkpoints

total 307500
-rw-r--r-- 1 root root 146928238 Nov 13 01:09 'megatron_gpt_peft_lora_tuning--validation_loss=0.039-step=1000-consumed_samples=32000.0-last.ckpt'
-rw-r--r-- 1 root root 146928238 Nov 13 00:42 'megatron_gpt_peft_lora_tuning--validation_loss=0.039-step=804-consumed_samples=25728.0.ckpt'
-rw-r--r-- 1 root root  21012480 Nov 13 01:09  megatron_gpt_peft_lora_tuning.nemo


In the code snippet below, the following configurations are worth noting - 

1. `model.restore_from_path` to the path for the Meta-Llama-3.1-8B-Instruct.nemo file.
2. `model.peft.restore_from_path` to the path for the PEFT checkpoint that was created in the fine-tuning run in the last step.
3. `model.test_ds.file_names` to the path of the preprocessed test file.

In [51]:
# Create a smaller test subset for a quick eval demonstration.

!head -n 128 ./data/sciworld_test1_set.jsonl > ./data/sciworld_test1_set-n128.jsonl

In [55]:
%%bash
MODEL="/workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TEST_DS="[./data/sciworld_test1_set-n128.jsonl]" # Smaller test split
# TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]" # Full test set
TEST_NAMES="[sci]"

TP_SIZE=1
PP_SIZE=1

# This is where your LoRA checkpoint was saved
PATH_TO_TRAINED_MODEL="results/Meta-llama3.1-8B-Instruct-sci/checkpoints/megatron_gpt_peft_lora_tuning.nemo"

# The generation run will save the generated outputs over the test dataset in a file prefixed like so
OUTPUT_PREFIX="sciworld_lora"

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${MODEL} \
    model.peft.restore_from_path=${PATH_TO_TRAINED_MODEL} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    model.data.test_ds.file_names=${TEST_DS} \
    model.data.test_ds.names=${TEST_NAMES} \
    model.data.test_ds.global_batch_size=32 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=25 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    inference.greedy=True  \
    model.data.test_ds.output_file_path_prefix=${OUTPUT_PREFIX} \
    model.data.test_ds.write_predictions_to_file=True \
    model.data.test_ds.truncation_field="null" \
    model.data.test_ds.add_bos=False \
    model.data.test_ds.add_eos=True \
    model.data.test_ds.add_sep=False \
    model.data.test_ds.label_key="output" \
    model.data.test_ds.prompt_template="\{input\}\ \{output\}"

`zarr` distributed checkpoint backend is deprecated. Please switch to PyTorch Distributed format (`torch_dist`).
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-11-13 03:20:24 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-11-13 03:20:24 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 1
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

[NeMo W 2024-11-13 03:20:24 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it 

[NeMo I 2024-11-13 03:20:29 megatron_init:269] Rank 0 has data parallel group : [0]
[NeMo I 2024-11-13 03:20:29 megatron_init:275] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-11-13 03:20:29 megatron_init:280] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2024-11-13 03:20:29 megatron_init:283] Ranks 0 has data parallel rank: 0
[NeMo I 2024-11-13 03:20:29 megatron_init:291] Rank 0 has context parallel group: [0]
[NeMo I 2024-11-13 03:20:29 megatron_init:294] All context parallel group ranks: [[0]]
[NeMo I 2024-11-13 03:20:29 megatron_init:295] Ranks 0 has context parallel rank: 0
[NeMo I 2024-11-13 03:20:29 megatron_init:302] Rank 0 has model parallel group: [0]
[NeMo I 2024-11-13 03:20:29 megatron_init:303] All model parallel group ranks: [[0]]
[NeMo I 2024-11-13 03:20:29 megatron_init:312] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-11-13 03:20:29 megatron_init:316] All tensor model parallel group ranks: 

[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministic_mode in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:29 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_trac

[NeMo I 2024-11-13 03:20:29 tokenizer_utils:183] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3-8B


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[NeMo I 2024-11-13 03:20:30 megatron_base_model:595] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-11-13 03:20:30 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:30 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:30 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:30 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-11-13 03:20:30 megatron_base_model:1182] The model: MegatronGPTSFTModel() does not have field.name: deterministi

Loading distributed checkpoint with TensorStoreLoadShardedStrategy
[NeMo I 2024-11-13 03:20:59 nlp_overrides:1346] Model MegatronGPTSFTModel was successfully restored from /workspace/llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo.
[NeMo I 2024-11-13 03:20:59 nlp_adapter_mixins:240] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
[NeMo I 2024-11-13 03:21:00 nlp_adapter_mixins:245] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    10.5 M    Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
  

[NeMo W 2024-11-13 03:21:00 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-11-13 03:21:00 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-11-13 03:21:00 megatron_gpt_sft_model:793] Building GPT SFT test datasets.
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:116] Building data files
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:525] Processing 1 data files using 16 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.174362
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:525] Processing 1 data files using 16 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:535] Time building 0 / 1 mem-mapped files: 0:00:00.171512
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:158] Loading data files
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:249] Loading ./data/sciworld_test1_set-n128.jsonl
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000458
[NeMo I 2024-11-13 03:21:00 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-11-13 03:21:00 megatron_gpt_sft_model:796] Length of test dataset: 128
[NeMo I 2024-11-13 03:21:00 megatron_gpt_sft_model:819] Building dataloader with consumed samples: 0


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[NeMo W 2024-11-13 03:21:00 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.
    
[NeMo W 2024-11-13 03:21:00 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:149: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    


Testing: |          | 0/? [00:00<?, ?it/s]setting number of micro-batches to constant 32


      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.as_tensor(
    


Testing DataLoader 0:   0%|          | 0/4 [00:00<?, ?it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 32
Testing DataLoader 0:  25%|██▌       | 1/4 [01:05<03:17,  0.02it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 32
Testing DataLoader 0:  50%|█████     | 2/4 [01:35<01:35,  0.02it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 32
Testing DataLoader 0:  75%|███████▌  | 3/4 [02:42<00:54,  0.02it/s]setting number of micro-batches to constant 1
setting number of micro-batches to constant 32
Testing DataLoader 0: 100%|██████████| 4/4 [03:10<00:00,  0.02it/s][NeMo I 2024-11-13 03:24:11 megatron_gpt_sft_model:551] Total deduplicated inference data size: 128 to 128
[NeMo I 2024-11-13 03:24:11 megatron_gpt_sft_model:702] Predictions saved to sciworld_lora_test_sci_inputs_preds_labels.jsonl


[NeMo W 2024-11-13 03:24:11 megatron_gpt_sft_model:642] No training data found, reconfiguring microbatches based on validation batch sizes.


setting number of micro-batches to constant 32


[NeMo W 2024-11-13 03:24:11 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-11-13 03:24:11 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss_sci', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-11-13 03:24:11 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:439: It is recommended to use `self.log('test_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    


Testing DataLoader 0: 100%|██████████| 4/4 [03:10<00:00,  0.02it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1m       Test metric       [0m[1m [0m┃[1m [0m[1m      DataLoader 0       [0m[1m [0m┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│[36m [0m[36m        test_loss        [0m[36m [0m│[35m [0m[35m    0.258175790309906    [0m[35m [0m│
│[36m [0m[36m      test_loss_sci      [0m[36m [0m│[35m [0m[35m    0.258175790309906    [0m[35m [0m│
│[36m [0m[36m        val_loss         [0m[36m [0m│[35m [0m[35m    0.258175790309906    [0m[35m [0m│
└───────────────────────────┴───────────────────────────┘


### Step 4: Check the model accuracy

Now that the results are in, let's read the results and calculate the accuracy on the question title generation task.
Let's take a look at one of the predictions in the generated output file. The `pred` key indicates what was generated.

In [56]:
# Take a look at predictions
!head -n1  sciworld_lora_test_sci_inputs_preds_labels.jsonl

{"input": "Generate the next action towards solving a task for the following observations in an interactive fiction game. The action should be the optimal action. [CLS] Your task is to boil lead. For compounds without a boiling point, combusting the substance is also acceptable. First, focus on the substance. Then, take actions that will cause it to change its state of matter. [SEP] This room is called the bathroom. In it, you see:   a substance called air  a sink, which is turned off. In the sink is: nothing.  a toilet. In the toilet is: A drain, which is open, a substance called water.  a picture  a glass cup (containing nothing)  the agent  a bath tub, which is turned off. In the bath tub is: nothing. You also see:  A door to the kitchen (that is open) [SEP] In your inventory, you see:  an orange  [SEP] This room is called the bathroom. In it, you see:   a substance called air  a sink, which is turned off. In the sink is: nothing.  a toilet. In the toilet is: A drain, which is open,

The predictions for the the subset of the test dataset is output to sciworld_lora_test_sci_inputs_preds_labels.jsonl file. The end of the file contains a prediction and a label. The prediction is the LLM prediction of the action given the game state, an dthe label is the ground truth value.  You can see the above row acurately predicts the action.

For evaluating this task, we will use [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)).  It measures overlap of ngrams, and a higher score is better. While it's not perfect and it misses capturing the semantics of the prediction, it is a popular metric in academia and industry for evaluating such systems. 

The following method uses the `rouge_score` library to implement scoring. It will report `ROUGE_{1/2/L/Lsum}` metrics.

In [57]:
def compute_rouge(input_file: str) -> dict:
    ROUGE_KEYS = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
    scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=True)
    aggregator = scoring.BootstrapAggregator()
    lines = [json.loads(line) for line in open(input_file)]
    num_response_words = []
    num_ref_words = []
    for idx, line in enumerate(lines):
        prompt = line['input']
        response = line['pred']
        answer = line['label']
        scores = scorer.score(response, answer)
        aggregator.add_scores(scores)
        num_response_words.append(len(response.split()))
        num_ref_words.append(len(answer.split()))

    result = aggregator.aggregate()
    rouge_scores = {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}
    print(rouge_scores)
    print(f"Average and stddev of response length: {np.mean(num_response_words):.2f}, {np.std(num_response_words):.2f}")
    print(f"Average and stddev of ref length: {np.mean(num_ref_words):.2f}, {np.std(num_ref_words):.2f}")

    return rouge_scores

In [58]:
compute_rouge("./sciworld_lora_test_sci_inputs_preds_labels.jsonl")

{'rouge1': 78.2972, 'rouge2': 65.9794, 'rougeL': 78.3603, 'rougeLsum': 78.2974}
Average and stddev of response length: 4.92, 1.89
Average and stddev of ref length: 5.20, 1.81


{'rouge1': 78.2972, 'rouge2': 65.9794, 'rougeL': 78.3603, 'rougeLsum': 78.2974}

#### see inferencing noteoook for deploying model with fine tuned weights