# Step 3: Training a LoRA Adapter

This notebook performs the preparatory tasks needed for obtaining the base model that we will use for fine-tuning.

This notebook showcases performing LoRA fine-tuning on the dataset that we curated in step 1.

## Setup and Requirements
Before proceeding, please make ensure you have completed the notebooks for steps 1 and 2. You will need to install one dependency to follow along. Execute the following cell before getting started.

In [None]:
! pip install ipywidgets

Let's also specify the base model name that we will use for fine-tuning. This should be the same model you downloaded/converted in step 2.

In [None]:
model_to_use = "google/gemma-2-2b"

---
# Sanity Checking

Let's do a quick sanity check to ensure we have all the pieces needed before moving forward.

In [None]:
import os

model_name = model_to_use.split('/')[-1].lower()

# The path to the model checkpoint, and also the data directory containing the training, validation, and test data.
nemo_model_fp = os.path.abspath(f"models/{model_name}.nemo")
data_dir = "data/split"

# The directory where the results will be stored.
result_dir = os.path.abspath("results")
os.makedirs(result_dir, exist_ok=True)

# Sanity checks
assert os.path.exists(nemo_model_fp), f"The model checkpoint at '{nemo_model_fp}' does not exist. Please ensure the model was downloaded successfully."
assert os.path.exists(data_dir), f"The data directory '{data_dir}' does not exist. Please ensure the data was prepared successfully."

train_fp = os.path.abspath(f"{data_dir}/train.jsonl")
val_fp = os.path.abspath(f"{data_dir}/val.jsonl")

# Sanity checks
assert os.path.exists(train_fp), f"The training data at '{train_fp}' does not exist. Please ensure the data was prepared successfully."
assert os.path.exists(val_fp), f"The validation data at '{val_fp}' does not exist. Please ensure the data was prepared successfully."

#
# Set the environment variables (needed for executing the next cell)
#
%env BASE_MODEL=$nemo_model_fp
%env DATA_DIR=$data_dir
%env TRAIN_DS=$train_fp
%env VAL_DS=$val_fp
%env RESULT_DIR=$result_dir

print(f"\n{'#'*80}")
print("All checks passed. You are ready to go!")
print(f"    Base model file: {nemo_model_fp}")
print(f"    Data directory: {data_dir}")
print(f"    Results: {result_dir}")

---
# Model Training

With all the sanity checks passing, it is time to start model training.

> NOTE: Running the following cell will remove any previously trained model!

In [None]:
%%bash

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

# Clear up cached mem-map file
rm $DATA_DIR/*idx*
# Clean up prior results
rm -r $RESULT_DIR

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${RESULT_DIR} \
    exp_manager.explicit_log_dir=${RESULT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16 \
    trainer.val_check_interval=200 \
    trainer.max_steps=1000 \
    trainer.gradient_clip_val=0.3 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=10 \
    model.peft.lora_tuning.adapter_dim=32\
    model.peft.lora_tuning.adapter_dropout=0.1\
    model.restore_from_path=${BASE_MODEL} \
    model.data.train_ds.num_workers=0 \
    model.data.train_ds.add_bos=True \
    model.data.validation_ds.num_workers=0 \
    model.data.train_ds.file_names=[${TRAIN_DS}] \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=[${VAL_DS}] \
    model.peft.peft_scheme=${SCHEME}

---
# Inference and Submission


To make a submission, run inference with your model on the test dataset at `data/split/submission.jsonl`.

> NOTE: This dataset was generated as part of Step 1. Please ensure it exists before proceeding.

In order to do this, set the variable pointing to your submission data file in the set below, then excute the final cell.

The inference results will be written under `results/inference` folder.

In [None]:
test_fp = os.path.abspath(f"{data_dir}/submission.jsonl")
assert os.path.exists(test_fp), f"The submission data at '{test_fp}' does not exist. Please ensure the data was prepared successfully."

test_fp = 'ODSC-Hackathon-Repository/data/submission2.jsonl'
adapter_fp = f"{result_dir}/checkpoints/megatron_gpt_peft_lora_tuning.nemo"
os.makedirs(f"{result_dir}/inference", exist_ok=True)

print(f"Inference set: {test_fp}")
print(f"Trained adapter: {adapter_fp}")
test_filename = os.path.basename(test_fp)


%env TEST_DS=$test_fp
%env TEST_FP=$test_filename
%env TRAINED_ADAPTER=$adapter_fp

In [None]:
%%bash

# This is where the inference results will be stored.
OUTPUT_DIR="results/inference/infer-$TEST_FP"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

# Clear up cached mem-map file
rm $DATA_DIR/*idx*

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
    model.restore_from_path=${BASE_MODEL} \
    model.peft.restore_from_path=${TRAINED_ADAPTER} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    inference.greedy=True \
    model.data.test_ds.file_names=[${TEST_DS}] \
    model.data.test_ds.names=["infer"] \
    model.data.test_ds.global_batch_size=16 \
    model.data.test_ds.micro_batch_size=1 \
    model.data.test_ds.tokens_to_generate=32 \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.data.test_ds.output_file_path_prefix=$OUTPUT_DIR \
    model.data.test_ds.write_predictions_to_file=True

The results will be written under `results/inference`. Please send us this file for your final submission.

Let's inspect a couple of lines from that file for sanity checking:

In [None]:
! cat results/inference/infer-submission.jsonl_test_infer_inputs_preds_labels.jsonl | head -n 2

---
# Freeing Memory and Other Resources

As always, it is a good idea to free up all allocated resources when you are done. Please execute the following cell to do so.

Alternatively, please restart the kernel by navigating to `Kernel > Restart Kernel` (if using Jypyter notebook), or clicking the `Restart` button in VS Code.

In [None]:
exit(0)