# Environment Setup
Please check the [wandbのreport](https://api.wandb.ai/links/wandb-healthcare/jm7scchv) for details on setting up the environment.



## Weights and Biases setup
The progress and charts of model training can be visualized through Weights and Biases (wandb). Please refer to the above report for instructions on setting up wandb. Below, you will set up the entity and project, which are the storage locations for wandb. Please enter any values you wish.


In [None]:
import os
#os.environ["WANDB_ENTITY"]=""
os.environ["WANDB_PROJECT"]="BioNeMo_protein_LLM_pretraining"

## Confirmation of GPU

In [None]:
!nvidia-smi

# Data Preprocessing

During the training of ESM-2nv, two datasets are used. Uniref50 is divided into training, validation, and testing. To increase the size and diversity of the data, minibatches from Uniref50 are sampled during training, but each sequence in this batch is replaced with a sequence from the corresponding Uniref90 cluster. For details, refer to "Language models of protein sequences at the scale of evolution enable accurate structure prediction." Uniref90 is used only during training and not for validation or testing.

## Get the original dataset

Before preprocessing the data, please decompress the sample data using the following command.

In [None]:
cd /workspace/bionemo/examples/tests/test_data/

In [None]:
!unzip uniref202104_esm2_qc_test200_val200.zip

In [None]:
import wandb
with wandb.init(name="data_upload") as run:
    artifact = wandb.Artifact(
        name="uniref202104_esm2_qc_test200_val200",
        type="dataset",
        description="uniref202104_esm2_qc_test200_val200 from BioNeMo examples",
        metadata={"path":"/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200.zip"},
    )
    artifact.add_dir("/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200")
    run.log_artifact(artifact)

Next, you will need to set the data path before running the code located below to preprocess the data.

/workspace/bionemo/examples/protein/esm2nv/pretrain.py

To correctly set the data path, you can modify the YAML configuration file below, or update these parameters as part of the command when preprocessing data or training the model.

/workspace/bionemo/examples/protein/esm2nv/conf/base_config.yaml


Then, you will preprocess the data using the following command.

In [None]:
cd /workspace/bionemo

In [None]:
!python examples/protein/esm2nv/pretrain.py\
 --config-path=conf\
 --config-name=pretrain_esm2_650M\
 ++do_training=False\
 ++exp_manager.wandb_logger_kwargs.name='preproceed_data_upload'\
 ++exp_manager.wandb_logger_kwargs.job_type='data_upload'\
 ++wandb_artifacts.wandb_use_artifact_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/uniref202104_esm2_qc_test200_val200:v0'\
 ++wandb_artifacts.wandb_log_artifact_name='uniref202104_esm2_qc_test200_val200_preprocessed'\
 ++model.data.val_size=500\
 ++model.data.test_size=100\
 ++model.data.uf50_datapath=/uniref50_train_filt.fasta\
 ++model.data.uf90_datapath=/ur90_ur50_sampler.fasta\
 ++model.data.cluster_mapping_tsv=/mapping.tsv\
 ++model.data.dataset_path=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf50\
 ++model.data.uf90.uniref90_path=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf90\
 ++model.data.train.uf50_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta\
 ++model.data.train.uf90_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/ur90_ur50_sampler.fasta\
 ++model.data.train.cluster_mapping_tsv=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/mapping.tsv\
 ++model.data.val.uf50_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta\
 ++model.data.test.uf50_datapath=/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uniref50_train_filt.fasta

Parameters starting with -- are passed as command line arguments to pretrain.py. For example, config-path and config-name specify the folder and YAML file name of the configuration file. This path is relative to pretrain.py. conf refers to examples/protein/esm2nv/conf, and pretrain_esm2_650M refers to examples/protein/esm2nv/conf/pretrain_esm2_650M.yaml.

Parameters starting with ++ can be set in the YAML file. For instance, in the pretrain_esm2_650M.yaml inherited from base_config.yaml, you can find the following parameters:

* `do_training`: Set to False to only perform data preprocessing without training.
* `exp_manager.wandb_logger_kwargs.name`: Specifies the run name for wandb.
* `exp_manager.wandb_logger_kwargs.job_type`: Specifies the job type for wandb (job type helps organize run information later; it does not affect the implementation).
* `wandb_artifacts.wandb_use_artifact_path`: Sets the path for the wandb artifacts of the data being preprocessed.
* `wandb_artifacts.wandb_log_artifact_name`: Specifies the name of the wandb artifacts for saving the preprocessed data.
* `model.data.val_size and model.data.test_size`: Specify the sizes of the validation and test datasets.
* `model.data.uf50_datapath`: Specifies the path to the uniref50 fasta file.
* `model.data.uf90_datapath`: Specifies the path to the uniref90 fasta file.
* `model.data.cluster_mapping_tsv`: Specifies the path to the file that maps uniref50 clusters to uniref90 sequences.
* `model.data.dataset_path`: Specifies the output directory path for the preprocessed uniref50 data, which will include splits for training, validation, and testing.
* `model.data.uf90.uniref90_path`: Specifies the output directory path for the preprocessed uniref90 data, which contains only the folder u90_csvs and does not include splits for training/testing/validation as uniref90 is used only for training.
* `model.data.train.uf50_datapath`: Specifies the path to the uniref50 fasta file.
* `model.data.train.uf90_datapath`: Specifies the path to the uniref90 fasta file.
* `model.data.train.cluster_mapping_tsv`: Specifies the path to the file that maps uniref50 clusters to uniref90 sequences.
* `model.data.val.uf50_datapat`: Specifies the path to the uniref50 fasta file.
* `model.data.test.uf50_datapath`: Specifies the path to the uniref50 fasta file.


You can also directly modify the YAML file instead of overriding the arguments through the command line. Once processing is complete, the preprocessed data will be located in /workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf50/uf50/.

If you want to use your own data for pretraining, fine-tuning, or inference, specify the path as /workspace/bionemo/mydata/. However, you must align the structure and format of your data with the sample data.

# Pretraining

Now that the preprocessing of the UniRef50 and UniRef90 subsets is complete, you can begin pretraining the ESM-2nv model. The BioNeMo Framework provides checkpoints for pretrained models such as ESM-1nv, ESM-2nv (650M, 3B, 15B), ProtT5nv, and MegaMolBART. The weights for these models can be downloaded from NVIDIA's NGC. If you already have a model checkpoint, you can also resume pretraining from that checkpoint.

You are now ready to start pretraining the model with the following command, taking into account the parameters described:

In [None]:
cd /workspace/bionemo

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
!python examples/protein/esm2nv/pretrain.py \
 --config-path=conf \
 --config-name=pretrain_esm2_650M \
 ++do_training=True \
 ++do_testing=True \
 ++wandb_artifacts.wandb_use_artifact_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/uniref202104_esm2_qc_test200_val200_preprocessed:v0'\
 ++model.data.dataset_path=/ \
 ++model.data.uf90.uniref90_path=/uf90 \
 ++trainer.devices=1 \
 ++model.tensor_model_parallel_size=1 \
 ++model.micro_batch_size=8 \
 ++trainer.max_steps=10 \
 ++trainer.val_check_interval=1 \
 ++exp_manager.create_wandb_logger=True \
 ++exp_manager.checkpoint_callback_params.save_top_k=10

Explanation of Parameters:
* `do_training`: Set to True to train the model, assuming that the data has been preprocessed.
* `do_testing`: Set to False to skip testing.
* `wandb_artifacts.wandb_use_artifact_path`: Enter the path of the dataset used for training.
* `model.data.dataset_path`: Specifies the path to the preprocessed uniref50 data folder that includes training/validation/test splits.
* `model.data.uf90.uniref90_path`: Specifies the path to the preprocessed uniref90 data. This folder should contain a separate folder named u90_csvs, which must include files from x000.csv to x049.csv.
* `trainer.devices`: Specifies the number of GPUs to use.
* `model.tensor_model_parallel_size`: Sets the tensor model parallel size.
* `model.micro_batch_size`: Sets the batch size. Increase this as much as possible unless memory errors occur.
* `trainer.max_steps`: Specifies the maximum number of training steps. Set to 100 for demonstration purposes. One step equals processing one batch. First, calculate total_batches = total number of samples / batch size. If you want to train for N epochs, set max_steps to N * total_batches.
* `trainer.val_check_interval`: Specifies the interval at which to run the validation set.
* `exp_manager.create_wandb_logger`: Set to False to disable logging to wandb. If set to True, you must provide a wandb API key.
* `exp_manager.checkpoint_callback_params.save_top_k`: Specifies the number of best checkpoints to save.
The trained results will be saved in /workspace/bionemo/results/nemo_experiments/.

With these settings, you are well-equipped to execute the command and start the pretraining process efficiently.


# Finetuning

In [None]:
cd /workspace/bionemo

In [None]:
# Please restart kernel
import os
os.environ["BIONEMO_HOME"]="/workspace/bionemo"
os.environ["WANDB_ENTITY"]=""
os.environ["WANDB_PROJECT"]="BioNeMo_protein_LLM_finetuning"

The BioNeMo Framework offers sample code for three downstream tasks. The first task is to predict the 10 subcellular localizations of a protein, the second is to predict the melting temperature of a protein, and the third is to predict the 3-state structure of a protein. This article will focus on the third task as an example. Specifically, it involves predicting whether each amino acid in a sequence belongs to a helix, sheet, or coil.

The sample data for this task can be found in the folder below.

/workspace/bionemo/examples/tests/test_data/protein/downstream/

In [None]:
!wget -q -O /tmp/ngccli_linux.zip --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.38.0/files/ngccli_linux.zip && unzip -o /tmp/ngccli_linux.zip -d /tmp && chmod u+x /tmp/ngc-cli/ngc && rm /tmp/ngccli_linux.zip

Next, to download a pretrained model, you will need to install NGC (NVIDIA GPU Cloud) and configure your NGC settings. Here’s how you can proceed:

/tmp/ngc-cli/ngc config set

To complete the configuration of your NGC CLI, follow these steps, inputting the requested details:

`API Key`: Enter your API key from the NVIDIA GPU Cloud (NGC). This key is critical for authentication and accessing the services.

`CLI Output Format`: You can choose the desired output format for your CLI responses. If unsure, the default format is usually fine, so just press "Enter".

`Organization (Org)`: Choose an organization other than 'no-org'. If you are part of an NVIDIA-approved organization, input its name; otherwise, consult NGC documentation or your NGC account settings for possible values.

`Team`: If your NGC usage involves a specific team setup, enter the team name. If not, just press "Enter" to skip this step.

`ACE`: This is typically for advanced configuration settings related to application, compute, and environment. Press "Enter" to accept default settings unless specific modifications are required.

After configuring NGC, you can download the model using the command below. Replace <model_name> with the actual name of the model you wish to download, for example, ESM-2nv:

In [None]:
!python download_models.py --download_dir /workspace/bionemo/models esm2nv_650m

In [None]:
# model saving
import wandb
with wandb.init(name="model_upload") as run:
    artifact = wandb.Artifact(
        name="esm2nv_650m",
        type="model",
        description="original esm2nv_650m",
        metadata={"path":"bionemo/models/esm2nv_650M_converted.nemo"},
    )
    artifact.add_file("/workspace/bionemo/models/esm2nv_650M_converted.nemo")
    run.log_artifact(artifact)

In [None]:
# dataset saving
with wandb.init(name="data_upload") as run:
    artifact = wandb.Artifact(
        name="downstream_taskdataset",
        type="dataset",
        description="bionemo/examples/tests/test_data/protein/downstream",
        metadata={"path":"/workspace/bionemo/examples/tests/test_data/protein/downstream"},
    )
    artifact.add_dir("/workspace/bionemo/examples/tests/test_data/protein/downstream")
    run.log_artifact(artifact)

The downloaded model is a .nemo file and is saved in /workspace/bionemo/models. Then, open the following YAML file:

/workspace/bionemo/examples/protein/esm2nv/conf/downstream_flip_sec_LORA.yaml

Set the following parameters as needed:

* `restore_from_path`: Set to the path of the .nemo file of the pretrained model checkpoint.
* `trainer.devices`, `trainer.num_nodes`: Set to the number of GPUs and nodes to use.
* `trainer.max_epochs`: Set to the number of epochs you want to train.
* `trainer.val_check_interval`: Set to the number of steps at which to perform validation.
* `model.micro_batch_size`: Set to the microbatch size for training.
* `data.task_name`: Set to any name you choose.
* `data.task_type`: Current options include token-level-classification, classification (sequence level), and regression (sequence level). Set preprocessed_data_path to the path of the parent folder of dataset_path.
* `wandb_artifacts.wandb_use_artifact_data_path`: This is the path of the wandb artifacts for the data being used.
* `wandb_artifacts.wandb_use_artifact_model_path`: This is the path of the wandb artifacts for the model being used.
* `dataset_path`: Set to the path of the folder that includes train/val/test folders. For example, set it to the path/to/data mentioned above.
* `dataset.train`, `dataset.val`, `dataset.test`: Set to CSV name or range.
* `sequence_column`: Set to the name of the column containing the sequence. Example: sequence
* `target_column`: Set to the name of the column containing the target. Example: scl_label
* `target_size`: The number of classes for each label in classification.
* `num_classes`: Set to target_size.

Then, you execute the fine-tuning using the following command. BioNeMo Framework v1.4 has introduced a new feature, a fine-tuning method called LoRa. LoRa is an efficient fine-tuning method that, instead of fine-tuning all weights of a pretrained large-scale language model, fine-tunes two smaller matrices that approximate the large weight matrix. This approach allows for more efficient parameter updates while preserving the performance of the original model.

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
import os
import shutil

# If the directory exists and contains checkpoint data, you should delete it. This is a precautionary measure to ensure that running the following code does not encounter issues with existing data. 
checkpoint_dir = "/workspace/bionemo/results/nemo_experiments/esm2nv_flip/esm2nv_flip_secondary_structure_finetuning_encoder_frozen_False/checkpoints"
if os.path.exists(checkpoint_dir) and os.listdir(checkpoint_dir):
    shutil.rmtree(checkpoint_dir)
    os.makedirs(checkpoint_dir)

In [None]:
!python examples/protein/downstream/downstream_flip.py\
 --config-path="../esm2nv/conf"\
 --config-name=downstream_sec_str_LORA\
 ++trainer.devices=1\
 ++trainer.num_nodes=1\
 ++trainer.max_epochs=100\
 ++wandb_artifacts.wandb_use_artifact_data_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/downstream_taskdataset:v0'\
 ++wandb_artifacts.wandb_use_artifact_model_path='${oc.env:WANDB_ENTITY}/${oc.env:WANDB_PROJECT}/esm2nv_650m:v0'\
 ++model.data.dataset_path=/\
 ++model.restore_encoder_path=/esm2nv_650M_converted.nemo

# Inference

Please open the following file

/workspace/bionemo/examples/protein/esm2nv/conf/infer.yaml

Please update the following information. You can use a pretrained model provided, or you can use a model that you have trained yourself. Make sure to specify which option you choose and adjust the settings accordingly, such as the path to the model checkpoint or any specific configurations needed for the fine-tuning process.

In [None]:
downstream_task:
 restore_from_path: "${oc.env:BIONEMO_HOME}/models/protein/esm2nv/esm2nv_650M_converted.nemo" # 事前学習済みモデルのパス

Please open the following Jupyter Notebook:

/workspace/bionemo/examples/protein/esm2nv/nbs/inference_interactive.ipynb

Then, proceed to execute the cells in sequence.

# Use your own data (Optional)

Please prepare directories named train, val, and test in the local directory /home/bionemo/. In each folder, prepare files named x000.csv, x001.csv, and so on.

## Pre-training Data Preparation
For the format of the pre-training data, please refer to the following file:

/workspace/bionemo/examples/tests/test_data/uniref202104_esm2_qc_test200_val200/uf50/train/x000.csv

The columns record_id and sequence are mandatory, and other columns are optional. You will also need to modify the data section of the following YAML file to change the number of data entries and paths:

/workspace/bionemo/examples/protein/esm2nv/conf/base_config.yaml

## Fine-Tuning Data Preparation
For the format of the fine-tuning data, please refer to the following file:

/workspace/bionemo/examples/tests/test_data/protein/downstream/train/x000.csv

The columns id, sequence, and target (either 3-state or 8-state) are mandatory, and other columns are optional. Additionally, you will need to change the number of data entries, paths, and target information in the data section of the following YAML file:

/workspace/bionemo/examples/protein/esm2nv/conf/downstream_sec_str_LORA.yaml
