# Environment Setup
Please check the [wandbのreport](https://api.wandb.ai/links/wandb-healthcare/jm7scchv) for details on setting up the environment.

## Weights and Biases setup
The progress and charts of model training can be visualized through Weights and Biases (wandb). Please refer to the above report for instructions on setting up wandb. Below, you will set up the entity and project, which are the storage locations for wandb. Please enter any values you wish.


In [None]:
import os
#os.environ["WANDB_ENTITY"]=""
os.environ["WANDB_PROJECT"]="BioNeMo_protein_LLM_pretraining"

## Confirmation of GPU

In [None]:
!nvidia-smi

# Data preprocessing

The training of DNABERT uses the human genome reference version GRCh38.p13, downloaded from the NIH, the same as the original DNABERT. The chromosomes are first divided into contiguous sections (e.g., sections cut by 'N'), and "empty" sequences that are sampled in training are removed. Then, slices of the genome are sampled at runtime and provided to the model for training. By default, chromosomes 1 through 19 are used for training, with chromosomes 20 and 21 reserved as holdout data. Chromosome 22 is also reserved for further evaluation.
Next, you need to set the data path before running the code located at:

/workspace/bionemo/examples/dna/dnabert/pretrain.py

To correctly set the data path, you can modify the following YAML configuration file, or update these parameters as part of the command when doing data preprocessing or model training

/workspace/bionemo/examples/dna/dnabert/conf/dnabert_base_config.yaml

By running the following optional command, you can download and preprocess the entire GRCh38.p13 data, but in this workshop, we will use a smaller dataset prepared in the BioNemo container for pretraining.

## (Optional) Download and Preprocess the Entire Dataset

In [None]:
cd /workspace/bionemo 

In [None]:
!python examples/dna/dnabert/pretrain.py\
 --config-path=conf\
 --config-name=dnabert_xsmall\
 ++do_training=False\
 ++model.data.dataset_path=/workspace/bionemo/examples/dna/dnabert/data

Parameters starting with -- are passed as command-line arguments to pretrain.py. For example, config-path and config-name specify the folder and YAML file name of the configuration file, respectively. These paths are relative to pretrain.py. conf refers to examples/dna/dnabert/conf, and dnabert_xsmall points to examples/dna/dnabert/conf/dnabert_xsmall.yaml.

Parameters starting with ++ can be set in the YAML file. For instance, in dnabert_xsmall.yaml, which inherits from dnabert_base_config.yaml, you can find the following parameters:

* `do_training`: Set to False to only perform data preprocessing and not training.
* `model.data.dataset_path`: Specifies the path to the output directory of preprocessed GRCh38.p13 data. This folder will include splits for training, validation, and testing.


Instead of overriding arguments through the command line, you can directly modify the YAML file. Once processing is complete, the preprocessed data will be located in /workspace/bionemo/examples/dna/dnabert/data.

If you want to use your own data for pre-training, fine-tuning, or inference, specify the path as /workspace/bionemo/mydata/. However, you must ensure that the data structure and format conform to the sample data.


# Pretraining

Since the preprocessing of the small Chr1 dataset is already complete, you can now start pre-training the DNABERT model. Please enter the following command in your terminal to initiate model pre-training.

In [None]:
cd /workspace/bionemo 

In [None]:
!python examples/dna/dnabert/pretrain.py\
 --config-path=conf\
 --config-name=dnabert_xsmall\
 ++do_training=True\
 ++trainer.val_check_interval=100\
 ++model.data.dataset_path=/workspace/bionemo/examples/dna/dnabert/data/small-example\
 ++model.data.dataset.train=/workspace/bionemo/examples/dna/dnabert/data/small-example/train/chr1-trim-train.fna\
 ++model.data.dataset.val=/workspace/bionemo/examples/dna/dnabert/data/small-example/val/chr1-trim-val.fna\
 ++model.data.dataset.test=/workspace/bionemo/examples/dna/dnabert/data/small-example/test/chr1-trim-test.fna\
 ++model.micro_batch_size=1\
 ++trainer.max_steps=100

Here is the explanation of the parameters for pre-training the DNABERT model:

* `do_training`: Set this to True to train the model, assuming that the data has been preprocessed.
* `model.data.dataset_path`: Specifies the path to the preprocessed GRCh38.p13 data folder that includes the training/validation/testing splits.
* `trainer.devices`: Specifies the number of GPUs to use.
* `model.micro_batch_size`: Sets the batch size. Increase this as much as possible as long as memory errors do not occur.
* `trainer.max_steps`: Specifies the maximum number of training steps. It has been set to 100 for demonstration purposes. One step equals the processing of one batch. First, calculate total_batches = total number of samples / batch size. If you want to train for N epochs, set max_steps to N * total_batches.


The results of the training will be saved in /workspace/bionemo/results/nemo_experiments/.


# Finetuning

The BioNemo Framework offers a downstream task sample code for splice site prediction. Next, you will fine-tune the model for the splice site prediction task using the Ensemble's annotations of the GRCh38.p13 version 99. A sample of 10,000 donor sites, 10,000 acceptor sites, and 10,000 random negative sites from the gene body will be sampled and split into training (80%), validation (10%), and testing (10%).

The command below can be used to download the entire dataset for the downstream task. However, in this workshop, we will use a pre-prepared dataset, so the following command is optional.

## （Optional）Download dataset for downstream task 

In [None]:
cd /workspace/bionemo

In [None]:
!python examples/dna/dnabert/downstream_splice_site.py\
 --config-path=conf\
 --config-name=dnabert_config_splice_site\
 ++do_training=False\
 ++model.data.dataset_path=/workspace/bionemo/examples/dna/dnabert/data

## Download pretrained model

In the BioNeMo Framework, pretrained model checkpoints are provided, such as ESM-1nv, ESM-2nv (650m, 3b, 15b), ProtT5nv, and MegaMolBART. The weights for these models can be downloaded from NVIDIA's NGC.

Next, you will download the pretrained model. To do this, you need to install ngc and set up ngc config. If you have already completed the ngc config setup in the 01_protein_LLM notebook, you can skip this step.

In [None]:
!wget -q -O /tmp/ngccli_linux.zip --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.38.0/files/ngccli_linux.zip && unzip -o /tmp/ngccli_linux.zip -d /tmp && chmod u+x /tmp/ngc-cli/ngc && rm /tmp/ngccli_linux.zip

Open the terminal and enter the following command to set up your ngc config:

/tmp/ngc-cli/ngc config set


To complete the configuration of your NGC CLI, follow these steps, inputting the requested details:

`API Key`: Enter your API key from the NVIDIA GPU Cloud (NGC). This key is critical for authentication and accessing the services.

`CLI Output Format`: You can choose the desired output format for your CLI responses. If unsure, the default format is usually fine, so just press "Enter".

`Organization (Org)`: Choose an organization other than 'no-org'. If you are part of an NVIDIA-approved organization, input its name; otherwise, consult NGC documentation or your NGC account settings for possible values.

`Team`: If your NGC usage involves a specific team setup, enter the team name. If not, just press "Enter" to skip this step.

`ACE`: This is typically for advanced configuration settings related to application, compute, and environment. Press "Enter" to accept default settings unless specific modifications are required.

In [None]:
!python download_models.py --download_dir /workspace/bionemo/models dnabert

The downloaded model will be saved as a .nemo file in the /workspace/bionemo/models directory. You should open the corresponding YAML file and set the parameters as needed to tailor the model's configuration to your specific requirements.

To perform fine-tuning, you will execute the following command in the terminal. Note that you should adjust the folder name 2024-05-29_05-11-01 in the command below to match the actual folder name based on the date and time when you run the training:

In [None]:
cd /workspace/bionemo

In [None]:
!python examples/dna/dnabert/downstream_splice_site.py\
 --config-path=conf\
 --config-name=dnabert_config_splice_site_xsmall_finetune\
 ++model.restore_encoder_path=/workspace/bionemo/results/nemo_experiments/dnabert/dnabert-xsmall/dnabert/2024-05-29_05-11-01/checkpoints/dnabert.nemo\
 ++model.micro_batch_size=1\
 ++model.data.dataset_path=/workspace/bionemo/examples/tests/test_data/dna/downstream\
 ++model.data.train_file=/workspace/bionemo/examples/tests/test_data/dna/downstream/train.csv\
 ++model.data.val_file=/workspace/bionemo/examples//tests/test_data/dna/downstream/val.csv\
 ++model.data.predict_file=/workspace/bionemo/examples/tests/test_data/dna/downstream/test.csv\
 ++model.data.fasta_directory=/workspace/bionemo/examples/tests/test_data/dna/downstream\
 ++model.data.fasta_pattern=test-chr1.fa

# Infererence

Open the following yaml file

/workspace/bionemo/examples/dna/dnabert/conf/infer.yaml

When fine-tuning a model with the BioNeMo Framework, you have the option to use either a provided pretrained model or a model you have trained yourself. 

downstream_task:

 restore_from_path: "${oc.env:BIONEMO_HOME}/models/dna/dnabert/dnabert-86M.nemo" 

And then, perform inference using the following command."

In [None]:
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [None]:
from pathlib import Path
import os

try:
    BIONEMO_HOME: Path = Path(os.environ['BIONEMO_HOME']).absolute()
except KeyError:
    print("Must have BIONEMO_HOME set in the environment! See docs for instructions.")
    raise

config_path = BIONEMO_HOME / "examples" / "dna" / "dnabert" / "conf"
print(f"Using model configuration at: {config_path}")
assert config_path.is_dir()

In [None]:
seqs = [
    'CACATGCTAGCGCGTCGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCGCCGCGCCG', 
    'GCGCAGGCGCAGAGACACATGCTACCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAG',
    'GCAAAGTCGCACGGCGCCGGGCTGGGGCGGGGGGAGGGTGGCGCCGTGCACGCGCAGAA',
    'CGCAGAGACGGGTAGAACCTCAGTAATCCGAAAAGCCGGGATCGACCGCCCCTTGCTTG',
]

In [None]:
from bionemo.utils.hydra import load_model_config

cfg = load_model_config(config_name="infer.yaml", config_path=config_path)

In [None]:
from bionemo.triton.utils import load_model_for_inference
from bionemo.model.dna.dnabert.infer import DNABERTInference

inferer = load_model_for_inference(cfg, interactive=True)

print(f"Loaded a {type(inferer)}")
assert isinstance(inferer, DNABERTInference)

In [None]:
hidden_states, pad_masks = inferer.seq_to_hiddens(seqs)
print(f"{hidden_states.shape=}")
print(f"{pad_masks.shape=}")
assert tuple(hidden_states.shape) == (4, 57, 256)
assert tuple(pad_masks.shape) == (4, 57)

In [None]:
embeddings = inferer.hiddens_to_embedding(hidden_states, pad_masks)
print(f"{embeddings.shape=}")
assert tuple(embeddings.shape) == (4, 256)

In [None]:
embeddings = inferer.seq_to_embeddings(seqs)
print(f"{embeddings.shape=}")
print(embeddings)
assert tuple(embeddings.shape) == (4, 256)