## Fine-tuning tutorial for Evo2
This tutorial goes through a toy fine-tuning example end to end starting with a fasta and continuing training a hugging
face checkpoint on this user defined dataset.

In [9]:
# Clean up any prior runs
!rm -rf preprocessed_data
!rm -rf preatraining_demo
!rm -rf nemo2_evo2_1b_8k
!rm -rf pretraining_demo
!rm -rf training_data_config.yaml
!rm -rf preprocess_config.yaml
!rm -f chr17.fa.gz
!rm -f chr18.fa.gz
!rm -f chr21.fa.gz
!rm -f chr17.fa
!rm -f chr18.fa
!rm -f chr21.fa
!rm -f chr17_18_21.fa


In [2]:
import os
concat_path = "chr17_18_21.fa"
if not os.path.exists(concat_path):
    !wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr17.fa.gz
    !wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr18.fa.gz
    !wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr21.fa.gz
    !zcat chr17.fa.gz > chr17.fa
    !zcat chr18.fa.gz > chr18.fa
    !zcat chr21.fa.gz > chr21.fa
    !cat chr17.fa chr18.fa chr21.fa > chr17_18_21.fa


--2025-02-25 01:11:46--  https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr17.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25930986 (25M) [application/x-gzip]
Saving to: ‘chr17.fa.gz.2’


2025-02-25 01:11:49 (82.3 MB/s) - ‘chr17.fa.gz.2’ saved [25930986/25930986]

--2025-02-25 01:11:49--  https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr18.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25154367 (24M) [application/x-gzip]
Saving to: ‘chr18.fa.gz.1’


2025-02-25 01:11:50 (54.6 MB/s) - ‘chr18.fa.gz.1’ saved [25154367/25154367]

--2025-02-25 01:11:50--  https://hgdownload.soe.ucsc.edu

In [3]:
full_fasta_path = os.path.abspath(concat_path)
output_dir = os.path.abspath("preprocessed_data")
output_yaml = f"""
- datapaths: ["{full_fasta_path}"]
  output_dir: "{output_dir}"
  output_prefix: chr17_18_21_uint8_distinct
  train_split: 0.9
  valid_split: 0.05
  test_split: 0.05
  overwrite: True
  embed_reverse_complement: true
  random_reverse_complement: 0.0
  random_lineage_dropout: 0.0
  include_sequence_id: false
  transcribe: "back_transcribe"
  force_uppercase: false
  indexed_dataset_dtype: "uint8"
  tokenizer_type: "Byte-Level"
  vocab_file: null
  vocab_size: null
  merges_file: null
  pretrained_tokenizer_model: null
  special_tokens: null
  fast_hf_tokenizer: true
  append_eod: true
  enforce_sample_length: null
  ftfy: false
  workers: 1
  preproc_concurrency: 100000
  chunksize: 25
  drop_empty_sequences: true
  nnn_filter: false  # If you split your fasta on NNN (in human these are contigs), then you should set this to true.
  seed: 12342  # Not relevant because we are not using random reverse complement or lineage dropout.
"""
with open("preprocess_config.yaml", "w") as f:
    print(output_yaml, file=f)


In [4]:
!preprocess_evo2 --config preprocess_config.yaml

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd


[NeMo I 2025-02-25 01:12:03 nemo_logging:393] Using byte-level tokenization
[NeMo I 2025-02-25 01:12:03 nemo_logging:393] Created temporary binary datasets: /workspaces/bionemo-framework/docs/docs/user-guide/examples/bionemo-evo2/preprocessed_data/chr17_18_21_uint8_distinct_byte-level_train.bin.tmp /workspaces/bionemo-framework/docs/docs/user-guide/examples/bionemo-evo2/preprocessed_data/chr17_18_21_uint8_distinct_byte-level_val.bin.tmp /workspaces/bionemo-framework/docs/docs/user-guide/examples/bionemo-evo2/preprocessed_data/chr17_18_21_uint8_distinct_byte-level_test.bin.tmp
[NeMo I 2025-02-25 01:12:32 nemo_logging:393] Average preprocessing time per sequence: 1.337763786315918
[NeMo I 2025-02-25 01:12:32 nemo_logging:393] Average indexing time per sequence: 3.9368359645207724
[NeMo I 2025-02-25 01:12:32 nemo_logging:393] Number of sequences processed: 6
[NeMo I 202

In [5]:
!ls -lh preprocessed_data/

total 402M
-rw-r--r-- 1 ubuntu ubuntu 159M Feb 25 01:12 chr17_18_21_uint8_distinct_byte-level_test.bin
-rw-r--r-- 1 ubuntu ubuntu   82 Feb 25 01:12 chr17_18_21_uint8_distinct_byte-level_test.idx
-rw-r--r-- 1 ubuntu ubuntu 154M Feb 25 01:12 chr17_18_21_uint8_distinct_byte-level_train.bin
-rw-r--r-- 1 ubuntu ubuntu   82 Feb 25 01:12 chr17_18_21_uint8_distinct_byte-level_train.idx
-rw-r--r-- 1 ubuntu ubuntu  90M Feb 25 01:12 chr17_18_21_uint8_distinct_byte-level_val.bin
-rw-r--r-- 1 ubuntu ubuntu   82 Feb 25 01:12 chr17_18_21_uint8_distinct_byte-level_val.idx


In [6]:
!evo2_convert_to_nemo2 \
  --model-path hf://arcinstitute/savanna_evo2_1b_base \
  --model-size 1b --output-dir nemo2_evo2_1b_8k

    
  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

[NeMo I 2025-02-25 01:12:48 nemo_logging:393] Using byte-level tokenization
[INFO     | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: False
[INFO     | pytorch_lightning.utilities.rank_zero]: TPU available: False, using: 0 TPU cores
[INFO     | pytorch_lightning.utilities.rank_zero]: HPU available: False, using: 0 HPUs

[NeMo I 2025-02-25 01:12:48 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config
[NeMo I 2025-02-25 01:12:48 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-02-25 01:12:48 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2025-02-25 01:12:48 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]]
[NeMo I 2025-02-25 01:12:48 nemo_logging:393] Ranks 0 has data parallel rank: 0
[NeMo I 2025-02-25 01:12:

In [7]:
from pathlib import Path
output_pfx = str(Path(os.path.abspath("preprocessed_data"))/"chr17_18_21_uint8_distinct_byte-level")
output_yaml = f"""
- dataset_prefix: {output_pfx}_train
  dataset_split: train
  dataset_weight: 1.0
- dataset_prefix: {output_pfx}_val
  dataset_split: validation
  dataset_weight: 1.0
- dataset_prefix: {output_pfx}_test
  dataset_split: test
  dataset_weight: 1.0
"""
with open("training_data_config.yaml", "w") as f:
    print(output_yaml, file=f)

In [8]:
# For evo2 training and fine-tuning follow the same set of steps, so we use the same train_evo2 command.
#  the big difference is the --ckpt-dir argument which points to a pre-existing checkpoint from some other training run.
!train_evo2 \
    -d training_data_config.yaml \
    --dataset-dir {preprocessed_data} \
    --experiment-dir pretraining_demo \
    --model-size 1b \
    --devices 1 \
    --num-nodes 1 \
    --seq-length 1024 \
    --micro-batch-size 2 \
    --lr 0.0001 \
    --warmup-steps 5 \
    --max-steps 100 \
    --ckpt-dir nemo2_evo2_1b_8k \
    --clip-grad 1 \
    --wd 0.01 \
    --activation-checkpoint-recompute-num-layers 1 \
    --val-check-interval 50 \
    --ckpt-async-save \
    --no-wandb

    
  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

  @custom_fwd

  @custom_bwd

[NeMo I 2025-02-25 01:13:19 nemo_logging:393] Using byte-level tokenization

[INFO     | pytorch_lightning.utilities.rank_zero]: Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
[INFO     | pytorch_lightning.utilities.rank_zero]: GPU available: True (cuda), used: True
[INFO     | pytorch_lightning.utilities.rank_zero]: TPU available: False, using: 0 TPU cores
[INFO     | pytorch_lightning.utilities.rank_zero]: HPU available: False, using: 0 HPUs
[NeMo W 2025-02-25 01:13:19 nemo_logging:405] No version folders would be created under the log folder as 'resume_if_exists' is enabled.
[NeMo I 2025-02-25 01:13:19 nemo_logging:393] Experiments will be logged at pretraining_demo/default
[NeMo W 2025-02-25 01:13:19 nemo_logging:405] "upd