In [1]:
import os
concat_path = "XTT22_train.fa"

In [2]:
full_fasta_path = os.path.abspath(concat_path)
output_dir = os.path.abspath("preprocessed_data")
output_yaml = f"""
- datapaths: ["{full_fasta_path}"]
  output_dir: "{output_dir}"
  output_prefix: XTT22_train
  train_split: 0.9
  valid_split: 0.05
  test_split: 0.05
  overwrite: True
  embed_reverse_complement: true
  random_reverse_complement: 0.0
  random_lineage_dropout: 0.0
  include_sequence_id: false
  transcribe: "back_transcribe"
  force_uppercase: false
  indexed_dataset_dtype: "uint8"
  tokenizer_type: "Byte-Level"
  vocab_file: null
  vocab_size: null
  merges_file: null
  pretrained_tokenizer_model: null
  special_tokens: null
  fast_hf_tokenizer: true
  append_eod: true
  enforce_sample_length: null
  ftfy: false
  workers: 1
  preproc_concurrency: 100000
  chunksize: 25
  drop_empty_sequences: true
  nnn_filter: false  # If you split your fasta on NNN (in human these are contigs), then you should set this to true.
  seed: 12342  # Not relevant because we are not using random reverse complement or lineage dropout.
"""
with open("preprocess_config.yaml", "w") as f:
    print(output_yaml, file=f)

In [3]:
!preprocess_evo2 --config preprocess_config.yaml

[NeMo I 2025-05-24 12:37:06 nemo_logging:393] Using byte-level tokenization
[NeMo I 2025-05-24 12:37:06 nemo_logging:393] Created temporary binary datasets: /workspace/preprocessed_data/XTT22_train_byte-level_train.bin.tmp /workspace/preprocessed_data/XTT22_train_byte-level_val.bin.tmp /workspace/preprocessed_data/XTT22_train_byte-level_test.bin.tmp
[NeMo I 2025-05-24 13:12:11 nemo_logging:393] Average preprocessing time per sequence: 0.04470627161196968
[NeMo I 2025-05-24 13:12:11 nemo_logging:393] Average indexing time per sequence: 0.1463382052460373
[NeMo I 2025-05-24 13:12:11 nemo_logging:393] Number of sequences processed: 12092
[NeMo I 2025-05-24 13:12:11 nemo_logging:393] Finished preprocessing XTT22_train ([PosixPath('/workspace/XTT22_train.fa')]) in 2105.082 seconds with 1 workers.


In [4]:
!ls -lh preprocessed_data/

total 14G
-rw-r--r-- 1 root root 936M May 24 13:11 XTT22_train_byte-level_test.bin
-rw-r--r-- 1 root root  12K May 24 13:12 XTT22_train_byte-level_test.idx
-rw-r--r-- 1 root root  13G May 24 13:12 XTT22_train_byte-level_train.bin
-rw-r--r-- 1 root root 213K May 24 13:12 XTT22_train_byte-level_train.idx
-rw-r--r-- 1 root root 411M May 24 13:12 XTT22_train_byte-level_val.bin
-rw-r--r-- 1 root root  12K May 24 13:12 XTT22_train_byte-level_val.idx


In [10]:
!evo2_convert_to_nemo2 \
  --model-path /workspace/savanna_evo2_1b_base/savanna_evo2_1b_base.pt \
  --model-size 1b --output-dir nemo2_evo2_1b_8k

Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[NeMo I 2025-05-24 15:02:39 nemo_logging:393] Using byte-level tokenization
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2025-05-24 15:02:39 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
    
[NeMo I 2025-05-24 15:02:39 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config
[NeMo I 2025-05-24 15:02:39 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-05-24 15:02:39 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0

In [11]:
import os
from pathlib import Path
output_pfx = str(Path(os.path.abspath("preprocessed_data"))/"XTT22_train_byte-level")
output_yaml = f"""
- dataset_prefix: {output_pfx}_train
  dataset_split: train
  dataset_weight: 1.0
- dataset_prefix: {output_pfx}_val
  dataset_split: validation
  dataset_weight: 1.0
- dataset_prefix: {output_pfx}_test
  dataset_split: test
  dataset_weight: 1.0
"""
with open("training_data_config.yaml", "w") as f:
    print(output_yaml, file=f)

In [12]:
!train_evo2 \
    -d training_data_config.yaml \
    --dataset-dir {preprocessed_data} \
    --model-size 1b \
    --devices 1 \
    --num-nodes 1 \
    --seq-length 1 \
    --micro-batch-size 1 \
    --lr 0.0001 \
    --warmup-steps 5 \
    --max-steps 100 \
    --ckpt-dir nemo2_evo2_1b_8k \
    --clip-grad 1 \
    --wd 0.01 \
    --activation-checkpoint-recompute-num-layers 1 \
    --val-check-interval 50 \
    --ckpt-async-save

Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[NeMo I 2025-05-24 15:03:17 nemo_logging:393] Using byte-level tokenization
[NeMo W 2025-05-24 15:03:17 nemo_logging:405] WandB is currently turned off.
[NeMo W 2025-05-24 15:03:17 nemo_logging:405] User-set tensorboard is currently turned off. Internally one may still be set by NeMo2.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2025-05-24 15:03:17 nemo_logging:393] Experiments will be logged at results/evo2/dev
[NeMo W 2025-05-24 15:03:17 

In [14]:
!pip install -q peft

[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.12.0.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/dill-0.3.9-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/o

In [20]:
!cp /workspace/bionemo_train.py /usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py

In [None]:

!train_evo2 \
    -d training_data_config.yaml \
    --dataset-dir {preprocessed_data} \
    --model-size 1b \
    --devices 1 \
    --num-nodes 1 \
    --seq-length 1 \
    --micro-batch-size 1 \
    --lr 0.0001 \
    --warmup-steps 5 \
    --max-steps 300000 \
    --ckpt-dir nemo2_evo2_1b_8k \
    --clip-grad 1 \
    --wd 0.01 \
    --activation-checkpoint-recompute-num-layers 1 \
    --val-check-interval 1000 \
    --ckpt-async-save

Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[NeMo I 2025-06-01 15:37:07 nemo_logging:393] Using byte-level tokenization
[NeMo W 2025-06-01 15:37:07 nemo_logging:405] WandB is currently turned off.
[NeMo W 2025-06-01 15:37:07 nemo_logging:405] User-set tensorboard is currently turned off. Internally one may still be set by NeMo2.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo I 2025-06-01 15:37:07 nemo_logging:393] Experiments will be logged at results/evo2/dev
[NeMo W 2025-06-01 15:37:07 

In [None]:
import os
import stat

# 检查文件路径
file_path = "/usr/local/bin/train_evo2"

# 首先检查文件是否存在
if os.path.exists(file_path):
    print(f"文件存在: {file_path}")
    
    # 获取文件信息
    file_stat = os.stat(file_path)
    print(f"文件大小: {file_stat.st_size} 字节")
    print(f"文件权限: {stat.filemode(file_stat.st_mode)}")
    print(f"是否可执行: {os.access(file_path, os.X_OK)}")
    
    # 检查文件类型
    with open(file_path, 'rb') as f:
        first_bytes = f.read(100)
        print(f"文件开头字节: {first_bytes[:50]}")
        
        # 检查是否是文本文件
        try:
            first_text = first_bytes.decode('utf-8')
            print("这是一个文本文件")
            print(f"文件开头内容: {first_text[:100]}...")
        except UnicodeDecodeError:
            print("这是一个二进制文件")
    
    # 如果是Python脚本，显示基本信息（不显示完整内容）
    if file_path.endswith('.py') or 'python' in first_text.lower():
        with open(file_path, 'r', encoding='utf-8') as f:
            lines = f.readlines()
            print(f"\n总行数: {len(lines)}")
            print("文件头部信息（前5行）:")
            for i, line in enumerate(lines[:100]):
                print(f"{i+1}: {line.rstrip()}")
else:
    print(f"文件不存在: {file_path}")
    
    # 检查可能的替代路径
    possible_paths = [
        "/usr/local/bin/train_evo2",
        "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py",
        "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/train.py"
    ]
    
    print("\n检查其他可能的路径:")
    for path in possible_paths:
        if os.path.exists(path):
            print(f"✓ 找到: {path}")
        else:
            print(f"✗ 不存在: {path}")

In [None]:
#!/usr/bin/env python3
"""
打印 BioNeMo Evo2 训练脚本内容
"""

import os
from pathlib import Path

def print_file_content():
    """打印指定文件的内容"""
    
    file_path = "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py"
    
    print("=" * 80)
    print(f"文件路径: {file_path}")
    print("=" * 80)
    
    try:
        # 检查文件是否存在
        if not os.path.exists(file_path):
            print(f"❌ 文件不存在: {file_path}")
            
            # 尝试查找类似的文件
            print("\n正在搜索相关文件...")
            base_dir = "/usr/local/lib/python3.12/dist-packages/"
            
            if os.path.exists(base_dir):
                print(f"✓ 基础目录存在: {base_dir}")
                
                # 搜索 bionemo 相关目录
                for root, dirs, files in os.walk(base_dir):
                    if "bionemo" in root.lower():
                        print(f"找到相关目录: {root}")
                        if "train.py" in files:
                            print(f"  -> 包含 train.py: {os.path.join(root, 'train.py')}")
            else:
                print(f"❌ 基础目录不存在: {base_dir}")
            
            return False
        
        # 读取并打印文件内容
        print(f"✓ 文件存在，正在读取内容...\n")
        
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        # 打印文件信息
        lines = content.split('\n')
        print(f"文件大小: {len(content)} 字符")
        print(f"行数: {len(lines)}")
        print("-" * 80)
        
        # 打印内容（带行号）
        for i, line in enumerate(lines, 1):
            print(f"{i:4d}: {line}")
        
        print("-" * 80)
        print("✓ 文件内容已打印完成")
        return True
        
    except PermissionError:
        print(f"❌ 权限不足，无法读取文件: {file_path}")
        print("请尝试使用 sudo 运行此脚本")
        return False
        
    except Exception as e:
        print(f"❌ 读取文件时发生错误: {e}")
        return False

def search_alternative_paths():
    """搜索可能的替代路径"""
    
    potential_paths = [
        "/usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py",
        "/usr/local/lib/python3.11/dist-packages/bionemo/evo2/run/train.py",
        "/usr/lib/python3.12/dist-packages/bionemo/evo2/run/train.py",
        "/usr/lib/python3.11/dist-packages/bionemo/evo2/run/train.py",
    ]
    
    # 也检查当前用户的site-packages
    import site
    user_site = site.getusersitepackages()
    if user_site:
        potential_paths.append(f"{user_site}/bionemo/evo2/run/train.py")
    
    print("\n搜索可能的路径:")
    print("-" * 50)
    
    found_files = []
    for path in potential_paths:
        if os.path.exists(path):
            print(f"✓ 找到: {path}")
            found_files.append(path)
        else:
            print(f"✗ 不存在: {path}")
    
    return found_files

def main():
    """主函数"""
    print("BioNeMo Evo2 训练脚本内容查看器")
    print("=" * 80)
    
    # 首先尝试打印目标文件
    success = print_file_content()
    
    if not success:
        # 如果失败，搜索替代路径
        found_files = search_alternative_paths()
        
        if found_files:
            print(f"\n找到 {len(found_files)} 个相关文件。")
            for i, file_path in enumerate(found_files, 1):
                print(f"\n{i}. 正在打印: {file_path}")
                print("=" * 80)
                
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                    
                    lines = content.split('\n')
                    print(f"文件大小: {len(content)} 字符")
                    print(f"行数: {len(lines)}")
                    print("-" * 80)
                    
                    for line_num, line in enumerate(lines, 1):
                        print(f"{line_num:4d}: {line}")
                    
                    print("-" * 80)
                    
                except Exception as e:
                    print(f"❌ 读取文件 {file_path} 时发生错误: {e}")
        else:
            print("\n❌ 未找到任何相关的训练脚本文件")

if __name__ == "__main__":
    main() 

In [3]:
!cp /workspace/hyena_modified.py /usr/local/lib/python3.12/dist-packages/nemo/collections/llm/gpt/model/hyena.py

In [4]:
!evo2_convert_to_nemo2 \
  --model-path /workspace/savanna_evo2_7b/savanna_evo2_7b.pt \
  --model-size 7b --output-dir nemo2_evo2_7b

Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[NeMo I 2025-06-02 07:46:00 nemo_logging:393] Using byte-level tokenization
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2025-06-02 07:46:00 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
    
[NeMo I 2025-06-02 07:46:00 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config
[NeMo I 2025-06-02 07:46:00 nemo_logging:393] Rank 0 has data parallel group : [0]
[NeMo I 2025-06-02 07:46:00 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0

In [7]:

!train_evo2 \
    -d training_data_config.yaml \
    --dataset-dir {preprocessed_data} \
    --model-size 7b \
    --devices 4 \
    --num-nodes 1 \
    --seq-length 1 \
    --micro-batch-size 1 \
    --lr 0.0001 \
    --warmup-steps 5 \
    --max-steps 200000 \
    --ckpt-dir nemo2_evo2_7b \
    --clip-grad 1 \
    --wd 0.01 \
    --activation-checkpoint-recompute-num-layers 1 \
    --val-check-interval 1000 \
    --ckpt-async-save

Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[NeMo I 2025-06-02 08:04:12 nemo_logging:393] Using byte-level tokenization
启用 LoRA 微调...
模型结构调试信息:
--------------------------------------------------
模型总共有 1 个模块
尝试访问模型的其他属性...
模块结构:
   1.  (HyenaModel)

目标模式: ['module.decoder.layers.17.self_attention.linear_qkv']

所有线性层 (0):

找到的注意力相关层 (0):

找到的线性层 (0):

找到的QKV层 (0):
⚠️  没有找到任何线性层！模型可能还没有完全初始化。
这在 NeMo/Megatron 框架中是正常的，模型结构会在训练开始时初始化。
LoRA 配置已保存，将在模型完全初始化后应用。
总参数数量: 0
可训练参数数量: 0
⚠️  警告: 模型总参数数量为0，可能模型初始化有问题
⚠️  警告: 没有可训练的参数，LoRA 可能没有正确应用
LoRA 将在训练开始时自动应用
已添加 LoRA 回调，将在训练开始时应用 LoRA
[NeMo W 2025-06-02 08:04:12 nemo_logging:405] WandB is currently turned off.
[NeMo W 2025-06-02 08:04:12 nemo_logging:405] User-set tensorboard is currently turned off. In

In [2]:
!cp /workspace/bionemo_train.py /usr/local/lib/python3.12/dist-packages/bionemo/evo2/run/train.py
import os
import subprocess
import time
from datetime import datetime
# 2. 设置NCCL和分布式环境变量
print("🔧 配置NCCL超时和优化参数...")

# NCCL超时设置 - 增加到2小时
os.environ['NCCL_TIMEOUT'] = '7200'  # 2小时超时
os.environ['TORCH_NCCL_BLOCKING_WAIT'] = '1'  # 使用新的环境变量名
os.environ['TORCH_NCCL_ASYNC_ERROR_HANDLING'] = '1'  # 使用新的环境变量名
os.environ['NCCL_DEBUG'] = 'INFO'  # 启用详细调试信息

# PyTorch分布式超时设置
os.environ['TORCH_DISTRIBUTED_TIMEOUT'] = '7200'  # PyTorch分布式超时
os.environ['TORCH_NCCL_TRACE_BUFFER_SIZE'] = '1024'  # 启用NCCL跟踪

# 数据加载和通信优化
os.environ['NCCL_BUFFSIZE'] = '8388608'  # 增加缓冲区大小到8MB
os.environ['NCCL_NTHREADS'] = '8'  # 增加NCCL线程数
os.environ['NCCL_MIN_NTHREADS'] = '4'  # 最小线程数

# 避免内存碎片和并行冲突
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:512'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'  # 避免tokenizer并行冲突
os.environ['OMP_NUM_THREADS'] = '4'  # 限制OpenMP线程数

# 数据集准备优化
os.environ['NCCL_P2P_DISABLE'] = '0'  # 确保P2P通信启用
os.environ['NCCL_SHM_DISABLE'] = '0'  # 确保共享内存通信启用
!export NCCL_TIMEOUT=7200                    # 2小时超时
!export TORCH_NCCL_TRACE_BUFFER_SIZE=1024    # 启用NCCL跟踪
!export TORCH_DISTRIBUTED_TIMEOUT=7200       # PyTorch分布式超时
!export NCCL_BLOCKING_WAIT=1                 # 阻塞等待
!export NCCL_DEBUG=INFO  
!train_evo2 \
    -d training_data_config.yaml \
    --dataset-dir {preprocessed_data} \
    --model-size 7b \
    --devices 2 \
    --num-nodes 1 \
    --seq-length 1 \
    --micro-batch-size 1 \
    --lr 0.0001 \
    --warmup-steps 5 \
    --max-steps 200000 \
    --ckpt-dir nemo2_evo2_7b \
    --clip-grad 1 \
    --wd 0.01 \
    --activation-checkpoint-recompute-num-layers 1 \
    --val-check-interval 1000 \
    --ckpt-async-save

🔧 配置NCCL超时和优化参数...
🔧 设置PyTorch分布式超时: 7200秒
Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so')
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[NeMo I 2025-06-05 11:05:02 nemo_logging:393] Using byte-level tokenization
启用 LoRA 微调...
模型结构调试信息:
--------------------------------------------------
模型总共有 1 个模块
尝试访问模型的其他属性...
模块结构:
   1.  (HyenaModel)

目标模式: ['module.decoder.layers.17.self_attention.linear_qkv']

所有线性层 (0):

找到的注意力相关层 (0):

找到的线性层 (0):

找到的QKV层 (0):
⚠️  没有找到任何线性层！模型可能还没有完全初始化。
这在 NeMo/Megatron 框架中是正常的，模型结构会在训练开始时初始化。
LoRA 配置已保存，将在模型完全初始化后应用。
总参数数量: 0
可训练参数数量: 0
⚠️  警告: 模型总参数数量为0，可能模型初始化有问题
⚠️  警告: 没有可训练的参数，LoRA 可能没有正确应用
LoRA 将在训练开始时自动应用
已添加 LoRA 回调，将在训练开始时应用 LoRA
[NeMo W 2025-06-05 11:05:02 nemo_logging:405] WandB is currently turned off.
[NeMo W 2025-06-05 11:05:02 nemo_logging:405] User-