The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

WeianMao · 2024-04-12T03:48:08Z

when I train the Neva model, I got following error

[NeMo I 2024-04-12 03:38:58 neva_model:252] Loading LLM weights from checkpoint /home/nemo/llama_weights/vicuna-2-7b.nemo
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.val_check_interval=1000', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=1000', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=2', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=16384', 'model.num_attention_heads=32', 'model.normalization=layernorm1p', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.activation=squared-relu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=0.5', 'model.num_query_groups=null', 'model.data.num_workers=0', 'model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo', 'model.mm_cfg.llm.model_type=nvgpt', 'model.data.conv_template=nvgpt', 'model.mm_cfg.vision_encoder.from_pretrained=/home/nemo/openai_weights/clip-vit-large-patch14-336', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.data.image_token_len=256', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', 'exp_manager.wandb_logger_kwargs.project=neva_demo']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/tensorstore.py", line 123, in open_ts_array
arr = ts.open(ts.Spec(spec), open=True).result()
ValueError: NOT_FOUND: Error opening "zarr" driver: Metadata at local file "/tmp/tmpe2_bw_kv/model_weights/model.decoder.layers.self_attention.linear_qkv.layer_norm_bias/.zarray" does not exist [source locations='tensorstore/driver/kvs_backed_chunk_driver.cc:1255\ntensorstore/driver/driver.cc:114'] [tensorstore_spec='{"context":{"cache_pool":{},"data_copy_concurrency":{},"file_io_concurrency":{},"f

Steps/Code to reproduce bug

First, I used following script to convert the Llama hf checkpoint to Nemo checkpoint (I try Vicuna and Llama both, but I got the same error):

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /data1/weight/llama_weights/models--lmsys--vicuna-7b-v1.5 --output_path /home/nemo/llama_weights/vicuna-2-7b.nemo

Then, I launch the train process (I tried 1 gpu and 8 gpu, but I got the same error):

CUDA_VISIBLE_DEVICES=2 NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=1 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py
trainer.precision=bf16
trainer.num_nodes=1
trainer.devices=1
trainer.val_check_interval=1000
trainer.limit_val_batches=5
trainer.log_every_n_steps=1
trainer.max_steps=1000
model.megatron_amp_O2=True
model.micro_batch_size=1
model.global_batch_size=2
model.tensor_model_parallel_size=1
model.pipeline_model_parallel_size=1
model.mcore_gpt=True
model.transformer_engine=True
model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json
model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain
model.tokenizer.library=sentencepiece
model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model
model.encoder_seq_length=4096
model.num_layers=32
model.hidden_size=4096
model.ffn_hidden_size=16384
model.num_attention_heads=32
model.normalization=layernorm1p
model.do_layer_norm_weight_decay=False
model.apply_query_key_layer_scaling=True
model.activation=squared-relu
model.headscale=False
model.position_embedding_type=rope
model.rotary_percentage=0.5
model.num_query_groups=null
model.data.num_workers=0
model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo
model.mm_cfg.llm.model_type=nvgpt
model.data.conv_template=nvgpt
model.mm_cfg.vision_encoder.from_pretrained='/home/nemo/openai_weights/clip-vit-large-patch14-336'
model.mm_cfg.vision_encoder.from_hf=True
model.data.image_token_len=256
model.optim.name="fused_adam"
exp_manager.create_checkpoint_callback=True
exp_manager.create_wandb_logger=False
exp_manager.wandb_logger_kwargs.project=neva_demo

Expected behavior

the training should start

Environment overview (please complete the following information)

I am in the main brach, I use the docker following:

sudo docker run --runtime=nvidia --gpus all -it --rm -v ~/project/NeMo:/opt/NeMo
-v /home/nemo:/home/nemo
-v /data1:/data1
--shm-size=8g -p 8888:8888
--ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.01.speech

Environment details

I try to compile the Nemo in the docker. however, It dose not work.

Additional context

8 H800 GPU
I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86

WeianMao · 2024-04-12T06:50:48Z

I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86

github-actions · 2024-05-13T01:46:15Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-05-21T01:45:45Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

KookHoiKim · 2024-07-03T07:06:19Z

Does this error solved?
It seems that _load_state_dict_from_disk expects model.ckpt file, but the result of untar model.nemo generates model_weights folder.

C080 · 2024-08-07T03:39:24Z

same issue I manage to run the pretraining script by setting model.mm_cfg.llm.from_pretrained=null and it works but is he seems to pretraing the llm from scratch (?)

WeianMao added the bug Something isn't working label Apr 12, 2024

github-actions bot added the stale label May 13, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

WeianMao commented Apr 12, 2024 •

edited

Loading

WeianMao commented Apr 12, 2024

github-actions bot commented May 13, 2024

github-actions bot commented May 21, 2024

KookHoiKim commented Jul 3, 2024

C080 commented Aug 7, 2024 •

edited

Loading

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

Comments

WeianMao commented Apr 12, 2024 • edited Loading

WeianMao commented Apr 12, 2024

github-actions bot commented May 13, 2024

github-actions bot commented May 21, 2024

KookHoiKim commented Jul 3, 2024

C080 commented Aug 7, 2024 • edited Loading

WeianMao commented Apr 12, 2024 •

edited

Loading

C080 commented Aug 7, 2024 •

edited

Loading