Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

Closed
WeianMao opened this issue Apr 12, 2024 · 3 comments
Closed

The error in loading Llama pretrain checkpoint for NeVa(LLAVA) #8898

WeianMao opened this issue Apr 12, 2024 · 3 comments
Labels
bug Something isn't working stale

Comments

@WeianMao
Copy link

WeianMao commented Apr 12, 2024

when I train the Neva model, I got following error

[NeMo I 2024-04-12 03:38:58 neva_model:252] Loading LLM weights from checkpoint /home/nemo/llama_weights/vicuna-2-7b.nemo
Loading distributed checkpoint with TensorStoreLoadShardedStrategy
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.val_check_interval=1000', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=1000', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=2', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=16384', 'model.num_attention_heads=32', 'model.normalization=layernorm1p', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.activation=squared-relu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=0.5', 'model.num_query_groups=null', 'model.data.num_workers=0', 'model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo', 'model.mm_cfg.llm.model_type=nvgpt', 'model.data.conv_template=nvgpt', 'model.mm_cfg.vision_encoder.from_pretrained=/home/nemo/openai_weights/clip-vit-large-patch14-336', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.data.image_token_len=256', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', 'exp_manager.wandb_logger_kwargs.project=neva_demo']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/tensorstore.py", line 123, in open_ts_array
arr = ts.open(ts.Spec(spec), open=True).result()
ValueError: NOT_FOUND: Error opening "zarr" driver: Metadata at local file "/tmp/tmpe2_bw_kv/model_weights/model.decoder.layers.self_attention.linear_qkv.layer_norm_bias/.zarray" does not exist [source locations='tensorstore/driver/kvs_backed_chunk_driver.cc:1255\ntensorstore/driver/driver.cc:114'] [tensorstore_spec='{"context":{"cache_pool":{},"data_copy_concurrency":{},"file_io_concurrency":{},"f

Steps/Code to reproduce bug

First, I used following script to convert the Llama hf checkpoint to Nemo checkpoint (I try Vicuna and Llama both, but I got the same error):

python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /data1/weight/llama_weights/models--lmsys--vicuna-7b-v1.5 --output_path /home/nemo/llama_weights/vicuna-2-7b.nemo

Then, I launch the train process (I tried 1 gpu and 8 gpu, but I got the same error):

CUDA_VISIBLE_DEVICES=2 NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=1 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py
trainer.precision=bf16
trainer.num_nodes=1
trainer.devices=1
trainer.val_check_interval=1000
trainer.limit_val_batches=5
trainer.log_every_n_steps=1
trainer.max_steps=1000
model.megatron_amp_O2=True
model.micro_batch_size=1
model.global_batch_size=2
model.tensor_model_parallel_size=1
model.pipeline_model_parallel_size=1
model.mcore_gpt=True
model.transformer_engine=True
model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json
model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain
model.tokenizer.library=sentencepiece
model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model
model.encoder_seq_length=4096
model.num_layers=32
model.hidden_size=4096
model.ffn_hidden_size=16384
model.num_attention_heads=32
model.normalization=layernorm1p
model.do_layer_norm_weight_decay=False
model.apply_query_key_layer_scaling=True
model.activation=squared-relu
model.headscale=False
model.position_embedding_type=rope
model.rotary_percentage=0.5
model.num_query_groups=null
model.data.num_workers=0
model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo
model.mm_cfg.llm.model_type=nvgpt
model.data.conv_template=nvgpt
model.mm_cfg.vision_encoder.from_pretrained='/home/nemo/openai_weights/clip-vit-large-patch14-336'
model.mm_cfg.vision_encoder.from_hf=True
model.data.image_token_len=256
model.optim.name="fused_adam"
exp_manager.create_checkpoint_callback=True
exp_manager.create_wandb_logger=False
exp_manager.wandb_logger_kwargs.project=neva_demo

Expected behavior

the training should start

Environment overview (please complete the following information)

I am in the main brach, I use the docker following:

sudo docker run --runtime=nvidia --gpus all -it --rm -v ~/project/NeMo:/opt/NeMo
-v /home/nemo:/home/nemo
-v /data1:/data1
--shm-size=8g -p 8888:8888
--ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.01.speech

Environment details

I try to compile the Nemo in the docker. however, It dose not work.

Additional context

8 H800 GPU
I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86

@WeianMao WeianMao added the bug Something isn't working label Apr 12, 2024
@WeianMao
Copy link
Author

I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 13, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

1 participant