We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when I train the Neva model, I got following error
[NeMo I 2024-04-12 03:38:58 neva_model:252] Loading LLM weights from checkpoint /home/nemo/llama_weights/vicuna-2-7b.nemo Loading distributed checkpoint with TensorStoreLoadShardedStrategy Error executing job with overrides: ['trainer.precision=bf16', 'trainer.num_nodes=1', 'trainer.devices=1', 'trainer.val_check_interval=1000', 'trainer.limit_val_batches=5', 'trainer.log_every_n_steps=1', 'trainer.max_steps=1000', 'model.megatron_amp_O2=True', 'model.micro_batch_size=1', 'model.global_batch_size=2', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.mcore_gpt=True', 'model.transformer_engine=True', 'model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json', 'model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain', 'model.tokenizer.library=sentencepiece', 'model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model', 'model.encoder_seq_length=4096', 'model.num_layers=32', 'model.hidden_size=4096', 'model.ffn_hidden_size=16384', 'model.num_attention_heads=32', 'model.normalization=layernorm1p', 'model.do_layer_norm_weight_decay=False', 'model.apply_query_key_layer_scaling=True', 'model.activation=squared-relu', 'model.headscale=False', 'model.position_embedding_type=rope', 'model.rotary_percentage=0.5', 'model.num_query_groups=null', 'model.data.num_workers=0', 'model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo', 'model.mm_cfg.llm.model_type=nvgpt', 'model.data.conv_template=nvgpt', 'model.mm_cfg.vision_encoder.from_pretrained=/home/nemo/openai_weights/clip-vit-large-patch14-336', 'model.mm_cfg.vision_encoder.from_hf=True', 'model.data.image_token_len=256', 'model.optim.name=fused_adam', 'exp_manager.create_checkpoint_callback=True', 'exp_manager.create_wandb_logger=False', 'exp_manager.wandb_logger_kwargs.project=neva_demo'] Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/megatron/core/dist_checkpointing/strategies/tensorstore.py", line 123, in open_ts_array arr = ts.open(ts.Spec(spec), open=True).result() ValueError: NOT_FOUND: Error opening "zarr" driver: Metadata at local file "/tmp/tmpe2_bw_kv/model_weights/model.decoder.layers.self_attention.linear_qkv.layer_norm_bias/.zarray" does not exist [source locations='tensorstore/driver/kvs_backed_chunk_driver.cc:1255\ntensorstore/driver/driver.cc:114'] [tensorstore_spec='{"context":{"cache_pool":{},"data_copy_concurrency":{},"file_io_concurrency":{},"f
Steps/Code to reproduce bug
First, I used following script to convert the Llama hf checkpoint to Nemo checkpoint (I try Vicuna and Llama both, but I got the same error):
python scripts/checkpoint_converters/convert_llama_hf_to_nemo.py --input_name_or_path /data1/weight/llama_weights/models--lmsys--vicuna-7b-v1.5 --output_path /home/nemo/llama_weights/vicuna-2-7b.nemo
Then, I launch the train process (I tried 1 gpu and 8 gpu, but I got the same error):
CUDA_VISIBLE_DEVICES=2 NCCL_P2P_DISABLE=1 CUDA_LAUNCH_BLOCKING=1 torchrun --nproc_per_node=1 /opt/NeMo/examples/multimodal/multimodal_llm/neva/neva_pretrain.py trainer.precision=bf16 trainer.num_nodes=1 trainer.devices=1 trainer.val_check_interval=1000 trainer.limit_val_batches=5 trainer.log_every_n_steps=1 trainer.max_steps=1000 model.megatron_amp_O2=True model.micro_batch_size=1 model.global_batch_size=2 model.tensor_model_parallel_size=1 model.pipeline_model_parallel_size=1 model.mcore_gpt=True model.transformer_engine=True model.data.data_path=/data1/data/datasets--liuhaotian--LLaVA-Pretrain/blip_laion_cc_sbu_558k.json model.data.image_folder=/data1/data/datasets--liuhaotian--LLaVA-Pretrain model.tokenizer.library=sentencepiece model.tokenizer.model=/home/nemo/llama_weights/tokenizer_neva.model model.encoder_seq_length=4096 model.num_layers=32 model.hidden_size=4096 model.ffn_hidden_size=16384 model.num_attention_heads=32 model.normalization=layernorm1p model.do_layer_norm_weight_decay=False model.apply_query_key_layer_scaling=True model.activation=squared-relu model.headscale=False model.position_embedding_type=rope model.rotary_percentage=0.5 model.num_query_groups=null model.data.num_workers=0 model.mm_cfg.llm.from_pretrained=/home/nemo/llama_weights/vicuna-2-7b.nemo model.mm_cfg.llm.model_type=nvgpt model.data.conv_template=nvgpt model.mm_cfg.vision_encoder.from_pretrained='/home/nemo/openai_weights/clip-vit-large-patch14-336' model.mm_cfg.vision_encoder.from_hf=True model.data.image_token_len=256 model.optim.name="fused_adam" exp_manager.create_checkpoint_callback=True exp_manager.create_wandb_logger=False exp_manager.wandb_logger_kwargs.project=neva_demo
Expected behavior
the training should start
Environment overview (please complete the following information)
I am in the main brach, I use the docker following:
sudo docker run --runtime=nvidia --gpus all -it --rm -v ~/project/NeMo:/opt/NeMo -v /home/nemo:/home/nemo -v /data1:/data1 --shm-size=8g -p 8888:8888 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:24.01.speech
Environment details
I try to compile the Nemo in the docker. however, It dose not work.
Additional context
8 H800 GPU I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86
The text was updated successfully, but these errors were encountered:
I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86
Sorry, something went wrong.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
No branches or pull requests
when I train the Neva model, I got following error
Steps/Code to reproduce bug
First, I used following script to convert the Llama hf checkpoint to Nemo checkpoint (I try Vicuna and Llama both, but I got the same error):
Then, I launch the train process (I tried 1 gpu and 8 gpu, but I got the same error):
Expected behavior
the training should start
Environment overview (please complete the following information)
I am in the main brach, I use the docker following:
Environment details
I try to compile the Nemo in the docker. however, It dose not work.
Additional context
8 H800 GPU
I'm on the commit:97d1abb2bca0b5daff6d434c4bb340d3bb702e86
The text was updated successfully, but these errors were encountered: