Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when running run_finetune.py #4

Closed
gianfilippo opened this issue Nov 30, 2020 · 15 comments
Closed

Segmentation fault when running run_finetune.py #4

gianfilippo opened this issue Nov 30, 2020 · 15 comments
Labels
bug Something isn't working

Comments

@gianfilippo
Copy link

Hi,

I just installed your DNABERT and downloaded the DNABERT6 model.
I create the conda env following all the step.
Finally, I tried to run the fine tune example, but the code ends with Segmentation fault. The error log is below. Can you please help ?

Thanks

11/30/2020 15:16:01 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 4, distributed training: False, 16-bits training: False
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - loading configuration file DNABERT/pretrained/6-new-12w-0/config.json
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}

11/30/2020 15:16:01 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/gc223/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
11/30/2020 15:16:01 - INFO - transformers.modeling_utils - loading weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin
/var/spool/slurmd/job44157571/slurm_script: line 46: 9764 Segmentation fault python DNABERT/examples/run_finetune.py --model_type dna --tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75 --per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate 2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH --evaluate_during_training --logging_steps 100 --save_steps 4000 --warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output --weight_decay 0.01 --n_process 8

@Zhihan1996
Copy link
Collaborator

This issue seems to result from the 'slurm_script'. We do not include this component in this repo. Do you use it to run experiments on a cluster? Could you please try to run our code on a single machine first?

@gianfilippo
Copy link
Author

gianfilippo commented Nov 30, 2020 via email

@Zhihan1996
Copy link
Collaborator

Hi,

Thanks for letting me know!

We have never met this problem before. Are the error log exactly the same when running on the compute node? If not, could you please show me the details?

Best,
Zhihan

@gianfilippo
Copy link
Author

gianfilippo commented Nov 30, 2020 via email

@Zhihan1996
Copy link
Collaborator

Hi,

It is hard to locate the problem since there are too many possible things that more result in the Segmentation fault. From the output log, the program seems to break at line 1088, which is model.to(args.device). I suspect it results from cuda/pytorch/nvidia-driver. What are the versions of them?

@gianfilippo
Copy link
Author

Hi,

I will try and reinstall then environment. My default CUDA is 10.1, but for some reason, it loaded CUDA 9.2. I will post an update soon
Thanks

@gianfilippo
Copy link
Author

gianfilippo commented Dec 1, 2020 via email

@Zhihan1996
Copy link
Collaborator

Hi,

This looks like a problem of Dataparallel. When you submit it to a queue, does it run on a single machine or run on multiple nodes distributedly?

@gianfilippo
Copy link
Author

gianfilippo commented Dec 1, 2020 via email

@Zhihan1996
Copy link
Collaborator

Hi,

Does the machine you running on has multiple gpus? If yes, this may result from the multi-gpu issue. In our code(line 1013), we calculate the n_gpu with torch.cuda.device_count(). If n_gpu is larger than 1, we use Dataparallel to run the model on multiple gpus. Maybe you should manually set the arg.n_gpu to 1.

@gianfilippo
Copy link
Author

Hi,

you are right, the issue is related to multi-gpu usage. I submitted to a single GPU and it seems to be ok.
What do you think the problem is with multi-gpu ? as far as I know Dataparallel should work.

@Zhihan1996
Copy link
Collaborator

Hi,

I think this may be related to the accessibility of GPUs. The code utilizes all the available GPUs by default. In most cases, all the available GPUs should be able to access. However, in your case, it can see multiple GPUs but can only access one of them. This may result in the problem. So maybe you should somehow specify the ID of GPUs that you want this model to use.

@gianfilippo
Copy link
Author

Hi,
it seems reasonable. I will need to find time to look into this
Thanks for your help

@jerryji1993 jerryji1993 added the bug Something isn't working label Dec 4, 2020
@jerryji1993 jerryji1993 reopened this Dec 4, 2020
@cipherome-minkim
Copy link

@gianfilippo did you end up resolve this issue? We are having the exact same problem, and any help would be greatly appreciated!

@gianfilippo
Copy link
Author

Hi,

I did not solve the issue within the code, as I have to move forward with the project.

I simply specify a single GPU when I submit the job. This is good enough for me, at this point

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants