-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault when running run_finetune.py #4
Comments
This issue seems to result from the 'slurm_script'. We do not include this component in this repo. Do you use it to run experiments on a cluster? Could you please try to run our code on a single machine first? |
Hi,
thanks for the prompt reply!
I tried it by running it directly from the compute node and running the
script directly, without submitting it to the queue.
I get the same result
Best
Gianfilippo
…On Mon, Nov 30, 2020 at 3:41 PM Zhihan Zhou ***@***.***> wrote:
This issue seems to result from the 'slurm_script'. We do not include this
component in this repo. Do you use it to run experiments on a cluster?
Could you please try to run our code on a single machine first?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPSFVARHYXWGTWYVKYXB2DSSP7PLANCNFSM4UICAHLA>
.
|
Hi, Thanks for letting me know! We have never met this problem before. Are the error log exactly the same when running on the compute node? If not, could you please show me the details? Best, |
Hi,
this is what I run
python DNABERT/examples/run_finetune.py --model_type dna
--tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name
dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75
--per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate
2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH
--evaluate_during_training --logging_steps 100 --save_steps 4000
--warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output
--weight_decay 0.01 --n_process 8
below is the output.
Thanks
Gianfilippo
*OUTPUT:*
11/30/2020 16:41:14 - WARNING - __main__ - Process rank: -1, device: cpu,
n_gpu: 0, distributed training: False, 16-bits training: False
11/30/2020 16:41:14 - INFO - transformers.configuration_utils - loading
configuration file DNABERT/pretrained/6-new-12w-0/config.json
11/30/2020 16:41:14 - INFO - transformers.configuration_utils - Model
config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}
============================================================
<class 'transformers.tokenization_dna.DNATokenizer'>
11/30/2020 16:41:14 - INFO - transformers.tokenization_utils - loading
file
https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt
from cache at
/home/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
11/30/2020 16:41:14 - INFO - transformers.modeling_utils - loading
weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin
Segmentation fault
…On Mon, Nov 30, 2020 at 4:21 PM Zhihan Zhou ***@***.***> wrote:
Hi,
Thanks for letting me know!
We have never met this problem before. Are the error log exactly the same
when running on the compute node? If not, could you please show me the
details?
Best,
Zhihan
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPSFVH4D7NEGD4Y7Q6D2B3SSQEFRANCNFSM4UICAHLA>
.
|
Hi, It is hard to locate the problem since there are too many possible things that more result in the Segmentation fault. From the output log, the program seems to break at line 1088, which is |
Hi, I will try and reinstall then environment. My default CUDA is 10.1, but for some reason, it loaded CUDA 9.2. I will post an update soon |
Hi,
ok, I reinstalled and now it is running on the compute node, if I directly
run the script. But it fails if I submit it as a job to the queue (see
below).
What do you think?
Thanks
Gianfilippo
11/30/2020 19:15:55 - WARNING - __main__ - Process rank: -1, device:
cuda, n_gpu: 4, distributed training: False, 16-bits training: False
11/30/2020 19:15:55 - INFO - transformers.configuration_utils - loading
configuration file DNABERT/pretrained/6-new-12w-0/config.json
11/30/2020 19:15:55 - INFO - transformers.configuration_utils - Model
config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}
11/30/2020 19:15:55 - INFO - transformers.tokenization_utils - loading
file
https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt
from cache at
/home/gc223/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
11/30/2020 19:15:56 - INFO - transformers.modeling_utils - loading
weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin
11/30/2020 19:15:58 - INFO - transformers.modeling_utils - Weights of
BertForSequenceClassification not initialized from pretrained model:
['classifier.weight', 'classifier.bias']
11/30/2020 19:15:58 - INFO - transformers.modeling_utils - Weights from
pretrained model not used in BertForSequenceClassification:
['cls.predictions.bias', 'cls.predictions.transform.dense.weight',
'cls.predictions.transform.dense.bias',
'cls.predictions.transform.LayerNorm.weight',
'cls.predictions.transform.LayerNorm.bias',
'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
11/30/2020 19:16:11 - INFO - __main__ - Training/evaluation parameters
Namespace(adam_epsilon=1e-08, attention_probs_dropout_prob=0.1, beta1=0.9,
beta2=0.999, cache_dir='', config_name='',
data_dir='sample_data/ft/prom-core/6', device=device(type='cuda'),
do_ensemble_pred=False, do_eval=True, do_lower_case=False,
do_predict=False, do_train=True, do_visualize=False, early_stop=0,
eval_all_checkpoints=False, evaluate_during_training=True, fp16=False,
fp16_opt_level='O1', gradient_accumulation_steps=1,
hidden_dropout_prob=0.1, learning_rate=0.0002, local_rank=-1,
logging_steps=100, max_grad_norm=1.0, max_seq_length=75, max_steps=-1,
model_name_or_path='DNABERT/pretrained/6-new-12w-0', model_type='dna',
n_gpu=4, n_process=8, no_cuda=False, num_rnn_layer=2, num_train_epochs=3.0,
output_dir='./ft/prom-core/6', output_mode='classification',
overwrite_cache=False, overwrite_output_dir=True,
per_gpu_eval_batch_size=16, per_gpu_pred_batch_size=8,
per_gpu_train_batch_size=16, predict_dir=None, predict_scan_size=1,
result_dir=None, rnn='lstm', rnn_dropout=0.0, rnn_hidden=768,
save_steps=4000, save_total_limit=None, seed=42, server_ip='',
server_port='', should_continue=False, task_name='dnaprom',
tokenizer_name='dna6', visualize_data_dir=None, visualize_models=None,
visualize_train=False, warmup_percent=0.1, warmup_steps=0,
weight_decay=0.01)
11/30/2020 19:16:11 - INFO - __main__ - Loading features from cached file
sample_data/ft/prom-core/6/cached_train_6-new-12w-0_75_dnaprom
11/30/2020 19:16:14 - INFO - __main__ - ***** Running training *****
11/30/2020 19:16:14 - INFO - __main__ - Num examples = 53277
11/30/2020 19:16:14 - INFO - __main__ - Num Epochs = 3
11/30/2020 19:16:14 - INFO - __main__ - Instantaneous batch size per
GPU = 16
11/30/2020 19:16:14 - INFO - __main__ - Total train batch size (w.
parallel, distributed & accumulation) = 64
11/30/2020 19:16:14 - INFO - __main__ - Gradient Accumulation steps = 1
11/30/2020 19:16:14 - INFO - __main__ - Total optimization steps = 2499
11/30/2020 19:16:14 - INFO - __main__ - Continuing training from
checkpoint, will skip to saved global_step
11/30/2020 19:16:14 - INFO - __main__ - Continuing training from epoch 0
11/30/2020 19:16:14 - INFO - __main__ - Continuing training from global
step 0
11/30/2020 19:16:14 - INFO - __main__ - Will skip the first 0 steps in
the first epoch
Iteration: 0%| | 0/833 [00:04<?, ?it/s]
Epoch: 0%| | 0/3 [00:04<?, ?it/s]?it/s]
Traceback (most recent call last):
File "DNABERT/examples/run_finetune.py", line 1280, in <module>
main()
File "DNABERT/examples/run_finetune.py", line 1095, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "DNABERT/examples/run_finetune.py", line 272, in train
outputs = model(**inputs)
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py",
line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py",
line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs,
self.device_ids[:len(replicas)])
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py",
line 86, in parallel_apply
output.reraise()
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/_utils.py",
line 428, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py",
line 61, in _worker
output = module(*input, **kwargs)
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/user/Chris/DNABERT/src/transformers/modeling_bert.py", line 1187,
in forward
inputs_embeds=inputs_embeds,
File
"/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/user/Chris/DNABERT/src/transformers/modeling_bert.py", line 745,
in forward
extended_attention_mask = extended_attention_mask.to
(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
…On Mon, Nov 30, 2020 at 4:59 PM Zhihan Zhou ***@***.***> wrote:
Hi,
It is hard to locate the problem since there are too many possible things
that more result in the Segmentation fault. From the output log, the
program seems to break at line 1088, which is model.to(args.device). I
suspect it results from cuda/pytorch/nvidia-driver. What are the versions
of them?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPSFVFFLMVY7XVVAGTGCI3SSQISPANCNFSM4UICAHLA>
.
|
Hi, This looks like a problem of |
Hi,
I submit to run on a single gpu enabled node
Gianfilippo
…On Mon, Nov 30, 2020 at 10:09 PM Zhihan Zhou ***@***.***> wrote:
Hi,
This looks like a problem of Dataparallel. When you submit it to a queue,
does it run on a single machine or run on multiple nodes distributedly?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPSFVBRVO5YEXEJM2WIABTSSRM75ANCNFSM4UICAHLA>
.
|
Hi, Does the machine you running on has multiple gpus? If yes, this may result from the multi-gpu issue. In our code(line 1013), we calculate the n_gpu with torch.cuda.device_count(). If n_gpu is larger than 1, we use |
Hi, you are right, the issue is related to multi-gpu usage. I submitted to a single GPU and it seems to be ok. |
Hi, I think this may be related to the accessibility of GPUs. The code utilizes all the available GPUs by default. In most cases, all the available GPUs should be able to access. However, in your case, it can see multiple GPUs but can only access one of them. This may result in the problem. So maybe you should somehow specify the ID of GPUs that you want this model to use. |
Hi, |
@gianfilippo did you end up resolve this issue? We are having the exact same problem, and any help would be greatly appreciated! |
Hi, I did not solve the issue within the code, as I have to move forward with the project. I simply specify a single GPU when I submit the job. This is good enough for me, at this point Best |
Hi,
I just installed your DNABERT and downloaded the DNABERT6 model.
I create the conda env following all the step.
Finally, I tried to run the fine tune example, but the code ends with Segmentation fault. The error log is below. Can you please help ?
Thanks
11/30/2020 15:16:01 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 4, distributed training: False, 16-bits training: False
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - loading configuration file DNABERT/pretrained/6-new-12w-0/config.json
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}
11/30/2020 15:16:01 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/gc223/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
11/30/2020 15:16:01 - INFO - transformers.modeling_utils - loading weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin
/var/spool/slurmd/job44157571/slurm_script: line 46: 9764 Segmentation fault python DNABERT/examples/run_finetune.py --model_type dna --tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75 --per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate 2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH --evaluate_during_training --logging_steps 100 --save_steps 4000 --warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output --weight_decay 0.01 --n_process 8
The text was updated successfully, but these errors were encountered: