Segmentation fault when running run_finetune.py #4

gianfilippo · 2020-11-30T20:26:15Z

Hi,

I just installed your DNABERT and downloaded the DNABERT6 model.
I create the conda env following all the step.
Finally, I tried to run the fine tune example, but the code ends with Segmentation fault. The error log is below. Can you please help ?

Thanks

11/30/2020 15:16:01 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 4, distributed training: False, 16-bits training: False
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - loading configuration file DNABERT/pretrained/6-new-12w-0/config.json
11/30/2020 15:16:01 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}

11/30/2020 15:16:01 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/gc223/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
11/30/2020 15:16:01 - INFO - transformers.modeling_utils - loading weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin
/var/spool/slurmd/job44157571/slurm_script: line 46: 9764 Segmentation fault python DNABERT/examples/run_finetune.py --model_type dna --tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75 --per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate 2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH --evaluate_during_training --logging_steps 100 --save_steps 4000 --warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output --weight_decay 0.01 --n_process 8

Zhihan1996 · 2020-11-30T20:41:11Z

This issue seems to result from the 'slurm_script'. We do not include this component in this repo. Do you use it to run experiments on a cluster? Could you please try to run our code on a single machine first?

gianfilippo · 2020-11-30T20:55:54Z

Hi, thanks for the prompt reply! I tried it by running it directly from the compute node and running the script directly, without submitting it to the queue. I get the same result Best Gianfilippo

…

On Mon, Nov 30, 2020 at 3:41 PM Zhihan Zhou ***@***.***> wrote: This issue seems to result from the 'slurm_script'. We do not include this component in this repo. Do you use it to run experiments on a cluster? Could you please try to run our code on a single machine first? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPSFVARHYXWGTWYVKYXB2DSSP7PLANCNFSM4UICAHLA> .

Zhihan1996 · 2020-11-30T21:21:13Z

Hi,

Thanks for letting me know!

We have never met this problem before. Are the error log exactly the same when running on the compute node? If not, could you please show me the details?

Best,
Zhihan

gianfilippo · 2020-11-30T21:46:07Z

Hi, this is what I run python DNABERT/examples/run_finetune.py --model_type dna --tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75 --per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate 2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH --evaluate_during_training --logging_steps 100 --save_steps 4000 --warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output --weight_decay 0.01 --n_process 8 below is the output. Thanks Gianfilippo *OUTPUT:* 11/30/2020 16:41:14 - WARNING - __main__ - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False 11/30/2020 16:41:14 - INFO - transformers.configuration_utils - loading configuration file DNABERT/pretrained/6-new-12w-0/config.json 11/30/2020 16:41:14 - INFO - transformers.configuration_utils - Model config BertConfig { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "do_sample": false, "eos_token_ids": 0, "finetuning_task": "dnaprom", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_labels": 2, "num_return_sequences": 1, "num_rnn_layer": 1, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": 0, "pruned_heads": {}, "repetition_penalty": 1.0, "rnn": "lstm", "rnn_dropout": 0.0, "rnn_hidden": 768, "split": 10, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 4101 } ============================================================ <class 'transformers.tokenization_dna.DNATokenizer'> 11/30/2020 16:41:14 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e 11/30/2020 16:41:14 - INFO - transformers.modeling_utils - loading weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin Segmentation fault

…

On Mon, Nov 30, 2020 at 4:21 PM Zhihan Zhou ***@***.***> wrote: Hi, Thanks for letting me know! We have never met this problem before. Are the error log exactly the same when running on the compute node? If not, could you please show me the details? Best, Zhihan — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPSFVH4D7NEGD4Y7Q6D2B3SSQEFRANCNFSM4UICAHLA> .

Zhihan1996 · 2020-11-30T21:58:50Z

Hi,

It is hard to locate the problem since there are too many possible things that more result in the Segmentation fault. From the output log, the program seems to break at line 1088, which is model.to(args.device). I suspect it results from cuda/pytorch/nvidia-driver. What are the versions of them?

gianfilippo · 2020-11-30T22:52:01Z

Hi,

I will try and reinstall then environment. My default CUDA is 10.1, but for some reason, it loaded CUDA 9.2. I will post an update soon
Thanks

gianfilippo · 2020-12-01T00:46:58Z

Hi, ok, I reinstalled and now it is running on the compute node, if I directly run the script. But it fails if I submit it as a job to the queue (see below). What do you think? Thanks Gianfilippo 11/30/2020 19:15:55 - WARNING - __main__ - Process rank: -1, device: cuda, n_gpu: 4, distributed training: False, 16-bits training: False 11/30/2020 19:15:55 - INFO - transformers.configuration_utils - loading configuration file DNABERT/pretrained/6-new-12w-0/config.json 11/30/2020 19:15:55 - INFO - transformers.configuration_utils - Model config BertConfig { "architectures": [ "BertForMaskedLM" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "do_sample": false, "eos_token_ids": 0, "finetuning_task": "dnaprom", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 3072, "is_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-12, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_beams": 1, "num_hidden_layers": 12, "num_labels": 2, "num_return_sequences": 1, "num_rnn_layer": 1, "output_attentions": false, "output_hidden_states": false, "output_past": true, "pad_token_id": 0, "pruned_heads": {}, "repetition_penalty": 1.0, "rnn": "lstm", "rnn_dropout": 0.0, "rnn_hidden": 768, "split": 10, "temperature": 1.0, "top_k": 50, "top_p": 1.0, "torchscript": false, "type_vocab_size": 2, "use_bfloat16": false, "vocab_size": 4101 } 11/30/2020 19:15:55 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/gc223/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e 11/30/2020 19:15:56 - INFO - transformers.modeling_utils - loading weights file DNABERT/pretrained/6-new-12w-0/pytorch_model.bin 11/30/2020 19:15:58 - INFO - transformers.modeling_utils - Weights of BertForSequenceClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias'] 11/30/2020 19:15:58 - INFO - transformers.modeling_utils - Weights from pretrained model not used in BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias'] 11/30/2020 19:16:11 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, attention_probs_dropout_prob=0.1, beta1=0.9, beta2=0.999, cache_dir='', config_name='', data_dir='sample_data/ft/prom-core/6', device=device(type='cuda'), do_ensemble_pred=False, do_eval=True, do_lower_case=False, do_predict=False, do_train=True, do_visualize=False, early_stop=0, eval_all_checkpoints=False, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, hidden_dropout_prob=0.1, learning_rate=0.0002, local_rank=-1, logging_steps=100, max_grad_norm=1.0, max_seq_length=75, max_steps=-1, model_name_or_path='DNABERT/pretrained/6-new-12w-0', model_type='dna', n_gpu=4, n_process=8, no_cuda=False, num_rnn_layer=2, num_train_epochs=3.0, output_dir='./ft/prom-core/6', output_mode='classification', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=16, per_gpu_pred_batch_size=8, per_gpu_train_batch_size=16, predict_dir=None, predict_scan_size=1, result_dir=None, rnn='lstm', rnn_dropout=0.0, rnn_hidden=768, save_steps=4000, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=False, task_name='dnaprom', tokenizer_name='dna6', visualize_data_dir=None, visualize_models=None, visualize_train=False, warmup_percent=0.1, warmup_steps=0, weight_decay=0.01) 11/30/2020 19:16:11 - INFO - __main__ - Loading features from cached file sample_data/ft/prom-core/6/cached_train_6-new-12w-0_75_dnaprom 11/30/2020 19:16:14 - INFO - __main__ - ***** Running training ***** 11/30/2020 19:16:14 - INFO - __main__ - Num examples = 53277 11/30/2020 19:16:14 - INFO - __main__ - Num Epochs = 3 11/30/2020 19:16:14 - INFO - __main__ - Instantaneous batch size per GPU = 16 11/30/2020 19:16:14 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 64 11/30/2020 19:16:14 - INFO - __main__ - Gradient Accumulation steps = 1 11/30/2020 19:16:14 - INFO - __main__ - Total optimization steps = 2499 11/30/2020 19:16:14 - INFO - __main__ - Continuing training from checkpoint, will skip to saved global_step 11/30/2020 19:16:14 - INFO - __main__ - Continuing training from epoch 0 11/30/2020 19:16:14 - INFO - __main__ - Continuing training from global step 0 11/30/2020 19:16:14 - INFO - __main__ - Will skip the first 0 steps in the first epoch Iteration: 0%| | 0/833 [00:04<?, ?it/s] Epoch: 0%| | 0/3 [00:04<?, ?it/s]?it/s] Traceback (most recent call last): File "DNABERT/examples/run_finetune.py", line 1280, in <module> main() File "DNABERT/examples/run_finetune.py", line 1095, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "DNABERT/examples/run_finetune.py", line 272, in train outputs = model(**inputs) File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/user/Chris/DNABERT/src/transformers/modeling_bert.py", line 1187, in forward inputs_embeds=inputs_embeds, File "/user/conda_envs/dnabert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/user/Chris/DNABERT/src/transformers/modeling_bert.py", line 745, in forward extended_attention_mask = extended_attention_mask.to (dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

…

On Mon, Nov 30, 2020 at 4:59 PM Zhihan Zhou ***@***.***> wrote: Hi, It is hard to locate the problem since there are too many possible things that more result in the Segmentation fault. From the output log, the program seems to break at line 1088, which is model.to(args.device). I suspect it results from cuda/pytorch/nvidia-driver. What are the versions of them? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPSFVFFLMVY7XVVAGTGCI3SSQISPANCNFSM4UICAHLA> .

Zhihan1996 · 2020-12-01T03:09:37Z

Hi,

This looks like a problem of Dataparallel. When you submit it to a queue, does it run on a single machine or run on multiple nodes distributedly?

gianfilippo · 2020-12-01T05:28:44Z

Hi, I submit to run on a single gpu enabled node Gianfilippo

…

On Mon, Nov 30, 2020 at 10:09 PM Zhihan Zhou ***@***.***> wrote: Hi, This looks like a problem of Dataparallel. When you submit it to a queue, does it run on a single machine or run on multiple nodes distributedly? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPSFVBRVO5YEXEJM2WIABTSSRM75ANCNFSM4UICAHLA> .

Zhihan1996 · 2020-12-01T07:48:17Z

Hi,

Does the machine you running on has multiple gpus? If yes, this may result from the multi-gpu issue. In our code(line 1013), we calculate the n_gpu with torch.cuda.device_count(). If n_gpu is larger than 1, we use Dataparallel to run the model on multiple gpus. Maybe you should manually set the arg.n_gpu to 1.

gianfilippo · 2020-12-01T15:38:26Z

Hi,

you are right, the issue is related to multi-gpu usage. I submitted to a single GPU and it seems to be ok.
What do you think the problem is with multi-gpu ? as far as I know Dataparallel should work.

Zhihan1996 · 2020-12-01T21:52:02Z

Hi,

I think this may be related to the accessibility of GPUs. The code utilizes all the available GPUs by default. In most cases, all the available GPUs should be able to access. However, in your case, it can see multiple GPUs but can only access one of them. This may result in the problem. So maybe you should somehow specify the ID of GPUs that you want this model to use.

gianfilippo · 2020-12-02T16:27:42Z

Hi,
it seems reasonable. I will need to find time to look into this
Thanks for your help

cipherome-minkim · 2020-12-11T01:14:46Z

@gianfilippo did you end up resolve this issue? We are having the exact same problem, and any help would be greatly appreciated!

gianfilippo · 2020-12-11T02:30:00Z

Hi,

I did not solve the issue within the code, as I have to move forward with the project.

I simply specify a single GPU when I submit the job. This is good enough for me, at this point

Best

jerryji1993 added the bug Something isn't working label Dec 4, 2020

jerryji1993 closed this as completed Dec 4, 2020

jerryji1993 reopened this Dec 4, 2020

jerryji1993 closed this as completed Dec 4, 2020

alexandremarcil mentioned this issue Jan 5, 2021

error while running examples - segmentation fault #7

Closed

alexandremarcil mentioned this issue Jan 18, 2021

run_finetine.py issue #12

Closed

Celestial-Bai mentioned this issue Apr 24, 2021

finetune issue #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when running run_finetune.py #4

Segmentation fault when running run_finetune.py #4

gianfilippo commented Nov 30, 2020

Zhihan1996 commented Nov 30, 2020

gianfilippo commented Nov 30, 2020 via email

Zhihan1996 commented Nov 30, 2020

gianfilippo commented Nov 30, 2020 via email

Zhihan1996 commented Nov 30, 2020

gianfilippo commented Nov 30, 2020

gianfilippo commented Dec 1, 2020 via email

Zhihan1996 commented Dec 1, 2020

gianfilippo commented Dec 1, 2020 via email

Zhihan1996 commented Dec 1, 2020

gianfilippo commented Dec 1, 2020

Zhihan1996 commented Dec 1, 2020

gianfilippo commented Dec 2, 2020

cipherome-minkim commented Dec 11, 2020

gianfilippo commented Dec 11, 2020

Segmentation fault when running run_finetune.py #4

Segmentation fault when running run_finetune.py #4

Comments

gianfilippo commented Nov 30, 2020

Zhihan1996 commented Nov 30, 2020

gianfilippo commented Nov 30, 2020 via email

Zhihan1996 commented Nov 30, 2020

gianfilippo commented Nov 30, 2020 via email

Zhihan1996 commented Nov 30, 2020

gianfilippo commented Nov 30, 2020

gianfilippo commented Dec 1, 2020 via email

Zhihan1996 commented Dec 1, 2020

gianfilippo commented Dec 1, 2020 via email

Zhihan1996 commented Dec 1, 2020

gianfilippo commented Dec 1, 2020

Zhihan1996 commented Dec 1, 2020

gianfilippo commented Dec 2, 2020

cipherome-minkim commented Dec 11, 2020

gianfilippo commented Dec 11, 2020