Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error while running examples - segmentation fault #7

Closed
alexandremarcil opened this issue Jan 5, 2021 · 5 comments
Closed

error while running examples - segmentation fault #7

alexandremarcil opened this issue Jan 5, 2021 · 5 comments

Comments

@alexandremarcil
Copy link

Hi,
I'm trying to run the example. I created the dnabest env and downloaded the packages and files. I get an error at step 3.3. while trying to run the Fine-tune with pre-trained model (DNABERT6). I get the following error message:

<class 'transformers.tokenization_dna.DNATokenizer'>
01/05/2021 17:08:16 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/mcb/users/zipcode/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
01/05/2021 17:08:16 - INFO - transformers.modeling_utils - loading weights file /home/mcb/users/zipcode/code/DNABERT/6-new-12w-0/pytorch_model.bin
Segmentation fault (core dumped)

I have tried redownloading the pretrained model, but got the same error. Strangely, I do not get this error locally on my mac, but without any GPU, it would take too long to run. I get this error on a linux server.

Any ideas on how to fix this? Thanks!

@alexandremarcil
Copy link
Author

alexandremarcil commented Jan 5, 2021

I just saw thread #4 (#4) had a similar issue and that the cuda drivers were the problem. I have tried reinstalling them with
conda install -c anaconda cudatoolkit
conda install -c anaconda cudnn
(there was a long list of conflicts...)
but it did not help. I am still getting the segmentation fault.

I have cuda v10.2
(dnabert) zipcode@mcb-gpu1:~/code/DNABERT$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89

NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2

I've never had such cuda issues on other PyTorch projects, so I don't really know how to troubleshoot this problem.

update: I also tried with the --no_cuda arg but I am getting still the same segmentation fault.

@Zhihan1996
Copy link
Collaborator

The problem happens during loading the model. Do you have any unavailable GPUs. Can you try to specify GPUs by adding CUDA_VISIBLE_GPUS=0,1 before python run_.... ?

@alexandremarcil
Copy link
Author

Yes, there are 10 GPUs on the server and some are used. The fix you proposed did not change anything. I hard-coded a free GPU (8) in run_finetune.py to check if that was the problem and still got the same error. Could you please explain what does the local_rank arg does, I am not sure I understand correctly. Here's the error I get:

01/09/2021 18:29:41 - WARNING - main - Process rank: -1, device: cuda:8, n_gpu: 1, distributed training: False, 16-bits training: False
01/09/2021 18:29:41 - INFO - transformers.configuration_utils - loading configuration file /home/mcb/users/zipcode/code/DNABERT/6-new-12w-0/config.json
01/09/2021 18:29:41 - INFO - transformers.configuration_utils - Model config BertConfig {
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"do_sample": false,
"eos_token_ids": 0,
"finetuning_task": "dnaprom",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 12,
"num_labels": 2,
"num_return_sequences": 1,
"num_rnn_layer": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": 0,
"pruned_heads": {},
"repetition_penalty": 1.0,
"rnn": "lstm",
"rnn_dropout": 0.0,
"rnn_hidden": 768,
"split": 10,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 4101
}

============================================================
<class 'transformers.tokenization_dna.DNATokenizer'>
01/09/2021 18:29:41 - INFO - transformers.tokenization_utils - loading file https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/vocab.txt from cache at /home/mcb/users/zipcode/.cache/torch/transformers/ea1474aad40c1c8ed4e1cb7c11345ddda6df27a857fb29e1d4c901d9b900d32d.26f8bd5a32e49c2a8271a46950754a4a767726709b7741c68723bc1db840a87e
01/09/2021 18:29:41 - INFO - transformers.modeling_utils - loading weights file /home/mcb/users/zipcode/code/DNABERT/6-new-12w-0/pytorch_model.bin
./test.sh: line 32: 28758 Segmentation fault (core dumped) python run_finetune.py --model_type dna --tokenizer_name=dna$KMER --model_name_or_path $MODEL_PATH --task_name dnaprom --do_train --do_eval --data_dir $DATA_PATH --max_seq_length 75 --per_gpu_eval_batch_size=16 --per_gpu_train_batch_size=16 --learning_rate 2e-4 --num_train_epochs 3.0 --output_dir $OUTPUT_PATH --evaluate_during_training --logging_steps 100 --save_steps 4000 --warmup_percent 0.1 --hidden_dropout_prob 0.1 --overwrite_output --weight_decay 0.01 --n_process 8

@hkmztrk
Copy link

hkmztrk commented Jan 13, 2021

@alexandremarcil
Copy link
Author

Thanks @hkmztrk. I downgraded the sentencepiece to 0.1.91 and I do not have the segmentation fault anymore. but I have other issues :(

here's the fix for anyone else having this issue:
pip install sentencepiece==0.1.91

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants