Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS_STATUS_INTERNAL_ERROR at examples/question-answering/run_qa.py #10592

Closed
2 of 4 tasks
LozanoAlvarezb opened this issue Mar 8, 2021 · 4 comments
Closed
2 of 4 tasks

Comments

@LozanoAlvarezb
Copy link

Environment info

  • transformers version: 4.4.0.dev0
  • Platform: Linux-5.10.20-1-lts-x86_64-with-glibc2.2.5
  • Python version: 3.8.3
  • PyTorch version (GPU?): 1.8.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

#!/bin/bash

python3 -m venv env
source env/bin/activate
pip install torch
pip install datasets
git clone https://github.com/huggingface/transformers.git
pip install -e transformers/
python transformers/examples/question-answering/run_qa.py \
	--model_name_or_path bert-base-uncased \
	--dataset_name squad \
	--do_train \
	--do_eval \
	--per_device_train_batch_size 1 \
	--learning_rate 3e-5 \
	--num_train_epochs 4 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./models/
03/08/2021 09:54:08 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 2distributed training: False, 16-bits training: False
03/08/2021 09:54:08 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=./models/, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=4.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Mar08_09-54-06_inf-105-gpu-1, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=./models/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=2)
03/08/2021 09:54:08 - WARNING - datasets.builder -   Reusing dataset squad (/home/blozano/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9)
[INFO|configuration_utils.py:463] 2021-03-08 09:54:09,206 >> loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /home/blozano/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170
[INFO|configuration_utils.py:499] 2021-03-08 09:54:09,207 >> Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.4.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

[INFO|configuration_utils.py:463] 2021-03-08 09:54:09,509 >> loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /home/blozano/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170
[INFO|configuration_utils.py:499] 2021-03-08 09:54:09,510 >> Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.4.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

[INFO|tokenization_utils_base.py:1721] 2021-03-08 09:54:10,138 >> loading file https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt from cache at /home/blozano/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
[INFO|tokenization_utils_base.py:1721] 2021-03-08 09:54:10,138 >> loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json from cache at /home/blozano/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4
[INFO|modeling_utils.py:1051] 2021-03-08 09:54:10,501 >> loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at /home/blozano/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
[WARNING|modeling_utils.py:1158] 2021-03-08 09:54:12,594 >> Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:1169] 2021-03-08 09:54:12,594 >> Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/08/2021 09:54:12 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/blozano/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9/cache-a560de6b2f76743b.arrow
03/08/2021 09:54:12 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/blozano/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9/cache-15b011eed342eca6.arrow
[INFO|trainer.py:471] 2021-03-08 09:54:15,885 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping.
[INFO|trainer.py:929] 2021-03-08 09:54:15,937 >> ***** Running training *****
[INFO|trainer.py:930] 2021-03-08 09:54:15,937 >>   Num examples = 88524
[INFO|trainer.py:931] 2021-03-08 09:54:15,937 >>   Num Epochs = 4
[INFO|trainer.py:932] 2021-03-08 09:54:15,937 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:933] 2021-03-08 09:54:15,937 >>   Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:934] 2021-03-08 09:54:15,937 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:935] 2021-03-08 09:54:15,937 >>   Total optimization steps = 177048
  0%|                                                                                                                                                                                   | 0/177048 [00:00<?, ?it/s]Traceback (most recent call last):
  File "transformers/examples/question-answering/run_qa.py", line 507, in <module>
    main()
  File "transformers/examples/question-answering/run_qa.py", line 481, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/blozano/finetune_qa/transformers/src/transformers/trainer.py", line 1036, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/trainer.py", line 1420, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/trainer.py", line 1452, in compute_loss
    outputs = model(**inputs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 1775, in forward
    outputs = self.bert(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 971, in forward
    encoder_outputs = self.encoder(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 568, in forward
    layer_outputs = layer_module(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 456, in forward
    self_attention_outputs = self.attention(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 387, in forward
    self_outputs = self.self(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 253, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2  

Expected behavior

The expected default behavior as stated in transformers/examples/question-answering/README.md

@LysandreJik
Copy link
Member

Hi! I don't think torch supports CUDA 11.2 yet. See pytorch/pytorch#50232 (comment)

@LittlePea13
Copy link

I had a similar issue with torch 1.8 and solved it by downgrading to 1.7.1

@LozanoAlvarezb
Copy link
Author

Hi! I don't think torch supports CUDA 11.2 yet. See pytorch/pytorch#50232 (comment)

Thanks for the quick response. I just tested the script with CUDA11.1 and it worked just fine.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants