CUBLAS_STATUS_INTERNAL_ERROR at examples/question-answering/run_qa.py #10592

LozanoAlvarezb · 2021-03-08T09:02:00Z

Environment info

transformers version: 4.4.0.dev0
Platform: Linux-5.10.20-1-lts-x86_64-with-glibc2.2.5
Python version: 3.8.3
PyTorch version (GPU?): 1.8.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

#!/bin/bash

python3 -m venv env
source env/bin/activate
pip install torch
pip install datasets
git clone https://github.com/huggingface/transformers.git
pip install -e transformers/
python transformers/examples/question-answering/run_qa.py \
	--model_name_or_path bert-base-uncased \
	--dataset_name squad \
	--do_train \
	--do_eval \
	--per_device_train_batch_size 1 \
	--learning_rate 3e-5 \
	--num_train_epochs 4 \
	--max_seq_length 384 \
	--doc_stride 128 \
	--output_dir ./models/

03/08/2021 09:54:08 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 2distributed training: False, 16-bits training: False
03/08/2021 09:54:08 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=./models/, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=4.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Mar08_09-54-06_inf-105-gpu-1, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=./models/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=2)
03/08/2021 09:54:08 - WARNING - datasets.builder -   Reusing dataset squad (/home/blozano/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9)
[INFO|configuration_utils.py:463] 2021-03-08 09:54:09,206 >> loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /home/blozano/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170
[INFO|configuration_utils.py:499] 2021-03-08 09:54:09,207 >> Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.4.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

[INFO|configuration_utils.py:463] 2021-03-08 09:54:09,509 >> loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /home/blozano/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170
[INFO|configuration_utils.py:499] 2021-03-08 09:54:09,510 >> Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.4.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

[INFO|tokenization_utils_base.py:1721] 2021-03-08 09:54:10,138 >> loading file https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt from cache at /home/blozano/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
[INFO|tokenization_utils_base.py:1721] 2021-03-08 09:54:10,138 >> loading file https://huggingface.co/bert-base-uncased/resolve/main/tokenizer.json from cache at /home/blozano/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4
[INFO|modeling_utils.py:1051] 2021-03-08 09:54:10,501 >> loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at /home/blozano/.cache/huggingface/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f
[WARNING|modeling_utils.py:1158] 2021-03-08 09:54:12,594 >> Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[WARNING|modeling_utils.py:1169] 2021-03-08 09:54:12,594 >> Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/08/2021 09:54:12 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/blozano/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9/cache-a560de6b2f76743b.arrow
03/08/2021 09:54:12 - WARNING - datasets.arrow_dataset -   Loading cached processed dataset at /home/blozano/.cache/huggingface/datasets/squad/plain_text/1.0.0/0fd9e01360d229a22adfe0ab7e2dd2adc6e2b3d6d3db03636a51235947d4c6e9/cache-15b011eed342eca6.arrow
[INFO|trainer.py:471] 2021-03-08 09:54:15,885 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: example_id, offset_mapping.
[INFO|trainer.py:929] 2021-03-08 09:54:15,937 >> ***** Running training *****
[INFO|trainer.py:930] 2021-03-08 09:54:15,937 >>   Num examples = 88524
[INFO|trainer.py:931] 2021-03-08 09:54:15,937 >>   Num Epochs = 4
[INFO|trainer.py:932] 2021-03-08 09:54:15,937 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:933] 2021-03-08 09:54:15,937 >>   Total train batch size (w. parallel, distributed & accumulation) = 2
[INFO|trainer.py:934] 2021-03-08 09:54:15,937 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:935] 2021-03-08 09:54:15,937 >>   Total optimization steps = 177048
  0%|                                                                                                                                                                                   | 0/177048 [00:00<?, ?it/s]Traceback (most recent call last):
  File "transformers/examples/question-answering/run_qa.py", line 507, in <module>
    main()
  File "transformers/examples/question-answering/run_qa.py", line 481, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/blozano/finetune_qa/transformers/src/transformers/trainer.py", line 1036, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/trainer.py", line 1420, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/trainer.py", line 1452, in compute_loss
    outputs = model(**inputs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 1775, in forward
    outputs = self.bert(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 971, in forward
    encoder_outputs = self.encoder(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 568, in forward
    layer_outputs = layer_module(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 456, in forward
    self_attention_outputs = self.attention(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 387, in forward
    self_outputs = self.self(
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/transformers/src/transformers/models/bert/modeling_bert.py", line 253, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/blozano/finetune_qa/env/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)`

NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2

Expected behavior

The expected default behavior as stated in transformers/examples/question-answering/README.md

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-03-08T13:11:08Z

Hi! I don't think torch supports CUDA 11.2 yet. See pytorch/pytorch#50232 (comment)

LittlePea13 · 2021-03-08T17:34:09Z

I had a similar issue with torch 1.8 and solved it by downgrading to 1.7.1

LozanoAlvarezb · 2021-03-09T09:59:26Z

Hi! I don't think torch supports CUDA 11.2 yet. See pytorch/pytorch#50232 (comment)

Thanks for the quick response. I just tested the script with CUDA11.1 and it worked just fine.

github-actions · 2021-04-14T15:02:10Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jiminHuang mentioned this issue Mar 15, 2021

Failed to run the code jdubpark/continual-bert#2

Closed

LysandreJik closed this as completed Apr 14, 2021

KomputerMaster64 mentioned this issue Aug 16, 2022

Query: Evaluation Error NVlabs/denoising-diffusion-gan#16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUBLAS_STATUS_INTERNAL_ERROR at examples/question-answering/run_qa.py #10592

CUBLAS_STATUS_INTERNAL_ERROR at examples/question-answering/run_qa.py #10592

LozanoAlvarezb commented Mar 8, 2021

LysandreJik commented Mar 8, 2021

LittlePea13 commented Mar 8, 2021

LozanoAlvarezb commented Mar 9, 2021

github-actions bot commented Apr 14, 2021

CUBLAS_STATUS_INTERNAL_ERROR at examples/question-answering/run_qa.py #10592

CUBLAS_STATUS_INTERNAL_ERROR at examples/question-answering/run_qa.py #10592

Comments

LozanoAlvarezb commented Mar 8, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

LysandreJik commented Mar 8, 2021

LittlePea13 commented Mar 8, 2021

LozanoAlvarezb commented Mar 9, 2021

github-actions bot commented Apr 14, 2021