Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行tools/train.sh脚本报错:Check failed: num_device > 0 (0 vs. 0) No IB device found #340

Closed
Sakura-gh opened this issue Aug 2, 2022 · 3 comments

Comments

@Sakura-gh
Copy link

Sakura-gh commented Aug 2, 2022

使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit

目前正在尝试利用oneflow-libai跑gpt-2,按照tutorial的指示,仅修改了dataset相关的配置信息,运行bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2脚本时程序报错,完整log如下:

(oneflow) root@28c67ac89ed8:/home/gehao/OneFlow/libai# bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2
loaded library: /usr/lib/libibverbs.so.1
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
loaded library: loaded library: /usr/lib/libibverbs.so.1/usr/lib/libibverbs.so.1

[08/02 12:39:21 libai]: Rank of current process: 0. World size: 2
[08/02 12:39:21 libai]: Command line arguments: Namespace(config_file='configs/gpt2_pretrain.py', resume=False, eval_only=False, fast_dev_run=False, opts=[])
[08/02 12:39:21 libai]: Contents of args.config_file=configs/gpt2_pretrain.py:
from libai.config import LazyCall
from libai.evaluation import PPLEvaluator
from .common.models.gpt import pretrain_model as model
from .common.train import train
from .common.optim import optim
from .common.data.gpt_dataset import dataloader, tokenization

from .common.models.graph import graph

# vocab_file = "./data_test/gpt_data/gpt2-vocab.json"
# merge_files = "./data_test/gpt_data/gpt2-merges.txt"
# data_prefix = "./data_test/gpt_data/loss_compara_content_sentence"
merge_files = "/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt"
vocab_file = "/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json"
data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document"

tokenization.tokenizer.vocab_file = vocab_file
tokenization.tokenizer.merges_file = merge_files
dataloader.train.dataset[0].data_prefix = data_prefix
dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix

# GPT-2 model config
model.cfg.embedding_dropout_prob = 0.1
model.cfg.attention_dropout_prob = 0.1
model.cfg.num_attention_heads = 16
model.cfg.hidden_size = 384
model.cfg.ffn_hidden_size = 1536
model.cfg.num_layers = 6
model.cfg.max_seq_length = 1024

train.input_placement_device = "cpu"

train.dist.pipeline_num_layers = model.cfg.num_layers

for ds in dataloader.train.dataset:
    ds.max_seq_length = model.cfg.max_seq_length

optim.lr = 1.5e-4

train.train_micro_batch_size = 4
train.amp.enabled = True

train.evaluation.evaluator = LazyCall(PPLEvaluator)()

train.output_dir = "./output/gpt2_output"

[08/02 12:39:21 libai]: Full config saved to ./output/gpt2_output/config.yaml
[08/02 12:39:21 lb.engine.default]: > compiling dataset index builder ...
make: Entering directory '/home/gehao/OneFlow/libai/libai/data/data_utils'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/gehao/OneFlow/libai/libai/data/data_utils'
[08/02 12:39:21 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.094 seconds
[08/02 12:39:21 lb.engine.default]: >>> done with compiling. Compilation time: 0.096 seconds
[08/02 12:39:21 lb.engine.default]: Prepare training, validating, testing set
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: building dataset index ...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: warming up index mmap file...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading sizes...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading pointers...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading document index...
[08/02 12:39:21 lb.data.data_utils.indexed_dataset]: warming up data mmap file...
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap...
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer...
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 7.357359 seconds
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: indexed dataset stats:
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: number of documents: 8013769
[08/02 12:39:28 lb.data.data_utils.indexed_dataset]: number of sentences: 8013769
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_doc_idx.npy
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_sample_idx.npy
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_shuffle_idx.npy
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.017 seconds
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:     total number of samples: 8828142
[08/02 12:39:28 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_doc_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_sample_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_shuffle_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.002 seconds
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of samples: 26484426
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of epochs: 3
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_doc_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_sample_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:  > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_shuffle_idx.npy
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     loaded indexed file in 0.002 seconds
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of samples: 8828142
[08/02 12:39:29 lb.data.datasets.gpt_dataset]:     total number of epochs: 1
F20220802 12:39:32.804811 201821 ibverbs_comm_network.cpp:112] Check failed: num_device > 0 (0 vs. 0) No IB device found
*** Check failure stack trace: ***
    @     0x7fced2e9962a  google::LogMessage::Fail()
    @     0x7fced2e99912  google::LogMessage::SendToLog()
    @     0x7fced2e99197  google::LogMessage::Flush()
    @     0x7fced2e9bd09  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fceca30f910  oneflow::IBVerbsCommNet::IBVerbsCommNet()
    @     0x7fcecc4a95de  oneflow::InitRDMA()
    @     0x7fcf319b35a3  (unknown)
    @     0x7fcf31ba1de9  (unknown)
    @     0x5581bfd408f4  cfunction_call
    @     0x5581bfcfa47f  _PyObject_MakeTpCall
    @     0x5581bfd982e9  _PyEval_EvalFrameDefault
    @     0x5581bfd55be4  _PyFunction_Vectorcall
    @     0x5581bfcbd300  _PyEval_EvalFrameDefault.cold.2983
    @     0x5581bfd54fe3  _PyEval_EvalCode
    @     0x5581bfd55cb4  _PyFunction_Vectorcall
    @     0x5581bfd409aa  _PyObject_FastCallDictTstate
    @     0x5581bfd4a429  slot_tp_init
    @     0x5581bfcfa52f  _PyObject_MakeTpCall
    @     0x5581bfd93d57  _PyEval_EvalFrameDefault
    @     0x5581bfd55be4  _PyFunction_Vectorcall
    @     0x5581bfcbc088  _PyEval_EvalFrameDefault.cold.2983
    @     0x5581bfd54fe3  _PyEval_EvalCode
    @     0x5581bfe01a7c  PyEval_EvalCodeEx
    @     0x5581bfd55dbb  PyEval_EvalCode
    @     0x5581bfe01b2b  run_eval_code_obj
    @     0x5581bfe32155  run_mod
    @     0x5581bfcd31f7  pyrun_file.cold.3078
    @     0x5581bfe3772f  PyRun_SimpleFileExFlags
    @     0x5581bfe37df8  Py_RunMain
    @     0x5581bfe37ff9  Py_BytesMain
    @     0x7fcf3acb6b97  __libc_start_main
    @     0x5581bfdbf6a0  (unknown)
F20220802 12:39:33.046900 201822 ibverbs_comm_network.cpp:112] Check failed: num_device > 0 (0 vs. 0) No IB device found
*** Check failure stack trace: ***
    @     0x7f6e5c4b662a  google::LogMessage::Fail()
    @     0x7f6e5c4b6912  google::LogMessage::SendToLog()
    @     0x7f6e5c4b6197  google::LogMessage::Flush()
    @     0x7f6e5c4b8d09  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f6e5392c910  oneflow::IBVerbsCommNet::IBVerbsCommNet()
    @     0x7f6e55ac65de  oneflow::InitRDMA()
    @     0x7f6ebafd05a3  (unknown)
    @     0x7f6ebb1bede9  (unknown)
    @     0x55652276c8f4  cfunction_call
    @     0x55652272647f  _PyObject_MakeTpCall
    @     0x5565227c42e9  _PyEval_EvalFrameDefault
    @     0x556522781be4  _PyFunction_Vectorcall
    @     0x5565226e9300  _PyEval_EvalFrameDefault.cold.2983
    @     0x556522780fe3  _PyEval_EvalCode
    @     0x556522781cb4  _PyFunction_Vectorcall
    @     0x55652276c9aa  _PyObject_FastCallDictTstate
    @     0x556522776429  slot_tp_init
    @     0x55652272652f  _PyObject_MakeTpCall
    @     0x5565227bfd57  _PyEval_EvalFrameDefault
    @     0x556522781be4  _PyFunction_Vectorcall
    @     0x5565226e8088  _PyEval_EvalFrameDefault.cold.2983
    @     0x556522780fe3  _PyEval_EvalCode
    @     0x55652282da7c  PyEval_EvalCodeEx
    @     0x556522781dbb  PyEval_EvalCode
    @     0x55652282db2b  run_eval_code_obj
    @     0x55652285e155  run_mod
    @     0x5565226ff1f7  pyrun_file.cold.3078
    @     0x55652286372f  PyRun_SimpleFileExFlags
    @     0x556522863df8  Py_RunMain
    @     0x556522863ff9  Py_BytesMain
    @     0x7f6ec42d3b97  __libc_start_main
    @     0x5565227eb6a0  (unknown)
Killing subprocess 201821
Killing subprocess 201822
Traceback (most recent call last):
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 231, in <module>
    main()
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 219, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 187, in sigkill_handler
    raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/home/gehao/anaconda3/envs/mlsys/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/gpt2_pretrain.py']' died with <Signals.SIGABRT: 6>.
@shangguanshiyuan
Copy link

shangguanshiyuan commented Aug 2, 2022

你好,如果您的环境中没有rdma的相关设备的话,可以在config里设置train.rdma_enabled = False来关闭rdma。

@Sakura-gh
Copy link
Author

train.rdma_enabled = False

感谢,该问题已解决,此外还发现另一个bug,需要将libai/libai/layers/attention.py中第242行的tril_fill_value=-10000.0注释掉程序才能正常运行,但不知道注释掉这行对程序逻辑是否会有影响?

image

@shangguanshiyuan
Copy link

shangguanshiyuan commented Aug 2, 2022

您好,您可以安装v0.2.0版本的libai来对应v0.8.0版本的oneflow。
也可以通过 python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu102 的方式安装nightly版本的oneflow来对应最新版本的libai。
不过正在开发中还没有正式发布的代码,还没有经过严格的测试与验证,可靠性会低于正式发布的版本。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants