We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用的oneflow版本:0.8.0+cu102,使用的libai版本:最新commit
目前正在尝试利用oneflow-libai跑gpt-2,按照tutorial的指示,仅修改了dataset相关的配置信息,运行bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2脚本时程序报错,完整log如下:
bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2
(oneflow) root@28c67ac89ed8:/home/gehao/OneFlow/libai# bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2 loaded library: /usr/lib/libibverbs.so.1 ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** loaded library: loaded library: /usr/lib/libibverbs.so.1/usr/lib/libibverbs.so.1 [08/02 12:39:21 libai]: Rank of current process: 0. World size: 2 [08/02 12:39:21 libai]: Command line arguments: Namespace(config_file='configs/gpt2_pretrain.py', resume=False, eval_only=False, fast_dev_run=False, opts=[]) [08/02 12:39:21 libai]: Contents of args.config_file=configs/gpt2_pretrain.py: from libai.config import LazyCall from libai.evaluation import PPLEvaluator from .common.models.gpt import pretrain_model as model from .common.train import train from .common.optim import optim from .common.data.gpt_dataset import dataloader, tokenization from .common.models.graph import graph # vocab_file = "./data_test/gpt_data/gpt2-vocab.json" # merge_files = "./data_test/gpt_data/gpt2-merges.txt" # data_prefix = "./data_test/gpt_data/loss_compara_content_sentence" merge_files = "/home/gehao/dataset/gpt/hf-GPT2Data/merges.txt" vocab_file = "/home/gehao/dataset/gpt/hf-GPT2Data/vocab.json" data_prefix = "/home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document" tokenization.tokenizer.vocab_file = vocab_file tokenization.tokenizer.merges_file = merge_files dataloader.train.dataset[0].data_prefix = data_prefix dataloader.train.dataset[0].indexed_dataset.data_prefix = data_prefix # GPT-2 model config model.cfg.embedding_dropout_prob = 0.1 model.cfg.attention_dropout_prob = 0.1 model.cfg.num_attention_heads = 16 model.cfg.hidden_size = 384 model.cfg.ffn_hidden_size = 1536 model.cfg.num_layers = 6 model.cfg.max_seq_length = 1024 train.input_placement_device = "cpu" train.dist.pipeline_num_layers = model.cfg.num_layers for ds in dataloader.train.dataset: ds.max_seq_length = model.cfg.max_seq_length optim.lr = 1.5e-4 train.train_micro_batch_size = 4 train.amp.enabled = True train.evaluation.evaluator = LazyCall(PPLEvaluator)() train.output_dir = "./output/gpt2_output" [08/02 12:39:21 libai]: Full config saved to ./output/gpt2_output/config.yaml [08/02 12:39:21 lb.engine.default]: > compiling dataset index builder ... make: Entering directory '/home/gehao/OneFlow/libai/libai/data/data_utils' make: Nothing to be done for 'default'. make: Leaving directory '/home/gehao/OneFlow/libai/libai/data/data_utils' [08/02 12:39:21 lb.engine.default]: >>> done with dataset index builder. Compilation time: 0.094 seconds [08/02 12:39:21 lb.engine.default]: >>> done with compiling. Compilation time: 0.096 seconds [08/02 12:39:21 lb.engine.default]: Prepare training, validating, testing set [08/02 12:39:21 lb.data.data_utils.indexed_dataset]: building dataset index ... [08/02 12:39:21 lb.data.data_utils.indexed_dataset]: warming up index mmap file... [08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading sizes... [08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading pointers... [08/02 12:39:21 lb.data.data_utils.indexed_dataset]: reading document index... [08/02 12:39:21 lb.data.data_utils.indexed_dataset]: warming up data mmap file... [08/02 12:39:28 lb.data.data_utils.indexed_dataset]: creating numpy buffer of mmap... [08/02 12:39:28 lb.data.data_utils.indexed_dataset]: creating memory view of numpy buffer... [08/02 12:39:28 lb.data.data_utils.indexed_dataset]: Finished creating indexed dataset in 7.357359 seconds [08/02 12:39:28 lb.data.data_utils.indexed_dataset]: indexed dataset stats: [08/02 12:39:28 lb.data.data_utils.indexed_dataset]: number of documents: 8013769 [08/02 12:39:28 lb.data.data_utils.indexed_dataset]: number of sentences: 8013769 [08/02 12:39:28 lb.data.datasets.gpt_dataset]: > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_doc_idx.npy [08/02 12:39:28 lb.data.datasets.gpt_dataset]: > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_sample_idx.npy [08/02 12:39:28 lb.data.datasets.gpt_dataset]: > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_80000ns_1024sl_1234s_shuffle_idx.npy [08/02 12:39:28 lb.data.datasets.gpt_dataset]: loaded indexed file in 0.017 seconds [08/02 12:39:28 lb.data.datasets.gpt_dataset]: total number of samples: 8828142 [08/02 12:39:28 lb.data.datasets.gpt_dataset]: total number of epochs: 1 [08/02 12:39:29 lb.data.datasets.gpt_dataset]: > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_doc_idx.npy [08/02 12:39:29 lb.data.datasets.gpt_dataset]: > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_sample_idx.npy [08/02 12:39:29 lb.data.datasets.gpt_dataset]: > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_19200000ns_1024sl_1234s_shuffle_idx.npy [08/02 12:39:29 lb.data.datasets.gpt_dataset]: loaded indexed file in 0.002 seconds [08/02 12:39:29 lb.data.datasets.gpt_dataset]: total number of samples: 26484426 [08/02 12:39:29 lb.data.datasets.gpt_dataset]: total number of epochs: 3 [08/02 12:39:29 lb.data.datasets.gpt_dataset]: > loading doc-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_doc_idx.npy [08/02 12:39:29 lb.data.datasets.gpt_dataset]: > loading sample-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_sample_idx.npy [08/02 12:39:29 lb.data.datasets.gpt_dataset]: > loading shuffle-idx mapping from /home/gehao/dataset/gpt/hf-GPT2Data/hf-gpt2_text_document_gpt-2_indexmap_6400000ns_1024sl_1234s_shuffle_idx.npy [08/02 12:39:29 lb.data.datasets.gpt_dataset]: loaded indexed file in 0.002 seconds [08/02 12:39:29 lb.data.datasets.gpt_dataset]: total number of samples: 8828142 [08/02 12:39:29 lb.data.datasets.gpt_dataset]: total number of epochs: 1 F20220802 12:39:32.804811 201821 ibverbs_comm_network.cpp:112] Check failed: num_device > 0 (0 vs. 0) No IB device found *** Check failure stack trace: *** @ 0x7fced2e9962a google::LogMessage::Fail() @ 0x7fced2e99912 google::LogMessage::SendToLog() @ 0x7fced2e99197 google::LogMessage::Flush() @ 0x7fced2e9bd09 google::LogMessageFatal::~LogMessageFatal() @ 0x7fceca30f910 oneflow::IBVerbsCommNet::IBVerbsCommNet() @ 0x7fcecc4a95de oneflow::InitRDMA() @ 0x7fcf319b35a3 (unknown) @ 0x7fcf31ba1de9 (unknown) @ 0x5581bfd408f4 cfunction_call @ 0x5581bfcfa47f _PyObject_MakeTpCall @ 0x5581bfd982e9 _PyEval_EvalFrameDefault @ 0x5581bfd55be4 _PyFunction_Vectorcall @ 0x5581bfcbd300 _PyEval_EvalFrameDefault.cold.2983 @ 0x5581bfd54fe3 _PyEval_EvalCode @ 0x5581bfd55cb4 _PyFunction_Vectorcall @ 0x5581bfd409aa _PyObject_FastCallDictTstate @ 0x5581bfd4a429 slot_tp_init @ 0x5581bfcfa52f _PyObject_MakeTpCall @ 0x5581bfd93d57 _PyEval_EvalFrameDefault @ 0x5581bfd55be4 _PyFunction_Vectorcall @ 0x5581bfcbc088 _PyEval_EvalFrameDefault.cold.2983 @ 0x5581bfd54fe3 _PyEval_EvalCode @ 0x5581bfe01a7c PyEval_EvalCodeEx @ 0x5581bfd55dbb PyEval_EvalCode @ 0x5581bfe01b2b run_eval_code_obj @ 0x5581bfe32155 run_mod @ 0x5581bfcd31f7 pyrun_file.cold.3078 @ 0x5581bfe3772f PyRun_SimpleFileExFlags @ 0x5581bfe37df8 Py_RunMain @ 0x5581bfe37ff9 Py_BytesMain @ 0x7fcf3acb6b97 __libc_start_main @ 0x5581bfdbf6a0 (unknown) F20220802 12:39:33.046900 201822 ibverbs_comm_network.cpp:112] Check failed: num_device > 0 (0 vs. 0) No IB device found *** Check failure stack trace: *** @ 0x7f6e5c4b662a google::LogMessage::Fail() @ 0x7f6e5c4b6912 google::LogMessage::SendToLog() @ 0x7f6e5c4b6197 google::LogMessage::Flush() @ 0x7f6e5c4b8d09 google::LogMessageFatal::~LogMessageFatal() @ 0x7f6e5392c910 oneflow::IBVerbsCommNet::IBVerbsCommNet() @ 0x7f6e55ac65de oneflow::InitRDMA() @ 0x7f6ebafd05a3 (unknown) @ 0x7f6ebb1bede9 (unknown) @ 0x55652276c8f4 cfunction_call @ 0x55652272647f _PyObject_MakeTpCall @ 0x5565227c42e9 _PyEval_EvalFrameDefault @ 0x556522781be4 _PyFunction_Vectorcall @ 0x5565226e9300 _PyEval_EvalFrameDefault.cold.2983 @ 0x556522780fe3 _PyEval_EvalCode @ 0x556522781cb4 _PyFunction_Vectorcall @ 0x55652276c9aa _PyObject_FastCallDictTstate @ 0x556522776429 slot_tp_init @ 0x55652272652f _PyObject_MakeTpCall @ 0x5565227bfd57 _PyEval_EvalFrameDefault @ 0x556522781be4 _PyFunction_Vectorcall @ 0x5565226e8088 _PyEval_EvalFrameDefault.cold.2983 @ 0x556522780fe3 _PyEval_EvalCode @ 0x55652282da7c PyEval_EvalCodeEx @ 0x556522781dbb PyEval_EvalCode @ 0x55652282db2b run_eval_code_obj @ 0x55652285e155 run_mod @ 0x5565226ff1f7 pyrun_file.cold.3078 @ 0x55652286372f PyRun_SimpleFileExFlags @ 0x556522863df8 Py_RunMain @ 0x556522863ff9 Py_BytesMain @ 0x7f6ec42d3b97 __libc_start_main @ 0x5565227eb6a0 (unknown) Killing subprocess 201821 Killing subprocess 201822 Traceback (most recent call last): File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 231, in <module> main() File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 219, in main sigkill_handler(signal.SIGTERM, None) File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/oneflow/distributed/launch.py", line 187, in sigkill_handler raise subprocess.CalledProcessError( subprocess.CalledProcessError: Command '['/home/gehao/anaconda3/envs/mlsys/bin/python3', '-u', 'tools/train_net.py', '--config-file', 'configs/gpt2_pretrain.py']' died with <Signals.SIGABRT: 6>.
The text was updated successfully, but these errors were encountered:
你好,如果您的环境中没有rdma的相关设备的话,可以在config里设置train.rdma_enabled = False来关闭rdma。
train.rdma_enabled = False
Sorry, something went wrong.
感谢,该问题已解决,此外还发现另一个bug,需要将libai/libai/layers/attention.py中第242行的tril_fill_value=-10000.0注释掉程序才能正常运行,但不知道注释掉这行对程序逻辑是否会有影响?
tril_fill_value=-10000.0
您好,您可以安装v0.2.0版本的libai来对应v0.8.0版本的oneflow。 也可以通过 python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu102 的方式安装nightly版本的oneflow来对应最新版本的libai。 不过正在开发中还没有正式发布的代码,还没有经过严格的测试与验证,可靠性会低于正式发布的版本。
python3 -m pip install --pre oneflow -f https://staging.oneflow.info/branch/master/cu102
No branches or pull requests
目前正在尝试利用oneflow-libai跑gpt-2,按照tutorial的指示,仅修改了dataset相关的配置信息,运行
bash tools/train.sh tools/train_net.py configs/gpt2_pretrain.py 2
脚本时程序报错,完整log如下:The text was updated successfully, but these errors were encountered: