NER代码运行问题 #15

JucksonP · 2021-06-23T03:37:00Z

你好,大佬:
首先，感谢开源！
尝试复现论文结果的时候遇到了一些问题,不知能否抽空解答一下.
1、在运行weiboNER的实验代码时，超参数设置与论文中一样，训练时loss下降有些异常(震荡下降，且前几个epoch验证集测试集f1均为0)，训练日志已邮件发送；
2、具体环境及运行设置：
GPU：A100-SXM4-40GB； torch:1.8.1+cu111 训练方式：单卡
期待大佬回复指导，谢谢！

liuwei1206 · 2021-06-23T04:00:31Z

Hi,

I have uploaded the checkpoint file for Weibo NER, in which there is a shell file to train and evaluate the model. I suggest you train the model using that shell file. Hopes it help.

Wei

JucksonP · 2021-06-23T06:02:26Z

是用checkpoint附带的shell文件运行的，数据文件和代码均未修改。
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 13111 --nproc_per_node=1
Trainer.py --do_train --do_eval --do_predict --evaluate_during_training
--data_dir="data/dataset/NER/weibo"
--output_dir="data/result/NER/weibo/wcbertcrf"
--config_name="data/berts/bert/config.json"
--model_name_or_path="data/berts/bert/pytorch_model.bin"
--vocab_file="data/berts/bert/vocab.txt"
--word_vocab_file="data/vocab/tencent_vocab.txt"
--max_scan_num=1500000
--max_word_num=5
--label_file="data/dataset/NER/weibo/labels.txt"
--word_embedding="data/embedding/word_embedding.txt"
--saved_embedding_dir="data/dataset/NER/weibo"
--model_type="WCBertCRF_Token"
--seed=106524
--per_gpu_train_batch_size=4
--per_gpu_eval_batch_size=16
--learning_rate=1e-5
--max_steps=-1
--max_seq_length=256
--num_train_epochs=20
--warmup_steps=190
--save_steps=600
--logging_steps=100
20个eopch最终结果如图，loss始终降不下去，不知道是什么原因。

liuwei1206 · 2021-06-23T06:08:28Z

Hi,

How about the result loading my checkpoints?

JucksonP · 2021-06-23T06:10:32Z

加载您给的checkpoint在验证集和测试集上的结果是正常的，f1跟论文中的值一样

liuwei1206 · 2021-06-23T06:17:21Z

Hi,

Please be patient. I will try to run the code to find if there are bugs in it.

JucksonP · 2021-06-23T06:27:22Z

thank you very much!

hezongfeng · 2021-06-24T03:48:41Z

我遇到过类似问题，同样的代码在1080Ti-11g显卡和V100-16G显卡上结果完全不一样，在V100显卡上一直不收敛，f1在前几十个epoch上一直为0。可以尝试换一台机器。

s1162276945 · 2021-06-24T06:04:02Z

是用checkpoint附带的shell文件运行的，数据文件和代码均未修改。
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 13111 --nproc_per_node=1
Trainer.py --do_train --do_eval --do_predict --evaluate_during_training
--data_dir="data/dataset/NER/weibo"
--output_dir="data/result/NER/weibo/wcbertcrf"
--config_name="data/berts/bert/config.json"
--model_name_or_path="data/berts/bert/pytorch_model.bin"
--vocab_file="data/berts/bert/vocab.txt"
--word_vocab_file="data/vocab/tencent_vocab.txt"
--max_scan_num=1500000
--max_word_num=5
--label_file="data/dataset/NER/weibo/labels.txt"
--word_embedding="data/embedding/word_embedding.txt"
--saved_embedding_dir="data/dataset/NER/weibo"
--model_type="WCBertCRF_Token"
--seed=106524
--per_gpu_train_batch_size=4
--per_gpu_eval_batch_size=16
--learning_rate=1e-5
--max_steps=-1
--max_seq_length=256
--num_train_epochs=20
--warmup_steps=190
--save_steps=600
--logging_steps=100
20个eopch最终结果如图，loss始终降不下去，不知道是什么原因。

我也跑的这个，可以交流一下吗

JucksonP · 2021-06-24T06:43:19Z

我遇到过类似问题，同样的代码在1080Ti-11g显卡和V100-16G显卡上结果完全不一样，在V100显卡上一直不收敛，f1在前几十个epoch上一直为0。可以尝试换一台机器。

多谢提醒，我在别的卡上试试~

JucksonP · 2021-06-24T06:44:06Z

是用checkpoint附带的shell文件运行的，数据文件和代码均未修改。
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 13111 --nproc_per_node=1
Trainer.py --do_train --do_eval --do_predict --evaluate_during_training
--data_dir="data/dataset/NER/weibo"
--output_dir="data/result/NER/weibo/wcbertcrf"
--config_name="data/berts/bert/config.json"
--model_name_or_path="data/berts/bert/pytorch_model.bin"
--vocab_file="data/berts/bert/vocab.txt"
--word_vocab_file="data/vocab/tencent_vocab.txt"
--max_scan_num=1500000
--max_word_num=5
--label_file="data/dataset/NER/weibo/labels.txt"
--word_embedding="data/embedding/word_embedding.txt"
--saved_embedding_dir="data/dataset/NER/weibo"
--model_type="WCBertCRF_Token"
--seed=106524
--per_gpu_train_batch_size=4
--per_gpu_eval_batch_size=16
--learning_rate=1e-5
--max_steps=-1
--max_seq_length=256
--num_train_epochs=20
--warmup_steps=190
--save_steps=600
--logging_steps=100
20个eopch最终结果如图，loss始终降不下去，不知道是什么原因。

我也跑的这个，可以交流一下吗

十分乐意啊，你跑通了吗，结果怎么样呢

JucksonP · 2021-06-24T07:12:47Z

我遇到过类似问题，同样的代码在1080Ti-11g显卡和V100-16G显卡上结果完全不一样，在V100显卡上一直不收敛，f1在前几十个epoch上一直为0。可以尝试换一台机器。

多谢提醒，我在别的卡上试试~

果然，是cuda版本的问题，用cuda10.1运行loss下降正常

liuwei1206 · 2021-06-24T08:23:49Z

@hezongfeng Thanks for your sharing. I don't know the version of CUDA and machine will make such a difference.

@JucksonP Sorry to reply so late. My running status is normal, and the loss has dropped. I am a full-time employee in the company so don't have so much time to run the experiment. Please understand.

liuwei1206 · 2021-06-24T08:29:31Z

And my GPU is tesla-P100. But I am not sure about the version of Cuda since it may be changed by my colleagues. Now, the version of two GPUs I used in the last year is 10.2 and 11.0

ziliwang · 2021-06-24T09:53:09Z

单卡训练下，遇到过类似的问题，替换CRF层解决。

s1162276945 · 2021-06-25T01:02:04Z

你好，我的邮箱号是1162276945@qq.com

s1162276945 · 2021-06-28T05:49:29Z

单卡训练下，遇到过类似的问题，替换CRF层解决。

你好，请问你是用什么替换CRF层的呢？

lvjiujin · 2021-09-17T05:56:28Z

你好,大佬:
首先，感谢开源！
尝试复现论文结果的时候遇到了一些问题,不知能否抽空解答一下.
1、在运行weiboNER的实验代码时，超参数设置与论文中一样，训练时loss下降有些异常(震荡下降，且前几个epoch验证集测试集f1均为0)，训练日志已邮件发送；
2、具体环境及运行设置：
GPU：A100-SXM4-40GB； torch:1.8.1+cu111 训练方式：单卡
期待大佬回复指导，谢谢！

我晕，这么奢侈呀，40G的显卡，我在自己电脑上8G显卡RTX3070跑不动，我猜想可能是这个词库太大了。导致Trie树异常庞大。

JucksonP · 2021-09-17T06:20:20Z

你好,大佬:
首先，感谢开源！
尝试复现论文结果的时候遇到了一些问题,不知能否抽空解答一下.
1、在运行weiboNER的实验代码时，超参数设置与论文中一样，训练时loss下降有些异常(震荡下降，且前几个epoch验证集测试集f1均为0)，训练日志已邮件发送；
2、具体环境及运行设置：
GPU：A100-SXM4-40GB； torch:1.8.1+cu111 训练方式：单卡
期待大佬回复指导，谢谢！

我晕，这么奢侈呀，40G的显卡，我在自己电脑上8G显卡RTX3070跑不动，我猜想可能是这个词库太大了。导致Trie树异常庞大。

公司服务器就是香，24张A100哈哈哈哈，显卡我是买不起的

liuwei1206 closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER代码运行问题 #15

NER代码运行问题 #15

JucksonP commented Jun 23, 2021

liuwei1206 commented Jun 23, 2021

JucksonP commented Jun 23, 2021

liuwei1206 commented Jun 23, 2021

JucksonP commented Jun 23, 2021

liuwei1206 commented Jun 23, 2021

JucksonP commented Jun 23, 2021

hezongfeng commented Jun 24, 2021

s1162276945 commented Jun 24, 2021

JucksonP commented Jun 24, 2021

JucksonP commented Jun 24, 2021

JucksonP commented Jun 24, 2021

liuwei1206 commented Jun 24, 2021

liuwei1206 commented Jun 24, 2021

ziliwang commented Jun 24, 2021

s1162276945 commented Jun 25, 2021

s1162276945 commented Jun 28, 2021

lvjiujin commented Sep 17, 2021 •

edited

JucksonP commented Sep 17, 2021

NER代码运行问题 #15

NER代码运行问题 #15

Comments

JucksonP commented Jun 23, 2021

liuwei1206 commented Jun 23, 2021

JucksonP commented Jun 23, 2021

liuwei1206 commented Jun 23, 2021

JucksonP commented Jun 23, 2021

liuwei1206 commented Jun 23, 2021

JucksonP commented Jun 23, 2021

hezongfeng commented Jun 24, 2021

s1162276945 commented Jun 24, 2021

JucksonP commented Jun 24, 2021

JucksonP commented Jun 24, 2021

JucksonP commented Jun 24, 2021

liuwei1206 commented Jun 24, 2021

liuwei1206 commented Jun 24, 2021

ziliwang commented Jun 24, 2021

s1162276945 commented Jun 25, 2021

s1162276945 commented Jun 28, 2021

lvjiujin commented Sep 17, 2021 • edited

JucksonP commented Sep 17, 2021

lvjiujin commented Sep 17, 2021 •

edited