Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

demo脚本无法复现sota效果 #1773

Closed
1 task done
shixinglingguihua opened this issue Aug 11, 2022 · 2 comments
Closed
1 task done

demo脚本无法复现sota效果 #1773

shixinglingguihua opened this issue Aug 11, 2022 · 2 comments
Labels

Comments

@shixinglingguihua
Copy link

shixinglingguihua commented Aug 11, 2022

Describe the bug
执行sota脚本无法复现效果

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Describe the current behavior
1、下载sota脚本 HanLP/ plugins / hanlp_demo / hanlp_demo / zh / train_sota_bert_pku.py
2、打开python shell 执行 脚本中的python代码 代码如下:
from hanlp.common.dataset import SortingSamplerBuilder
from hanlp.components.tokenizers.transformer import TransformerTaggingTokenizer
from hanlp.datasets.tokenization.sighan2005.pku import SIGHAN2005_PKU_TRAIN_ALL, SIGHAN2005_PKU_TEST
from tests import cdroot

cdroot()
tokenizer = TransformerTaggingTokenizer()
save_dir = 'data/model/cws/sighan2005_pku_bert_base_96.66'
tokenizer.fit(
SIGHAN2005_PKU_TRAIN_ALL,
SIGHAN2005_PKU_TEST, # Conventionally, no devset is used. See Tian et al. (2020).
save_dir,
'bert-base-chinese',
max_seq_len=300,
char_level=True,
hard_constraint=True,
sampler_builder=SortingSamplerBuilder(batch_size=32),
epochs=10,
adam_epsilon=1e-6,
warmup_steps=0.1,
weight_decay=0.01,
word_dropout=0.1,
seed=1609422632,
)
tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)

Expected behavior
A clear and concise description of what you expected to happen.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04):
  • Python version: 3.8.8
  • HanLP version: 2.1.0b37

Other info / logs
623/623 loss: 1416.3794 P: 59.23% R: 65.39% F1: 62.16% ET: 2 m 6 s
63/63 loss: 451.0509 P: 89.72% R: 89.54% F1: 89.63% ET: 4 s
2 m 9 s / 21 m 33 s ETA: 19 m 24 s (saved)
Epoch 2 / 10:
623/623 loss: 286.6173 P: 31.52% R: 63.34% F1: 42.09% ET: 2 m 8 s
63/63 loss: 486.3771 P: 0.37% R: 8.27% F1: 0.71% ET: 4 s
4 m 22 s / 21 m 50 s ETA: 17 m 28 s (1)
Epoch 3 / 10:
623/623 loss: 211.5965 P: 21.55% R: 61.44% F1: 31.91% ET: 2 m 8 s
63/63 loss: 470.1510 P: 0.39% R: 8.66% F1: 0.75% ET: 4 s
6 m 34 s / 21 m 53 s ETA: 15 m 19 s (2)
Epoch 4 / 10:
623/623 loss: 173.5003 P: 16.42% R: 59.68% F1: 25.75% ET: 2 m 9 s
63/63 loss: 469.0070 P: 0.37% R: 8.32% F1: 0.71% ET: 4 s
8 m 47 s / 21 m 57 s ETA: 13 m 10 s (3)
Epoch 5 / 10:
623/623 loss: 149.7656 P: 13.30% R: 58.04% F1: 21.63% ET: 2 m 9 s
63/63 loss: 482.1117 P: 0.38% R: 8.61% F1: 0.73% ET: 4 s
10 m 59 s / 21 m 59 s ETA: 10 m 59 s (4)
Epoch 6 / 10:
623/623 loss: 131.2736 P: 11.19% R: 56.51% F1: 18.69% ET: 2 m 9 s
63/63 loss: 530.2560 P: 0.36% R: 8.13% F1: 0.69% ET: 4 s
13 m 12 s / 22 m 0 s ETA: 8 m 48 s (5) early stop
Max score of dev is P: 89.72% R: 89.54% F1: 89.63% at epoch 1
Average time of each epoch is 2 m 12 s
13 m 12 s elapsed
P: 89.72% R: 89.54% F1: 89.63%

tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)
Pruned 0 (0.0%) samples out of 2004.
63/63 loss: 451.0509 P: 89.72% R: 89.54% F1: 89.63% ET: 4 s
speed: 531 samples/second
(P: 89.72% R: 89.54% F1: 89.63%, (451.0508732871404, P: 89.72% R: 89.54% F1: 89.63%))

  • I've completed this form and searched the web for solutions.
@hankcs hankcs removed their assignment Aug 11, 2022
@hankcs hankcs closed this as completed Aug 11, 2022
Repository owner locked as spam and limited conversation to collaborators Aug 11, 2022
Repository owner unlocked this conversation Aug 11, 2022
@hankcs hankcs reopened this Aug 11, 2022
@hankcs
Copy link
Owner

hankcs commented Aug 11, 2022

第一时间响应:这段脚本适配早期版本2.1.0-alpha.0,训练的SOTA模型和日志公开下载:https://od.hankcs.com/hanlp/data/tok/sighan2005_pku_bert_base_zh_20201231_141130.zip

安装pip install hanlp==2.1.0-alpha.0后暂时只能达到P: 96.91% R: 96.09% F1: 96.50%,估计与transformers等第三方库有关。正在排查问题。

@hankcs
Copy link
Owner

hankcs commented Aug 11, 2022

成功复现,的确与第三方库的版本有关。需要安装如下版本:

!pip install transformers==3.5.1
!pip install torch==1.6.0
!pip install hanlp==2.1.0-alpha.0

然后运行该脚本后即可复现,日志为:

Model built with 102270724/102270724 trainable/total parameters.
Using GPUs: [1]
19922/2004 samples in trn/dev set.
Epoch 1 / 3:
623/623 loss: 1016.5334 P: 84.75% R: 83.30% F1: 84.02% ET: 1 m 58 s
  63/63 loss: 348.9937 P: 96.49% R: 95.60% F1: 96.04% ET: 4 s
2 m 2 s / 6 m 6 s ETA: 4 m 4 s (saved)
Epoch 2 / 3:
623/623 loss: 244.4627 P: 90.65% R: 89.83% F1: 90.24% ET: 2 m 0 s
  63/63 loss: 355.4353 P: 96.61% R: 96.33% F1: 96.47% ET: 4 s
4 m 6 s / 6 m 10 s ETA: 2 m 3 s (saved)
Epoch 3 / 3:
623/623 loss: 187.5929 P: 92.85% R: 92.28% F1: 92.57% ET: 1 m 57 s
  63/63 loss: 341.4750 P: 96.93% R: 96.39% F1: 96.66% ET: 4 s
6 m 8 s / 6 m 8 s ETA: 0 s (saved)
Max score of dev is P: 96.93% R: 96.39% F1: 96.66% at epoch 3
Average time of each epoch is 2 m 3 s
6 m 8 s elapsed
63/63 loss: 341.4750 P: 96.93% R: 96.39% F1: 96.66% ET: 4 s
speed: 568 samples/second
Model saved in data/model/cws/sighan2005_pku_bert_base_96.66

你也可以在colab上复现该实验:https://colab.research.google.com/drive/12w6qmHg0xyrvnRHOE7oTehRRD_5ZCBlI?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants