Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV 词典中不能包含逗号 #1785

Closed
1 task done
elonzh opened this issue Sep 15, 2022 · 1 comment
Closed
1 task done

CSV 词典中不能包含逗号 #1785

elonzh opened this issue Sep 15, 2022 · 1 comment
Assignees
Labels

Comments

@elonzh
Copy link

elonzh commented Sep 15, 2022

Describe the bug

使用 CSV 文件作为词典时,由于部分词含有逗号会导致词典失败。

从代码上看,HanLP 只是单纯的使用逗号切分每一行,并没有处理 CSV 转义的情况。

列数据中存在 ", , 符号时会将该列使用 "" 进行转义。

Code to reproduce the issue

将以下文本直接保存为 csv 文件并加载词典。

19th century music
20 century British history
21st Century Music
21st century science & technology
2D Materials
3 Biotech
3D Printing and Additive Manufacturing
3D Printing in Medicine
3D Research
"3L: Language, Linguistics, Literature"

Describe the current behavior

Exception in thread "main" java.lang.NumberFormatException: For input string: " Linguistics"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Integer.parseInt(Integer.java:638)
	at java.base/java.lang.Integer.parseInt(Integer.java:770)
	at com.hankcs.hanlp.corpus.io.IOUtil.loadDictionary(IOUtil.java:794)
	at com.hankcs.hanlp.corpus.io.IOUtil.loadDictionary(IOUtil.java:752)
	at com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment.<init>(DoubleArrayTrieSegment.java:68)
	at org.grobid.core.lexicon.DictSegmenterKt.main(DictSegmenter.kt:6)
	at org.grobid.core.lexicon.DictSegmenterKt.main(DictSegmenter.kt)

Expected behavior

正常加载

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.1 LTS
  • HanLP version: com.hankcs:hanlp:portable-1.8.3
  • I've completed this form and searched the web for solutions.
@hankcs
Copy link
Owner

hankcs commented Sep 15, 2022

感谢反馈,csv转义有很多繁琐的细节不准备花时间实现,请参考上面的commit使用tsv格式。
如果还有问题,欢迎重开issue。

@hankcs hankcs closed this as completed Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants