Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

中文数据集怎么预处理 #3

Closed
renmada opened this issue Feb 18, 2022 · 6 comments
Closed

中文数据集怎么预处理 #3

renmada opened this issue Feb 18, 2022 · 6 comments

Comments

@renmada
Copy link

renmada commented Feb 18, 2022

分词吗

@ljynlp
Copy link
Owner

ljynlp commented Feb 18, 2022

不需要分词,以字符为单位即可

@ljynlp ljynlp closed this as completed Mar 1, 2022
@songbaiTalk
Copy link

不需要分词,以字符为单位即可

请问中英文,包含数字,比如“扎克伯格是facebook的ceo,今年37岁”。那么facebook和37也都拆成单个的数组元素吗

@ljynlp ljynlp reopened this Mar 10, 2022
@ljynlp
Copy link
Owner

ljynlp commented Mar 10, 2022

请问中英文,包含数字,比如“扎克伯格是facebook的ceo,今年37岁”。那么facebook和37也都拆成单个的数组元素吗

是的,由于我们拿到的中文数据本身就是以字符分割的,并且英语单词之间没有空格,为了方便就直接以字符序列作为输入了。当然如果对输入进行预处理分好词,效果应该会更好。

@10652835
Copy link

是的,由于我们拿到的中文数据本身就是以字符分割的,并且英语单词之间没有空格,为了方便就直接以字符序列作为输入了。当然如果对输入进行预处理分好词,效果应该会更好。

不需要分词,以字符为单位即可

请问中文数据集中的word是如何处理的?是根据字典匹配的吗,如果是的话,针对匹配的词是选择最长的词吗?

@nlper01
Copy link

nlper01 commented May 18, 2022

分词吗

你好,请问可以分享一下中文数据集预处理的代码吗?邮箱2674053421@qq.com

@renmada renmada closed this as completed May 30, 2022
@nlper01
Copy link

nlper01 commented Apr 6, 2023

请问中英文,包含数字,比如“扎克伯格是facebook的ceo,今年37岁”。那么facebook和37也都拆成单个的数组元素吗

是的,由于我们拿到的中文数据本身就是以字符分割的,并且英语单词之间没有空格,为了方便就直接以字符序列作为输入了。当然如果对输入进行预处理分好词,效果应该会更好。

大佬,如果是分词后输入,那我的lable怎么对齐到单个字上面呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants