Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

punkt.PunktSentenceTokenizer() for Chinese #1824

Open
xrtang opened this Issue Sep 4, 2017 · 8 comments

Comments

Projects
None yet
6 participants
@xrtang
Copy link

xrtang commented Sep 4, 2017

I use the following code to train punkt for Chinese, but it doesn't produce desired result:

input_str_cn = "台湾之所以出现这种危机,是台湾不但长年低薪,且不知远景在哪里。20世纪90年代,台湾的大学毕业生起薪不到新台币3万元(约合人民币6594元),到了今天,依然如此。"

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in training corpus 
import codecs

train_file = "D:/CL/comp_ling/data/dushu_1999_2008/1999.txt"
text = codecs.open(train_file, "r", "gb18030").read()

# Train tokenizer
tokenizer.sent_end_chars = ('','','','')
for sent_end in tokenizer.sent_end_chars:
    print sent_end
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("chinese.pickle","wb")
pickle.dump(tokenizer, out)
out.close()

# To use the tokenizer
with open("chinese.pickle") as infile:
    tokenizer_new = pickle.load(infile)
sents = tokenizer_new.tokenize(input_str_cn)
for s in sents:
    print s

The produced result is as follows:

"台湾之所以出现这种危机,是台湾不但长年低薪,且不知远景在哪里。20世纪90年代,台湾的大学毕业生起薪不到新台币3万元(约合人民币6594元),到了今天,依然如此。"

It seems that the sent_end_chars does not work here. I have checked the encoding. There's no problem with that. Could anyone help with it? Thanks.

@jnothman

This comment has been minimized.

Copy link
Contributor

jnothman commented Sep 4, 2017

Punkt here only considers a sent_end_char to be a potential sentence boundary if it is followed by either whitespace or punctuation (see _period_context_fmt). The absence of a whitespace character after "。" is sufficient for it to not be picked up.

I have my doubts about the applicability of Punkt to Chinese. Does "。" not deterministically mark the end of a sentence in Chinese? Is it ambiguous? Is it used for abbreviations?

@alvations alvations added bug and removed bug labels Sep 4, 2017

@alvations

This comment has been minimized.

Copy link
Contributor

alvations commented Sep 4, 2017

@jnothman the isn't ambiguous and won't be used for abbreviations. You're right that the lack of whitespace after the boundary is causing Punkt to ignore the character.

@xrtang try padding the sentence ending punctuation with spaces before training punkt, e.g.

import re
with codecs.open(train_file, "r", "gb18030") as fin:
    text = re.sub('([!?。])', r'\1 ', fin.read())
@xrtang

This comment has been minimized.

Copy link
Author

xrtang commented Sep 4, 2017

@jnothman @alvations Thank you for the comment. The "。“ is mostly deterministic in marking the end of a sentence in Chinese. But there are still occasions that it does not. My idea was to use punkt to solve these ambiguous cases. But if a space is required for the training, it may not be able to do the job. But I think the general idea proposed in the paper might still work for Chinese.

@jnothman

This comment has been minimized.

Copy link
Contributor

jnothman commented Sep 4, 2017

I don't think the space is going to make a difference except that our implementation expects it: you can either change _period_context_fmt, or add a space before processing and strip it afterwards. I'd be interested to hear if Punkt resolved any of the ambiguities in Chinese sentence boundaries.

@alvations alvations added the tokenizer label Oct 13, 2017

@twinkled

This comment has been minimized.

Copy link

twinkled commented Jun 1, 2018

@jnothman Thanks for your comments. I meet the same problem today. Could you explain how to set the parameter _period_context_fmt to solve the problem? I'm new for NLKT so not quite knowledgeable about it .

@jnothman

This comment has been minimized.

Copy link
Contributor

jnothman commented Jun 2, 2018

@echan00

This comment has been minimized.

Copy link

echan00 commented Oct 13, 2018

I tried padding the sentence ending punctuation with spaces before training punkt but still didn't see the sentences being tokenized properly.

@vitaly-zdanevich

This comment has been minimized.

Copy link

vitaly-zdanevich commented Dec 19, 2018

Sorry, I am not an expert in NLTK - I cannot find chinese.pickle - so currently this is impossible to tokenize Chinese language?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.