-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nltk word tokenizer strange behaviour #1699
Comments
Rather strange bug: >>> s = '{}. Young c Moin Khan b Wasim 0'
>>> for i in range(65, 65+26):
... word_tokenize(s.format(chr(i)))
...
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['B', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['C.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['D.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['E.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['F.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['G.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['H.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['I', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['J', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['K.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['L.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['M.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['N.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['O', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['P.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Q', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['R.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['S.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['T.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['U', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['V.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['W.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['X', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Y', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Z', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0'] Without caps after the fullstop, it sort of gives the expected behavior of the TreebankWordTokenizer, e.g. https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L39 >>> s = '{}. young c Moin Khan b Wasim 0'
>>> for i in range(65, 65+26):
... word_tokenize(s.format(chr(i)))
...
['A.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['B.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['C.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['D.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['E.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['F.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['G.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['H.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['I.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['J.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['K.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['L.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['M.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['N.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['O.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['P.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Q.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['R.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['S.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['T.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['U.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['V.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['W.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['X.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Y.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
['Z.', 'young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0'] |
It's the >>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> from nltk import word_tokenize, sent_tokenize
>>> s = 'A. Young c Moin Khan b Wasim 0' # Causes unexpected splitting of "A ."
# Unexpected.
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
# Expected.
>>> tbt = TreebankWordTokenizer()
>>> tbt.tokenize(s)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
# Sent tokenize splits it into 2 sentences.
>>> sent_tokenize(s)
['A.', 'Young c Moin Khan b Wasim 0'] So when the sentence is split into 2 sentences, this regex will kick in in the It will pad the final fullstop with a preceding space. To verify the explanation above, we will see that >>> s = 'C. Young c Moin Khan b Wasim 0'
>>> sent_tokenize(s)
['C. Young c Moin Khan b Wasim 0']
>>>
>>> len(sent_tokenize(s))
1 This split is most probably because when training the punkt sentence tokenizer, it must have recognized that some single capital preceding a fullstop could be some sort of abbreviation while others are not hence the latter we be treated as the sentence boundary and be split. |
Since this is idosyncratic to your data, I think this is not a bug but a quaint feature of Kiss and Strunk (2006) Punkt tokenizer =) I've added a feature to In the latest commit of #1710, it'll allow this:
This is done by skipping the @grafael Thanks for catching this! |
Given the sentence:
word = "B. Young c Moin Khan b Wasim 0"
if we call nltk.word_tokenize(word) We'll have the following tokens:
['B', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
But, if we change the sentence to be:
word = "C. Young c Moin Khan b Wasim 0"
The tokens will be:
['C.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
The char '.' is now joined to the letter 'C' .
What is more strange is that the 'B' splitted from '.' only occurs when the next word is 'Young' I tried several words and none appears to reproduce the error.
The text was updated successfully, but these errors were encountered: