Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Treebank tokenizer #1710

Merged
merged 22 commits into from
May 5, 2017
Merged

Better Treebank tokenizer #1710

merged 22 commits into from
May 5, 2017

Conversation

alvations
Copy link
Contributor

@alvations alvations commented May 4, 2017

Adding on to #1214, this PR:

  • Added detokenization abilities by reversing the regexes operations
  • Post-hoc tokens mapping to original sentence; might not be 100% correctly aligned but I can't think of any counter examples yet..., c.f. Ptb tokenize with offsets #1190
  • Allow MXPOST parentheses that PTB uses (also in the original sed treebank tokenizer)
  • Allow additional parameters for word_tokenize to preserve the sentence without calling the sent_tokenize step.
  • (moonshot) keeping explicit offset during tokenization.

This PR should resolve:

#948

>>> from nltk import word_tokenize
>>> word_tokenize("hi, my name can't hello,")
['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']

#916

>>> word_tokenize(u'«Esto es un ejemplo de cómo se suele hacer una cita literal en español».')
[u'\xab', u'Esto', u'es', u'un', u'ejemplo', u'de', u'c\xf3mo', u'se', u'suele', u'hacer', u'una', u'cita', u'literal', u'en', u'espa\xf1ol', u'\xbb', u'.']

#1699

>>> from nltk import word_tokenize
>>> s = 'A. Young c Moin Khan b Wasim 0'
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
>>> word_tokenize(s, preserve_line=True)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

Also, The detokenizer and align_sent also attends to #1217 and #1190

And the span_tokenize duck function using align_sent should also attend to #1054, #131 and #78.

@alvations
Copy link
Contributor Author

@leondz Sorry for hijacking the PR, I've add the code from #1190 with a regex detokenizer. The credits for the align_tokens function should go to you though.

@stevenbird stevenbird self-assigned this May 5, 2017
@alvations
Copy link
Contributor Author

The PR is good for review now =)

@stevenbird stevenbird merged commit 54d4457 into nltk:develop May 5, 2017
@stevenbird
Copy link
Member

Thanks for resolving all these long-outstanding problems @alvations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants