Better Treebank tokenizer #1710

alvations · 2017-05-04T02:14:00Z

Adding on to #1214, this PR:

Added detokenization abilities by reversing the regexes operations
Post-hoc tokens mapping to original sentence; might not be 100% correctly aligned but I can't think of any counter examples yet..., c.f. Ptb tokenize with offsets #1190
Allow MXPOST parentheses that PTB uses (also in the original sed treebank tokenizer)
Allow additional parameters for word_tokenize to preserve the sentence without calling the sent_tokenize step.
~~(moonshot) keeping explicit offset during tokenization.~~

This PR should resolve:

>>> from nltk import word_tokenize
>>> word_tokenize("hi, my name can't hello,")
['hi', ',', 'my', 'name', 'ca', "n't", 'hello', ',']

#916

>>> word_tokenize(u'«Esto es un ejemplo de cómo se suele hacer una cita literal en español».')
[u'\xab', u'Esto', u'es', u'un', u'ejemplo', u'de', u'c\xf3mo', u'se', u'suele', u'hacer', u'una', u'cita', u'literal', u'en', u'espa\xf1ol', u'\xbb', u'.']

#1699

>>> from nltk import word_tokenize
>>> s = 'A. Young c Moin Khan b Wasim 0'
>>> word_tokenize(s)
['A', '.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']
>>> word_tokenize(s, preserve_line=True)
['A.', 'Young', 'c', 'Moin', 'Khan', 'b', 'Wasim', '0']

Also, The detokenizer and align_sent also attends to #1217 and #1190

And the span_tokenize duck function using align_sent should also attend to #1054, #131 and #78.

alvations · 2017-05-04T11:02:01Z

@leondz Sorry for hijacking the PR, I've add the code from #1190 with a regex detokenizer. The credits for the align_tokens function should go to you though.

…s in other tokenizer tools

alvations · 2017-05-05T06:26:49Z

The PR is good for review now =)

stevenbird · 2017-05-05T09:30:11Z

Thanks for resolving all these long-outstanding problems @alvations.

alvations added 12 commits May 4, 2017 09:05

added a tokenizer

7d3c51b

added the \s after "wanna" contraction regex

ef1d83a

parenthesis regex group in detokenization should be padded with spaces

8e00ade

added info about not being able to restore \n\s\t

359ad7f

added align_tokens functions from nltk#1190

153b317

remove erroneous tests for align_tokens

3af748b

added spacing after the doctest, improve param documentation

2e9ead2

clean up doctest

90fc945

use more explicit test cases

3fd7770

fixed typo in doctest

b67e702

move doctest comments to the end of line

444274f

move doctest comments to the end of line

786d905

alvations added 2 commits May 5, 2017 10:29

added option to allow PTB MXPOST parentheses

8c541c9

reinstated the triple quotes to accommodate nosetest

e8064f4

alvations added the enhancement label May 5, 2017

alvations added 2 commits May 5, 2017 11:45

added missing doctest output

133532b

added preserve line option, this is consistent with standard practice…

0e924f3

…s in other tokenizer tools

alvations mentioned this pull request May 5, 2017

Nltk word tokenizer strange behaviour #1699

Closed

alvations added 3 commits May 5, 2017 12:13

fixed typo in doctest variable

9344e1e

added the span_tokenize for TreebankWordTokenizer

9df35f9

the correct offset (nostest docstr is different from Python interpreter)

a3b21a7

stevenbird self-assigned this May 5, 2017

alvations added 3 commits May 5, 2017 13:26

checking the offset from nosetest

f67e9d2

checking the offset from nosetest

06c1f54

The correct offset that nosetest needed

659600f

stevenbird merged commit 54d4457 into nltk:develop May 5, 2017

alvations mentioned this pull request May 15, 2017

[patch] implement span_tokenize() for Treebank word tokenizer #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Treebank tokenizer #1710

Better Treebank tokenizer #1710

alvations commented May 4, 2017 •

edited

alvations commented May 4, 2017

alvations commented May 5, 2017

stevenbird commented May 5, 2017

Better Treebank tokenizer #1710

Better Treebank tokenizer #1710

Conversation

alvations commented May 4, 2017 • edited

alvations commented May 4, 2017

alvations commented May 5, 2017

stevenbird commented May 5, 2017

alvations commented May 4, 2017 •

edited