Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

alvations · 2016-02-04T14:57:21Z

Responding to #1214, here's 3 new tokenizers to add to NLTK. There's still much to do to #1214 though.

Added a wrapper to the REPP tokenizer (http://moin.delph-in.net/ReppTop)
- added 3rd party installation instructions to https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software#repp-tokenizer
- the wrapper works by outputing a list of sentences into a textfile and calling the tokenizer from shell and parsing the output to retrieves the list of tokens.
- Although it is possible to pipe the input strings to REPP instead of output-ing it to file first, it creates problems with locales when the machine isn't set to the utf8, so it's safer to control the file read/write and make sure the REPP is always fed with utf8 inputs.
- it has a tokenize_sents() in addition to the standard tokenize()
Added a python port of the tok-tok.pl (https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl)
- Note that for the doctest to work seamless across python2 and python3, it's checking the outputs with assert instead of the standard string outputs since the input can be some utf8 chars/strings that cannot be printed out consistently since STDOUT depends on the locale of the users/servers setting.
Added a python port of the Moses tokenizer
- One known issue is how Moses tokenizers handles URL.

Added a python port of the Moses detokenizer
- One known issue is how Moses detokenizer ignores strange unicode punctuation

And there's also two new wordlist and character list corpus added to ease the porting of other tokenizers.

Added the nonbreaking_prefixes wordlist from Moses SMT https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes.
- Note that there is an extra clause added to the WordListCorpusReader object that allows us to specify what kind of lines to ignore, e.g. lines starting with // or # or \n. Since the default WordListCorpusReader ignores blank lines, the default value for the ignore_lines_startswith parameter would be \n.
Added the perluniprops character list from http://perldoc.perl.org/perluniprops.html, these are very useful list of characters when porting Perl tokenizers into Python. Especially when porting Moses SMT tokenize.pl and Cdec SMT's tokenize-anything.sh.

Syncing with bleeding edge develop branch

nschneid · 2016-02-04T15:43:57Z

Thanks! Is it possible to have a documentation blurb discussing the differences between the tokenizers available in NLTK? Languages supported, assumptions about input (already sentence-split?), configurable options, maybe some example sentences that yield different results. Having this all in one place would make it easier for users to choose a tokenizer.

alvations · 2016-02-04T16:12:34Z

@nschneid where should we document that? In nltk.tokeinze.__init__.py? On a github wiki page? On a howto page? Or on all of them?

At some point, it should be on the main NLTK book (http://www.nltk.org/book/) too.

nschneid · 2016-02-04T16:36:56Z

I don't actually know. Maybe somebody else has an opinion.

alvations · 2016-02-05T10:34:16Z

@nschneid let's wait for all the tokenizers to be stable and do the comparison in another PR. There's still a few more to go before we have a full comparison of well-known tokenizers.

stevenbird · 2016-02-06T04:54:20Z

I've been migrating general-purpose user-level documentation to module docstrings for ease of discovery (via the help command and the API docs), and preferring to keep the test/* files for regression testing.

alvations · 2016-02-16T14:08:54Z

I've sorted the tokenizers in alphabetical order when adding toktok and repp to the nltk.tokenize.__init__.py

alvations · 2016-02-16T14:16:07Z

Added the nonbreaking_prefixes wordlist from Moses SMT https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes.
- Note that there is an extra clause added to the WordListCorpusReader object that allows us to specify what kind of lines to ignore, e.g. lines starting with // or # or \n. Since the default WordListCorpusReader ignores blank lines, the default value for the ignore_lines_startswith parameter would be \n.
Added the perluniprops character list from http://perldoc.perl.org/perluniprops.html, these are very useful list of characters when porting Perl tokenizers into Python. Especially when porting Moses SMT tokenize.pl and Cdec SMT's tokenize-anything.sh.

The zipballs for the two can be downloaded at:

nonbreaking_prefixes.zip: https://www.dropbox.com/s/2673igh9mnqywqo/nonbreaking_prefixes.zip
perluniprops.zip: https://www.dropbox.com/s/kjed1vh5nm1033q/perluniprops.zip

@stevenbird I've tested the new WordListCorpusReader that i've added with the zipfiles extracted. It also works with NLTK automatically extracting the .zip file through the LazyCorpusLoader

I'm not sure where to put these test though:

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> nbp.words('english')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> print "".join(nbp.words('tamil'))
அஆஇஈஉஊஎஏஐஒஓஔஃககாகிகீகுகூகெகேகைகொகோகௌக்சசாசிசீசுசூசெசேசைசொசோசௌச்டடாடிடீடுடூடெடேடைடொடோடௌட்ததாதிதீதுதூதெதேதைதொதோதௌத்பபாபிபீபுபூபெபேபைபொபோபௌப்றறாறிறீறுறூறெறேறைறொறோறௌற்யயாயியீயுயூயெயேயையொயோயௌய்ரராரிரீருரூரெரேரைரொரோரௌர்லலாலிலீலுலூலெலேலைலொலோலௌல்வவாவிவீவுவூவெவேவைவொவோவௌவ்ளளாளிளீளுளூளெளேளைளொளோளௌள்ழழாழிழீழுழூழெழேழைழொழோழௌழ்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌங்  ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌஞ் ணணாணிணீணுணூணெணேணைணொணோணௌண்நநாநிநீநுநூநெநேநைநொநோநௌந்  மமாமிமீமுமூமெமேமைமொமோமௌம்     னனானினீனுனூனெனேனைனொனோனௌன்திருதிருமதிவணகௌரவஉ.ம்No #NUMERIC_ONLY# NosArt #NUMERIC_ONLY#Nrpp #NUMERIC_ONLY#

and

>>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')
[u'(', u'[', u'{', u'\u0f3a', u'\u0f3c', u'\u169b', u'\u201a', u'\u201e', u'\u2045', u'\u207d', u'\u208d', u'\u2329', u'\u2768', u'\u276a', u'\u276c', u'\u276e', u'\u2770', u'\u2772', u'\u2774', u'\u27c5', u'\u27e6', u'\u27e8', u'\u27ea', u'\u27ec', u'\u27ee', u'\u2983', u'\u2985', u'\u2987', u'\u2989', u'\u298b', u'\u298d', u'\u298f', u'\u2991', u'\u2993', u'\u2995', u'\u2997', u'\u29d8', u'\u29da', u'\u29fc', u'\u2e22', u'\u2e24', u'\u2e26', u'\u2e28', u'\u3008', u'\u300a', u'\u300c', u'\u300e', u'\u3010', u'\u3014', u'\u3016', u'\u3018', u'\u301a', u'\u301d', u'\ufd3e', u'\ufe17', u'\ufe35', u'\ufe37', u'\ufe39', u'\ufe3b', u'\ufe3d', u'\ufe3f', u'\ufe41', u'\ufe43', u'\ufe47', u'\ufe59', u'\ufe5b', u'\ufe5d', u'\uff08', u'\uff3b', u'\uff5b', u'\uff5f', u'\uff62']
>>> "".join(pup.chars('Open_Punctuation'))
u'([{\u0f3a\u0f3c\u169b\u201a\u201e\u2045\u207d\u208d\u2329\u2768\u276a\u276c\u276e\u2770\u2772\u2774\u27c5\u27e6\u27e8\u27ea\u27ec\u27ee\u2983\u2985\u2987\u2989\u298b\u298d\u298f\u2991\u2993\u2995\u2997\u29d8\u29da\u29fc\u2e22\u2e24\u2e26\u2e28\u3008\u300a\u300c\u300e\u3010\u3014\u3016\u3018\u301a\u301d\ufd3e\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41\ufe43\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
>>> print "".join(pup.chars('Open_Punctuation'))
([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝（［｛｟｢
>>> pup.chars('Currency_Symbol')
[u'$', u'\xa2', u'\xa3', u'\xa4', u'\xa5', u'\u058f', u'\u060b', u'\u09f2', u'\u09f3', u'\u09fb', u'\u0af1', u'\u0bf9', u'\u0e3f', u'\u17db', u'\u20a0', u'\u20a1', u'\u20a2', u'\u20a3', u'\u20a4', u'\u20a5', u'\u20a6', u'\u20a7', u'\u20a8', u'\u20a9', u'\u20aa', u'\u20ab', u'\u20ac', u'\u20ad', u'\u20ae', u'\u20af', u'\u20b0', u'\u20b1', u'\u20b2', u'\u20b3', u'\u20b4', u'\u20b5', u'\u20b6', u'\u20b7', u'\u20b8', u'\u20b9', u'\u20ba', u'\ua838', u'\ufdfc', u'\ufe69', u'\uff04', u'\uffe0', u'\uffe1', u'\uffe5', u'\uffe6']
>>> print "".join(pup.chars('Currency_Symbol'))
$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩＄￠￡￥￦

Should they go to nltk.test or should they be like in the parts of the code where there's doctest in the docstring of the classes?

alvations · 2016-02-17T14:17:28Z

Added a python port of the Moses tokenizer
- One thing that is a known issue is how Moses tokenizers handles URL.

Given the input:

Is 9.5 or 525,600 my favorite number?
The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
This, is a sentence with weird» symbols… appearing everywhere¿

The output from NLTK's Python port of Moses tokenizers:

Is 9.5 or 525,600 my favorite number ?
The https : / / github.com / jonsafari / tok-tok / blob / master / tok-tok.pl is a website with / and / or slashes and sort of weird : things
This , is a sentence with weird » symbols … appearing everywhere ¿

The output from original Moses Tokenizer in perl:

Is 9.5 or 525,600 my favorite number ?
The https : / / github.com / jonsafari / tok-tok / blob / master / tok-tok.pl is a website with / and / or slashes and sort of weird : things
This , is a sentence with weird » symbols … appearing everywhere ¿

stevenbird · 2016-02-17T16:58:22Z

@alvations – the main guidance re doctests is to put anything that would count as informative documentation for users in a method, class, or module docstring (where it can be found using help or using the online API docs, and to put regression tests in test/*.doctest.

stevenbird · 2016-02-17T17:01:05Z

@alvations – would you like me to add those files to the NLTK data collection?

alvations · 2016-02-17T17:10:08Z

Yes, could you help add those 2 files to the NLTK data collection? Thank you.

I'll add the documentation to the nonbreaking_prefixes.words() and perluniprops.chars() method docstrings.

stevenbird · 2016-02-28T08:28:03Z

Sorry for the delay @alvations – I hope to do this tomorrow

stevenbird · 2016-03-04T11:46:53Z

@alvations – now that the data is in the right place, would you mind updating your PR please?

Syncing with bleeding edge develop branch

(jonsafari/tok-tok@d07e9c8)

subdirectories that is not nltk_data/corpora. - Modified the perluniprops lazycorpusloader to find the zipball in the nltk_data/misc using the nltk_data_subdir parameter iin the new lazycorpusloader.

alvations · 2016-03-04T19:10:57Z

@stevenbird when making the perluniprops load from nltk_data/misc instead of nltk_data/corpora I had to use kwargs and introduced the self.subdir property in LazyCorpusLoader while doing so.

In the process, I've also changed the "x %s y" % "foobar" to "x {} y".format("foobar") in the LazyCorpusLoader.__load(). Is that alright?

I've tested nltk/corpus/reader/wordlist.py, nltk/tokenize/toktok.py and nltk/tokenize/moses.py, they passed the doctests.

@moses-smt, @jonsafari did we miss anything? Or do you have some comments/suggestions to make to the NLTK port of your tokenizer?

@stevenbird , @nschneid the code should be good for review now =)

Syncing with bleeding edge develop branch

Syncing with bleeding edge.

alvations · 2016-05-18T21:50:53Z

Added the python port of Moses' detokenizer.

stevenbird · 2016-06-20T14:41:23Z

Thanks @alvations. This looks good to me now.

alvations added 4 commits February 4, 2016 12:02

Merge pull request #47 from nltk/develop

601dcd0

Syncing with bleeding edge develop branch

Added 2 new tokenizers

5d9fd59

added the reference to the publication that used tok-tok.pl

95b065f

Remove the over-excited statement in docstring.

1f540c3

added new line at EOF.

a6aa115

stevenbird self-assigned this Feb 6, 2016

Added toktok and repp wrapper to the nltk.tokenize namespace level

163d0ee

Added nonbreaking_prefixes and perluniprops to the nltk.corpus

4a3942a

alvations changed the title ~~Added two tokenizers~~ Added two tokenizers and two tokenizer-related word/char list corpora Feb 16, 2016

alvations added 2 commits February 16, 2016 15:21

added newline after class.

df399c8

Added Moses Tokenizer

b7b2735

alvations changed the title ~~Added two tokenizers and two tokenizer-related word/char list corpora~~ Added three tokenizers and two tokenizer-related word/char list corpora Feb 17, 2016

alvations added 4 commits February 17, 2016 15:20

added author/contributors information

05c01c6

Uses py3.x compatiable regexes

5780303

added MosesTokenizer to nltk.tokenize

d9d4c1f

remove moses from __init__

2ff4cd1

alvations added 2 commits February 17, 2016 18:19

added docstrings to the WordListCorpusReader methods and class API

127ae54

fix grammar in docstring

514c082

stevenbird added this to the 3.2 milestone Feb 28, 2016

stevenbird added a commit to nltk/nltk_data that referenced this pull request Mar 2, 2016

useful data for tokenizing, cf nltk/nltk#1282

f3eff83

alvations added 3 commits March 4, 2016 13:27

Merge pull request #48 from nltk/develop

0457735

Syncing with bleeding edge develop branch

readjust regexes to correspond with recent changes to toktok.pl

51e950d

(jonsafari/tok-tok@d07e9c8)

- Added functionality for lazycorpusloader to load from nltk_data

633e319

subdirectories that is not nltk_data/corpora. - Modified the perluniprops lazycorpusloader to find the zipball in the nltk_data/misc using the nltk_data_subdir parameter iin the new lazycorpusloader.

stevenbird modified the milestones: 3.2, 3.2.1 Mar 5, 2016

Merge pull request #49 from nltk/develop

0b49ce3

Syncing with bleeding edge develop branch

alvations mentioned this pull request Mar 22, 2016

No abbreviations files found in /home/zhaojinming/git_manager/theano_lstm_imdb/../share/nonbreaking_prefixes Tokenizing.. Done moses-smt/mosesdecoder#145

Closed

alvations added 5 commits April 3, 2016 22:09

Merge pull request #50 from nltk/develop

ca049a3

Syncing with bleeding edge.

Merge pull request #51 from nltk/develop

3612960

Syncing with bleeding edge.

Added Moses detokenizer

c1cd589

Returns list of string for tokenize() function, added a detokenizer

2eff2f4

Merge branch 'develop' of https://github.com/alvations/nltk into develop

0501128

alvations changed the title ~~Added three tokenizers and two tokenizer-related word/char list corpora~~ Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora May 18, 2016

added newline at end of file for nltk/tokenize/util.py

2651672

Enable return_str vs return tuple outputs for tokenize().

79eed6d

stevenbird merged commit 83c6700 into nltk:develop Jun 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

alvations commented Feb 4, 2016 •

edited

Loading

nschneid commented Feb 4, 2016

alvations commented Feb 4, 2016

nschneid commented Feb 4, 2016

alvations commented Feb 5, 2016

stevenbird commented Feb 6, 2016

alvations commented Feb 16, 2016

alvations commented Feb 16, 2016

alvations commented Feb 17, 2016

stevenbird commented Feb 17, 2016

stevenbird commented Feb 17, 2016

alvations commented Feb 17, 2016

stevenbird commented Feb 28, 2016

stevenbird commented Mar 4, 2016

alvations commented Mar 4, 2016

alvations commented May 18, 2016

stevenbird commented Jun 20, 2016

Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

Conversation

alvations commented Feb 4, 2016 • edited Loading

nschneid commented Feb 4, 2016

alvations commented Feb 4, 2016

nschneid commented Feb 4, 2016

alvations commented Feb 5, 2016

stevenbird commented Feb 6, 2016

alvations commented Feb 16, 2016

alvations commented Feb 16, 2016

alvations commented Feb 17, 2016

stevenbird commented Feb 17, 2016

stevenbird commented Feb 17, 2016

alvations commented Feb 17, 2016

stevenbird commented Feb 28, 2016

stevenbird commented Mar 4, 2016

alvations commented Mar 4, 2016

alvations commented May 18, 2016

stevenbird commented Jun 20, 2016

alvations commented Feb 4, 2016 •

edited

Loading