Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282

Merged
merged 26 commits into from
Jun 20, 2016

Conversation

alvations
Copy link
Contributor

@alvations alvations commented Feb 4, 2016

Responding to #1214, here's 3 new tokenizers to add to NLTK. There's still much to do to #1214 though.

  • Added a wrapper to the REPP tokenizer (http://moin.delph-in.net/ReppTop)
    • added 3rd party installation instructions to https://github.com/nltk/nltk/wiki/Installing-Third-Party-Software#repp-tokenizer
    • the wrapper works by outputing a list of sentences into a textfile and calling the tokenizer from shell and parsing the output to retrieves the list of tokens.
    • Although it is possible to pipe the input strings to REPP instead of output-ing it to file first, it creates problems with locales when the machine isn't set to the utf8, so it's safer to control the file read/write and make sure the REPP is always fed with utf8 inputs.
    • it has a tokenize_sents() in addition to the standard tokenize()
  • Added a python port of the tok-tok.pl (https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl)
    • Note that for the doctest to work seamless across python2 and python3, it's checking the outputs with assert instead of the standard string outputs since the input can be some utf8 chars/strings that cannot be printed out consistently since STDOUT depends on the locale of the users/servers setting.
  • Added a python port of the Moses tokenizer
    • One known issue is how Moses tokenizers handles URL.

  • Added a python port of the Moses detokenizer
    • One known issue is how Moses detokenizer ignores strange unicode punctuation

And there's also two new wordlist and character list corpus added to ease the porting of other tokenizers.

  • Added the nonbreaking_prefixes wordlist from Moses SMT https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes.
    • Note that there is an extra clause added to the WordListCorpusReader object that allows us to specify what kind of lines to ignore, e.g. lines starting with // or # or \n. Since the default WordListCorpusReader ignores blank lines, the default value for the ignore_lines_startswith parameter would be \n.
  • Added the perluniprops character list from http://perldoc.perl.org/perluniprops.html, these are very useful list of characters when porting Perl tokenizers into Python. Especially when porting Moses SMT tokenize.pl and Cdec SMT's tokenize-anything.sh.

@nschneid
Copy link
Contributor

nschneid commented Feb 4, 2016

Thanks! Is it possible to have a documentation blurb discussing the differences between the tokenizers available in NLTK? Languages supported, assumptions about input (already sentence-split?), configurable options, maybe some example sentences that yield different results. Having this all in one place would make it easier for users to choose a tokenizer.

@alvations
Copy link
Contributor Author

@nschneid where should we document that? In nltk.tokeinze.__init__.py? On a github wiki page? On a howto page? Or on all of them?

At some point, it should be on the main NLTK book (http://www.nltk.org/book/) too.

@nschneid
Copy link
Contributor

nschneid commented Feb 4, 2016

I don't actually know. Maybe somebody else has an opinion.

@alvations
Copy link
Contributor Author

@nschneid let's wait for all the tokenizers to be stable and do the comparison in another PR. There's still a few more to go before we have a full comparison of well-known tokenizers.

@stevenbird stevenbird self-assigned this Feb 6, 2016
@stevenbird
Copy link
Member

I've been migrating general-purpose user-level documentation to module docstrings for ease of discovery (via the help command and the API docs), and preferring to keep the test/* files for regression testing.

@alvations
Copy link
Contributor Author

I've sorted the tokenizers in alphabetical order when adding toktok and repp to the nltk.tokenize.__init__.py

@alvations alvations changed the title Added two tokenizers Added two tokenizers and two tokenizer-related word/char list corpora Feb 16, 2016
@alvations
Copy link
Contributor Author

  • Added the nonbreaking_prefixes wordlist from Moses SMT https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes.
    • Note that there is an extra clause added to the WordListCorpusReader object that allows us to specify what kind of lines to ignore, e.g. lines starting with // or # or \n. Since the default WordListCorpusReader ignores blank lines, the default value for the ignore_lines_startswith parameter would be \n.
  • Added the perluniprops character list from http://perldoc.perl.org/perluniprops.html, these are very useful list of characters when porting Perl tokenizers into Python. Especially when porting Moses SMT tokenize.pl and Cdec SMT's tokenize-anything.sh.

The zipballs for the two can be downloaded at:

@stevenbird I've tested the new WordListCorpusReader that i've added with the zipfiles extracted. It also works with NLTK automatically extracting the .zip file through the LazyCorpusLoader


I'm not sure where to put these test though:

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> nbp.words('english')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> print "".join(nbp.words('tamil'))
அஆஇஈஉஊஎஏஐஒஓஔஃககாகிகீகுகூகெகேகைகொகோகௌக்சசாசிசீசுசூசெசேசைசொசோசௌச்டடாடிடீடுடூடெடேடைடொடோடௌட்ததாதிதீதுதூதெதேதைதொதோதௌத்பபாபிபீபுபூபெபேபைபொபோபௌப்றறாறிறீறுறூறெறேறைறொறோறௌற்யயாயியீயுயூயெயேயையொயோயௌய்ரராரிரீருரூரெரேரைரொரோரௌர்லலாலிலீலுலூலெலேலைலொலோலௌல்வவாவிவீவுவூவெவேவைவொவோவௌவ்ளளாளிளீளுளூளெளேளைளொளோளௌள்ழழாழிழீழுழூழெழேழைழொழோழௌழ்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌங்  ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌஞ் ணணாணிணீணுணூணெணேணைணொணோணௌண்நநாநிநீநுநூநெநேநைநொநோநௌந்  மமாமிமீமுமூமெமேமைமொமோமௌம்     னனானினீனுனூனெனேனைனொனோனௌன்திருதிருமதிவணகௌரவஉ.ம்No #NUMERIC_ONLY# NosArt #NUMERIC_ONLY#Nrpp #NUMERIC_ONLY#

and

>>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')
[u'(', u'[', u'{', u'\u0f3a', u'\u0f3c', u'\u169b', u'\u201a', u'\u201e', u'\u2045', u'\u207d', u'\u208d', u'\u2329', u'\u2768', u'\u276a', u'\u276c', u'\u276e', u'\u2770', u'\u2772', u'\u2774', u'\u27c5', u'\u27e6', u'\u27e8', u'\u27ea', u'\u27ec', u'\u27ee', u'\u2983', u'\u2985', u'\u2987', u'\u2989', u'\u298b', u'\u298d', u'\u298f', u'\u2991', u'\u2993', u'\u2995', u'\u2997', u'\u29d8', u'\u29da', u'\u29fc', u'\u2e22', u'\u2e24', u'\u2e26', u'\u2e28', u'\u3008', u'\u300a', u'\u300c', u'\u300e', u'\u3010', u'\u3014', u'\u3016', u'\u3018', u'\u301a', u'\u301d', u'\ufd3e', u'\ufe17', u'\ufe35', u'\ufe37', u'\ufe39', u'\ufe3b', u'\ufe3d', u'\ufe3f', u'\ufe41', u'\ufe43', u'\ufe47', u'\ufe59', u'\ufe5b', u'\ufe5d', u'\uff08', u'\uff3b', u'\uff5b', u'\uff5f', u'\uff62']
>>> "".join(pup.chars('Open_Punctuation'))
u'([{\u0f3a\u0f3c\u169b\u201a\u201e\u2045\u207d\u208d\u2329\u2768\u276a\u276c\u276e\u2770\u2772\u2774\u27c5\u27e6\u27e8\u27ea\u27ec\u27ee\u2983\u2985\u2987\u2989\u298b\u298d\u298f\u2991\u2993\u2995\u2997\u29d8\u29da\u29fc\u2e22\u2e24\u2e26\u2e28\u3008\u300a\u300c\u300e\u3010\u3014\u3016\u3018\u301a\u301d\ufd3e\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41\ufe43\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
>>> print "".join(pup.chars('Open_Punctuation'))
([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「
>>> pup.chars('Currency_Symbol')
[u'$', u'\xa2', u'\xa3', u'\xa4', u'\xa5', u'\u058f', u'\u060b', u'\u09f2', u'\u09f3', u'\u09fb', u'\u0af1', u'\u0bf9', u'\u0e3f', u'\u17db', u'\u20a0', u'\u20a1', u'\u20a2', u'\u20a3', u'\u20a4', u'\u20a5', u'\u20a6', u'\u20a7', u'\u20a8', u'\u20a9', u'\u20aa', u'\u20ab', u'\u20ac', u'\u20ad', u'\u20ae', u'\u20af', u'\u20b0', u'\u20b1', u'\u20b2', u'\u20b3', u'\u20b4', u'\u20b5', u'\u20b6', u'\u20b7', u'\u20b8', u'\u20b9', u'\u20ba', u'\ua838', u'\ufdfc', u'\ufe69', u'\uff04', u'\uffe0', u'\uffe1', u'\uffe5', u'\uffe6']
>>> print "".join(pup.chars('Currency_Symbol'))
$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩

Should they go to nltk.test or should they be like in the parts of the code where there's doctest in the docstring of the classes?

@alvations alvations changed the title Added two tokenizers and two tokenizer-related word/char list corpora Added three tokenizers and two tokenizer-related word/char list corpora Feb 17, 2016
@alvations
Copy link
Contributor Author

  • Added a python port of the Moses tokenizer
    • One thing that is a known issue is how Moses tokenizers handles URL.

Given the input:

Is 9.5 or 525,600 my favorite number?
The https://github.com/jonsafari/tok-tok/blob/master/tok-tok.pl is a website with/and/or slashes and sort of weird : things
This, is a sentence with weird» symbols… appearing everywhere¿

The output from NLTK's Python port of Moses tokenizers:

Is 9.5 or 525,600 my favorite number ?
The https : / / github.com / jonsafari / tok-tok / blob / master / tok-tok.pl is a website with / and / or slashes and sort of weird : things
This , is a sentence with weird » symbols … appearing everywhere ¿

The output from original Moses Tokenizer in perl:

Is 9.5 or 525,600 my favorite number ?
The https : / / github.com / jonsafari / tok-tok / blob / master / tok-tok.pl is a website with / and / or slashes and sort of weird : things
This , is a sentence with weird » symbols … appearing everywhere ¿

@stevenbird
Copy link
Member

@alvations – the main guidance re doctests is to put anything that would count as informative documentation for users in a method, class, or module docstring (where it can be found using help or using the online API docs, and to put regression tests in test/*.doctest.

@stevenbird
Copy link
Member

@alvations – would you like me to add those files to the NLTK data collection?

@alvations
Copy link
Contributor Author

Yes, could you help add those 2 files to the NLTK data collection? Thank you.

I'll add the documentation to the nonbreaking_prefixes.words() and perluniprops.chars() method docstrings.

@stevenbird
Copy link
Member

Sorry for the delay @alvations – I hope to do this tomorrow

@stevenbird stevenbird added this to the 3.2 milestone Feb 28, 2016
stevenbird added a commit to nltk/nltk_data that referenced this pull request Mar 2, 2016
@stevenbird
Copy link
Member

@alvations – now that the data is in the right place, would you mind updating your PR please?

Syncing with bleeding edge develop branch
subdirectories that is not nltk_data/corpora. 

- Modified the perluniprops lazycorpusloader to find the zipball in the
nltk_data/misc using the nltk_data_subdir parameter iin the new
lazycorpusloader.
@alvations
Copy link
Contributor Author

@stevenbird when making the perluniprops load from nltk_data/misc instead of nltk_data/corpora I had to use kwargs and introduced the self.subdir property in LazyCorpusLoader while doing so.

In the process, I've also changed the "x %s y" % "foobar" to "x {} y".format("foobar") in the LazyCorpusLoader.__load(). Is that alright?

I've tested nltk/corpus/reader/wordlist.py, nltk/tokenize/toktok.py and nltk/tokenize/moses.py, they passed the doctests.

@moses-smt, @jonsafari did we miss anything? Or do you have some comments/suggestions to make to the NLTK port of your tokenizer?

@stevenbird , @nschneid the code should be good for review now =)

@stevenbird stevenbird modified the milestones: 3.2, 3.2.1 Mar 5, 2016
Syncing with bleeding edge develop branch
@alvations alvations changed the title Added three tokenizers and two tokenizer-related word/char list corpora Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora May 18, 2016
@alvations
Copy link
Contributor Author

Added the python port of Moses' detokenizer.

@stevenbird
Copy link
Member

Thanks @alvations. This looks good to me now.

@stevenbird stevenbird merged commit 83c6700 into nltk:develop Jun 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants