-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added three tokenizers, one detokenizer and two tokenizer-related word/char list corpora #1282
Conversation
Syncing with bleeding edge develop branch
Thanks! Is it possible to have a documentation blurb discussing the differences between the tokenizers available in NLTK? Languages supported, assumptions about input (already sentence-split?), configurable options, maybe some example sentences that yield different results. Having this all in one place would make it easier for users to choose a tokenizer. |
@nschneid where should we document that? In At some point, it should be on the main NLTK book (http://www.nltk.org/book/) too. |
I don't actually know. Maybe somebody else has an opinion. |
@nschneid let's wait for all the tokenizers to be stable and do the comparison in another PR. There's still a few more to go before we have a full comparison of well-known tokenizers. |
I've been migrating general-purpose user-level documentation to module docstrings for ease of discovery (via the |
I've sorted the tokenizers in alphabetical order when adding toktok and repp to the |
The zipballs for the two can be downloaded at:
@stevenbird I've tested the new I'm not sure where to put these test though: >>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> nbp.words('english')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> print "".join(nbp.words('tamil'))
அஆஇஈஉஊஎஏஐஒஓஔஃககாகிகீகுகூகெகேகைகொகோகௌக்சசாசிசீசுசூசெசேசைசொசோசௌச்டடாடிடீடுடூடெடேடைடொடோடௌட்ததாதிதீதுதூதெதேதைதொதோதௌத்பபாபிபீபுபூபெபேபைபொபோபௌப்றறாறிறீறுறூறெறேறைறொறோறௌற்யயாயியீயுயூயெயேயையொயோயௌய்ரராரிரீருரூரெரேரைரொரோரௌர்லலாலிலீலுலூலெலேலைலொலோலௌல்வவாவிவீவுவூவெவேவைவொவோவௌவ்ளளாளிளீளுளூளெளேளைளொளோளௌள்ழழாழிழீழுழூழெழேழைழொழோழௌழ்ஙஙாஙிஙீஙுஙூஙெஙேஙைஙொஙோஙௌங் ஞஞாஞிஞீஞுஞூஞெஞேஞைஞொஞோஞௌஞ் ணணாணிணீணுணூணெணேணைணொணோணௌண்நநாநிநீநுநூநெநேநைநொநோநௌந் மமாமிமீமுமூமெமேமைமொமோமௌம் னனானினீனுனூனெனேனைனொனோனௌன்திருதிருமதிவணகௌரவஉ.ம்No #NUMERIC_ONLY# NosArt #NUMERIC_ONLY#Nrpp #NUMERIC_ONLY# and >>> from nltk.corpus import perluniprops as pup
>>> pup.chars('Open_Punctuation')
[u'(', u'[', u'{', u'\u0f3a', u'\u0f3c', u'\u169b', u'\u201a', u'\u201e', u'\u2045', u'\u207d', u'\u208d', u'\u2329', u'\u2768', u'\u276a', u'\u276c', u'\u276e', u'\u2770', u'\u2772', u'\u2774', u'\u27c5', u'\u27e6', u'\u27e8', u'\u27ea', u'\u27ec', u'\u27ee', u'\u2983', u'\u2985', u'\u2987', u'\u2989', u'\u298b', u'\u298d', u'\u298f', u'\u2991', u'\u2993', u'\u2995', u'\u2997', u'\u29d8', u'\u29da', u'\u29fc', u'\u2e22', u'\u2e24', u'\u2e26', u'\u2e28', u'\u3008', u'\u300a', u'\u300c', u'\u300e', u'\u3010', u'\u3014', u'\u3016', u'\u3018', u'\u301a', u'\u301d', u'\ufd3e', u'\ufe17', u'\ufe35', u'\ufe37', u'\ufe39', u'\ufe3b', u'\ufe3d', u'\ufe3f', u'\ufe41', u'\ufe43', u'\ufe47', u'\ufe59', u'\ufe5b', u'\ufe5d', u'\uff08', u'\uff3b', u'\uff5b', u'\uff5f', u'\uff62']
>>> "".join(pup.chars('Open_Punctuation'))
u'([{\u0f3a\u0f3c\u169b\u201a\u201e\u2045\u207d\u208d\u2329\u2768\u276a\u276c\u276e\u2770\u2772\u2774\u27c5\u27e6\u27e8\u27ea\u27ec\u27ee\u2983\u2985\u2987\u2989\u298b\u298d\u298f\u2991\u2993\u2995\u2997\u29d8\u29da\u29fc\u2e22\u2e24\u2e26\u2e28\u3008\u300a\u300c\u300e\u3010\u3014\u3016\u3018\u301a\u301d\ufd3e\ufe17\ufe35\ufe37\ufe39\ufe3b\ufe3d\ufe3f\ufe41\ufe43\ufe47\ufe59\ufe5b\ufe5d\uff08\uff3b\uff5b\uff5f\uff62'
>>> print "".join(pup.chars('Open_Punctuation'))
([{༺༼᚛‚„⁅⁽₍〈❨❪❬❮❰❲❴⟅⟦⟨⟪⟬⟮⦃⦅⦇⦉⦋⦍⦏⦑⦓⦕⦗⧘⧚⧼⸢⸤⸦⸨〈《「『【〔〖〘〚〝﴾︗︵︷︹︻︽︿﹁﹃﹇﹙﹛﹝([{⦅「
>>> pup.chars('Currency_Symbol')
[u'$', u'\xa2', u'\xa3', u'\xa4', u'\xa5', u'\u058f', u'\u060b', u'\u09f2', u'\u09f3', u'\u09fb', u'\u0af1', u'\u0bf9', u'\u0e3f', u'\u17db', u'\u20a0', u'\u20a1', u'\u20a2', u'\u20a3', u'\u20a4', u'\u20a5', u'\u20a6', u'\u20a7', u'\u20a8', u'\u20a9', u'\u20aa', u'\u20ab', u'\u20ac', u'\u20ad', u'\u20ae', u'\u20af', u'\u20b0', u'\u20b1', u'\u20b2', u'\u20b3', u'\u20b4', u'\u20b5', u'\u20b6', u'\u20b7', u'\u20b8', u'\u20b9', u'\u20ba', u'\ua838', u'\ufdfc', u'\ufe69', u'\uff04', u'\uffe0', u'\uffe1', u'\uffe5', u'\uffe6']
>>> print "".join(pup.chars('Currency_Symbol'))
$¢£¤¥֏؋৲৳৻૱௹฿៛₠₡₢₣₤₥₦₧₨₩₪₫€₭₮₯₰₱₲₳₴₵₶₷₸₹₺꠸﷼﹩$¢£¥₩ Should they go to |
Given the input:
The output from NLTK's Python port of Moses tokenizers:
The output from original Moses Tokenizer in perl:
|
@alvations – the main guidance re doctests is to put anything that would count as informative documentation for users in a method, class, or module docstring (where it can be found using |
@alvations – would you like me to add those files to the NLTK data collection? |
Yes, could you help add those 2 files to the NLTK data collection? Thank you. I'll add the documentation to the |
Sorry for the delay @alvations – I hope to do this tomorrow |
@alvations – now that the data is in the right place, would you mind updating your PR please? |
Syncing with bleeding edge develop branch
subdirectories that is not nltk_data/corpora. - Modified the perluniprops lazycorpusloader to find the zipball in the nltk_data/misc using the nltk_data_subdir parameter iin the new lazycorpusloader.
@stevenbird when making the perluniprops load from In the process, I've also changed the I've tested @moses-smt, @jonsafari did we miss anything? Or do you have some comments/suggestions to make to the NLTK port of your tokenizer? @stevenbird , @nschneid the code should be good for review now =) |
Syncing with bleeding edge develop branch
Syncing with bleeding edge.
Syncing with bleeding edge.
Added the python port of Moses' detokenizer. |
Thanks @alvations. This looks good to me now. |
Responding to #1214, here's 3 new tokenizers to add to NLTK. There's still much to do to #1214 though.
tokenize_sents()
in addition to the standardtokenize()
And there's also two new wordlist and character list corpus added to ease the porting of other tokenizers.
nonbreaking_prefixes
wordlist fromMoses SMT
https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes.WordListCorpusReader
object that allows us to specify what kind of lines to ignore, e.g. lines starting with//
or#
or\n
. Since the defaultWordListCorpusReader
ignores blank lines, the default value for theignore_lines_startswith
parameter would be\n
.perluniprops
character list from http://perldoc.perl.org/perluniprops.html, these are very useful list of characters when porting Perl tokenizers into Python. Especially when porting Moses SMTtokenize.pl
and Cdec SMT's tokenize-anything.sh.