Python port of the Twokenize class of ark-tweet-nlp
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
.gitignore Initial commit Apr 29, 2013
LICENSE Add LICENSE Jul 24, 2016 Initial commit Apr 29, 2013 Add support for Python 3 (and require Python >3.3) May 4, 2018


This is a crude Python port of the Twokenize class from ark-tweet-nlp.

It produces nearly identical output to the original Java tokenizer, except in a few infrequent situations. In particular, Python does not support partial case-insensitivity in regular expressions and this causes some tokenization differences for ``Eastern" style emoticons, particularly when the left and right halves are of different cases. For example:

Java (original): v.V
Python (port): v . V

Emoticons of this kind are seemingly pretty rare. Nevertheless, I have included a fix for one special case:

Java (original): o.O
Python (port, w/o fix): o . O
Python (port, w/ fix): o.O


A comparison on 1 million tweets found 83 instances (0.0083%) where tokenization differed between the original Java version and this Python port. The differences were primarily related to the emoticon issue discussed above, and it was not clear in general which output was more desirable. For example:

Profit-Taking Hits Nikkei RT @WSJmarkets

Java (original):
Profi t-T aking Hits Nikkei RT @WSJmarkets

Python (port):
Profit-Taking Hits Nikkei RT @WSJmarkets