# Text Tokenization

Tokenization splits a text document into, well, tokens, and it is the most basic text processing step. A token is the smallest unit with semantic meaning -- that is, tokens a mostly words an numbers, while letters of words or digits of numbers are not tokens. Punctuation marks are also considered tokens. With the more informal writing style on social media, concepts such as hashtags, emoticons and emojies are nowadays also often meaningful tokens.

The following examples compare different tokenizers to highlight the differences and subtleties when it comes to splitting text into tokens. There is not the best or only correct tokenizer. Which implementation is best suitable depends on the type of input and further processing tasks.

## Import all important packages

We first use NLTK, a very popular and mature Python package from language processing.

In [5]:
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer

from nltk import word_tokenize # Simplfied notation; it's a wrapper for the TreebankWordTokenizer

NLTK provides more tokenizer: http://www.nltk.org/api/nltk.tokenize.html

## Define a document

We first create list of sentences.

In [6]:
sentences = ["Text processing with Python is great.", 
             "It isn't (very) complicated to get started.",
             "However,careful to...you know....avoid mistakes.",
             "Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.",
             "This is so cooool #nltkrocks :))) :-P <3."]

To form the document, we can use the in-built `join()` method to concatenate all sentences using a whitesplace as seperator

In [7]:
document = ' '.join(sentences)

# Print the document to see if everything looks alright
print (document)

Text processing with Python is great. It isn't (very) complicated to get started. However,careful to...you know....avoid mistakes. Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg. This is so cooool #nltkrocks :))) :-P <3.


## Document tokenization into sentences

Sometime, you just want to split a document into sentences and not individual tokens.

In [8]:
sentence_tokenizer = PunktSentenceTokenizer()

# The tokenize() method returns a list containing the sentences
sentences_alt = sentence_tokenizer.tokenize(document)

# Loop over all sentences and print each sentence
for s in sentences_alt:
    print (s)

Text processing with Python is great.
It isn't (very) complicated to get started.
However,careful to...you know....avoid mistakes.
Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.
This is so cooool #nltkrocks :))) :-P <3.


## Document tokenization into tokens

In the following, we tokenize each sentence individually. This makes the presentation a bit more convienient. In practice, you can tokenize the whole document at once.

### Naive tokenization

Python provides an in-built method `split()` that splits strings with respect to a user-defined separator. Be default, the separator is a whitespace.

In [9]:
print ('\nOutput of split() method:')
for s in sentences:
    print (s.split(' '))
    #print(s.split()) # This is also fine since whitespace is the default separator


Output of split() method:
['Text', 'processing', 'with', 'Python', 'is', 'great.']
['It', "isn't", '(very)', 'complicated', 'to', 'get', 'started.']
['However,careful', 'to...you', 'know....avoid', 'mistakes.']
['Contact', 'me', 'at', 'vonderweth@nus.edu.sg;', 'see', 'http://nus.edu.sg.']
['This', 'is', 'so', 'cooool', '#nltkrocks', ':)))', ':-P', '<3.']


The limitation of this approach is obivous, since many token are not separated by a whitespace. Most commonly this is the case for punctuation marks.

### TreebankWordTokenizer

The `TreebankWordTokenizer` is the default tokenizer. If you have well-formed text such as news articles, this tokenizer is usually the way to go.

In [6]:
treebank_tokenizer = TreebankWordTokenizer()

print ('\nOutput of TreebankWordTokenizer:')
for s in sentences:
    print (treebank_tokenizer.tokenize(s))

print ('\nOutput of the word_tokenize() method:')
for s in sentences:
    print (word_tokenize(s))   


Output of TreebankWordTokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'is', "n't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '...', '.avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth', '@', 'nus.edu.sg', ';', 'see', 'http', ':', '//nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':', ')', ')', ')', ':', '-P', '<', '3', '.']

Output of the word_tokenize() method:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'is', "n't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '...', '.avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth', '@', 'nus.edu.sg', ';', 'see', 'http', ':', '//nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':', ')', ')', ')', ':', '-P', '<', '3', '.']


Both outputs are the same, since the `word_tokenize()` method is just a wrapper for the `TreebankWordTokenizer` to simplify the coding.

See how this tokenizer also splits common contractions such as *isn't*, *hasn't*, *haven't*. Other tokenizers (see below) consider such contractions as one token. Being aware how this is handled is, for example, important for sentiment analysis where handling negations is very important to get the sentiment right.

Also, notice how the tokenizer can handle the ellipsis (`...`) correctly in the first case but fails in the second case since an ellipsis is by definition comprised of exactly 3 dots. More or less the 3 dots are not handled properly.

### TweetTokenizer

The `TweetTokenizer` is optimized for social media content where people use informal concepts such as hashtags or emoticons. Note that emoticons are often contain punctiation marks that throw other tokenizers off.

In [7]:
tweet_tokenizer = TweetTokenizer()

print ('Output of TweetTokenizer:')
for s in sentences:
    print (tweet_tokenizer.tokenize(s))

Output of TweetTokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', "isn't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '...', 'avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth@nus.edu.sg', ';', 'see', 'http://nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#nltkrocks', ':)', ')', ')', ':-P', '<3', '.']


Here, both ellipsis are recognized, with the second one even "correct" to three dots.

Note how the toknizer fails with `:)))`. The problem is that it is not the "official version" of the emoticon -- which is `:)` or `:-)` -- but uses multiple "mouths" to emphasize the expressed sentiment of feeling. If this is a problem depends on the subsequent analysis; some extra `)` are no big deal in many cases.

The 2 basic alternatives to properly address this issue:
- Clean your text before tokenizing
- Remove all "odd" tokens from the list before further processing
- Write your own sophisticated tokenizer :-)

### RegexpTokenizer

The `RegexpTokenizer` takes as input a regular expression that specifies which parts of the string qualify as a valid token. That means that some parts of the string might be removed. In principle, all previous tokenizers can be expressed as a `RegexpTokenizer` -- however, the required regular expressions can be very complex.

In [8]:
pattern = '\w+' # all alphanumeric words
pattern = '[a-zA-Z]+' # all alphanumeric words (without digits)
pattern = '[a-zA-Z\']+' # all alphanumeric words (without digits, but keep contractions)
regexp_tokenizer = RegexpTokenizer(pattern)

print ('\nOutput of RegexpTokenizer for pattern {}:'.format(pattern))
for s in sentences:
    print (regexp_tokenizer.tokenize(s))


Output of RegexpTokenizer for pattern [a-zA-Z']+:
['Text', 'processing', 'with', 'Python', 'is', 'great']
['It', "isn't", 'very', 'complicated', 'to', 'get', 'started']
['However', 'careful', 'to', 'you', 'know', 'avoid', 'mistakes']
['Contact', 'me', 'at', 'vonderweth', 'nus', 'edu', 'sg', 'see', 'http', 'nus', 'edu', 'sg']
['This', 'is', 'so', 'cooool', 'nltkrocks', 'P']


## Tokenization with spaCy

spaCy is another Python text processing package. It is rather new but is also very sophisticated and quickly gained a lot of popularity. Its usage is often easier compared to NLTK since spaCy typically combines several processing steps into one, this hiding more of its logic and requiring less lines of codes.

### Import required packages

In [1]:
import spacy

### Load English language model

In [2]:
nlp = spacy.load('en')

OSError: Can't find model 'en'

### Process document

Compared to NLTK, the common usage of spaCy is to process a string which not only performs tokenization but also other steps (see later tutorial). Here, we only look at the tokens.

Again, we process each sentence individually to simplify the output.

In [11]:
print ('\nOutput of spaCy tokenizer:')
for s in sentences:
    doc = nlp(s) # doc is an object, not just a simple list
    # Let's create a list so the output matches the previous ones
    token_list = []
    for token in doc:
        token_list.append(token.text) # token is also an object, not a string
    print (token_list)


Output of spaCy tokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'is', "n't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '...', '.avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth@nus.edu.sg', ';', 'see', 'http://nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':))', ')', ':-P', '<3', '.']


spaCy does a bit better with the uncommon emoticon, but splits the hashtag. Also the second ellisis (the one with 4 dots) is not handled correctly.

## Summary

- There are different tokenizer implementations available, Each with its strengths and weaknesses, and none of them is in all case the best choice.
- For formal text, most tokenizers do just fine -- still, it is important how the toknizer works, e.g., how it is handling contractions. A bit more consideration is needed if the input is informal text like it is commonly found on social media.
- Since tokenization is usually the first and most basic step, it is worth to "get it right" to avoid that any errors or issue get forwarded into subsequent processing steps.