<a href="https://colab.research.google.com/github/ngzhiwei517/NLP/blob/main/Lecture_02_Tokenizaton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Tokenization

Tokenization splits a text document into, well, tokens, and it is the most basic text processing step. A token is the smallest unit with semantic meaning -- that is, tokens are mostly words and numbers, while letters of words or digits of numbers are not tokens. Punctuation marks are also considered tokens. With the more informal writing style on social media, concepts such as hashtags, emoticons and emojis are nowadays also often meaningful tokens.

The following examples compare different tokenizers to highlight the differences and subtleties when it comes to splitting text into tokens. There is not the best or only correct tokenizer. Which implementation is best suitable depends on the type of input and further processing tasks.


## Import all important packages

We first use NLTK, a very popular and mature Python package for language processing.


In [None]:
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import RegexpTokenizer

from nltk import word_tokenize # Simplfied notation; it's a wrapper for the TreebankWordTokenizer

# **PunktSentenceTokenizer**

📌 Splits text into sentences (not words).

text = "Mr. Smith went to the store. He bought milk."

["Mr. Smith went to the store.", "He bought milk."]

✅ Useful for sentence-level tasks like summarization or sentence classification.

# **TreebankWordTokenizer**

Handles punctuation well: splits off 's, punctuation, contractions, etc.

['I', 'ca', "n't", 'do', 'this', '.', 'He', "'s", 'here', '!']

✅ Great for formal text (news, books).

NLTK provides more tokenizers: http://www.nltk.org/api/nltk.tokenize.html

# **TweetTokenizer**

Designed for Twitter/social media style.

Keeps hashtags, emojis, @mentions, and emoticons intact.

"I ❤️ Python! #awesome @nltk :)"

['I', '❤️', 'Python', '!', '#awesome', '@nltk', ':)']

# **RegexpTokenizer**
You define a pattern (regex) for what counts as a token.

Example: only extract words (letters only).

"Wait... what?! $100 is too much!!"

['Wait', 'what', '100', 'is', 'too', 'much']

## Define a document

We first create list of sentences.

In [None]:
sentences = ["Text processing with Python is great.",
             "It isn't (very) complicated to get started.",
             "However,careful to...you know....avoid mistakes.",
             "Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.",
             "This is so cooool #nltkrocks :))) :-P <3."]

To form the document, we can use the in-built `join()` method to concatenate all sentences using a whitespace as separator.

In [None]:
document = ' '.join(sentences)

# Print the document to see if everything looks alright
print (document)

Text processing with Python is great. It isn't (very) complicated to get started. However,careful to...you know....avoid mistakes. Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg. This is so cooool #nltkrocks :))) :-P <3.


## Document tokenization into sentences

Sometimes, you just want to split a document into sentences and not individual tokens.

# **PunktSentenceTokenizer**

📌 Splits text into sentences (not words).

In [None]:
sentence_tokenizer = PunktSentenceTokenizer()

# The tokenize() method returns a list containing the sentences
sentences_alt = sentence_tokenizer.tokenize(document)

# Loop over all sentences and print each sentence
for s in sentences_alt:
    print (s)

Text processing with Python is great.
It isn't (very) complicated to get started.
However,careful to...you know....avoid mistakes.
Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.
This is so cooool #nltkrocks :))) :-P <3.


## Document tokenization into tokens

In the following, we tokenize each sentence individually. This makes the presentation a bit more convenient. In practice, you can tokenize the whole document at once.

# **Naive tokenization**

Python provides an in-built method `split()` that splits strings with respect to a user-defined separator. By default, the separator is a whitespace.


Naive tokenization means splitting a sentence based on whitespaces only using Python’s built-in str.split() method

In [None]:
print ('\nOutput of split() method:')
for s in sentences:
    print (s.split(' '))
    #print(s.split()) # This is also fine since whitespace is the default separator


Output of split() method:
['Text', 'processing', 'with', 'Python', 'is', 'great.']
['It', "isn't", '(very)', 'complicated', 'to', 'get', 'started.']
['However,careful', 'to...you', 'know....avoid', 'mistakes.']
['Contact', 'me', 'at', 'vonderweth@nus.edu.sg;', 'see', 'http://nus.edu.sg.']
['This', 'is', 'so', 'cooool', '#nltkrocks', ':)))', ':-P', '<3.']


**⚠️ Limitations:**
❌ It does not separate punctuation from words.

❌ It keeps symbols like !, ?, and . attached to words.

❌ Cannot handle contractions (like "don't") or special text (like hashtags, mentions, emojis).


"Wait... what?! Really? That's awesome!"
Output: ['Wait...', 'what?!', 'Really?', "That's", 'awesome!']

"Wait..." keeps ... stuck to the word.

"what?!" includes both ? and !.

# **TreebankWordTokenizer**

The `TreebankWordTokenizer` is the default tokenizer. If you have well-formed text such as news articles, this tokenizer is usually the way to go.


It splits:

Punctuation

Contractions (e.g., "don't" → "do" + "n't")

Periods and quotes

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
from nltk.tokenize import TreebankWordTokenizer, word_tokenize

treebank_tokenizer = TreebankWordTokenizer()

sentences = [
    "Text processing with Python is great.",
    "It isn't (very) complicated to get started.",
    "However, careful to... you know....avoid mistakes.",
    "Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.",
    "This is so cooool #nltkrocks :))) :-P <3."
]

print('\nOutput of TreebankWordTokenizer:')
for s in sentences:
    print(treebank_tokenizer.tokenize(s))

print('\nOutput of the word_tokenize() method:')
for s in sentences:
    print(word_tokenize(s))


Output of TreebankWordTokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'is', "n't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '...', '.avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth', '@', 'nus.edu.sg', ';', 'see', 'http', ':', '//nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':', ')', ')', ')', ':', '-P', '<', '3', '.']

Output of the word_tokenize() method:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'is', "n't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '....', 'avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth', '@', 'nus.edu.sg', ';', 'see', 'http', ':', '//nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':', ')', ')', ')', ':', '-P', '<', '3', '.']


Both outputs are the same, since the `word_tokenize()` method is just a wrapper for the `TreebankWordTokenizer` to simplify the coding.

See how this tokenizer also splits common contractions such as *isn't*, *hasn't*, *haven't*. Other tokenizers (see below) consider such contractions as one token. Being aware how this is handled is, for example, important for sentiment analysis where handling negations is very important to get the sentiment right.

Also, notice how the tokenizer can handle the ellipsis (`...`) correctly in the first case but fails in the second case since an ellipsis is by definition composed of exactly 3 dots. More or less the 3 dots are not handled properly.


# **TweetTokenizer**

The `TweetTokenizer` is optimized for social media content where people use informal concepts such as hashtags or emoticons. Note that emoticons often contain punctuation marks that throw other tokenizers off.


In [None]:
tweet_tokenizer = TweetTokenizer()

print ('Output of TweetTokenizer:')
for s in sentences:
    print (tweet_tokenizer.tokenize(s))

Output of TweetTokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', "isn't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '...', 'avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth@nus.edu.sg', ';', 'see', 'http://nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#nltkrocks', ':)', ')', ')', ':-P', '<3', '.']


Here, both ellipses are recognized, with the second one even "corrected" to three dots.

Note how the tokenizer fails with `:)))`. The problem is that it is not the "official version" of the emoticon -- which is `:)` or `:-)` -- but uses multiple "mouths" to emphasize the expressed sentiment of feeling. If a subsequent analysis not really depends on it, some extra `)` are no big deal in many cases.

The 2 basic alternatives to properly address this issue:
- Clean your text before tokenizing
- Remove all "odd" tokens from the list before further processing
- Write your own sophisticated tokenizer :-)


### RegexpTokenizer

The `RegexpTokenizer` takes as input a regular expression that specifies which parts of the string qualify as a valid token. That means that some parts of the string might be removed. In principle, all previous tokenizers can be expressed as a `RegexpTokenizer` -- however, the required regular expressions can be very complex.

In [None]:
pattern = '\w+' # all alphanumeric words
pattern = '[a-zA-Z]+' # all alphanumeric words (without digits)
pattern = '[a-zA-Z\']+' # all alphanumeric words (without digits, but keep contractions)
regexp_tokenizer = RegexpTokenizer(pattern)

print ('\nOutput of RegexpTokenizer for pattern {}:'.format(pattern))
for s in sentences:
    print (regexp_tokenizer.tokenize(s))


Output of RegexpTokenizer for pattern [a-zA-Z']+:
['Text', 'processing', 'with', 'Python', 'is', 'great']
['It', "isn't", 'very', 'complicated', 'to', 'get', 'started']
['However', 'careful', 'to', 'you', 'know', 'avoid', 'mistakes']
['Contact', 'me', 'at', 'vonderweth', 'nus', 'edu', 'sg', 'see', 'http', 'nus', 'edu', 'sg']
['This', 'is', 'so', 'cooool', 'nltkrocks', 'P']


# **WordPunctTokenizer**

Splits every punctuation and contractions
✅ Good for text where punctuation is important (e.g., tweets, quotes)

(a bit too detailed sometimes)


In [None]:
from nltk.tokenize import WordPunctTokenizer

# Initialize the tokenizer
wordpunct_tokenizer = WordPunctTokenizer()

# Sample sentences
sentences = [
    "Text processing with Python is great.",
    "It isn't (very) complicated to get started.",
    "However, careful to... you know....avoid mistakes.",
    "Contact me at vonderweth@nus.edu.sg; see http://nus.edu.sg.",
    "This is so cooool #nltkrocks :))) :-P <3."
]

print('\nOutput of WordPunctTokenizer:')
for s in sentences:
    print(wordpunct_tokenizer.tokenize(s))



Output of WordPunctTokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'isn', "'", 't', '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '....', 'avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth', '@', 'nus', '.', 'edu', '.', 'sg', ';', 'see', 'http', '://', 'nus', '.', 'edu', '.', 'sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':)))', ':-', 'P', '<', '3', '.']


## Tokenization with spaCy

spaCy is another Python text processing package. It is rather new but is also very sophisticated and quickly gained a lot of popularity. Its usage is often easier compared to NLTK since spaCy typically combines several processing steps into one, thus hiding more of its logic and requiring less lines of code.


### Import required packages

In [None]:
import spacy

### Load English language model

In [None]:
nlp = spacy.load('en_core_web_sm')

### Process document

Compared to NLTK, the common usage of spaCy is to process a string which not only performs tokenization but also other steps (see later tutorial). Here, we only look at the tokens.

Again, we process each sentence individually to simplify the output.

In [None]:
print ('\nOutput of spaCy tokenizer:')
for s in sentences:
    doc = nlp(s) # doc is an object, not just a simple list
    # Let's create a list so the output matches the previous ones
    token_list = []
    for token in doc:
        token_list.append(token.text) # token is also an object, not a string
    print (token_list)


Output of spaCy tokenizer:
['Text', 'processing', 'with', 'Python', 'is', 'great', '.']
['It', 'is', "n't", '(', 'very', ')', 'complicated', 'to', 'get', 'started', '.']
['However', ',', 'careful', 'to', '...', 'you', 'know', '....', 'avoid', 'mistakes', '.']
['Contact', 'me', 'at', 'vonderweth@nus.edu.sg', ';', 'see', 'http://nus.edu.sg', '.']
['This', 'is', 'so', 'cooool', '#', 'nltkrocks', ':)))', ':-P', '<3', '.']


spaCy does a bit better with the uncommon emoticon, but splits the hashtag.

## Summary

There are different tokenizer implementations available — each has its own strengths and weaknesses.

There's no one-size-fits-all tokenizer that works best in every situation.

For formal text (like news articles or academic writing), most tokenizers do just fine.
But it's still important to know how the tokenizer works — for example, how it handles contractions :

“don’t” → “do” + “n’t” 🧠

When it comes to informal text 💬 — like tweets 🐦, chats 💬, or internet slang  — tokenization becomes trickier. These texts often break language rules, so tokenizers need to be smarter.

Since tokenization is the very first step in natural language processing (NLP), it's really important to get it right ✅.
Otherwise, errors made here can cause problems in all the steps that follow.

