In [1]:
import nltk

## Tokenization

Splitting the sentence into words or creating a list of words from a string

In [5]:
from nltk.tokenize import word_tokenize
sentence = 'Hello ! , This is Meet'
word_tokenize(sentence)

['Hello', '!', ',', 'This', 'is', 'Meet']

In [14]:
from nltk.tokenize import TreebankWordTokenizer

token = TreebankWordTokenizer()
sentence = "I can't allow you to go home early"
token.tokenize(sentence)

['I', 'ca', "n't", 'allow', 'you', 'to', 'go', 'home', 'early']

The most significant convention of a tokenizer is to separate contractions. For example, if we use word_tokenize() module for this purpose, it will give the output as follows −

In [7]:
from nltk.tokenize import word_tokenize

word_tokenize("won't")

['wo', "n't"]

Such kind of convention by TreebankWordTokenizer is unacceptable. 

In [13]:
from nltk.tokenize import TreebankWordTokenizer

token = TreebankWordTokenizer()
token.tokenize("won't")

['wo', "n't"]

That’s why we have two alternative word tokenizers namely PunktWordTokenizer and WordPunctTokenizer.

### WordPunctTokenizer

An alternative word tokenizer that splits all punctuation into separate tokens. Let us understand it with the following simple example −

In [15]:
from nltk.tokenize import WordPunctTokenizer

word_token = WordPunctTokenizer()

word_token.tokenize("I can't allow you to go home early")

['I', 'can', "'", 't', 'allow', 'you', 'to', 'go', 'home', 'early']

### Tokenizing text into sentences

An obvious question that came in our mind is that when we have word tokenizer then why do we need sentence tokenizer or why do we need to tokenize text into sentences. Suppose we need to count average words in sentences, how we can do this? For accomplishing this task, we need both sentence tokenization and word tokenization.

In [17]:
from nltk.tokenize import sent_tokenize

text = """Let us understand the difference between sentence & word tokenizer. 
It is going to be a simple example."""
sent_tokenize(text)

['Let us understand the difference between sentence & word tokenizer.',
 'It is going to be a simple example.']

If you feel that the output of word tokenizer is unacceptable and want complete control over how to tokenize the text, we have regular expression which can be used while doing sentence tokenization. NLTK provide RegexpTokenizer class to achieve this.

Let us understand the concept with the help of two examples below.

In first example we will be using regular expression for matching alphanumeric tokens plus single quotes so that we don’t split contractions like “won’t”.

In [25]:
from nltk.tokenize import RegexpTokenizer
# we will be using regular expression to tokenize on whitespace.
reg1 = RegexpTokenizer("[\w]+")
print(reg1.tokenize("won't is a coincidence."))
print(reg1.tokenize("can't is a coincidence ."))

['won', 't', 'is', 'a', 'coincidence']
['can', 't', 'is', 'a', 'coincidence']


In [32]:
reg2 = RegexpTokenizer('\s+' , gaps = True)
print(reg2.tokenize("won't is a contraction."))

["won't", 'is', 'a', 'contraction.']


From the above output, we can see that the punctuation remains in the tokens. The parameter gaps = True means the pattern is going to identify the gaps to tokenize on. On the other hand, if we will use gaps = False parameter then the pattern would be used to identify the tokens

In [33]:
reg2 = RegexpTokenizer('\s+' , gaps = False)
print(reg2.tokenize("won't is a contraction."))

[' ', ' ', ' ']
