Tokenization is the process of breaking up text into smaller pieces, called _tokens_. Tokens may or may not be words. There is not one specific way to perform tokenization, as different problems may call for different granluarity of tokens. Below are two examples of tokenizers.

In [13]:
text = "I'm against picketing, but I don't know how to show it."

In [14]:
# A naïve tokenizer that splits on spaces.
tokens = text.split(" ")
print(tokens)

["I'm", 'against', 'picketing,', 'but', 'I', "don't", 'know', 'how', 'to', 'show', 'it.']


Just splitting on spaces gets you 80% of the way there, but it doesn't take into account things like punctuation.

In [17]:
# The default tokenizer in spaCy is a bit more robust.
import spacy
from spacy.lang.en import English

english = spacy.load('en')
tokenizer = English().Defaults.create_tokenizer(english)

tokens = tokenizer(text)
tokens = [token.text for token in tokens]

print(tokens)

['I', "'m", 'against', 'picketing', ',', 'but', 'I', 'do', "n't", 'know', 'how', 'to', 'show', 'it', '.']


It treats words and symbols as separate tokens, and even splits up contractions.