# **Word Tokenization**

## Abstract

This notebook explores the process of tokenizing a corpus of text into words.

>[Word Tokenization](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=LWo5LONqUpBf)

>>[Abstract](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=XRhHsTisUstv)

>>[Word Tokenization](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=8csuou_RVRqk)

>>[Common Word Tokenization Issues](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=8csuou_RVRqk)

>>[Implementation](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=msNFmErgeS3I)

>>[Tensorflow Tokenizers](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=EjOTql9CeRJv)

>>>[Whitespace Tokenization](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=iPXvWD7JVdrt)

>>>[Unicode Script Word Tokenization](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=7kXkhhUdhZxP)

>>[NLTK Word Tokenizer](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=ENIAIDuLka4a)

>>[References](#folderId=1mMkxJ9aGYkHVPC6LfQ3j0xpAEfTgI_cq&updateTitle=true&scrollTo=7EjBIibsaFVJ)



## Word Tokenization

**Word tokenization** is the process of splitting a text string into a list words, called tokens. 

For example:
> "Hello, World!"

can be tokenized as:
> ["Hello", "World"]

## Common Word Tokenization Issues

- **Clitic Contractions**
  - I’m, I’ve, I’d, It’s, We’re
- **Possessive Marker**
  - The Queen of England's crown
- **Negative Marker**
  - n’t
- **Hyphenated Words**
  - State-of-the-art
- **Multiword Expressions**
  - San Francisco
- **Abbreviations**
  - m.p.h.
  - PhD.
- **Formatted Data**
  - 25/02/1996
  - $69.420



## Implementation

In [1]:
!pip install -q "tensorflow-text==2.8.*"

[K     |████████████████████████████████| 4.9 MB 4.0 MB/s 
[?25h

In [59]:
import tensorflow as tf
import tensorflow_text as tf_text

import nltk
from nltk.tokenize import WhitespaceTokenizer, word_tokenize, wordpunct_tokenize
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Tensorflow Tokenizers

Shown below are some of tokenizers provided by TensorFlow Text. String inputs are assumed to be [UTF-8](https://www.tensorflow.org/text/guide/unicode).

In [43]:
def print_tokens(tokens):

  if isinstance(tokens, list):
    for tokenized_sentence in tokens:
      for word in tokenized_sentence:
        print(word)
      print()

  else:
    for tokenized_sentence in tokens.to_list():
      for word in tokenized_sentence:
        print(word.decode())
      print()

In [57]:
sentences = [
    "I'm enjoying NLP!",
    "I would like to work on state-of-the-art NLP research.",
    "I have never been to San Francisco.",
    "I might visit U.S.A. in the future.",
    "My birthday is on 25/02."
    ]

### Whitespace Tokenization

Whitespace tokenization is the most intuitive way to split text, as it splits the string on whitespace. 


In [44]:
# Instanciate a WhitespaceTokenizer
whitespace_tokenizer = tf_text.WhitespaceTokenizer()

# Tokenize the sentences and show the result
tokens = whitespace_tokenizer.tokenize(sentences)
print_tokens(tokens)

I'm
enjoying
NLP!

I
would
like
to
work
on
state-of-the-art
NLP
research.

I
have
never
been
to
San
Francisco.

I
might
visit
the
U.S.A.
in
the
future.

My
birthday
is
on
25/02.



### Unicode Script Word Tokenization

The `UnicodeScriptTokenizer` splits strings based on Unicode script boundaries. In practice, this is similar to the `WhitespaceTokenizer` with the most apparent difference being that it will split punctuation.


In [45]:
# Instanciate a UnicodeScriptTokenizer
unicode_script_tokenizer = tf_text.UnicodeScriptTokenizer()

# Tokenize the sentences and show the result
tokens = unicode_script_tokenizer.tokenize(sentences)
print_tokens(tokens)

I
'
m
enjoying
NLP
!

I
would
like
to
work
on
state
-
of
-
the
-
art
NLP
research
.

I
have
never
been
to
San
Francisco
.

I
might
visit
the
U
.
S
.
A
.
in
the
future
.

My
birthday
is
on
25/02.



## NLTK Word Tokenizer

In [58]:
tokens = []
for sentence in sentences:
  tokens.append(word_tokenize(sentence))

print_tokens(tokens)

I
'm
enjoying
NLP
!

I
would
like
to
work
on
state-of-the-art
NLP
research
.

I
have
never
been
to
San
Francisco
.

I
might
visit
U.S.A.
in
the
future
.

My
birthday
is
on
25/02
.



## References

1. *Dan Jurafsky and James H. Martin. [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft).*

2. Tensorflow's [WhitespaceTokenizer](https://www.tensorflow.org/text/api_docs/python/text/WhitespaceTokenizer) and [UnicodeScriptTokenizer](https://www.tensorflow.org/text/api_docs/python/text/UnicodeScriptTokenizer) Documentation.

3. NLTK's [WordTokenizer](https://www.nltk.org/api/nltk.tokenize.html)