## Introduction

**Tensorflow Text** provides a collection of text related classes and [operations (ops for short)]() ready to use with TensorFlow 2.0. The library can perform the **preprocessing regularly required by text-based models**, and includes other features useful for sequence modeling not provided by core TensorFlow.

The benefit of using these ops in your text preprocessing is that they are done in the TensorFlow graph. We do not need to worry about tokenization in training being different than the tokenization at inference, or managing preprocessing scripts.

## Eager execution

TensorFlow Text requires TensorFlow 2.0, and is fully compatible with eager mode and graph mode.

In [1]:
!pip install -q tensorflow-text

In [2]:
import tensorflow as tf
import tensorflow_text as text

## Unicode

Most ops expect that the strings are in UTF-8. If we're using a different encoding, we can use the core tensorflow **transcode op** to transcode into UTF-8. We can also use the same op to coerce your string to structurally valid UTF-8 if our input could be invalid.

In [3]:
docs = tf.constant([
    'Everything not saved will be lost.'.encode('UTF-16-BE'),
    'Sad☹'.encode('UTF-16-BE')
])
docs

<tf.Tensor: shape=(2,), dtype=string, numpy=
array([b'\x00E\x00v\x00e\x00r\x00y\x00t\x00h\x00i\x00n\x00g\x00 \x00n\x00o\x00t\x00 \x00s\x00a\x00v\x00e\x00d\x00 \x00w\x00i\x00l\x00l\x00 \x00b\x00e\x00 \x00l\x00o\x00s\x00t\x00.',
       b'\x00S\x00a\x00d&9'], dtype=object)>

In [5]:
utf8_docs = tf.strings.unicode_transcode(
    docs,
    input_encoding='UTF-16-BE',
    output_encoding='UTF-8'
)
utf8_docs

<tf.Tensor: shape=(2,), dtype=string, numpy=
array([b'Everything not saved will be lost.', b'Sad\xe2\x98\xb9'],
      dtype=object)>

## Tokenization

Tokenization is the process of **breaking up a string into tokens**. Commonly, these tokens are words, numbers, and/or punctuation.

### WhitespaceTokenizer

This is a basic tokenizer that splits UTF-8 strings on ICU defined whitespace characters (eg. space, tab, new line).

In [7]:
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', 'Sad☹'.encode('UTF-8')])
tokens.to_list()

[[b'everything', b'not', b'saved', b'will', b'be', b'lost.'],
 [b'Sad\xe2\x98\xb9']]

### UnicodeScriptTokenizer

This tokenizer splits UTF-8 strings based on Unicode script boundaries. The script codes used correspond to International Components for Unicode (ICU) UScriptCode values. 

In practice, this is similar to the `WhitespaceTokenizer` with the most apparent **difference** being that it will **split punctuation** (USCRIPT_COMMON) from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also **separating language texts from each other**.

In [8]:
tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', 'Sad☹'.encode('UTF-8')])
tokens.to_list()

[[b'everything', b'not', b'saved', b'will', b'be', b'lost', b'.'],
 [b'Sad', b'\xe2\x98\xb9']]

### Unicode split

When tokenizing languages without whitespace to segment words, it is common to just split by character, which can be accomplished using the `unicode_split` op found in core.

In [9]:
tokens = tf.strings.unicode_split(["仅今年前".encode('UTF-8')], 'UTF-8')
tokens.to_list()

[[b'\xe4\xbb\x85', b'\xe4\xbb\x8a', b'\xe5\xb9\xb4', b'\xe5\x89\x8d']]

### Offset

When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements `TokenizerWithOffsets` has a `tokenize_with_offsets` method that will **return the byte offsets along with the tokens**. 
- The `offset_starts` lists the bytes in the original string each token starts at.
- The `offset_limits` lists the bytes where each token ends.

In [11]:
tokenizer = text.UnicodeScriptTokenizer()
(tokens, offset_starts, offset_limits) = tokenizer.tokenize_with_offsets(
    ['everything not saved will be lost.', 'Sad☹'.encode('UTF-8')]
)
tokens, offset_starts, offset_limits

(<tf.RaggedTensor [[b'everything', b'not', b'saved', b'will', b'be', b'lost', b'.'], [b'Sad', b'\xe2\x98\xb9']]>,
 <tf.RaggedTensor [[0, 11, 15, 21, 26, 29, 33], [0, 3]]>,
 <tf.RaggedTensor [[10, 14, 20, 25, 28, 33, 34], [3, 6]]>)

### TF.Data example

Tokenizers work as expected with the tf.data API. A simple example is provided below.

In [12]:
docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = iter(tokenized_docs)

next(iterator).to_list(), next(iterator).to_list()

([[b'Never', b'tell', b'me', b'the', b'odds.']], [[b"It's", b'a', b'trap!']])

## Other Text Ops
TF.Text packages other useful preprocessing ops. We will review a couple below.

### Wordshape

A common feature used in some natural language understanding models is to see if the text string has a certain property. For example, a sentence breaking model might contain features which check for word capitalization or if a punctuation character is at the end of a string.

**Wordshape** defines a variety of **useful regular expression** based helper functions for **matching various relevant patterns** in our input text. Here are a few examples.

In [13]:
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.', 'Sad☹'.encode('UTF-8')])
tokens.to_list()

[[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'],
 [b'Sad\xe2\x98\xb9']]

In [15]:
# Is capitalized?
f1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)
f1.to_list()

[[True, False, False, False, False, False], [True]]

In [16]:
# Are all letters uppercased?
f2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)
f2.to_list()

[[False, False, False, False, False, False], [False]]

In [17]:
# Does the token contain punctuation?
f3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)
f3.to_list()

[[False, False, False, False, False, True], [True]]

In [18]:
# Is the token a number?
f4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)
f4.to_list()

[[False, False, False, False, False, False], [False]]

### N-gram & Sliding Window

**N-grams are sequential words given a sliding window size of n**. When combining the tokens, there are three reduction mechanisms supported.
- Text
    - `Reduction.STRING_JOIN` which appends the strings to each other
- Numerical values
    - `Reduction.SUM`
    - `Reduction.MEAN`

In [23]:
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Everything not saved will be lost.', 'Sad☹'.encode('UTF-8')])
tokens.to_list()

[[b'Everything', b'not', b'saved', b'will', b'be', b'lost.'],
 [b'Sad\xe2\x98\xb9']]

In [24]:
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)
bigrams.to_list()

[[b'Everything not', b'not saved', b'saved will', b'will be', b'be lost.'], []]

## References 
- https://www.tensorflow.org/tutorials/tensorflow_text/intro