Processing natural language data often need to handle different languages. For now, the Unicode is the universal solution to encode and decode almost all languages. Each character is encoded as a unique integer **code point** between `0` and `0x10FFFF`. A Unicode string is a sequence of zero or more code points. In this tutorial, you are going to manipulate the Unicode data.

In [0]:
!pip install -q tf-nightly

In [19]:
import tensorflow as tf

print("Tensorflow Version: {}".format(tf.__version__))
print("Eager Mode: {}".format(tf.executing_eagerly()))
print("GPU {} available".format("is" if tf.config.experimental.list_physical_devices("GPU") else "not"))

Tensorflow Version: 2.1.0-dev20200107
Eager Mode: True
GPU is available


# The `tf.String` Data Type

The basic data type `tf.String` allows you to build tensors of byte strings. The Unicode string is UTF8 encoded by default.

In [20]:
tf.constant(u"Hello world, Unicode! 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Hello world, Unicode! \xf0\x9f\x98\x8a'>

In [21]:
tf.constant([u"Nice", u"Frameworks"]).shape

TensorShape([2])

# Representing Unicode String in Tensorflow

In Tensorflow, there are two common representations of the Unicode string.
* `string` scalar: the sequence of the code points
* `int32` vector: the integer sequence of the Unicode code points of characters

In [22]:
text_utf8 = tf.constant(u"語言處理")
text_utf8

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaa\x9e\xe8\xa8\x80\xe8\x99\x95\xe7\x90\x86'>

In [23]:
text_utf16 = tf.constant(u"語言處理".encode("UTF-16-BE"))
text_utf16

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8a\x9e\x8a\x00\x86Ut\x06'>

In [24]:
text_chars = tf.constant([ord(char) for char in u"語言處理"])
text_chars

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35486, 35328, 34389, 29702], dtype=int32)>

## Converting between Representations

Three common APIs are provided for converting between representations of the Unicode strings.
* `tf,strings.unicode_decode`: convert an encoded string into a vector of code points
* `tf.strings.unicode_encode`: convert a vector of code points into an encoded string
* `tf.strings.unicode_transcode`: convert an encoded string to different encoding

In [25]:
tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35486, 35328, 34389, 29702], dtype=int32)>

In [26]:
tf.strings.unicode_encode(text_chars, output_encoding='UTF-8')

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaa\x9e\xe8\xa8\x80\xe8\x99\x95\xe7\x90\x86'>

In [27]:
tf.strings.unicode_transcode(text_utf8, input_encoding='UTF-8', output_encoding='UTF-16-BE')

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8a\x9e\x8a\x00\x86Ut\x06'>

## Batch Dimensions

When decoding a batch of the Unicode strings, the number of characters in each string may not be equal. You can easy to prepare the batch of data via the `tf.RaggedTensor` API. It provides a variant length structure inside its core.

In [28]:
batch_utf8 = [s.encode("UTF-8") for s in [u"Hello world!", u"您好"]]
batch_utf8

[b'Hello world!', b'\xe6\x82\xa8\xe5\xa5\xbd']

In [35]:
_batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding='UTF-8')
_batch_chars_ragged

<tf.RaggedTensor [[72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33], [24744, 22909]]>

In [36]:
for sent_char in batch_chars_ragged.to_list():
  print(sent_char)

[72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]
[24744, 22909]


In practical, you can use `tf.RaggedTensor` directly or convert to `tf.Tensor` via `tf.RaggedTensor.to_tensor` or `tf.SparseTensor` via `tf.RaggedTensor.to_sparse`.

After you convert it to the tensor, the batch data with the same dimension is prepared.

In [37]:
batch_chars_ragged = _batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_ragged.numpy())

[[   72   101   108   108   111    32   119   111   114   108   100    33]
 [24744 22909    -1    -1    -1    -1    -1    -1    -1    -1    -1    -1]]


In [0]:
batch_chars_sparse = _batch_chars_ragged.to_sparse()

When encoding multiple lines in the same length, it is recommended to use `tf.Tensor` as the input and use `tf.RaggedTensor` for the variant length.

In [44]:
tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]], 
                          output_encoding='UTF-8')

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

In [46]:
tf.strings.unicode_encode(_batch_chars_ragged, output_encoding='UTF-8')

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Hello world!', b'\xe6\x82\xa8\xe5\xa5\xbd'], dtype=object)>

If you are going to convert multiple strings in padded or sparse format to call `unicode_encode`, you need to convert it to `tf.RaggedTensor` first.

In [50]:
tf.strings.unicode_encode(tf.RaggedTensor.from_sparse(batch_chars_sparse), 
                          output_encoding='UTF-8')

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Hello world!', b'\xe6\x82\xa8\xe5\xa5\xbd'], dtype=object)>

In [51]:
tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_ragged, padding=-1),
    output_encoding='UTF-8')

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'Hello world!', b'\xe6\x82\xa8\xe5\xa5\xbd'], dtype=object)>

# Unicode Operations

## Character Length

In [52]:
thanks = u'Thanks 😊'.encode("UTF-8")
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes in {} UTF-8 Chars'.format(num_bytes, num_chars))

11 bytes in 8 UTF-8 Chars


## Character Substrings

The `tf.strings.substr` API accepts `unit` as the parameter to determine what kinds of units are sent.

In [54]:
tf.strings.substr(thanks, pos=7, len=1).numpy(), tf.strings.substr(thanks, pos=1, len=1).numpy()

(b'\xf0', b'h')

Return a character.

In [56]:
tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy()

b'\xf0\x9f\x98\x8a'

## Split Unicode Strings

In [57]:
tf.strings.unicode_split(thanks, input_encoding='UTF-8').numpy()

array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
      dtype=object)

## Byte Offsets for Characters

To align the character tensor generated by `tf.strings.unicode_decode`, it is useful to get the offset for where each character begins.

In [58]:
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 
                                                             input_encoding="UTF-8")

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
  print("At byte offset {}: codepoint {}".format(offset, codepoint))

At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882


# Unicode Scripts

Tensorflow provides the `tf.strings.unicode_script` API to help determine which script a given codepoint uses. That is, to determine which language the character might be in.

You can also look for the language list (https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/uscript_8h.html) based on the output of the result.

In [60]:
used_script = tf.strings.unicode_script([ord(u"語")])
used_script.numpy() # [17] == [USCRIPT_HAN]

array([17], dtype=int32)

In [61]:
tf.strings.unicode_script(batch_chars_ragged)

<tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[25, 25, 25, 25, 25,  0, 25, 25, 25, 25, 25,  0],
       [17, 17, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]], dtype=int32)>

# Example: Simple Segmentation

The segmentation in the text field is a task to split the text into word-like units.

In [0]:
sentence_texts = [u"Hello world", u"世界こんにちは", u"這真是很重要"]

Retrieve the Unicode code point of each character.

In [63]:
sentence_char_codepoints = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoints)

<tf.RaggedTensor [[72, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100], [19990, 30028, 12371, 12435, 12395, 12385, 12399], [36889, 30495, 26159, 24456, 37325, 35201]]>


Retrieve the Unicode script of each code point. This implies which language the code point may be in.

In [65]:
sentence_char_scripts = tf.strings.unicode_script(sentence_char_codepoints)
print(sentence_char_scripts)

<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25], [17, 17, 20, 20, 20, 20, 20], [17, 17, 17, 17, 17, 17]]>


Set the start point of a word unit.

In [73]:
sentence_char_starts_word = tf.concat(
  [tf.fill([sentence_char_scripts.nrows(), 1], True), 
  tf.not_equal(sentence_char_scripts[:, 1:], sentence_char_scripts[:, :-1])],
  axis=1)
sentence_char_starts_word

<tf.RaggedTensor [[True, False, False, False, False, True, True, False, False, False, False], [True, False, True, False, False, False, False], [True, False, False, False, False, False]]>

List the start point of a sequence.

In [74]:
word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

tf.Tensor([ 0  5  6 11 13 18], shape=(6,), dtype=int64)


Split the code point vector to represent the word unit.

In [76]:
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoints.values,
    row_starts=word_starts)
print(word_char_codepoint)

<tf.RaggedTensor [[72, 101, 108, 108, 111], [32], [119, 111, 114, 108, 100], [19990, 30028], [12371, 12435, 12395, 12385, 12399], [36889, 30495, 26159, 24456, 37325, 35201]]>


Get the number of words.

In [80]:
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)
sentence_num_words

<tf.Tensor: shape=(3,), dtype=int64, numpy=array([3, 2, 1])>

Compose code points into word units.

In [78]:
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [32], [119, 111, 114, 108, 100]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]], [[36889, 30495, 26159, 24456, 37325, 35201]]]>


Encode each code point back to UTF-8 encoded strings.

In [79]:
tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

[[b'Hello', b' ', b'world'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf'],
 [b'\xe9\x80\x99\xe7\x9c\x9f\xe6\x98\xaf\xe5\xbe\x88\xe9\x87\x8d\xe8\xa6\x81']]