**Unicode** is a standard encoding system that is used to represent character from almost all languages. Each character is encoded using a unique integer code point between `0` and `0x10FFFF`. A **Unicode string** is a sequence of zero or more code points.

In [1]:
import tensorflow as tf

## The `tf.string` data type

The basic TensorFlow `tf.string` `dtype` allows you to build tensors of byte strings. **Unicode** strings are **utf-8** encoded by default.

In [2]:
# unicode - python 2
tf.constant(u"Thanks 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

In [3]:
# unicode (default) - python3
tf.constant("Thanks 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

In [4]:
# unicode - python 2
tf.constant([u"You're", u"welcome!"]).shape

TensorShape([2])

In [5]:
# unicode (default) - python3
tf.constant(["You're", u"welcome!"]).shape

TensorShape([2])

<div class="alert alert-info">
    <b>Note</b>: When using python to construct strings, the handling of unicode differs betweeen v2 and v3. In v2, unicode strings are indicated by the "u" prefix, as above. In v3, strings are unicode-encoded by default.
</div>

## Representing Unicode
There are two standard ways to represent a Unicode string in TensorFlow:

- `string` scalar — where the sequence of code points is encoded using a known character encoding.
<br><br>
- `int32` vector — where each position contains a single code point.

In [6]:
# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant("语言处理")
text_utf8

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [7]:
# Unicode string, represented as a UTF-16-BE encoded string scalar.
text_utf16be = tf.constant("语言处理".encode("UTF-16-BE"))
text_utf16be

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

In [8]:
# Unicode string, represented as a vector of Unicode code points.
text_chars = tf.constant([ord(char) for char in "语言处理"])
text_chars

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

### Converting between representations

<img src="../images/unicode.png" />


TensorFlow provides operations to convert between these different representations:

- `tf.strings.unicode_decode`: Converts an encoded **string** scalar **to** a vector of code **points**.
<br><br>
- `tf.strings.unicode_encode`: Converts a vector of code **points to** an encoded **string** scalar.
<br><br>
- `tf.strings.unicode_transcode`: Converts an encoded **string** scalar **to** a **different encoding**.

In [9]:
tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

In [10]:
tf.strings.unicode_encode(text_chars, output_encoding='UTF-8')

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [11]:
tf.strings.unicode_transcode(text_utf8, input_encoding='UTF-8', output_encoding='UTF-16-BE')

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

### Batch dimensions

When decoding multiple strings, **the number of characters in each string may not be equal**. The return result is a `tf.RaggedTensor`, where the length of the innermost dimension varies depending on the number of characters in each string:

In [12]:
# A batch of Unicode strings, each represented as a UTF8-encoded string.
batch_utf8 = [s.encode('UTF-8') for s in ['hÃllo', 'What is the weather tomorrow', 'Göödnight', '😊']]

batch_utf8

[b'h\xc3\x83llo',
 b'What is the weather tomorrow',
 b'G\xc3\xb6\xc3\xb6dnight',
 b'\xf0\x9f\x98\x8a']

In [13]:
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding='UTF-8')
batch_chars_ragged

<tf.RaggedTensor [[104, 195, 108, 108, 111], [87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119], [71, 246, 246, 100, 110, 105, 103, 104, 116], [128522]]>

In [14]:
for sentence_chars in batch_chars_ragged.to_list():
    print(sentence_chars)

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]


You can use this `tf.RaggedTensor` directly, or convert it to 
- a dense `tf.Tensor` with **padding** using `tf.RaggedTensor.to_tensor` 
<br><br>
- a `tf.SparseTensor` using `tf.RaggedTensor.to_sparse`

In [15]:
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())

[[   104    195    108    108    111     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [    87    104     97    116     32    105    115     32    116    104
     101     32    119    101     97    116    104    101    114     32
     116    111    109    111    114    114    111    119]
 [    71    246    246    100    110    105    103    104    116     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]]


In [16]:
batch_chars_sparse = batch_chars_ragged.to_sparse()
print(batch_chars_sparse)

SparseTensor(indices=tf.Tensor(
[[ 0  0]
 [ 0  1]
 [ 0  2]
 [ 0  3]
 [ 0  4]
 [ 1  0]
 [ 1  1]
 [ 1  2]
 [ 1  3]
 [ 1  4]
 [ 1  5]
 [ 1  6]
 [ 1  7]
 [ 1  8]
 [ 1  9]
 [ 1 10]
 [ 1 11]
 [ 1 12]
 [ 1 13]
 [ 1 14]
 [ 1 15]
 [ 1 16]
 [ 1 17]
 [ 1 18]
 [ 1 19]
 [ 1 20]
 [ 1 21]
 [ 1 22]
 [ 1 23]
 [ 1 24]
 [ 1 25]
 [ 1 26]
 [ 1 27]
 [ 2  0]
 [ 2  1]
 [ 2  2]
 [ 2  3]
 [ 2  4]
 [ 2  5]
 [ 2  6]
 [ 2  7]
 [ 2  8]
 [ 3  0]], shape=(43, 2), dtype=int64), values=tf.Tensor(
[   104    195    108    108    111     87    104     97    116     32
    105    115     32    116    104    101     32    119    101     97
    116    104    101    114     32    116    111    109    111    114
    114    111    119     71    246    246    100    110    105    103
    104    116 128522], shape=(43,), dtype=int32), dense_shape=tf.Tensor([ 4 28], shape=(2,), dtype=int64))


In [17]:
# When encoding multiple strings with the same lengths, a tf.Tensor may be used as input
tf.strings.unicode_encode([
    [99, 97, 116], 
    [100, 111, 103], 
    [ 99, 111, 119]
], output_encoding='UTF-8')

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

In [18]:
# When encoding multiple strings with varyling length, a tf.RaggedTensor should be used as input
tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

If you have a tensor with multiple **strings in padded or sparse format**, then **convert** it to a `tf.RaggedTensor` **before** calling `unicode_encode`:

In [19]:
tf.strings.unicode_encode(
    tf.RaggedTensor.from_sparse(batch_chars_sparse),
    output_encoding='UTF-8'
)

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

In [20]:
tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
    output_encoding='UTF-8'
)

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

## Unicode operations

### Character length

In [21]:
thanks = 'Thanks 😊'.encode('UTF-8')
num_bytes = tf.strings.length(thanks).numpy()
# unit: indicates how lengths should be computed.
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print(f"{num_bytes} bytes; {num_chars} UTF-8 characters")

11 bytes; 8 UTF-8 characters


### Character substrings

In [22]:
tf.strings.substr(thanks, pos=7, len=1).numpy()

b'\xf0'

In [23]:
tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy()

b'\xf0\x9f\x98\x8a'

### Split Unicode strings

In [24]:
tf.strings.unicode_split(thanks, 'UTF-8').numpy()

array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
      dtype=object)

### Byte offsets for characters

In [26]:
codepoints, offsets = tf.strings.unicode_decode_with_offsets("🎈🎉🎊", 'UTF-8')

for codepoint, offset in zip(codepoints.numpy(), offsets.numpy()):
    print(f"At byte offset: {offset}: codepoint {codepoint}")

At byte offset: 0: codepoint 127880
At byte offset: 4: codepoint 127881
At byte offset: 8: codepoint 127882


## Unicode scripts

Each Unicode code point belongs to a **single** collection of codepoints known as a **script**. A character's script is helpful in **determining which language the character might be in**.

TensorFlow provides the `tf.strings.unicode_script` operation to determine which script a given codepoint uses. The script codes are `int32` values corresponding to **International Components for Unicode (ICU) UScriptCode** values.

In [27]:
uscript = tf.strings.unicode_script([33464, 1041]) # ['芸', 'Б']
print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

[17  8]


In [28]:
# The tf.strings.unicode_script operation can also be applied to multidimensional tf.Tensors 
# or tf.RaggedTensors of codepoints
tf.strings.unicode_script(batch_chars_ragged)

<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>

## Example: Simple segmentation

Segmentation is the task of **splitting text into word**-like units.

We can perform very rough segmentation (without implementing any ML models) by using changes in script to approximate word boundaries. It will also work for most languages that use spaces, as the space characters of various scripts are all classified as USCRIPT_COMMON, a special script code that differs from that of any actual text.

In [39]:
# dtype: string; shape: [num_sentences]
#
# The sentences to process.  Edit this line to try out different inputs!
sentence_texts = ['Hello, world.', '世界こんにちは']

First, we decode the sentences into character **codepoints**, and find the **script** identifeir for each character.

In [40]:
# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_codepoint[i, j] is the codepoint for the j'th character in
# the i'th sentence.
sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
sentence_char_codepoint

<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>

In [41]:
# dtype: int32; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_scripts[i, j] is the unicode script of the j'th character in
# the i'th sentence.
sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
sentence_char_script

<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>

In [42]:
sentence_char_script.nrows()

<tf.Tensor: shape=(), dtype=int64, numpy=2>

In [43]:
tf.fill([sentence_char_script.nrows(), 1], True)

<tf.Tensor: shape=(2, 1), dtype=bool, numpy=
array([[ True],
       [ True]])>

Next, we use those **script** identifiers to **determine where word boundaries should be added**. We add a word boundary at the *beginning of each sentence*, and for each character whose script differs from the previous character.

In [44]:
# dtype: bool; shape: [num_sentences, (num_chars_per_sentence)]
#
# sentence_char_starts_word[i, j] is True if the j'th character in the i'th
# sentence is the start of a word.
sentence_char_starts_word = tf.concat(
    [
        tf.fill([sentence_char_script.nrows(), 1], True),
        tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])
    ], axis=1
)
sentence_char_starts_word

<tf.RaggedTensor [[True, False, False, False, False, True, False, True, False, False, False, False, True], [True, False, True, False, False, False, False]]>

In [45]:
# dtype: int64; shape: [num_words]
#
# word_starts[i] is the index of the character that starts the i'th word (in
# the flattened list of characters from all sentences).
word_starts = tf.squeeze(
    tf.where(sentence_char_starts_word.values),
    axis=1
)
word_starts

<tf.Tensor: shape=(6,), dtype=int64, numpy=array([ 0,  5,  7, 12, 13, 15])>

We can then use those start offsets to build a `RaggedTensor` containing the list of words from all batches:

In [46]:
# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word.
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts
)
word_char_codepoint

<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>

And finally, we can segment the word codepoints `RaggedTensor` back into sentences:

In [47]:
# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1
)
sentence_num_words

<tf.Tensor: shape=(2,), dtype=int64, numpy=array([4, 2])>

In [48]:
# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words
)
sentence_word_char_codepoint

<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>

To make the final result easier to read, we can encode it back into UTF-8 strings

In [49]:
tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

[[b'Hello', b', ', b'world', b'.'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]

## References
- https://www.tensorflow.org/tutorials/load_data/unicode#character_length