<a href="https://colab.research.google.com/github/nigoda/machine_learning/blob/main/13_Uncode_string_NLP_ipynb13s.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unicode strings

Models that process natural language often handle different language with different character sets. *Unicode* is a stanard encoding system that is used to represent character from almost all language. Each character is encoded using a unique integer code point between 0 and 0x10FFFF. A *Unicode string* is a sequence of zero or more code points.


In [None]:
import tensorflow as tf

### The tf.string data type

In [None]:
tf.constant(u"Thanks 😊")

<tf.Tensor: shape=(), dtype=string, numpy=b'Thanks \xf0\x9f\x98\x8a'>

In [None]:
tf.constant([u"You're", u"welcome!"]).shape

TensorShape([2])

### Representing Unicode

There are two standard ways to repreent a Unicode string in Tensorflow:


*   `string` scalar - where the sequence of code points in encoding using a known character encoding.

*   `int32` vector - Where each position contains a single code point.

In [None]:
# Unicode string, represented as a UTF-8 encoded string scalar.
text_utf8 = tf.constant(u"语言处理")
text_utf8

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [None]:
# Unicode string, represent as a UTF-16-BE encoded string scalar.
text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE"))
text_utf16be

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

In [None]:
# Unicode string, represent as a vector of Unicode code points.
text_chars = tf.constant([ord(char) for char in u"语言处理"])
text_chars

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

### Converting between representations

TensorFlow provides operations to convert between these different representation:

*   `tf.string.unicode_decode`: Convert an encoded string scalar to a vector of code points.
*   `tf.string.unicode_encode`: Convert a vector of code points to an encoded string scalar.
*   `tf.string.unicode_transcode`: Convert an encoded string scalar to a different encoding.




In [None]:
tf.strings.unicode_decode(text_utf8,
                         input_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=int32, numpy=array([35821, 35328, 22788, 29702], dtype=int32)>

In [None]:
tf.strings.unicode_encode(text_chars,
                          output_encoding='UTF-8')

<tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xaf\xad\xe8\xa8\x80\xe5\xa4\x84\xe7\x90\x86'>

In [None]:
tf.strings.unicode_transcode(text_utf8,
                             input_encoding="UTF8",
                             output_encoding="UTF-16-BE")

<tf.Tensor: shape=(), dtype=string, numpy=b'\x8b\xed\x8a\x00Y\x04t\x06'>

### Batch dimensions

When decoding multiple strings, the number of character in each string may not be equal. The return is a `tf.RaggedTensor`,Where the length of the innermost dimension varies depending on the number of characters in each string:

In [None]:
# A batch of Unicode string each represented as a UTF8-encoded sring.
batch_utf8 = [s.encode('UTF-8') for s in 
              [u'hÃllo',  u'What is the weather tomorrow',  u'Göödnight', u'😊']]
batch_chars_ragged = tf.strings.unicode_decode(batch_utf8,
                                               input_encoding='UTF-8')
for sentence_chars in batch_chars_ragged.to_list():
  print(sentence_chars)

[104, 195, 108, 108, 111]
[87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119]
[71, 246, 246, 100, 110, 105, 103, 104, 116]
[128522]


You can use this `tf.RaggedTensor` directly,or convert it to dense `tf.Tensor`with padding or a tf.SparseTensor using the method `tf.RaggedTensor.to_tensor` and `tf.RaggedTensor.to_sparse`.

In [None]:
batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1)
print(batch_chars_padded.numpy())

[[   104    195    108    108    111     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [    87    104     97    116     32    105    115     32    116    104
     101     32    119    101     97    116    104    101    114     32
     116    111    109    111    114    114    111    119]
 [    71    246    246    100    110    105    103    104    116     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]
 [128522     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1     -1     -1
      -1     -1     -1     -1     -1     -1     -1     -1]]


In [None]:
batch_chars_sparse = batch_chars_ragged.to_sparse()

When encoding multiple string with the same length, a `tf.Tensor` may be used as input:

In [None]:
tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]],
                          output_encoding='UTF-8')

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'cat', b'dog', b'cow'], dtype=object)>

When encoding multiple string with varing length, a `tf.RaggedTensor` should be used as input:

In [None]:
tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

If you have a tensor with multiple string in padded or sparse format,then convert it to a `tf.RaggedTensor` before calling unicode_encode:

In [None]:
tf.strings.unicode_encode(
    tf.RaggedTensor.from_sparse(batch_chars_sparse),
    output_encoding='UTF-8')

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

In [None]:
tf.strings.unicode_encode(
    tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1),
    output_encoding='UTF-8'
)

<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b'h\xc3\x83llo', b'What is the weather tomorrow',
       b'G\xc3\xb6\xc3\xb6dnight', b'\xf0\x9f\x98\x8a'], dtype=object)>

## Unicode operations

### character length
 

In [None]:
#Note that the final character takes up 4 bytes in UTF8
thanks = u'Thanks 😊'.encode('UTF-8')
num_bytes = tf.strings.length(thanks).numpy()
num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy()
print('{} bytes; {} UTF-8 character'.format(num_bytes, num_chars))

11 bytes; 8 UTF-8 character


### Character substrings

Similarly, the `tf.strings.substr` operation accepts the "unit" parameter, uses it to determine what kind of offsets the "pos" and "len" parameters contains.

In [None]:
# default: unit='BYTE'. With len=1, we return a single byte.
tf.strings.substr(thanks, pos=5, len=3).numpy()

b's \xf0'

In [None]:
# Specifying unit='UTF8_CHAR', we return a single character, which in this case 
# is 4 bytes.
print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy())

b'\xf0\x9f\x98\x8a'


### Split Unicode strings


In [None]:
tf.strings.unicode_split(thanks, 'UTF-8').numpy()

array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'],
      dtype=object)

### Byte offsets for character

To align the character tensor generated by `tf.strings.unicode_decode` with the original string, it's useful to know the offset for where each character begins. The method `tf.strings.unicode_decode_with_offsets` is similar to `unicode_decode`, except that it return a second tensor containing the start offset of each character.

In [None]:
codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8')

for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()):
  print("At byte offset {}: codepoint {}".format(offset, codepoint))

At byte offset 0: codepoint 127880
At byte offset 4: codepoint 127881
At byte offset 8: codepoint 127882


## Unicode scripts

Each Unicode code point belong to single collection of codepoints known as a [script](https://en.wikipedia.org/wiki/Script_%28Unicode%29). 

In [None]:
uscript = tf.strings.unicode_script([33464,1041]) # ['芸', 'Б']

print(uscript.numpy()) #[17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

[17  8]


In [None]:
print(tf.strings.unicode_script(batch_chars_ragged))

<tf.RaggedTensor [[25, 25, 25, 25, 25], [25, 25, 25, 25, 0, 25, 25, 0, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 0, 25, 25, 25, 25, 25, 25, 25, 25], [25, 25, 25, 25, 25, 25, 25, 25, 25], [0]]>


## Example: Simple segmentation

Segmentation is the task of splitting text into word-like units. This is often easy when space characters are used to separate words, but some language(like chinese and Japanese) do not use spaces, and some language(like German) conatin long compounds that must be split in order to analyse their meaning. In web text, different language ans scripts are frequently mixed together, as in "NY株価"(New York Stock Exchange).

We can perform very rough segmentation(without implementing any ML models) by using changes in script to approximate word boundaries. This will work for string like the "NY株価" example above. It will also work for most languages that use spaces, as the spaces characters of various scripts are all classified as USCRIPT_COMMON, a special scripts code that differs from that of any actual text.

In [None]:
# dtype: string; shape: [num_sentences]
# The sentence to process. Edit this line to try out different inputs!

sentence_texts = [u'Hello, world.', u'世界こんにちは']

Decode the sentences into character codepoints, and find the script identifeis for each character.

In [None]:
# dtype : int32; shape[num_sentences, (num_chars_per_sentence)]

# sentence_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th sentence.

sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8')
print(sentence_char_codepoint)

# dtype : int32; shape[num_sentences, (num_chars_per_sentence)]

# sentence_char_codepoint[i, j] is the unicode script of the j'th character in the
# i'th sentence.

sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint)
print(sentence_char_script)


<tf.RaggedTensor [[72, 101, 108, 108, 111, 44, 32, 119, 111, 114, 108, 100, 46], [19990, 30028, 12371, 12435, 12395, 12385, 12399]]>
<tf.RaggedTensor [[25, 25, 25, 25, 25, 0, 0, 25, 25, 25, 25, 25, 0], [17, 17, 20, 20, 20, 20, 20]]>


Next, we use those script identifiers to determine where word boundaries should be added. We add a word boundary at the beginning of each sentence, and for each character whoses script differs from the previous character:

In [None]:
# dtype : bool; shape : [num_sentences, (num_chars_per_sentence)]

# sentence_char_starts_word[i, j] is True if the j'th character in the i'th 
# sentence is the start of a word.
sentence_char_starts_word = tf.concat(
    [tf.fill([sentence_char_script.nrows(), 1], True),
     tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])],
     axis=1)

# dtype: int64 ; shape : [num_words]

#word_starts[i] is the index of the character that starts with the i'th word (in
# the flattened list of characters from all sentences).

word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1)
print(word_starts)

tf.Tensor([ 0  5  7 12 13 15], shape=(6,), dtype=int64)


We can then use those start offset to build a `RaggedTensor` containing the list of words from all batches:

In [None]:
# dtype: int32; shape: [num_words, (num_chars_per_word)]
#
# word_char_codepoint[i, j] is the codepoint for the j'th character in the
# i'th word.
word_char_codepoint = tf.RaggedTensor.from_row_starts(
    values=sentence_char_codepoint.values,
    row_starts=word_starts)
print(word_char_codepoint)

<tf.RaggedTensor [[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46], [19990, 30028], [12371, 12435, 12395, 12385, 12399]]>


In [None]:
# dtype: int64; shape: [num_sentences]
#
# sentence_num_words[i] is the number of words in the i'th sentence.
sentence_num_words = tf.reduce_sum(
    tf.cast(sentence_char_starts_word, tf.int64),
    axis=1)

# dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)]
#
# sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character
# in the j'th word in the i'th sentence.
sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths(
    values=word_char_codepoint,
    row_lengths=sentence_num_words)
print(sentence_word_char_codepoint)


<tf.RaggedTensor [[[72, 101, 108, 108, 111], [44, 32], [119, 111, 114, 108, 100], [46]], [[19990, 30028], [12371, 12435, 12395, 12385, 12399]]]>


In [None]:
tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list()

[[b'Hello', b', ', b'world', b'.'],
 [b'\xe4\xb8\x96\xe7\x95\x8c',
  b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']]