# Text Preprocessing

For any NLP tasks in Deep Learning the first step would be preprocessing the text data into numbers!

In the recent years almost all the DL packages have started to provide their own APIs to do the text preprocessing, however each one has its own subtle differences, which if not understood correctly will lead to improper data preparation and thus skewing model trianing.

When I resumed my hobby in DL with Transformers + Tensorflow 2.0, I came across different APIs doing the same text tokneization as part of the Tensorflow ecosystem tutorials.

From the days of writing our own tokenizer and encoders/decoders, we now have APIs which can simplify our work a lot. However care should be taken while using such APIs, like 
- How you wanted the text to be splitted?
- How the tokenizers wanted to handle the punctuations/special characters?
- How to handle out of vocab word (OOV)?
- Do you wanted to use [WordPiece tokenization](https://stackoverflow.com/questions/55382596/how-is-wordpiece-tokenization-helpful-to-effectively-deal-with-rare-words-proble/55416944#55416944)?
- Does the tokenizer/enoder support charcter level encoding ?
- How is vocab length is calculated? does it include PAD and OOV words in it?

Choosing the right API to do our task with multiple options out there is not an easy job, as each API is build with specific purpose to fit with its counter parts. Some wors natively with Tensors, somw with Tensrflow datasets, some with character level etc.,

This is a quick skim through reference blog for word and character level encoding in Tensorflow.

In [1]:
from string import punctuation

In [3]:
import tensorflow as tf
import tensorflow_text
import tensorflow_datasets as tfds

Data is a sample from [CoNLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/).

In [4]:
text_data = ["4. Kurt Betschart - Bruno Risi ( Switzerland ) 22",
            "Israel approves Arafat 's flight to West Bank .",
            "Moreau takes bronze medal as faster losing semifinalist .",
            "W D L G / F G / A P",
            "-- Helsinki newsroom +358 - 0 - 680 50 248",
            "M'bishi Gas sets terms on 7-year straight ."]
ner_data = ["O B-PER I-PER O B-PER I-PER O B-LOC O O",
            "B-LOC O B-PER O O O B-LOC I-LOC O",
            "B-PER O O O O O O O O",
            "O O O O O O O O O O",
            "O B-LOC O O O O O O O O",
            "B-ORG I-ORG O O O O O O"]

In [5]:
start_word, end_word, unknown_word = "<START>", "<END>", "<UNK>"

Three set of APIs are explored
- Tensorflow Dataset APIs
- Tensorflow Keras Text Preprocessing
- Tensorflow Text

For my current task Keras APIs solved my requirements, i.e word and character tokenizing, encoding and decoding.

Note: Decoding will be updated if I get time.:)

## 1. Tensorflow Dataset API
Like many I started the TRansformers from this tutorial which uses the Tensorflow Dataset APIs.
https://www.tensorflow.org/tutorials/text/transformer

- The API is clean and easy to use.
- https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/Tokenizer
- https://www.tensorflow.org/datasets/api_docs/python/tfds/features/text/TextEncoder
- Here we need Tokenizer and Encoder seprately.

Cons:
- For the task of preparing the text for NER, we have to consider all special characters, which by default is ignored.
- Even if we add the `punctuation` as reserved tokens, it still removes the special characters while tokenizing


In [6]:
text_tokenizer = tfds.features.text.Tokenizer(reserved_tokens=[start_word, end_word] + list(punctuation))
tags_tokenizer = tfds.features.text.Tokenizer(reserved_tokens=['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC',
                                                               'I-MISC', 'I-ORG', 'I-PER', 'O',
                                                               start_word, end_word])
text_vocabulary_set = set()
tags_vocabulary_set = set()

In [7]:
for text, ner in zip(text_data, ner_data):
    text_tokens = text_tokenizer.tokenize(text)
    tag_tokens = tags_tokenizer.tokenize(ner)
    
    text_vocabulary_set.update(text_tokens)
    tags_vocabulary_set.update(tag_tokens)
    
text_vocabulary_set.update([start_word, end_word])
tags_vocabulary_set.update([start_word, end_word])

In [8]:
text_encoder = tfds.features.text.TokenTextEncoder(text_vocabulary_set, oov_token=unknown_word, tokenizer=text_tokenizer)
tags_encoder = tfds.features.text.TokenTextEncoder(tags_vocabulary_set, oov_token=unknown_word, tokenizer=tags_tokenizer)

In [9]:
for token, id in text_encoder._token_to_id.items():
    print(token,"--->", id+1) # Be default 0 is used PAD index

( ---> 1
F ---> 2
<START> ---> 3
. ---> 4
A ---> 5
Betschart ---> 6
22 ---> 7
- ---> 8
248 ---> 9
D ---> 10
Risi ---> 11
Bank ---> 12
West ---> 13
Gas ---> 14
bronze ---> 15
newsroom ---> 16
on ---> 17
medal ---> 18
straight ---> 19
W ---> 20
s ---> 21
P ---> 22
approves ---> 23
Arafat ---> 24
Israel ---> 25
flight ---> 26
/ ---> 27
0 ---> 28
Switzerland ---> 29
) ---> 30
Helsinki ---> 31
to ---> 32
takes ---> 33
sets ---> 34
losing ---> 35
year ---> 36
Bruno ---> 37
<END> ---> 38
4 ---> 39
' ---> 40
G ---> 41
semifinalist ---> 42
L ---> 43
M ---> 44
Kurt ---> 45
680 ---> 46
50 ---> 47
faster ---> 48
+ ---> 49
358 ---> 50
bishi ---> 51
7 ---> 52
terms ---> 53
Moreau ---> 54
as ---> 55


In [10]:
for token, id in tags_encoder._token_to_id.items():
    print(token, "--->", id+1)

B-ORG ---> 1
O ---> 2
I-PER ---> 3
<START> ---> 4
I-LOC ---> 5
<END> ---> 6
B-LOC ---> 7
B-PER ---> 8
I-ORG ---> 9


In [11]:
tags_encoder.vocab_size # i.e above tags + PAD + UNK

11

In [12]:
text_data[0]

'4. Kurt Betschart - Bruno Risi ( Switzerland ) 22'

In [13]:
ner_data[0]

'O B-PER I-PER O B-PER I-PER O B-LOC O O'

In [14]:
res = text_encoder.encode(text_data[0])
res

[39, 4, 45, 6, 8, 37, 11, 1, 29, 30, 7]

In [15]:
for text_token, tag_token, id in zip(text_tokenizer.tokenize(text_data[0]), ner_data[0].split(" "), res):
    print(text_token, tag_token, id)

4 O 39
. B-PER 4
Kurt I-PER 45
Betschart O 6
- B-PER 8
Bruno I-PER 37
Risi O 11
( B-LOC 1
Switzerland O 29
) O 30


**As you can see "4." is splitted into "4" and "."**

# Keras API
- If you are lucky enough and had patient to read this tutorial https://www.tensorflow.org/tutorials/text/nmt_with_attention or who loves Keras, then your requirement for Text preprocessing is met.

- https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer?version=stable

## Word Encoding

In [16]:
def keras_tokenize(text_corpus, char_level=False, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters=filters, oov_token="<UNK>", char_level=char_level, lower=False)
    lang_tokenizer.fit_on_texts(text_corpus)
    return lang_tokenizer

In [17]:
text_word_tokenizer = keras_tokenize(text_data, filters='')

 Lets print word index for our data

In [18]:
text_word_tokenizer.index_word

{1: '<UNK>',
 2: '-',
 3: '.',
 4: 'G',
 5: '/',
 6: '4.',
 7: 'Kurt',
 8: 'Betschart',
 9: 'Bruno',
 10: 'Risi',
 11: '(',
 12: 'Switzerland',
 13: ')',
 14: '22',
 15: 'Israel',
 16: 'approves',
 17: 'Arafat',
 18: "'s",
 19: 'flight',
 20: 'to',
 21: 'West',
 22: 'Bank',
 23: 'Moreau',
 24: 'takes',
 25: 'bronze',
 26: 'medal',
 27: 'as',
 28: 'faster',
 29: 'losing',
 30: 'semifinalist',
 31: 'W',
 32: 'D',
 33: 'L',
 34: 'F',
 35: 'A',
 36: 'P',
 37: '--',
 38: 'Helsinki',
 39: 'newsroom',
 40: '+358',
 41: '0',
 42: '680',
 43: '50',
 44: '248',
 45: "M'bishi",
 46: 'Gas',
 47: 'sets',
 48: 'terms',
 49: 'on',
 50: '7-year',
 51: 'straight'}

So, if you wanna to convert your text data into intergers..

In [19]:
res = text_word_tokenizer.texts_to_sequences(text_data)
# Easy to use padding API from Keras
res = tf.keras.preprocessing.sequence.pad_sequences(res, padding='post')
res

array([[ 6,  7,  8,  2,  9, 10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19, 20, 21, 22,  3,  0],
       [23, 24, 25, 26, 27, 28, 29, 30,  3,  0],
       [31, 32, 33,  4,  5, 34,  4,  5, 35, 36],
       [37, 38, 39, 40,  2, 41,  2, 42, 43, 44],
       [45, 46, 47, 48, 49, 50, 51,  3,  0,  0]], dtype=int32)

To test how out of vocab index values are used, we can feed tags to text tokenizer ;) and see `1s` 

In [20]:
text_word_tokenizer.texts_to_sequences(ner_data)

[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1]]

In [21]:
ner_word_tokenizer = keras_tokenize(ner_data)

In [22]:
res = ner_word_tokenizer.texts_to_sequences(ner_data)
res = tf.keras.preprocessing.sequence.pad_sequences(res, padding="post")
res

array([[2, 3, 4, 6, 4, 2, 3, 4, 6, 4, 2, 3, 5, 2, 2],
       [3, 5, 2, 3, 4, 2, 2, 2, 3, 5, 6, 5, 2, 0, 0],
       [3, 4, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0],
       [2, 3, 5, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0],
       [3, 7, 6, 7, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0]], dtype=int32)

In [23]:
ner_word_tokenizer.index_word

{1: '<UNK>', 2: 'O', 3: 'B', 4: 'PER', 5: 'LOC', 6: 'I', 7: 'ORG'}

In [24]:
vocab_size = len(ner_word_tokenizer.word_index) + 1
vocab_size

8

## Character Encoding

Character level encoding will be useful when you wanted to capture the sematics at word level which can help you to deal with out of vocab words, deeper understanding of words with repect to its position etc.,

In [25]:
text_char_tonkenizer = keras_tokenize(text_data, char_level=True)

In [26]:
text_data[0]

'4. Kurt Betschart - Bruno Risi ( Switzerland ) 22'

In [27]:
text_char_tonkenizer.index_word
# care needs to be taken while using cahracter level tokenizing with OOV, 
# since all the characters will be part of our vocab. This can happen when we wanted to 
# tokenize a differnt language or different string encoding.

{1: '<UNK>',
 2: ' ',
 3: 's',
 4: 'e',
 5: 'a',
 6: 't',
 7: 'r',
 8: 'i',
 9: 'n',
 10: 'o',
 11: 'l',
 12: '-',
 13: '.',
 14: 'h',
 15: 'f',
 16: 'm',
 17: 'u',
 18: 'B',
 19: '2',
 20: 'g',
 21: 'k',
 22: 'G',
 23: '8',
 24: '0',
 25: '4',
 26: 'w',
 27: 'z',
 28: 'd',
 29: 'p',
 30: 'A',
 31: "'",
 32: 'W',
 33: 'M',
 34: 'b',
 35: '/',
 36: '5',
 37: 'K',
 38: 'c',
 39: 'R',
 40: '(',
 41: 'S',
 42: ')',
 43: 'I',
 44: 'v',
 45: 'D',
 46: 'L',
 47: 'F',
 48: 'P',
 49: 'H',
 50: '+',
 51: '3',
 52: '6',
 53: '7',
 54: 'y'}

In [28]:
# split the text by spaces i.e list of list of words
char_data = [text.split(" ") for text in text_data]
print(char_data)

char_data_encoded = []
for char_seq in char_data:
    # tokenize each sentence
    res = text_char_tonkenizer.texts_to_sequences(char_seq)
    # pad it 
    res = tf.keras.preprocessing.sequence.pad_sequences(res, padding="post", maxlen=6)
    # group it as a batch
    char_data_encoded.append(res)
    
char_data_encoded

[['4.', 'Kurt', 'Betschart', '-', 'Bruno', 'Risi', '(', 'Switzerland', ')', '22'], ['Israel', 'approves', 'Arafat', "'s", 'flight', 'to', 'West', 'Bank', '.'], ['Moreau', 'takes', 'bronze', 'medal', 'as', 'faster', 'losing', 'semifinalist', '.'], ['W', 'D', 'L', 'G', '/', 'F', 'G', '/', 'A', 'P'], ['--', 'Helsinki', 'newsroom', '+358', '-', '0', '-', '680', '50', '248'], ["M'bishi", 'Gas', 'sets', 'terms', 'on', '7-year', 'straight', '.']]


[array([[25, 13,  0,  0,  0,  0],
        [37, 17,  7,  6,  0,  0],
        [ 3, 38, 14,  5,  7,  6],
        [12,  0,  0,  0,  0,  0],
        [18,  7, 17,  9, 10,  0],
        [39,  8,  3,  8,  0,  0],
        [40,  0,  0,  0,  0,  0],
        [ 4,  7, 11,  5,  9, 28],
        [42,  0,  0,  0,  0,  0],
        [19, 19,  0,  0,  0,  0]], dtype=int32),
 array([[43,  3,  7,  5,  4, 11],
        [29,  7, 10, 44,  4,  3],
        [30,  7,  5, 15,  5,  6],
        [31,  3,  0,  0,  0,  0],
        [15, 11,  8, 20, 14,  6],
        [ 6, 10,  0,  0,  0,  0],
        [32,  4,  3,  6,  0,  0],
        [18,  5,  9, 21,  0,  0],
        [13,  0,  0,  0,  0,  0]], dtype=int32),
 array([[33, 10,  7,  4,  5, 17],
        [ 6,  5, 21,  4,  3,  0],
        [34,  7, 10,  9, 27,  4],
        [16,  4, 28,  5, 11,  0],
        [ 5,  3,  0,  0,  0,  0],
        [15,  5,  3,  6,  4,  7],
        [11, 10,  3,  8,  9, 20],
        [ 9,  5, 11,  8,  3,  6],
        [13,  0,  0,  0,  0,  0]], dtype=int32),
 ar

# TF Text APIs
- https://github.com/tensorflow/text
- https://www.tensorflow.org/tutorials/tensorflow_text/intro
- https://blog.tensorflow.org/2019/06/introducing-tftext.html

The last one is Tensorflow Text APIs. From first glance it seems to have good integration with the Tensorflow Dataset APIs and Keras.

Since my current requirements are met with Keras preprocessing APIs, I am keepin theis for later time exploration.

In [29]:
tokenizer = tensorflow_text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])
print(tokens.to_list())

Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.
[[b'everything', b'not', b'saved', b'will', b'be', b'lost.'], [b'Sad\xe2\x98\xb9']]


In [30]:
text_tokens = tokenizer.tokenize(text_data)

In [31]:
text_tokens.values

<tf.Tensor: shape=(56,), dtype=string, numpy=
array([b'4.', b'Kurt', b'Betschart', b'-', b'Bruno', b'Risi', b'(',
       b'Switzerland', b')', b'22', b'Israel', b'approves', b'Arafat',
       b"'s", b'flight', b'to', b'West', b'Bank', b'.', b'Moreau',
       b'takes', b'bronze', b'medal', b'as', b'faster', b'losing',
       b'semifinalist', b'.', b'W', b'D', b'L', b'G', b'/', b'F', b'G',
       b'/', b'A', b'P', b'--', b'Helsinki', b'newsroom', b'+358', b'-',
       b'0', b'-', b'680', b'50', b'248', b"M'bishi", b'Gas', b'sets',
       b'terms', b'on', b'7-year', b'straight', b'.'], dtype=object)>