<a href="https://colab.research.google.com/github/mkaramib/nlp/blob/main/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization
Tokenization is the first key part of any NLP processes. In this notebook, various ways of tokenizations are explained. 


## TensorFlow
Tensorflow provides tokenization as the following:

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
# define some sentences
train_sentences = ['This is a test sentence for tekonization.', 'Tensor flow provides tokenization.', 'Machine learning is growing rapidly.']

# test sentences
test_sentences = ['NLP is a filed of machine learning.']

In [4]:
# instantiate tokenizar
tokenizer = Tokenizer()

# call the fit_on_texts and pass the sentences
tokenizer.fit_on_texts(train_sentences)

### Word Indexes
Tokenizer provides the attribute of *word_index* which provides the indexes assigned to the tokens in the training texts.

In [5]:
# get word indices
word_indices = tokenizer.word_index
print(f'word indixes = {word_indices}')

word indixes = {'is': 1, 'this': 2, 'a': 3, 'test': 4, 'sentence': 5, 'for': 6, 'tekonization': 7, 'tensor': 8, 'flow': 9, 'provides': 10, 'tokenization': 11, 'machine': 12, 'learning': 13, 'growing': 14, 'rapidly': 15}


### Vectorize the text
Tokenizer provides some methods to vectorize the given text.

In [6]:
# vectorize the text(convert the given text into list of index)
print(f'vectors of training_sentences = {tokenizer.texts_to_sequences(train_sentences)}')

# vectorize the test sentences
print(f'vectors of test sentences = {tokenizer.texts_to_sequences(test_sentences)}')

vectors of training_sentences = [[2, 1, 3, 4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 1, 14, 15]]
vectors of test sentences = [[1, 3, 12, 13]]


### Limit the Vocabulary
If the content is long, we can only tokenize the most frequent terms using a *num_words* argument to the Tokenizer as the following:

tokenizer = Tokenizer(num_words = 100)

### Out Of Vocabulary(OOV)
Out Of Vocabulary (OOV) token is another issue in NLP processes that must be considered. Tensorflow provides an option to deal with it using *oov_token* argument of Tokenizer as the following:

tokenizer = Tokenizer(oov_token='<OOV>')

In [9]:
# define tokenizer with num_words and oov_token
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')

# call fit_on_texts to sentences
tokenizer.fit_on_texts(train_sentences)

# get word indices
word_indices = tokenizer.word_index
print(f'word indixes = {word_indices}')

word indixes = {'<OOV>': 1, 'is': 2, 'this': 3, 'a': 4, 'test': 5, 'sentence': 6, 'for': 7, 'tekonization': 8, 'tensor': 9, 'flow': 10, 'provides': 11, 'tokenization': 12, 'machine': 13, 'learning': 14, 'growing': 15, 'rapidly': 16}


Now, we can test the new tokenizer with OOV argument on a test sentences with un-seen tokens in the training. The index for *OOV* is 1, so in the result sequence, 1 means that the token is not seen in the training process.

In [12]:
# run the tokenizer on the test sentence
tests_sequences = tokenizer.texts_to_sequences(test_sentences)
print(f'vectorized test sentence = {tests_sequences}')

vectorized test sentence = [[1, 2, 4, 1, 1, 13, 14]]


### Padding
Another pre-process in most of NLP component is to make the inputs equal-length. Padding is a technique for this purpose. 

Tensorflow provides *pad_sequences* function which can get some arguments such as *maxlen* as well as *padding* to define where the padding must be added, *pre* or *post*. 

The default position for the padding is *pre* which means padding process adds 0 to the start of sentence.

In [14]:
# padded the sentence, the default 
padded = pad_sequences(tests_sequences, maxlen=10)

# print padded 
print(f'Padded sequence = {padded}')

Padded sequence = [[ 0  0  0  1  2  4  1  1 13 14]]


It is also possible to add the padding after the sequence using *post* as the *padding* argument of the pad_sequences() method.

In [15]:
# padded the sentence, the default 
padded = pad_sequences(tests_sequences, maxlen=10, padding='post')

# print padded 
print(f'Padded sequence = {padded}')

Padded sequence = [[ 1  2  4  1  1 13 14  0  0  0]]
