<a href="https://colab.research.google.com/github/mkaramib/nlp/blob/main/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization
Tokenization is the first key part of any NLP processes. In this notebook, various ways of tokenizations are explained. 


## TensorFlow
Tensorflow provides tokenization as the following:

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [12]:
# define some sentences
train_sentences = ['This is a test sentence for tekonization.', 'Tensor flow provides tokenization.', 'Machine learning is growing rapidly.']

# test sentences
test_sentences = ['NLP is a filed of machine learning.']

In [13]:
# instantiate tokenizar
tokenizer = Tokenizer()

# call the fit_on_texts and pass the sentences
tokenizer.fit_on_texts(sentences)

## Word Indexes
Tokenizer provides the attribute of *word_index* which provides the indexes assigned to the tokens in the training texts.

In [14]:
# get word indices
word_indices = tokenizer.word_index
print(f'word indixes = {word_indices}')

word indixes = {'is': 1, 'this': 2, 'a': 3, 'test': 4, 'sentence': 5, 'for': 6, 'tekonization': 7, 'tensor': 8, 'flow': 9, 'provides': 10, 'tokenization': 11, 'machine': 12, 'learning': 13, 'growing': 14, 'rapidly': 15}


## Vectorize the test
Tokenizer provides some methods to vectorize the given text.

In [18]:
# vectorize the text(convert the given text into list of index)
print(f'vectors of training_sentences = {tokenizer.texts_to_sequences(train_sentences)}')

# vectorize the test sentences
print(f'vectors of test sentences = {tokenizer.texts_to_sequences(test_sentences)}')

vectors of training_sentences = [[2, 1, 3, 4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 1, 14, 15]]
vectors of test sentences = [[1, 3, 12, 13]]


**Note 1:** if the content is long, we can only tokenize the most frequent terms using a *num_words* argument to the Tokenizer as the following:

tokenizer = Tokenizer(num_words = 100)

**Note 2:** Out Of Vocabulary (OOV) token is another issue in NLP processes that must be considered. Tensorflow provides an option to deal with it using *oov_token* argument of Tokenizer as the following:

tokenizer = Tokenizer(oov_token='<OOV>')


In [6]:
# define tokenizer with num_words and oov_token
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')

# call fit_on_texts to sentences
tokenizer.fit_on_texts(sentences)

Now, we can test the new tokenizer with OOV argument on a test sentences with un-seen tokens in the training.