<a href="https://colab.research.google.com/github/mkaramib/nlp/blob/main/tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization
Tokenization is the first key part of any NLP processes. In this notebook, various ways of tokenizations are explained. 


## TensorFlow
Tensorflow provides tokenization as the following:

In [1]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
# define some sentences
train_sentences = ['This is a test sentence for tekonization.', 'Tensor flow provides tokenization.', 'Machine learning is growing rapidly.']

# test sentences
test_sentences = ['NLP is a field of machine learning.']

In [4]:
# instantiate tokenizar
tokenizer = Tokenizer()

# call the fit_on_texts and pass the sentences
tokenizer.fit_on_texts(train_sentences)

### Word Indexes
Tokenizer provides the attribute of *word_index* which provides the indexes assigned to the tokens in the training texts.

In [None]:
# get word indices
word_indices = tokenizer.word_index
print(f'word indixes = {word_indices}')

### Vectorize the text
Tokenizer provides some methods to vectorize the given text.

In [None]:
# vectorize the text(convert the given text into list of index)
print(f'vectors of training_sentences = {tokenizer.texts_to_sequences(train_sentences)}')

# vectorize the test sentences
print(f'vectors of test sentences = {tokenizer.texts_to_sequences(test_sentences)}')

### Limit the Vocabulary
If the content is long, we can only tokenize the most frequent terms using a *num_words* argument to the Tokenizer as the following:

tokenizer = Tokenizer(num_words = 100)

### Out Of Vocabulary(OOV)
Out Of Vocabulary (OOV) token is another issue in NLP processes that must be considered. Tensorflow provides an option to deal with it using *oov_token* argument of Tokenizer as the following:

tokenizer = Tokenizer(oov_token='<OOV>')

In [None]:
# define tokenizer with num_words and oov_token
tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')

# call fit_on_texts to sentences
tokenizer.fit_on_texts(train_sentences)

# get word indices
word_indices = tokenizer.word_index
print(f'word indixes = {word_indices}')

Now, we can test the new tokenizer with OOV argument on a test sentences with un-seen tokens in the training. The index for *OOV* is 1, so in the result sequence, 1 means that the token is not seen in the training process.

In [None]:
# run the tokenizer on the test sentence
tests_sequences = tokenizer.texts_to_sequences(test_sentences)
print(f'vectorized test sentence = {tests_sequences}')

### Padding
Another pre-process in most of NLP component is to make the inputs equal-length. Padding is a technique for this purpose. 

Tensorflow provides *pad_sequences* function which can get some arguments such as *maxlen* as well as *padding* to define where the padding must be added, *pre* or *post*. 

The default position for the padding is *pre* which means padding process adds 0 to the start of sentence.

In [None]:
# padded the sentence, the default 
padded = pad_sequences(tests_sequences, maxlen=10)

# print padded 
print(f'Padded sequence = {padded}')

It is also possible to add the padding after the sequence using *post* as the *padding* argument of the pad_sequences() method.

In [None]:
# padded the sentence, the default 
padded = pad_sequences(tests_sequences, maxlen=10, padding='post')

# print padded 
print(f'Padded sequence = {padded}')

## NLTK
NLTK is a NLP toolkit that provides pre-processing steps of NLP processes such as tokenization, stopwords, etc. 



In [18]:
# import nltk tokenize and stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

### Initialize
It is required to initialize the NLTk tokenizer and stopwords. Following lines show these step for English language. 

In [None]:
# initialize
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

### Tokenize and Stopwords
Following lines shows how to run the tokenization and stopwords removal.

In [None]:
# define the sentence
temp_setences = 'NLP is a field of machine learning.'

# tokenize
temp_tokens = word_tokenize(temp_setences)
print(f'tokens = {temp_tokens} ')

temp_tokens_without_stopwords = [w for w in temp_tokens if w not in stop_words]
print(f'tokens without stop words = {temp_tokens_without_stopwords}')

## Trax
Trax provides tokenization for NLP processes. 