<a href="https://colab.research.google.com/github/mvenkatesh431/NLP/blob/master/Tensorflow_Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Lets start with the Tokenizer.

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool'
]

tokenizer = Tokenizer(num_words = 10)
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

print(word_index)

{'tokenizer': 1, 'tensorflow': 2, 'nlp': 3, 'is': 4, 'really': 5, 'cool': 6}


- Note : Above tokenizer have small 't' but keras takes care of this. and also removes special characters also.


### More on tokenizer

In [6]:
from tensorflow.keras.preprocessing.text import Tokenizer

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

tokenizer = Tokenizer(num_words = 10)
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

{'tokenizer': 1, 'tensorflow': 2, 'is': 3, 'nlp': 4, 'really': 5, 'cool': 6, 'awesome': 7, 'spliting': 8, 'sentences': 9, 'in': 10, 'to': 11, 'words': 12, 'using': 13}
[[2, 4, 1], [1, 3, 5, 6], [2, 3, 7], [8, 9, 1]]


-Now we can see that texts_to_sequences directly converted sentences into the Vectors.
- Good thing about tokenizer is, It can fit on one data and still we can perform on other data. 
- What I mean is we can fit the tokenizer on the training data and then only use the transform on the test data. 
- We must need to do the fit on the training data and only do the transform on the testing data, because we are building model using the word_indexes of training data and if we use the fit on the test data word_indexes will change and effectiveness of model may decrease.


### Using tokenizer on Test data

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

tokenizer = Tokenizer(num_words = 10)
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

test_data = [
             'NLP Tokenizer became really powerful',
             'Tensorflow tokenizer at work'
]

# Now i am not going to use fit on the test data, But only using the transform
seq = tokenizer.texts_to_sequences(test_data)
print("Test Seq: ", seq)

{'tokenizer': 1, 'tensorflow': 2, 'is': 3, 'nlp': 4, 'really': 5, 'cool': 6, 'awesome': 7, 'spliting': 8, 'sentences': 9, 'in': 10, 'to': 11, 'words': 12, 'using': 13}
[[2, 4, 1], [1, 3, 5, 6], [2, 3, 7], [8, 9, 1]]
Test Seq:  [[4, 1, 5], [2, 1]]




*   We can see we only got the word index for the words which are in the training vocabulary 
*   For example : 'NLP Tokenizer became really powerful' is encoded as [4, 1,5 ] where [4-NLP, 1-tokenizer, 5-really] and remaining words which are not present in the training data is ignored.
* So tokenizer will only transform the words which are present in the training data, So it is always good idea to have large corpus of training data so that we can have all words in the vocabulary.
 



### Solving Out of vocabulary words problem

In [9]:
from tensorflow.keras.preprocessing.text import Tokenizer

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

# for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token
tokenizer = Tokenizer(num_words = 10, oov_token='<OOV>')
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

test_data = [
             'NLP Tokenizer became really powerful',
             'Tensorflow tokenizer at work'
]

# Now i am not going to use fit on the test data, But only using the transform
seq = tokenizer.texts_to_sequences(test_data)
print("Test Seq: ", seq)

{'<OOV>': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14}
[[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 1, 1, 1, 1, 1, 2]]
Test Seq:  [[5, 2, 1, 6, 1], [3, 2, 1, 1]]


For the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token

Now our test seq have <OOV> also, This is one way fill the missing words.

**Note:** Make sure to use Unique word for the oov_token

### Text Padding 

- It is good idea to have the same length for all our sentences so that neural network can perform the calculations efficiently
- But our sentences may have different length, So we can use the text padding technique to bring all of them in to the same length.

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

# for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token
tokenizer = Tokenizer(num_words = 20, oov_token='<OOV>')
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

padded_seq = pad_sequences(sequences)
print("Padded Seq :  \n", padded_seq)


{'<OOV>': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14}
[[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]]
Padded Seq :  
 [[ 0  0  0  0  3  5  2]
 [ 0  0  0  2  4  6  7]
 [ 0  0  0  0  3  4  8]
 [ 9 10 11 12 13 14  2]]


- Now our sentences are transformed in the matrix, Each row represents to one sentence and have the same length
- padded sentence length is equal to the maximum length sentence in the training data i.e 4th sentence in the training data have highest length with 7 words, So all other sentences are padded with zeros to make them 7 words.
- So basically pad_sequences takes the highest length and converts all other sentences to the 
- Also note zeros are added at the begining of the sentences. This is also called pre-padding.

We can also pad at the end using the padding = 'post' for the pad_sequences function like below.


#### Post Padding, Maxlen and truncating

In [16]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

# for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token
tokenizer = Tokenizer(num_words = 20, oov_token='<OOV>')
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

padded_seq = pad_sequences(sequences, padding= 'post', maxlen=6)
print("Padded Seq :  \n", padded_seq)


{'<OOV>': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14}
[[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]]
Padded Seq :  
 [[ 3  5  2  0  0  0]
 [ 2  4  6  7  0  0]
 [ 3  4  8  0  0  0]
 [10 11 12 13 14  2]]


- Now we can see sentences are padded at the end.
- We can also control the length of the words, Observe the 4th sentence only have 6 words and word at the begining(same like padding) has been dropped.  So we are loosing the data by using the maxlen. So choose the maxlen parameter as per your sentences

We can also choose which way we can loose the data in case if we use small maxlen parameter. Too choose use truncating parameter like below.


In [17]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

# for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token
tokenizer = Tokenizer(num_words = 20, oov_token='<OOV>')
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

padded_seq = pad_sequences(sequences, padding= 'post', truncating='post', maxlen=6)
print("Padded Seq :  \n", padded_seq)


{'<OOV>': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14}
[[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]]
Padded Seq :  
 [[ 3  5  2  0  0  0]
 [ 2  4  6  7  0  0]
 [ 3  4  8  0  0  0]
 [ 9 10 11 12 13 14]]


- Now we can see last word in the 4th sentence is dropped. i.e tokenizer

Lets apply padding for the test data as well.

In [22]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

words = [
    'Tensorflow NLP Tokenizer', 
    'tokenizer! is really cool',
    'Tensorflow is awesome!',
    'spliting sentences in to words using tokenizer'
]

# for the Tokenizer constructor we can pass the out of vocabulary token (oov_token), If tokenizer gets any oov word it will replace that word with the specified token
tokenizer = Tokenizer(num_words = 20, oov_token='<OOV>')
tokenizer.fit_on_texts(words)
word_index = tokenizer.word_index

# fit_on_sequences takes direct sentences, And it will work with the previous fitted data.
sequences = tokenizer.texts_to_sequences(words)

print(word_index)
print(sequences)

padded_seq = pad_sequences(sequences, padding= 'post', truncating='post', maxlen=6)
print("\nPadded Seq :  \n", padded_seq)

test_data = [
             'NLP Tokenizer became really powerful',
             'Tensorflow tokenizer at work'
]

# Now i am not going to use fit on the test data, But only using the transform
seq = tokenizer.texts_to_sequences(test_data)
padded_test_seq = pad_sequences(seq, padding= 'post', truncating='post', maxlen=6)
print("\nTest Seq: ", seq)
print("\nPadded test seq : \n", padded_test_seq)


{'<OOV>': 1, 'tokenizer': 2, 'tensorflow': 3, 'is': 4, 'nlp': 5, 'really': 6, 'cool': 7, 'awesome': 8, 'spliting': 9, 'sentences': 10, 'in': 11, 'to': 12, 'words': 13, 'using': 14}
[[3, 5, 2], [2, 4, 6, 7], [3, 4, 8], [9, 10, 11, 12, 13, 14, 2]]

Padded Seq :  
 [[ 3  5  2  0  0  0]
 [ 2  4  6  7  0  0]
 [ 3  4  8  0  0  0]
 [ 9 10 11 12 13 14]]

Test Seq:  [[5, 2, 1, 6, 1], [3, 2, 1, 1]]

Padded test seq : 
 [[5 2 1 6 1 0]
 [3 2 1 1 0 0]]
