# Content
This notebook is the second part of  Data Preparation in NLP using Keras.   
  
 In this notebook firstly we will repeat the steps in the previous notebook<br>
 Then we will <br>
 1\. Create test texts (test corpus)<br>
 2\. Use  **`texts_to_sequences`** method on test texts (test corpus)<br>
 3\. Create a tokenizer with oov_token and num_words parameters and repeat.<br>

In [2]:
from tensorflow.keras.preprocessing.text import Tokenizer
train_texts = ["You take the blue-pill!",  "The story ends."]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_texts)

We have a tokenizer and its word_index,index_word and word_counts dictionaries are filled based on the training corpus.<br>
Now we can convert the texts we want to index sequences using texts_to_sequences method.

# **1 . Create test texts (test corpus)**

Let's create a test corpus as we created the training corpus previously.

In [3]:
test_texts = ["You take the red pill.",
             "You stay in wonderland."] 

# **2 .Use `texts_to_sequences` method on test texts**

* Note that we use `fit_on_texts` method on training corpus and our training corpus did not contain words like **"red"** , **"stay"**, **"in"** and **"wonderland"**.
* Therefore **word_index** does not contain those words. As a result unknown words are not indexed.<br>
Let's remember our word_index dictionary:

In [4]:
tokenizer.word_index

{'the': 1, 'you': 2, 'take': 3, 'blue': 4, 'pill': 5, 'story': 6, 'ends': 7}

Now let's convert test texts to index sequences using texts_to_sequences method and check the result.

In [5]:
test_sequences = tokenizer.texts_to_sequences(test_texts)
print(test_sequences)

[[2, 3, 1, 5], [2]]



* **As a result** when we apply texts_to_sequences on **test texts** these words cannot be represented as indices, hence they are **missing**. The image below depicts it.

<img src="./Images/Tokenizer_3.jpg" />

# **3. Create a tokenizer with  `num_words` and `oov_token`  parameters**

## 3.1 Tokenizer with **`oov_token`** parameter

* Above we have seen that tokenizer's  **word_index** dictionary does not contain the unknown words(the words not met before) in the test corpus, because since we fit the tokenizer on training corpus. <br>
* When we apply texts_to_sequences on **test texts**, the unkown words are missing in the resulting index sequence.


What if we want to represent unknown words when converting test texts to sequences.<br>
We can do this When creating the tokenizer, if we use **oov_token paremeter** which indicates that word is **out of vocabulary** <br>
Using **oov_token**:
* we initialize tokenizer's **word_index** dictionary  with a special token for unknown(out of vocabulary) words.
* Then, we can represent those unknown words, when converting test texts to integer sequences,. <br>So they won't be missing but they will be represented the same integer.

Let's repeat the steps we have done earlier. But this time we will use oov_token parameter <br> 
* create a tokenizer (this time using **oov_token** parameter. Let's use represent those out of vocab words as **\<OOV\>**)
* fit on train_texts
* convert texts to sequences

In [5]:
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

Let's print **word_index**, **test_texts** and **test_sequences**. <br>
Note: We can skip converting (applying texts_to_sequences method on) train texts to train_sequences, because as expected there is nothing new for train_texts and train_sequences. It is clear that there cannot be out of vocabulary word in train corpus, since the tokenizer is fit on it.<br>
In short, using **oov_token** affects conversions of text sequences.

In [6]:
print(tokenizer.word_index)
print(test_texts)
print(test_sequences)

{'<OOV>': 1, 'the': 2, 'you': 3, 'take': 4, 'blue': 5, 'pill': 6, 'story': 7, 'ends': 8}
['You take the red pill.', 'You stay in wonderland.']
[[3, 4, 2, 1, 6], [3, 1, 1, 1]]


As a result, we can see that the words **"red"**, **"stay"**, **"in"** and **"wonderland"** are not ignored, but represented by 1 now.

<img src="./Images/Tokenizer_4.jpg" />

## 3.2 Tokenizer with **`num_words`** parameter

Now let's create train_texts again adding a third sentence.

In [6]:
train_texts = ["You take the blue pill.",
             "The story ends...",
             "or you take the red pill."]

* When creating Tokenizer, **num_words** parameter is used to take the most common num_words-1 words when applying `texts_to_sequences` method  and to ignore less frequent words. <br> 
* One thing to note is less frequent words are still given indices, but they don't take place in sequences.<br>
Let's repeat the steps we have done earlier. But this time we will use num_words <br> 
  - create a tokenizer (this time using **num_words** parameter. Let's use top 5 words)
  - fit on train_texts
  - convert texts to sequences

In [7]:
tokenizer = Tokenizer(num_words=5)
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts)

Let's print **word_counts** first.

In [8]:
tokenizer.word_counts

OrderedDict([('you', 2),
             ('take', 2),
             ('the', 3),
             ('blue', 1),
             ('pill', 2),
             ('story', 1),
             ('ends', 1),
             ('or', 1),
             ('red', 1)])

Now let's check **word_index**, **train_texts** and **train_sequences**. 

In [9]:
print(tokenizer.word_index)
print(train_texts)
print(train_sequences)

{'the': 1, 'you': 2, 'take': 3, 'pill': 4, 'blue': 5, 'story': 6, 'ends': 7, 'or': 8, 'red': 9}
['You take the blue pill.', 'The story ends...', 'or you take the red pill.']
[[2, 3, 1, 4], [1], [2, 3, 1, 4]]


<img src="./Images/Tokenizer_5.jpg"/>

Above
* We can see that the top 4 (num_words-1) words are represented as sequences and other words (although they are encoded in word_index) ignored by text_to_sequences method. 
* Since their counts are 1, the words **blue** (index 5), **story** (index 6), **ends** (represented as 7), **or** (index 8) and **red** (index 9) are ignored (see the sequence for the second sentence) and they don't appear in the integer sequences. In other words they are not in the top 5 words.

## 3.3 Tokenizer with **`num_words`**  and  **oov_token** parameters

* Previously in part 3.1 when we used oov_token, the resulting sequence of the training corpus did not contain oov_token (as expected), **because we set up vocabulary using ALL of the tokens(we did not use num_words) in training corpus.** (No token was left out of vocabulary.) <br><br>
* One important thing to pay attention when **using num_words and oov_token together** is that **we can see oov_tokens in the the resulting sequence of the training corpus.**<br><br>
* The reason for this is **considering less frequent words(words other than most common num_words-1) as oov_token even if they are in the training corpus.** <br><br>
* As a result all the words less frequent than most common (num_words-1) words will take place in the resulting sequences, but they all are represented by oov_token index.

Now let's increase num_words to 6 to represent most common 5 words in sequences.

In [11]:
tokenizer = Tokenizer(num_words=6, oov_token="<OOV>")
tokenizer.fit_on_texts(train_texts)
train_sequences = tokenizer.texts_to_sequences(train_texts)

Let's print **word_counts** first. Here the count of oov_token is not shown.

In [12]:
tokenizer.word_counts

OrderedDict([('you', 2),
             ('take', 2),
             ('the', 3),
             ('blue', 1),
             ('pill', 2),
             ('story', 1),
             ('ends', 1),
             ('or', 1),
             ('red', 1)])

Although not seen in tokenizer.word_counts, below we can see that the most common token is the oov_token represented as index 1 in the train_sequences.<br>


In [13]:
print(tokenizer.word_index)
print(train_texts)
print(train_sequences)

{'<OOV>': 1, 'the': 2, 'you': 3, 'take': 4, 'pill': 5, 'blue': 6, 'story': 7, 'ends': 8, 'or': 9, 'red': 10}
['You take the blue pill.', 'The story ends...', 'or you take the red pill.']
[[3, 4, 2, 1, 5], [2, 1, 1], [1, 3, 4, 2, 1, 5]]


We can summarize the process in the image given below.

<img src = "./Images/Tokenizer_6.jpg" />