# Content
This notebook series explain the basics of data preparation steps in NLP including tokenization using **tensorflow.keras.preprocessing.text**  module and padding using **tensorflow.keras.utils**  module. 
  
 In this notebook we will <br>
 1\. Create training texts (training corpus)<br>
 2\. Create a tokenizer <br>
 3\. Use **`fit_on_texts`** method: Fit the tokenizer on the texts to turn texts to sequence of tokens and index them.<br>
 4\. Use **`texts_to_sequences`** method on training texts (training corpus)<br>

### **1 . Create training texts (training corpus)**

  In this series our corpus will be a list of sentences(strings). <br> In the projects we will see, training and test corpora will contain thousands of sentences. <br>
  For simplification and demonstration purposes here we create a training corpus consisting of two sentences. 

In [2]:
train_texts = ["You take the blue-pill!",
               "The story ends."]

### **2 . Create a tokenizer**

* Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords[[1]](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/).<br>
* However, usually tokenization term is used in the sense that it includes the later step, which is conversion of texts to integer(index) sequences.<br>
* In the later courses, we will see that other libraries' tokenizers have methods like *tokenize* which directly give index representations(integer seqeunces) of the text. <br>
Here, we will do this process in two steps:
1. In the first step, by using **fit_to_texts** method of the tokenizer we will convert texts to pieces (words) and fill the **word_index** dictionary that tells us which word can be represented by which index.
2. In the second step by using the **word_index** dictionary of the tokenizer we will obtain index representations(integer seqeunces) of the text. 

For example let's consider the sentence **"Faster but inattentive"**.
This sentence can be tokenized in three approaches <br>
1. Word level as **faster** - **but** - **inattentive** <br>
2. Sub-word level as **fast** - **er** - **but** - **in** - **attent** - **ive** <br>
3. Character level as  **f** - **a** - **s** - **t** - **e** - **r** - **b** - **u** - **t** - **i** - **n** - **a** - **t** - **t** - **e** - **n** - **t** - **i** - **v** - **e** <br>

If we don't specify tokenizer level,the default is **word level** tokenizer which we will use in this notebook. Now let's create the tokenizer.<br>
First we have to import **Tokenizer** class from **tensorflow.keras.preprocessing.text** module.

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

tokenizer object has three dictionaries  which are currently empty:<br>
* **word_index** will map words to indices<br>
* **index_word** will map indices to words  (reverse mapping of word_index)<br>
***word_counts** will map words to counts

**word_index** and **index_word** are dictionaries whereas **word_counts** is of a special type of dictionary - ordered dictionary. OrderedDict can keep pairs in the order they are added.


tokenizer will generate contents of these dictionaries based on the corpus.<br>
Now let's see that these are currently empty. 

In [4]:
tokenizer.word_index, tokenizer.index_word, tokenizer.word_counts

({}, {}, OrderedDict())

### **3 .Use `fit_on_texts` method on training texts**
* Now we have a tokenizer and a corpus. Then we can fit the tokenizer on the **training texts** (corpus) using  **`fit_on_texts`** method to tokenize words and give them indices.<br> In short, **`fit_on_texts`** method does the tokenization on the corpus.



In [5]:
tokenizer.fit_on_texts(train_texts)

* Using **`fit_on_texts`** method of the tokenizer, we filled tokenizer's three dictionaries based on token, their corresponding indices and counts for the given corpus. <br>
Now let's check these dictionaries.

#### 3.1. **word_index** <br>

* word_index map words to indices. Below we can see that as a result of the tokenization, words are converted to lower case and punctuations are removed. 
Note that **The** and **the** are represented as the same token.

In [6]:
print(tokenizer.word_index)

{'the': 1, 'you': 2, 'take': 3, 'blue': 4, 'pill': 5, 'story': 6, 'ends': 7}


*  Above note that words are indexed according to their frequency. Index 1 is given to the most common word **"the"** <br> 
* Also note that in this dictionary pairs are kept based on their values (which are indices), not the keys(words).

#### 3.2. **index_word** <br>

* Now let's check index_word dictionary which is reverse of the word_index dictionary.
* Below note that in this dictionary pairs are kept based on keys, not values.

In [7]:
print(tokenizer.index_word)

{1: 'the', 2: 'you', 3: 'take', 4: 'blue', 5: 'pill', 6: 'story', 7: 'ends'}


#### 3.3. **word_counts** <br>
* Finally we can now check our third dictionary of the tokenizer **word_counts** which is ordered dictionary.<br>
This is an OrderderDict, therefore words order are the same as they were met in the corpus.

In [8]:
print(tokenizer.word_counts)

OrderedDict([('you', 1), ('take', 1), ('the', 2), ('blue', 1), ('pill', 1), ('story', 1), ('ends', 1)])


We can depict these steps as in the image below.

<img src="./Images/Tokenizer_1.jpg"/>

### **4. Use `texts_to_sequences` method on training texts**

Convert texts to sequences of token indices (list of integer lists)  using `texts_to_sequences` method . The conversion from string to int is made based on word_index dictionary.

In [10]:
train_sequences = tokenizer.texts_to_sequences(train_texts)
print(train_sequences)

[[2, 3, 1, 4, 5], [1, 6, 7]]


Let's print **sentences**, **word_index** dictionary and see how texts are converted to sequences of indices based on **word_index**.

In [11]:
print(tokenizer.word_index)
print(train_texts)
print(train_sequences)

{'the': 1, 'you': 2, 'take': 3, 'blue': 4, 'pill': 5, 'story': 6, 'ends': 7}
['You take the blue-pill!', 'The story ends.']
[[2, 3, 1, 4, 5], [1, 6, 7]]


We can summarize the steps and describe text_to_sequences method as below.

<img src="./Images/Tokenizer_2.jpg" />

References<br>
[1] https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/