# NLP (Natural Language Processing)

### 1. Text to Numeric

* When dealing with text, it has to be encoded so that it can be easily processed by a neural network.

* To encode the words, we could use their ASCII (American Standard Code for Information Interchange) values. 
![ASCII Table](https://www.johndcook.com/ascii.png)

* However, using ASCII values limits our semantic understanding of the sentence.
* Ex: In the below two words, we have the same letters thus having the same ASCII values but each word is having a completely opposite meaning.
* Therefore, using ASCII values to extract meaning from the words is daunting task.
![example 1](https://miro.medium.com/v2/resize:fit:640/format:webp/1*cUNtGgZxxyNIEtx15T1Ubw.png)

* Next, instead of labelling each letter with a number (i.e. ASCII values), we can label each word.
* Ex: In the below sentences, we have labelled each word with a number. 
![example](https://miro.medium.com/v2/resize:fit:640/format:webp/1*uVqOzeZd4q8fareUtkZkPQ.png)

* When we only view the labels and we can observe the pattern.
![example](https://miro.medium.com/v2/resize:fit:640/format:webp/1*vY3z3R5e12fB-p-16X0e9w.png)

* Now, we can see similarity between the sentences. 
* we can begin to train a neural network which can understand the meanings of the sentences.

* When coding, we can label each word and provide a dictionary of the words being used in the sentences using the **Tokenizer**. 
* We create an instance of tokenizer and assign a hyperparameter num_words to 100. 
* This essentially takes the most common 100 words and tokenize them. 

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
sentences = [
    'I Love my dog',
    'I love my cat'
]

In [4]:
tokenizer = Tokenizer(num_words = 100)

tokenizer.fit_on_texts(sentences)  #The fit_on_texts() method is used to encode the sentences.
word_index = tokenizer.word_index
print(word_index)

{'i': 1, 'love': 2, 'my': 3, 'dog': 4, 'cat': 5}


* The **word_index** method returns a dictionary of key value pairs where the key is the word in the sentence and the value is the label assigned to it.
* Notice that **‘I’** has been replaced by **‘i’** and both **'Love'** and **'love'** has been replaced by **'love'**.

In [5]:
sentences = [
    'I Love my dog',
    'I love my cat',
    'You love my dog!'
]

In [9]:
tokenizer = Tokenizer(num_words = 100)

tokenizer.fit_on_texts(sentences) 
word_index = tokenizer.word_index
print(word_index)

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}


* Tokenizer is intellegent and it ommits the punctuations when tokenizing
* Notice, that **‘dog!’** is not treated as a separate word just because there is an exclamation

* Passing set of sentences to the **‘texts_to_sequences()’** method converts the sentences to their labelled equivalent based on the corpus of words passed to it.
* If the corpus has a word missing that is present in the sentence, the word while being encoded to the label equivalent is omitted and the rest of the words are encoded and printed.
* Ex:

In [12]:
test_data = [
    'i really love my dog',
    'my dog loves my parrot'
]
seq_data = tokenizer.texts_to_sequences(test_data)
print(seq_data)

[[3, 1, 2, 4], [2, 4, 2]]


* In the above test_data, the word **‘really’** is missing in the corpus. 
* Hence, while encoding, the word **‘really’** is omitted and instead the encoded sentence is **‘i love my dog’**.
* Similarly, for the second sentence, the words **‘loves’**, **‘manatee’** is missing in the word corpus. 
* Hence, the encoded sentence is **‘my dog my’**.
* To overcome the problem faced in the above examples, we can **either use a huge corpus of words** or use a hyperparameter **‘oov_token’** and assign it to a certain value which will be used to encode words previously unseen in the corpus. 

In [15]:
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
    'I Love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences) 
word_index = tokenizer.word_index
print(word_index)

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}


* Notice that **‘<00V>’** is now part of the word_index. 
* Any word not present in the sentences will replaced by the ‘<00V>’ encoding.

In [16]:
sequences = tokenizer.texts_to_sequences(sentences)

test_data = [
    'i really love my dog',
    'my dog loves my parrot'
]
seq_data = tokenizer.texts_to_sequences(test_data)
print(seq_data)

[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]


* When feeding training data to the neural network, a **uniformity** of the data must be maintained. 
* i.e. All sentences being fed should be in similar dimensions.
* In NLP, while feeding training data in the form of sentences, **padding** is used to provide uniformity in the sentences.

In [23]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

sentences = [
    'I Love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
tokenizer.fit_on_texts(sentences) 
word_index = tokenizer.word_index

sequences = tokenizer.texts_to_sequences(sentences)

padded = pad_sequences(sequences)

print("sequences")
print(sequences)
print("\n padded sequences")
print(padded)

sequences
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]

 padded sequences
[[ 0  0  0  5  3  2  4]
 [ 0  0  0  5  3  2  7]
 [ 0  0  0  6  3  2  4]
 [ 8  6  9  2  4 10 11]]


* As we can see, padding in the form of **‘00’** is generated in the beginning of the sentence. 
* Padding has been done with reference to the longest sentence.

* If padding is to be done after the sentence, the hyperparameter **padding** can be set to **‘post’**. 
* Padding is generally done with reference to the longest sentence, however the hyperparameter **maxlen** can be provided to override it and define the maximum length of the sentence. 
* However, with use of **maxlen** the information in sentences could be lost as only a certain length of the sentence is taken. 
* But you can specify from where the words are omitted. 
* Ex: Setting it to **‘post’** allows you to loose words from the end of the sentence.

In [24]:
padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)

print(padded)

[[5 3 2 4 0]
 [5 3 2 7 0]
 [6 3 2 4 0]
 [8 6 9 2 4]]


### 2. Word Embeddings

* Words and associated words are clustered as vectors in a multi-dimensional space. 
* This allows the Words that are present in a sentence and often words with similar meanings to be placed close to each other in the multi-dimensional space.
* Ex: “The movie was **dull** and **boring**.”; “The movie was **fun** and **exciting**.”

* Now imagine we pick up a vector in a higher dimensional space, suppose 16 dimensions and words that are found together are given similar vectors. 
* Overtime, words of similar meaning begin to cluster together. 
* The meaning of the words can come from labelling the dataset.

* So taking the example of the above sentence, the words dull and boring show up a lot in the negative review, therefore they have a similar sentiment and they show up close to each other in a sentence, thus their vectors will be similar. 
* As the neural network trains, it can learn these vectors and associate them with the labels to come up with something called and embedding 
* i.e. the vectors of each word with their associated sentiment.

In [30]:
max_length=120
vocab_size=120
embedding_dim=16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

* Now while building the neural network, we use the **Embedding layer** which gives an output of the shape of a 2D array with length of the sentence as one dimension and the embedding dimension, in our case 16 as the other dimension.

* Therefore, we use the **Flatten layer** just as we used it in computer vision problems. 
* In CNN based problems, a 2D array of pixels was needed to be flattened to feed it to the neural network. 
* In a NLP based problem the 2D array of Embiddings is needed to be flattened.

In [31]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 120, 16)           1920      
                                                                 
 flatten_2 (Flatten)         (None, 1920)              0         
                                                                 
 dense_4 (Dense)             (None, 6)                 11526     
                                                                 
 dense_5 (Dense)             (None, 1)                 7         
                                                                 
Total params: 13,453
Trainable params: 13,453
Non-trainable params: 0
_________________________________________________________________
