### OOV(Out Of Vocabulary) Problem:--

OOV is a kind of problem that occurs during the Natural Langauge Processing in which we come across the words that are present in the testing phase but are not present during the training vocabulary dataset.

### Simple Example

1. Training Data:

"I love data science"

"Machine learning is amazing"

 ##  Vocabulary learned during Training :  I, love, data , science, Machine, Learning, is, amazing.

2. Test Sentence:

"I love artificial intelligence"

Here

1. artificial

2. intelligence

The above are the two words that do not occur during the training phase.


### Handling OOV Problem in NLP

#### Steps used in this Algorithm:-----

1.   Import all the necessary libraries

2.   define the Training Corpus (Known Words)

3.   Create Tokenizer WITHOUT OOV Handling

4.   Check Word Index

5.   Test the Sentence with Unknown Word

#### Let's Solve OOV Problem

6.    Create Tokenizer WITH OOV Handling

7.    Check Word Index Again

8.    Convert Test Sentence Again

9.    Perform the padding on  the Sequences

### Step 1: Import all the necessary libraries

In [83]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

### OBSERVATIONS:

1.  tensorflow ----------------->  Deep Learning Framework

2.  Tokenizer  ----------------->  breaks the text sequences into smaller parts

3.  sequence   ----------------->  comprises of the input sequences

4.  pad_sequences -------------->  performs the padding on the input sequences to make the length as equal.


### Step 2: define the Training Corpus (Known Words)

In [84]:
train_corpus = [
    "I love data science",
    "Machine learning is amazing",
    "Deep learning is powerful"
]


### OBSERVATIONS:

1. This training corpus has three sentences.

2.  The tokenizer will learn all the vocabularies from these corpus.

### Step 3: Create Tokenizer WITHOUT OOV Handling

In [85]:
### Create an object of Tokenizer without OOV

tokenizer_no_oov = Tokenizer()

### using the object of Tokenizer, train and transform the text

tokenizer_no_oov.fit_on_texts(train_corpus)

### OBSERVATIONS:

1. The object for Tokenizer is created without including OOV(any unknown word will be lost which leads to the loss of information for the Model).

2. Then using the object of Tokenizer, fit_on_texts is applied on the corpus data to transform the data and build the vocabulary dictionary.

3. Here every word has been assigned with the integer sequence.

### Step 4:  Check Word Index


In [86]:
tokenizer_no_oov.word_index

{'learning': 1,
 'is': 2,
 'i': 3,
 'love': 4,
 'data': 5,
 'science': 6,
 'machine': 7,
 'amazing': 8,
 'deep': 9,
 'powerful': 10}

### OBSERVATIONS:

1.  Here index is assigned to every word in the text.

### Step 5: Test the Sentence with Unknown Word

In [87]:
test_sentence = ["I love artificial intelligence"]

test_sequences_no_oov = tokenizer_no_oov.texts_to_sequences(test_sentence)

print(test_sequences_no_oov)

[[3, 4]]


### OBSERVATIONS:

1. texts_to_sequences   -------> It converts the text data into numerical sequences based on the vocabulary learned during the model training.

2. Here on applying texts_to_sequences on the test data, it is seen that

    (a.) "artificial" and  "intelligence" are not present in the tarining data

    (b.) These are the unknown words that get dissappear, so the Model loses these information.

    (c.) This is OOV problem.

#### Let's Solve OOV Problem

#### Step 6:  Create Tokenizer WITH OOV Handling

In [88]:
### Create the object for Tokenizer with oov token to identify the unknown words
tokenizer_with_oov = Tokenizer(oov_token = "<OOV>")
### train and transform the corpus text data using the new object for Tokenizer
tokenizer_with_oov.fit_on_texts(train_corpus)

### OBSERVATIONS:

1. The object for Tokenizer is created  including OOV(that will identify unknown words that occur during the testing phase and will be marked as OOV).

2. Then using the object of Tokenizer, fit_on_texts is applied on the corpus data to transform the data and build the vocabulary dictionary.

3. Here every word has been assigned with the integer sequence.

### Step 7:  Check Word Index Again

In [89]:
tokenizer_with_oov.word_index

{'<OOV>': 1,
 'learning': 2,
 'is': 3,
 'i': 4,
 'love': 5,
 'data': 6,
 'science': 7,
 'machine': 8,
 'amazing': 9,
 'deep': 10,
 'powerful': 11}

### OBSERVATIONS:

1.  Here index is assigned to every word in the text.

#### Step 8:  Convert Test Sentence Again

In [90]:
test_sentence = ["I love artificial intelligence"]

test_sequences_with_oov = tokenizer_with_oov.texts_to_sequences(test_sentence)

print(test_sequences_with_oov)

[[4, 5, 1, 1]]


### OBSERVATIONS:

1.  'artificial' and 'intelligence' are the two unknown words that occur druing the testing but not in model training.

2. So now these words will be marked as OOV with index 1 as mentioned in the vocabulary dictionary.

3. Now these words will not get dissappeared and the model will not loose any information.

### Step 9:  Perform the padding on  the Sequences

In [91]:
padding_sequence = pad_sequences(test_sequences_with_oov, maxlen=6, padding='post')

In [92]:
padding_sequence

array([[4, 5, 1, 1, 0, 0]], dtype=int32)

### OBSERVATIONS:

1. Now padding is applied on the test output data.

2. It length is 6

3. 0 has been added at the end of the sequence due to post  padding.