# Natural Language Processing

# Tokenization

Tokenization is the process of breaking down text into words or sentences and encoding them into a numerical format.
It is a crucial step in natural language processing (NLP) and is used in various applications such as text classification, sentiment analysis, and machine translation.


In [2]:
# Tokenizer example

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

sentences = ["Today is a sunny day", "Today is a rainy day", "Is it sunny today?"]

tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# Print the word index (token dictionary)
print(word_index)

sequences = tokenizer.texts_to_sequences(sentences)

# Print the sequences
print(sequences)

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}
[[2, 3, 4, 5, 6], [2, 3, 4, 7, 6], [3, 8, 5, 2]]


# Out-of-vocabulary token

An out-of-vocabulary (OOV) token is a special token used to represent words that are not present in the vocabulary of a language model.

When a word is not present in the vocabulary, it is replaced with the OOV token to ensure that the model can still process the input text.

Add `oov_token="<OOV>"` to the Tokenizer constructor.

This will get you from:
"Today is a snowy day" -> "today is a day"

To:
"Today is a snowy day" -> "today is a <OOV> day"

This detail improves the accuracy of the encoding, preserving context information by keeping the words in thir original position in the sentence.


In [8]:
# Out-of-vocabulary token

test_data = ["Today is a snowy day", "Will it be rainy tomorrow?"]

test_sequences = tokenizer.texts_to_sequences(test_data)
print(word_index)
print(test_sequences)

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}
[[2, 3, 4, 1, 6], [1, 8, 1, 7, 1]]


# Padding

As the sentences can vary in length and the model expects a fixed-size input, we need to pad the sequences to ensure that they all have the same length.

Add `padding="post"` to the Tokenizer constructor.


In [4]:
from keras.utils import pad_sequences

sentences = [
    "Today is a sunny day",
    "Today is a rainy day",
    "Is it sunny today?",
    "I really enjoyed walking in the snow today",
]

sequences = tokenizer.texts_to_sequences(sentences)
print(word_index)
print(sequences)

padded = pad_sequences(sequences, padding="post", maxlen=6, truncating="post")

print(padded)

{'<OOV>': 1, 'today': 2, 'is': 3, 'a': 4, 'sunny': 5, 'day': 6, 'rainy': 7, 'it': 8}
[[2, 3, 4, 5, 6], [2, 3, 4, 7, 6], [3, 8, 5, 2], [1, 1, 1, 1, 1, 1, 1, 2]]
[[2 3 4 5 6 0]
 [2 3 4 7 6 0]
 [3 8 5 2 0 0]
 [1 1 1 1 1 2]]


# Common Pre-processing Optimizations for Tokenization

- Lowercasing: Convert all text to lowercase to reduce the vocabulary size and improve generalization.
- Removing punctuation: Remove punctuation marks from the text to reduce the vocabulary size and improve generalization.
- Removing stop words: Remove common words such as "the", "is", and "and" that do not carry much meaning to reduce the vocabulary size and improve generalization.
