“Every time I fire a linguist, the performance of the speech recognizer goes up.” - Frederick Jelinek:1990


## What is NLP about?
Using machine learning and large datasets to give computers the ability not to understand language, which is a more lofty goal, but to ingest a piece of language as input and return something useful

Examples:
- “What’s the topic of this text?” (text classification)
- “Does this text contain abuse?” (content filtering)
- “Does this text sound positive or negative?” (sentiment analysis)
- “What should be the next word in this incomplete sentence?” (language modeling)
- “How would you say this in German?” (translation)
- “How would you summarize this article in one paragraph?” (summarization)
etc.

## NLP Evolution: 
1. The toolset of NLP—decision trees, logistic regression—only saw slow evolution from the 1990s to the early 2010s. Most of the research focus was on feature engineering.
2. Then from 2015 to 2017, recurrent neural networks dominated the booming NLP scene. Bidirectional LSTM models, in particular, set the state of the art on many important tasks
3. Around 2017–2018, a new architecture rose to replace RNNs: the Transformer

## 11.2 Preparing text data
Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. Vectorizing text is the process of transforming text into numeric tensors. Text vectorization processes come in many shapes and forms, but they all follow the same template:

1. First, you standardize the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
2. You split the text into units (called tokens), such as characters, words, or groups of words. This is called tokenization.
3. You convert each such token into a numerical vector. This will usually involve first indexing all tokens present in the data.

![alt text](static/text_vectorization.png "text_vector")

### 1- Text Standardization
Text standardization is a basic form of feature engineering that aims to erase encoding differences that you don’t want your model to have to deal with
1. Remove punctuation marks.
2. All to lower case.
3. Transform non-standard characters to standard characters.
4. Stemming: converting variations of terms into a single shared representation. Example: caught/been catching/catches -> [catch]

A full example:
{sunset came. i was staring at the Mexico sky. Isnt nature splendid??}
will be
{sunset came i [stare] at the mexico sky isnt nature splendid}

**IMPORTANT NOTE:**
This doesn't always apply to every context. For example, if you are classifying questions, '?' will be treated as a single token!

### 2- Text Splitting (tokenization)
**Ways of tokenization:**
- Word-level tokenization: Where tokens are space-separated (or punctuation-separated) substrings. A variant of this is to further split words into subwords when applicable—for instance, treating “staring” as “star+ing” or “called” as “call+ed.”
- N-gram tokenization: Where tokens are groups of N consecutive words. For instance, “the cat” or “he was” would be 2-gram tokens (also called bigrams).
- Character-level tokenization: Where each character is its own token. In practice, this scheme is rarely used, and you only really see it in specialized contexts, like text generation or speech recognition.

#### Types of Text Processing Models:
Sequence Models: Models that care about order. Use word-level tokenization.

Bag-of-Words models: treat input words as a set, discarding their original order. Use N-gram tokenization.
N-grams are a way to artificially inject a small amount of local word order information into the model.

**Understanding N-grams and bag-of-words**
Word N-grams are groups of N (or fewer) consecutive words that you can extract from a sentence. The same concept may also be applied to characters instead of words. For Example:
“the cat sat on the mat.” It may be decomposed into the following set of 2-grams:
{"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the mat", "mat"}

It may also be decomposed into the following set of 3-grams:
{"the", "the cat", "cat", "cat sat", "the cat sat", "sat", "sat on", "on", "cat sat on", "on the", "sat on the", "the mat", "mat", "on the mat"}

Because bag-of-words isn’t an order-preserving tokenization method (the tokens generated are understood as a set, not a sequence, and the general structure of the sentences is lost), it tends to be used in shallow language-processing models rather than in deep learning models.

<br/>

### 3- Vocabulary Indexing :
It is building an index of all terms found in the training data (the “vocabulary”), and assign a unique integer to each entry in the vocabulary.
You can then convert that integer into a vector encoding that can be processed by a neural network, like a one-hot vector.

**Note:** that at this step it’s common to restrict the vocabulary to only the top 20,000 or 30,000 most common words found in the training data. Any text dataset tends to feature an extremely large number of unique terms, most of which only show up once or twice—indexing those rare terms would result in an excessively large feature space, where most features would have almost no information content.

**What if we passed by a word that is not in the index?**
To handle this, you should use an **“out of vocabulary” index (abbreviated as OOV index)**—a catch-all for any token that wasn't in the index. It’s usually index 1: you’re actually doing `token_index = vocabulary.get(token, 1)`. When decoding a sequence of integers back into words, you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”)

“Why use 1 and not 0?” you may ask. That’s because 0 is already taken. There are two special tokens that you will commonly use:
- The OOV token (index 1): Not recognized word
- The mask token (index 0): Used to pad sequences of data. Sequences should be in the same length.

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras import layers
from tensorflow import keras
import os, pathlib, shutil, random
import tensorflow as tf

In [2]:
text_vectorization = layers.TextVectorization(output_mode="int")
# output_mode: Configures the layer to return sequences of words encoded as integer indices

2024-07-13 18:51:34.797073: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
2024-07-13 18:51:34.797143: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-07-13 18:51:34.797160: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-07-13 18:51:34.797549: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-13 18:51:34.797590: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


By default, the TextVectorization layer will use the setting “convert to lowercase and remove punctuation” for text standardization, and “split on whitespace” for tokenization. Note that such custom functions should operate on **tf.string** tensors, not regular Python strings!
You can customize the standardization and the tokenization functions by passing it to the function.

Note that you can retrieve the computed vocabulary via get_vocabulary()—this can be useful if you need to convert text encoded as integer sequences back into words. The first two entries in the vocabulary are the mask token (index 0) and the OOV token (index 1). Entries in the vocabulary list are sorted by frequency,so with a real-world dataset, very common words like “the” or “a” would come first.

In [4]:
text_vectorization.get_vocabulary()

['',
 '[UNK]',
 'erase',
 'write',
 'then',
 'rewrite',
 'poppy',
 'i',
 'blooms',
 'and',
 'again',
 'a']

In [5]:
# Full Example
# Encode
vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)
# Decode
inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
i write rewrite and [UNK] rewrite again


Because TextVectorization is mostly a dictionary lookup operation, **it can’t be executed on a GPU (or TPU)—only on a CPU**. So if you’re training your model on a GPU, your TextVectorization layer will run on the CPU before sending its output to the GPU. 

There are two ways we could use our TextVectorization layer.
1. The first option is to put it in the tf.data pipeline
2. The second option is to make it part of the model (after all, it’s a Keras layer)

There’s an important difference between the two:
- In option 1, vectorization will happen synchronously with the rest of the model. This means that at each training step, the rest of the model (placed on the GPU) will have to wait for the output of the TextVectorization layer (placed on the CPU) to be ready in order to get to work.
- In option 2, vectorization is asynchronous preprocessing on CPU: while the GPU runs the model on one batch of vectorized data, the CPU stays busy by vectorizing the next batch of raw strings.

In production, there is a solution to handle raw input text. We'll discuss later in the chapter.

## 11.3 Two approaches for representing groups of words: Sets and sequences
The problem of order in natural language is an interesting one: unlike the steps of a timeseries, words in a sentence don’t have a natural, canonical order.
Order is clearly important, but its relationship to meaning isn’t straightforward.

The Transformer architecture is technically order-agnostic, yet it injects word-position information into the representations it processes, which enables it to simultaneously look at different parts of a sentence (unlike RNNs) while still being order-aware. 

Because they take into account word order, both RNNs and Transformers are called **sequence models**


### Preparing the IMDB movie reviews data

In [14]:
!curl -o data/aclImdb_v1.tar.gz -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  4882k      0  0:00:16  0:00:16 --:--:-- 9015k
tar: could not chdir to '/data'



In [18]:
!tar -xf data/aclImdb_v1.tar.gz -C ./data

In [19]:
# look at the data
!cat data/aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [20]:
# Let’s prepare a validation set by setting apart 20% of the training text files in a new directory, aclImdb/val:

base_dir = pathlib.Path("data/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"


def create_val_files(category):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    num_val_samples = int(0.2 * len(files))
    random.seed(1337)
    val_files = random.sample(files, num_val_samples)
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)


create_val_files("neg")
create_val_files("pos")

In [54]:
batch_size = 32
# text_dataset_from_directory: utility to create a batched Dataset of text and their labels for a directory structure
# main_directory/
# ...class_a/
# ......a_text_1.txt
# ......a_text_2.txt
# ...class_b/
# ......b_text_1.txt
# ......b_text_2.txt
train_ds = keras.utils.text_dataset_from_directory(
    "data/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "data/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "data/aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.


2024-07-14 19:47:21.450027: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
2024-07-14 19:47:21.450053: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-07-14 19:47:21.450062: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-07-14 19:47:21.450226: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-14 19:47:21.450402: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


**keras.utils.text_dataset_from_directory**
Generates a tf.data.Dataset from text files in a directory.

**tf.data.Dataset** 
Has a map function.
```
map(
    map_func, num_parallel_calls=None, deterministic=None, name=None
)
```
This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input.

In [8]:
print("Type:", type(train_ds))
inputs, targets = next(iter(train_ds))
print("inputs.shape:", inputs.shape)
print("inputs.dtype:", inputs.dtype)
print("targets.shape:", targets.shape)
print("targets.dtype:", targets.dtype)
print("inputs[0]:", inputs[0].numpy()[:20], "...")
print("targets[0]:", targets[0])

Type: <class 'tensorflow.python.data.ops.batch_op._BatchDataset'>
inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: b'I have seen this mov' ...
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


### Processing words as a set: The bag-of-words approach
The simplest way to encode a piece of text for processing by a machine learning model is to discard order and treat it as a set (a “bag”) of tokens. You could either look at individual words (unigrams), or try to recover some local order information by looking at groups of consecutive token (N-grams).



In [151]:
# Unigrams First
# Treat each single word as a token by itself.
text_vectorization = keras.layers.TextVectorization(max_tokens=20000,  # limit to 20,000 most frequent words
                                                    output_mode="multi_hot")
text_only_train_ds = train_ds.map(lambda x, y: x)  # extract the text from the train dataset and ignore targets
text_vectorization.adapt(text_only_train_ds)  # build the text vectorizer
test_sentence = "the movie is awesome"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)
# notice that "the" is the second most popular term

tf.Tensor([0 1 0 ... 0 0 0], shape=(20000,), dtype=int64)


2024-07-13 16:19:22.833939: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [53]:
# Getting the multi-hot encoded text and targets for train dataset
binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
# getting the multi-hot encoded text and targets for test dataset
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
# getting the multi-hot encoded text and targets for val dataset
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [55]:
inputs, targets = next(iter(binary_1gram_train_ds))
print("inputs.shape:", inputs.shape)
print("inputs.dtype:", inputs.dtype)
print("targets.shape:", targets.shape)
print("targets.dtype:", targets.dtype)
print("inputs[0]:", inputs[0])  # all words encoded into a multi-hot encode 
print("targets[0]:", targets[0])

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'int64'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1 1 1 ... 0 0 0], shape=(20000,), dtype=int64)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


In [59]:
# Building the model

def get_model(max_tokens=20000, hidden_dim=16):
    input_layer = keras.Input(shape=(max_tokens,))  # we don't specify batch size here because it's a dataset
    x = keras.layers.Dense(hidden_dim, activation='relu')(input_layer)
    x = keras.layers.Dropout(0.5)(x)
    output = keras.layers.Dense(1, activation='sigmoid')(x)
    model = keras.Model(inputs=input_layer, outputs=output)
    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

In [64]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint("models/binary_1gram.keras", save_best_only=True)]
model.fit(x=binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), callbacks=callbacks, epochs=10)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.7556 - loss: 0.5075 - val_accuracy: 0.8936 - val_loss: 0.2828
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.8879 - loss: 0.3077 - val_accuracy: 0.8976 - val_loss: 0.2648
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9097 - loss: 0.2585 - val_accuracy: 0.8928 - val_loss: 0.2742
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9174 - loss: 0.2478 - val_accuracy: 0.8914 - val_loss: 0.2857
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9195 - loss: 0.2341 - val_accuracy: 0.8906 - val_loss: 0.2972
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9255 - loss: 0.2215 - val_accuracy: 0.8870 - val_loss: 0.3104
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x349afbed0>

In [65]:
model = keras.models.load_model("models/binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 15ms/step - accuracy: 0.8842 - loss: 0.2972
Test acc: 0.884


Notes on the Training cell:
1. We call cache() on the datasets to cache them in memory: this way, we will only do the preprocessing once, during the first epoch, and we’ll reuse the preprocessed texts for the following epochs. This can only be done if the data is small enough to fit in memory.

2. Q: Why we don't provide a batch size here?
    Answer from the docs:
    If unspecified, batch_size will default to 32. Do not specify the batch_size if your data is in the form of datasets, generators, or keras.utils.PyDataset instances (since they generate batches). 
3. Q: Why we don't provide y here?
   Answer from docs: If x is a dataset, generator, or keras.utils.PyDataset instance, y should not be specified (since targets will be obtained from x). 

#### Bigrams with binary encoding
Most common type of N-Grams is Bigrams.
The **TextVectorization layer** can be configured to return arbitrary N-grams: bigrams, trigrams, etc. 


In [150]:
bigram_text_vectorization = keras.layers.TextVectorization(ngrams=2, output_mode="multi_hot", max_tokens=20000)
bigram_text_vectorization.adapt(text_only_train_ds)
test_sentence = "the movie is awesome"
encoded_sentence = bigram_text_vectorization(test_sentence)
print(encoded_sentence)

2024-07-13 16:19:14.727557: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


tf.Tensor([0 1 0 ... 0 0 0], shape=(20000,), dtype=int64)


In [72]:
binary_2gram_train_ds = train_ds.map(lambda x, y: (bigram_text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(lambda x, y: (bigram_text_vectorization(x), y), num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(lambda x, y: (bigram_text_vectorization(x), y), num_parallel_calls=4)

In [70]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint("models/binary_2gram.keras", save_best_only=True)]
model.fit(x=binary_2gram_train_ds.cache(), validation_data=binary_2gram_val_ds.cache(), callbacks=callbacks, epochs=10)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 10ms/step - accuracy: 0.7895 - loss: 0.4557 - val_accuracy: 0.9016 - val_loss: 0.2580
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9131 - loss: 0.2449 - val_accuracy: 0.9078 - val_loss: 0.2506
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9305 - loss: 0.2004 - val_accuracy: 0.9074 - val_loss: 0.2671
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9429 - loss: 0.1807 - val_accuracy: 0.9076 - val_loss: 0.2833
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9480 - loss: 0.1813 - val_accuracy: 0.9060 - val_loss: 0.2955
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9535 - loss: 0.1656 - val_accuracy: 0.9012 - val_loss: 0.3077
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x3a3153250>

In [74]:
model = keras.models.load_model("models/binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8988 - loss: 0.2732
Test acc: 0.900


## Bigrams with TF-IDF encoding
You can also add a bit more information to this representation by counting how many times each word or N-gram occurs.
```
{"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
 "sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}
```
If you’re doing text classification, knowing how many times a word occurs in a sample is critical: any sufficiently long movie review may contain the word “terrible” regardless of sentiment, but a review that contains many instances of the word “terrible” is likely a negative one.

The words “the,” “a,” “is,” and “are” will always dominate your word count histograms, drowning out other words—despite being pretty much useless features in a classification context. How could we address this?
**Normalization!** We could just normalize word counts by subtracting the mean and dividing by the variance.

But... this method of normalization won't work in such a situation. It'll wreck the sparsity of the text vectors.
Sparse Vector = [1, 1, 1, 0, 0, 0, 1] each one and zero here represents an existence of a word (1), or absence (0).
Normalization will wreck the sparsity since Zero values might become non-Zeroes.

***What to do then??***
TF-IDF normalization. TF-IDF stands for “term frequency, inverse document frequency.”


### Understanding TF-IDF
The more a given term appears in a document, the more important that term is for understanding what the document is about.
At the same time, the frequency at which the term appears across all documents in your dataset matters too: terms that appear in almost every document (like “the” or “a”) aren’t particularly informative, while terms that appear only in a small subset of all texts (like “Herzog”) are very distinctive,

**TF-IDF** is a metric that fuses these two ideas. 
It weights a given term by taking “term frequency,” how many times the term appears in the current document, and dividing it by a measure of “document frequency” which estimates how often the term comes up across the dataset. 

```python
def tfidf(term, document, dataset):
    # Term Frequency (TF)
    term_freq = document.count(term)
    
    # Total number of documents
    total_docs = len(dataset)
    
    # Number of documents containing the term
    doc_freq = sum(1 for doc in dataset if term in doc)
    
    # Inverse Document Frequency (IDF)
    if doc_freq == 0:
        idf = 0  # Handle the case where the term does not appear in any document
    else:
        idf = math.log(total_docs / doc_freq)
    
    # TF-IDF
    tfidf_value = term_freq * idf
    
    return tfidf_value
```
High TF-IDF Value:
A high TF-IDF value for a term in a particular document suggests that the term is highly relevant to that document. This is because the term appears frequently in that document (high term frequency) relative to how often it appears across all documents (low document frequency).

Low TF-IDF Value:
A low TF-IDF value indicates that the term is common across many documents in the corpus. It doesn't provide much discriminatory power in distinguishing one document from another.



In [149]:
tfidf_text_vectorization = keras.layers.TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)
with tf.device("/CPU:0"):
    tfidf_text_vectorization.adapt(text_only_train_ds)
test_sentence = "the movie is fantastic"
encoded_sentence = tfidf_text_vectorization(test_sentence)
print(encoded_sentence)

2024-07-13 16:18:53.687823: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


tf.Tensor([0.         0.69727254 0.         ... 0.         0.         0.        ], shape=(20000,), dtype=float32)


In [99]:
tfidf_2gram_train_ds = train_ds.map(lambda x, y: (bigram_text_vectorization(x), y), num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (bigram_text_vectorization(x), y), num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (bigram_text_vectorization(x), y), num_parallel_calls=4)

In [100]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint("models/tfidf_2gram.keras", save_best_only=True)]
model.fit(x=tfidf_2gram_train_ds.cache(), validation_data=tfidf_2gram_val_ds.cache(), callbacks=callbacks, epochs=5)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 11ms/step - accuracy: 0.7976 - loss: 0.4536 - val_accuracy: 0.9038 - val_loss: 0.2468
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9140 - loss: 0.2398 - val_accuracy: 0.9060 - val_loss: 0.2456
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9286 - loss: 0.2038 - val_accuracy: 0.9078 - val_loss: 0.2549
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.9433 - loss: 0.1801 - val_accuracy: 0.9064 - val_loss: 0.2748
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9470 - loss: 0.1625 - val_accuracy: 0.9028 - val_loss: 0.2926
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.9536 - loss: 0.1547 - val_accuracy: 0.9044 - val_loss: 0.3039
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x374629350>

In [101]:
model = keras.models.load_model("models/tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 16ms/step - accuracy: 0.8984 - loss: 0.2749
Test acc: 0.899


## Exporting a model that processes raw strings
Just create a new model that reuses your TextVectorization layer and adds to it the model you just trained

In [102]:
inputs = keras.Input(shape=(1,), dtype="string")  # inference will be one string only
processed_inputs = tfidf_text_vectorization(inputs)  # apply text preprocessing
outputs = model(processed_inputs)  # apply the trained model
inference_model = keras.Model(inputs, outputs)  # full pipeline

In [133]:
raw_text_data = tf.convert_to_tensor([
    ["That was an excellent movie, I loved it."],
])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0] * 100):.2f} percent positive")

99.98 percent positive


## Processing words as a sequence: The sequence model approach
Instead of manually crafting order-based features (like in N-Grams), we exposed the model to raw word sequences and let it figure out such features on its own. This is what sequence models are about.

To implement a sequence model, you’d start by representing your input samples as sequences of integer indices (one integer standing for one word). Then, you’d map each integer to a vector to obtain vector sequences. Finally, you’d feed these sequences of vectors into a stack of layers that could cross-correlate features from adjacent vectors, such as a 1D convnet, an RNN, or a Transformer.

For some time around 2016–2017, bidirectional RNNs (in particular, **bidirectional LSTMs**) were considered to be the state of the art for sequence modeling. Nowadays, sequence modeling is almost universally done with **Transformers**.

Oddly, one-dimensional convnets were never very popular in NLP, **even though, in my own experience, a residual stack of depthwise-separable 1D convolutions can often achieve comparable performance to a bidirectional LSTM, at a greatly reduced computational cost**.

In [55]:
max_length = 500
max_tokens = 20_000
text_only_train_ds = train_ds.map(lambda x, y: x)  # extract the text from the train dataset and ignore targets
text_vectorization = keras.layers.TextVectorization(
    # In order to keep a manageable input size, we’ll truncate the inputs after the first 500 words. Only used with output_mode=int
    output_sequence_length=max_length,
    output_mode="int",
    max_tokens=max_tokens
)
text_vectorization.adapt(text_only_train_ds)
test_sentence = "the movie is fantastic"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)
# Notice how it's already padded the output to be 500 items
# Remember that each number in the output is and index of where this word lies in the total number of tokens. 
# Lower number means that it is more frequent ('the' is the second most frequent, 'is' 7th, 'movie' 19th)

2024-07-14 19:47:23.000000: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


tf.Tensor(
[  2  19   7 765   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0

In [70]:
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

In [9]:
# there is an error in the author's notebook. 
inputs = keras.Input(shape=(500,), dtype="int64")
one_hot = tf.keras.layers.Lambda(lambda x: tf.one_hot(x, depth=max_tokens), output_shape=(500, max_tokens))(inputs)
x = keras.layers.Bidirectional(keras.layers.LSTM(32))(one_hot)
x = keras.layers.Dropout(0.5)(x)
outputs = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

In [14]:
callbacks = [keras.callbacks.ModelCheckpoint("models/lambda_one_hot_bidir_lstm.keras", save_best_only=True)]
# model.fit(int_train_ds, callbacks=callbacks, validation_data=int_val_ds, epochs=10)
model = keras.models.load_model("models/lambda_one_hot_bidir_lstm.keras")
print(f"{model.evaluate(int_test_ds)[1]:.3f}")
# Ran this code on Google colab. Accuracy is 87%. Took a long time.

ValueError: File not found: filepath=models/lambda_one_hot_bidir_lstm.keras. Please ensure the file is an accessible `.keras` zip file.

## Understanding word embeddings
When using one-hot encoding, you make an assumption. That assumption is that the different tokens you’re encoding are all independent of each other: indeed, one-hot vectors are all orthogonal to one another. Which is wrong!!

To get a bit more abstract, **the geometric relationship between two word vectors should reflect the semantic relationship between these words.**
For instance, in a reasonable word vector space, you would expect synonyms to be embedded into similar word vectors, and in general, you would expect the geometric distance (such as the cosine distance or L2 distance) between any two word vectors to relate to the “semantic distance” between the associated words.

**Word embeddings* are low-dimensional floating-point vectors (dense vectors)
**Word embeddings** can be 256-dimensional, 512-dimensional, or 1,024-dimensional when dealing with very large vocabularies.
**Word embeddings** pack more information into far fewer dimensions.
**Word embeddings** are structured representations, and their structure is learned from data. Similar words get embedded in close locations,

![alt text](static/word_embeddings.png "word_embeddings")

In a 2D plane, some semantic relationships between these words can be encoded as geometric transformations.
- From cat to tiger and from dog to wolf: “from pet to wild animal” vector.
- From dog to cat and from wolf to tiger “from canine to feline” vector.

![alt text](static/2d_word_embedding.png "2d_word_embeddings")

Different Example:
By adding a “female” vector to the vector “king,” we obtain the vector “queen.” By adding a “plural” vector, we obtain “kings.”

#### Two ways to learn word embeddings
- Similar to learning weights in a Neural Network
- Load a precomputed (pretrained) word vector. 

**RULE: What makes a good word-embedding space depends heavily on your task**

## Embedding Layer
The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, looks up these integers in an internal dictionary, and returns the associated vectors. 
```
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)
```
**The Embedding layer** takes at least two arguments: the number of possible tokens and the dimensionality of the embeddings (here, 256)
**The Embedding layer** takes as input a rank-2 tensor of integers, of shape `(batch_size, sequence_length)`, where each entry is a sequence of integers. The layer then returns a 3D floating-point tensor of shape `(batch_size, sequence_length, embedding_ dimensionality)`.
Word vectors are initially random, then gradually adjusted via **backpropagation**.

In [9]:
inputs = keras.Input(shape=(max_length,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, name="embeddings_layer")(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("models/embeddings_bidir_lstm.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=5, callbacks=callbacks)

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 500)]             0         
                                                                 
 embeddings_layer (Embeddin  (None, 500, 256)          5120000   
 g)                                                              
                                                                 
 bidirectional_1 (Bidirecti  (None, 64)                73984     
 onal)                                                           
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 5194049 (19.81 MB)
Trainable params: 5194049 

<keras.src.callbacks.History at 0x318ce0c50>

In [10]:
model = keras.models.load_model("models/embeddings_bidir_lstm.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.867


In [11]:
desired_layer_name = 'embeddings_layer'
desired_layer_output_model = keras.Model(inputs=model.input,
                                         outputs=model.get_layer(desired_layer_name).output)

test_sentence = ["the movie is fantastic", "worst movie i have ever seen"]
input_data = text_vectorization(test_sentence)
layer_output = desired_layer_output_model.predict(input_data)

# Print the output of the desired layer
print("Output of", desired_layer_name, ":", layer_output)

Output of embeddings_layer : [[[ 0.01360201  0.01767087  0.00328842 ...  0.01169621 -0.04574747
    0.01518579]
  [-0.04140052 -0.01244501  0.03076882 ...  0.00277484  0.04939148
   -0.01400608]
  [ 0.02322518 -0.04022496  0.00818865 ...  0.02046797  0.02836032
    0.04751245]
  ...
  [-0.00334531  0.04427297 -0.0595779  ...  0.03713576  0.01749754
   -0.02225276]
  [-0.00334531  0.04427297 -0.0595779  ...  0.03713576  0.01749754
   -0.02225276]
  [-0.00334531  0.04427297 -0.0595779  ...  0.03713576  0.01749754
   -0.02225276]]

 [[ 0.09805202  0.07216296 -0.07133085 ... -0.1775976  -0.2420526
   -0.11087494]
  [-0.04140052 -0.01244501  0.03076882 ...  0.00277484  0.04939148
   -0.01400608]
  [-0.03173373 -0.03777039  0.01584751 ...  0.0070865  -0.00574558
    0.03534818]
  ...
  [-0.00334531  0.04427297 -0.0595779  ...  0.03713576  0.01749754
   -0.02225276]
  [-0.00334531  0.04427297 -0.0595779  ...  0.03713576  0.01749754
   -0.02225276]
  [-0.00334531  0.04427297 -0.0595779  ...  0

### Understanding Padding and Masking
One thing that’s slightly hurting model performance here is that our input sequences are full of zeros. This comes from our use of the **output_sequence_length=max_length** option in TextVectorization (with max_length equal to 500): sentences longer than 500 tokens are truncated to a length of 500 tokens, and sentences shorter than 500 tokens are padded with zeros.

Because of the Zeroes (padding), the information stored in the internal state of the RNN will gradually fade out as it gets exposed to these meaningless inputs.
We need some way to tell the RNN that it should skip these iterations. There’s an API for that: **Masking.**

The Embedding layer is capable of generating a “mask” that corresponds to its input data. This mask is a tensor of ones and zeros (or True/False booleans), of shape (batch_size, sequence_length), where the entry mask[i, t] indicates where timestep t of sample i should be skipped or not.

By default, this option isn’t active—you can turn it on by passing mask_zero=True to your Embedding layer.

In [None]:
inputs = keras.Input(shape=(max_length,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256, mask_zero=True)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("models/embeddings_bidir_lstm_with_masking.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=5, callbacks=callbacks)
model = keras.models.load_model("models/embeddings_bidir_lstm_with_masking.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

# Ran the code into another PC because my laptop couldn't handle it

## Using Pretrained Word Embeddings
When to use this solution?
Sometimes you have so little training data available that you can’t use your data alone to learn an appropriate task-specific embedding of your vocabulary. In such cases, instead of learning word embeddings jointly with the problem you want to solve, you can load embedding vectors from a precomputed embedding space that you know is highly structured and exhibits useful properties—one that captures generic aspects of language structure.

It's very similar to use a pretrained ConvNet model.

Such word embeddings are generally computed using word-occurrence statistics (observations about what words co-occur in sentences or documents), using a variety of techniques, some involving neural networks, others not.

The Word2Vec algorithm (https://code.google.com/archive/p/word2vec): one of the most famous pre-trained word-embeddings.
Global Vectors for Word Representation (GloVe, https://nlp.stanford.edu/projects/glove)

We'll start with **GloVe**

In [2]:
!wget http://nlp.stanford.edu/data/glove.6B.zip -P ./data/
!unzip -q ./data/glove.6B.zip -d ./data

--2024-07-14 18:03:10--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2024-07-14 18:03:10--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-07-14 18:03:11--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘./dat

In [56]:
import numpy as np

path_to_glove_file = "data/glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs


In [58]:
print(f"Found {len(embeddings_index)} word vectors, each has {len(embeddings_index['hello'])} embedding.")

Found 400000 word vectors, each has 100 embedding.


In [59]:
def euclidean_distance(u, v):
    # Measure how far embeddings from each other. Higher number means not similar
    return np.sqrt(np.sum((u - v) ** 2))


def cosine_similarity(u, v):
    # Measure how similar embeddings to each other. Higher number means more similar
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))


print(euclidean_distance(embeddings_index["cat"], embeddings_index["dog"]))
print(cosine_similarity(embeddings_index["cat"], embeddings_index["dog"]))

2.681131
0.87980753


 To load an embedding matrix that you can load to an Embedding layer, it must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in the reference word index (built during tokenization).

In [66]:
embedding_dim = 100

vocabulary = text_vectorization.get_vocabulary()
embedding_matrix = np.zeros((max_tokens, embedding_dim))  # max token=20_000 and embedding_dim=100

for i, word in enumerate(vocabulary):
    embedding_vector = embeddings_index.get(word, np.zeros(100))
    embedding_matrix[i] = embedding_vector

print(embedding_matrix.shape)

(20000, 100)


In [67]:
# We use a Constant initializer to load the pretrained embeddings in an Embedding layer. 
# To not disrupt the pretrained representations during training, we freeze the layer via trainable=False
embedding_layer = layers.Embedding(
    max_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
    mask_zero=True,
)

In [72]:
inputs = keras.Input(shape=(max_length, ), dtype="int64")
embedded = embedding_layer(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("models/glove_embeddings_sequence_model.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=15, callbacks=callbacks)
model = keras.models.load_model("models/glove_embeddings_sequence_model.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       multiple                  2000000   
                                                                 
 bidirectional_3 (Bidirecti  (None, 64)                34048     
 onal)                                                           
                                                                 
 dropout_2 (Dropout)         (None, 64)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 2034113 (7.76 MB)
Trainable params: 34113 (133.25 KB)
Non-trainable params: 2000000 (7.63 MB)
_________________