# Word Embeddings

In [24]:
%pip install tensorflow-macos tensorflow-metal

Collecting tensorflow-macos
  Using cached tensorflow_macos-2.16.2-cp39-cp39-macosx_12_0_arm64.whl.metadata (3.3 kB)
Collecting tensorflow-metal
  Using cached tensorflow_metal-1.1.0-cp39-cp39-macosx_12_0_arm64.whl.metadata (1.2 kB)
Collecting tensorflow==2.16.2 (from tensorflow-macos)
  Using cached tensorflow-2.16.2-cp39-cp39-macosx_12_0_arm64.whl.metadata (4.1 kB)
Collecting ml-dtypes~=0.3.1 (from tensorflow==2.16.2->tensorflow-macos)
  Using cached ml_dtypes-0.3.2-cp39-cp39-macosx_10_9_universal2.whl.metadata (20 kB)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow==2.16.2->tensorflow-macos)
  Using cached protobuf-4.25.5-cp37-abi3-macosx_10_9_universal2.whl.metadata (541 bytes)
Collecting tensorboard<2.17,>=2.16 (from tensorflow==2.16.2->tensorflow-macos)
  Using cached tensorboard-2.16.2-py3-none-any.whl.metadata (1.6 kB)
Collecting numpy<2.0.0,>=1.23.5 (from tensorflow==2.16.2->tensorflow-macos)
  Using cached numpy-1.2

## Sentiment Classifier Model

In [4]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

### GPU Acceleration

In [2]:
tf.config.list_physical_devices("GPU")

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [3]:
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Download the Dataset

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True, cache_dir='.',
                                  cache_subdir='')

In [5]:
dataset_dir = 'aclImdb_v1_extracted/aclImdb/'

Our dataset has pos and neg folders with movie reviews labelled as positive and negative respectively. We will use reviews from pos and neg folders to train a binary classification model.

In [6]:
train_dir = os.path.join(dataset_dir,'train')
os.listdir(train_dir)

['urls_unsup.txt',
 '.DS_Store',
 'neg',
 'urls_pos.txt',
 'urls_neg.txt',
 'pos',
 'unsupBow.feat',
 'labeledBow.feat']

The `Train` directory has additional folders which should be removed before creating training dataset

In [10]:
remove_dir = os.path.join(train_dir,'unsup')
shutil.rmtree(remove_dir)

Use the train directory to create both train and validation datasets with a split of 20% for validation.



In [8]:
batch_size = 1024
seed = 123

train_ds = tf.keras.utils.text_dataset_from_directory(train_dir, batch_size=batch_size, validation_split=0.2, subset='training', seed=seed)
val_ds = tf.keras.utils.text_dataset_from_directory(train_dir, batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed)

Found 25001 files belonging to 2 classes.
Using 20001 files for training.
Found 25001 files belonging to 2 classes.
Using 5000 files for validation.


Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset.

In [9]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])
    print('------------------------')

0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe"
------------------------
1 b"Halloween is the story of a boy who was misunderstood as a child. He takes out his problems on his older sister, whom he murders at the beginning of the film. This is just the start of things to come from Michael Myers.<br /><br />Donald Pleasance plays the doctor who's been studying Myers for years. He knows that something

2024-12-07 17:16:14.854247: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### Configure the dataset for performance

These are two important methods to use when loading data to make sure that I/O does not become blocking.

- `.cache()` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.
- `prefetch()` overlaps data preprocessing and model execution while training

In [10]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### The Embedding Layer

The Embedding layer can be interpreted as a lookup table that maps from integer indices(which stand for specific words) to dense vectors(their embeddings). The dimensionality or width of the embedding is a parameter you can experiment with to see what works well for the problem

In [11]:
# Embed a 1000 word vocabulary into 5 dimensions
embedding_layer = tf.keras.layers.Embedding(1000, 5)

When an Embedding layer is created, the weights for the embedding layer are randomly initialized(just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarity between words.

If we pass an interger into the embedding layer, the result replaces each integer with the vector from the embedding table.

In [12]:
result = embedding_layer(tf.constant([1, 2, 3]))
result.numpy()

array([[ 0.0201366 , -0.00642856, -0.04087245, -0.00741005, -0.02235507],
       [-0.04529129,  0.02532952,  0.04054601,  0.01945807, -0.04099651],
       [ 0.01179831, -0.04293561, -0.03884371,  0.04257048, -0.00390206]],
      dtype=float32)

For Text or sequence problems, the embedding layer takes a 2d tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers. It can embed sequences of variable lengths. One could feed into the embedding layer above batches with shapes(32,10)(batch of 32 sequences of length 10) or (64,15)(batch of 64 sequences of length 15). 

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a (2,3) input batch and the output is (2, 3, N)

In [13]:
result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape

TensorShape([2, 3, 5])

When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). 

### Text Preprocessing

In [14]:
# Create a custom standardization function to strip HTML break tags '<br />'
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)

    stripped_html = tf.strings.regex_replace(lowercase, '<br />', " ")

    return tf.strings.regex_replace(stripped_html, '[%s]' % re.escape(string.punctuation),'') # remove punctuation from the string

# Vocabulary size and number of words in a sequence 
vocab_size = 10000
sequence_length = 100 

# Use the text vectorization layer to normalize, split and map strings to integers. Watch that the layer uses the custom custom_standardization defined above
# Set maximum_sequence length as all samples are not of the same length
vectorize_layer = TextVectorization(
    standardize=custom_standardization, 
    max_tokens=vocab_size, 
    output_mode='int',
    output_sequence_length=sequence_length
)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary
# NOTE - By using train_ds.map(lambda x, y: x), you effectively create a new dataset that consists only of the input features from the original dataset, discarding the labels. 
# This can be useful in scenarios where you only need to work with or analyze the features without concern for their associated labels.
text_ds = train_ds.map(lambda x, y:x)
vectorize_layer.adapt(text_ds)

2024-12-07 17:16:27.341982: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### Classification Model 

In [15]:
embedding_dim = 16

model = Sequential()

In [16]:
model.add(vectorize_layer)
model.add(Embedding(vocab_size, embedding_dim, name='embedding_2'))
model.add(GlobalAveragePooling1D())


model.add(Dense(16, activation='relu'))
model.add(Dense(1))

### Compliling and Training

In [17]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir='logs')

In [18]:
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

In [19]:
model.fit(train_ds, validation_data=val_ds, epochs=15, callbacks=[tensorboard_callback])

Epoch 1/15


2024-12-07 17:16:45.242201: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 101ms/step - accuracy: 0.5027 - loss: 0.6914 - val_accuracy: 0.4884 - val_loss: 0.6842
Epoch 2/15
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 54ms/step - accuracy: 0.5029 - loss: 0.6810 - val_accuracy: 0.4912 - val_loss: 0.6694
Epoch 3/15
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 57ms/step - accuracy: 0.5042 - loss: 0.6634 - val_accuracy: 0.4942 - val_loss: 0.6463
Epoch 4/15
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 57ms/step - accuracy: 0.5174 - loss: 0.6363 - val_accuracy: 0.5552 - val_loss: 0.6151
Epoch 5/15
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 54ms/step - accuracy: 0.5943 - loss: 0.6009 - val_accuracy: 0.6406 - val_loss: 0.5793
Epoch 6/15
[1m20/20[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 54ms/step - accuracy: 0.6756 - loss: 0.5609 - val_accuracy: 0.7002 - val_loss: 0.5431
Epoch 7/15
[1m20/20[0m [32m━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x1479584c0>

With this approach the model reaches a validation accuracy of around 80% (note that the model is overfitting since training accuracy is higher).

In [20]:
model.summary()

### Visualize the model metrics in TensorBoard

In [21]:
#docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

### Retrieve the trained word embeddings and save them to disk

In [23]:
weights = model.get_layer('embedding_2').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Write the weights to disk. To use the Embedding Projector, you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [29]:
out_v = io.open('vectors.tsc', 'w', encoding='utf-8')
out_m = io.open('metadata.tsc', 'w', encoding='utf-8')


for index, word in enumerate(vocab): 
    if index == 0: 
        continue # skip 0, it`s padding

    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + '\n')
    out_m.write(word + '\n') 
out_v.close()
out_m.close()