-----------------------
#### Word embeddings 
--------------------------

- Objective

    - build a binary classification model
    - perform sentiment analysis on IMDB dataset
    

**Data Download and Extraction:**

Downloads a sentiment analysis dataset (IMDb reviews) from a specified URL.
Extracts the dataset from the downloaded tar.gz file.

**Data Preparation:**

Creates directories for training and validation data.
Loads the training data using text_dataset_from_directory from TensorFlow, splitting it into training and validation subsets.

**Data Preprocessing and Optimization:**

Defines the custom_standardization function to perform text preprocessing, converting text to lowercase and stripping HTML break tags.
Uses the TextVectorization layer to normalize, split, and map strings to integers, adapting it to the training data.
Sets up the AUTOTUNE constant and applies caching and prefetching to the training and validation datasets for optimized performance.

**Model Definition:**

Creates a text classification model using TensorFlow's Keras API.
Comprises layers for text vectorization, embedding, global average pooling, and two dense layers for classification.
Specifies the vocabulary size, sequence length, and embedding dimension.

**Model Training:**

Compiles and trains the defined model on the preprocessed training dataset.
Utilizes the fit method with specified parameters such as training data, validation data, number of epochs, and callbacks.

**TensorBoard Callback:**

There is a reference to a tensorboard_callback, which suggests the usage of TensorBoard for model training visualization. However, the instantiation and definition of this callback are not provided in the provided code snippet.

In [42]:
import io
import os
import re
import shutil
import string

import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization

#### Download the IMDb Dataset

In [4]:
%%time
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# 82 MB file
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", 
                                  url,
                                  untar       = True, 
                                  cache_dir   = r'D:\AI-DATASETS\02-MISC-large\keras\datasets',
                                  cache_subdir= '')

CPU times: total: 2min 33s
Wall time: 5min 21s


In [6]:
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [7]:
dataset_dir

'D:\\AI-DATASETS\\02-MISC-large\\keras\\datasets\\aclImdb'

**train/ directory**

- `pos` and `neg` folders with movie reviews labelled as positive and negative respectively. 

In [8]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [9]:
train_dir

'D:\\AI-DATASETS\\02-MISC-large\\keras\\datasets\\aclImdb\\train'

The train directory also has additional folders which should be removed before creating training dataset.

In [11]:
%%time
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'D:\\AI-DATASETS\\02-MISC-large\\keras\\datasets\\aclImdb\\train\\unsup'

Next, create a `tf.data.Dataset` using `tf.keras.utils.text_dataset_from_directory`.

Use the train directory to create both train and validation datasets with a split of 20% for validation.

In [13]:
%%time
batch_size = 1024
seed       = 123

train_ds = tf.keras.utils.text_dataset_from_directory(
                    train_dir, 
                    batch_size      = batch_size, 
                    validation_split= 0.2,
                    subset          = 'training', 
                    seed            = seed)

val_ds = tf.keras.utils.text_dataset_from_directory(
                    train_dir, 
                    batch_size      = batch_size, 
                    validation_split= 0.2,
                    subset          = 'validation', 
                    seed            = seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
CPU times: total: 4.11 s
Wall time: 8.87 s


Take a look at a few movie reviews and their labels (1: positive, 0: negative) from the train dataset.

In [16]:
%%time
for text_batch, label_batch in train_ds.take(1):
    for i in range(2):
        print(label_batch[i].numpy(), text_batch.numpy()[i])
        print()

0 b"Wow. Some movies just leave me speechless. This was undeniably one of those movies. When I left the theatre, not a single word came to my mouth. All I had was an incredible urge to slam my head against the theatre wall to help me forget about the last hour and a half. Unfortunately, it didn't work. Honestly, this movie has nothing to recommend. The humor was at the first grade level, at best, the acting was overly silly, and the plot was astronomically far-fetched. I hearby pledge never to see an other movie starring Chris Kattan or any other cast-member of SNL."

1 b'If any show in the last ten years deserves a 10, it is this rare gem. It allows us to escape back to a time when things were simpler and more fun. Filled with heart and laughs, this show keeps you laughing through the three decades of difference. The furniture was ugly, the clothes were colorful, and the even the drugs were tolerable. The hair was feathered, the music was accompanied by roller-skates, and in the words

In [17]:
# sets the variable AUTOTUNE to the special value tf.data.AUTOTUNE, 
# which is a constant used in TensorFlow to dynamically tune the performance of 
# operations based on the available resources.
AUTOTUNE = tf.data.AUTOTUNE

# caches the elements of the dataset in memory
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds   = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

#### how are we going to create embeddings

Given review text : "Some movies just leave me speechless.. any other cast-member of SNL"

1. first tokenize the text into words
2. assign unique integer number (think like a code) to every word

#### Using the Embedding layer

- The Embedding layer serves as a lookup table, associating integer indices with dense vectors that represent the embeddings of specific words.
- It can be compared to a parameterized table where each word is assigned a unique dense vector.
- The dimensionality or width of the embedding is a tunable parameter, allowing experimentation to find an optimal setting for a given problem.
- Similar to adjusting the number of neurons in a Dense layer, modifying the embedding dimensionality enables fine-tuning for improved model performance.
- Experimenting with different embedding dimensions helps determine the most effective representation of words in the context of a particular task.

In [18]:
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)




If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:

In [19]:
result = embedding_layer(tf.constant([1, 2, 3]))
result.numpy()

array([[ 0.03098005,  0.03636936, -0.03422924,  0.02029983,  0.00572123],
       [ 0.02952042,  0.03769893,  0.04816072, -0.02836904,  0.01211315],
       [ 0.02682513, -0.03596522, -0.03820059,  0.02709473, -0.04366157]],
      dtype=float32)

**for text data**


- For text or sequence-related problems, the Embedding layer in neural networks accepts a 2D tensor of integers with a shape of (`samples`, `sequence_length`).

- Each entry in this tensor represents a sequence of integers, allowing the layer to handle variable-length sequences effectively.

- Batches with different shapes, such as (32, 10) or (64, 15), can be fed into the Embedding layer, where 32 or 64 represents the number of sequences in the batch, and 10 or 15 is the length of each sequence.

- The resulting tensor from the Embedding layer has one additional axis compared to the input. The embedding vectors are aligned along this new last axis.
- If a batch with a shape of (2, 3) is passed to the Embedding layer, the output tensor will be of shape (2, 3, N), where N represents the dimensionality of the embedding space. 
- The embeddings for each integer in the input sequences are aligned along the new axis, preserving the sequence structure.






In [20]:
import numpy as np
np.set_printoptions(linewidth=140)

In [21]:
result = embedding_layer(tf.constant([[0, 1, 2], 
                                      [3, 4, 5]]))
result.shape

TensorShape([2, 3, 5])

In [22]:
result

<tf.Tensor: shape=(2, 3, 5), dtype=float32, numpy=
array([[[-0.01677575,  0.0234295 , -0.00909972,  0.03595456,  0.04958868],
        [ 0.03098005,  0.03636936, -0.03422924,  0.02029983,  0.00572123],
        [ 0.02952042,  0.03769893,  0.04816072, -0.02836904,  0.01211315]],

       [[ 0.02682513, -0.03596522, -0.03820059,  0.02709473, -0.04366157],
        [ 0.00437683,  0.0455189 , -0.04694527,  0.03287429,  0.00334846],
        [-0.04294212,  0.00376356, -0.00407047,  0.00884806,  0.00669036]]], dtype=float32)>

- When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape (`samples`, `sequence_length`, `embedding_dimensionality`). 
- To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. 
- You could use an RNN, Attention, or pooling layer before passing it to a Dense layer.

#### Text preprocessing
Next, define the dataset preprocessing steps required for your sentiment classification model. 

Initialize a `TextVectorization layer` with the desired parameters to vectorize movie reviews. 

In [23]:
from tensorflow.keras.layers import TextVectorization

In [24]:
# Sample training data
train_texts = ["This is a sample sentence.", 
               "Another example sentence.", 
               "TensorFlow is great!"]

In [25]:
# Create a TextVectorization layer
text_vectorizer = TextVectorization(max_tokens            = 100, 
                                    output_mode           = 'int', 
                                    output_sequence_length= 5)

In [26]:
# Adapt the layer to your training text data
text_vectorizer.adapt(train_texts)




In [27]:
# Transform input text into numerical vectors
input_texts = ["Sample sentence for testing.", "TensorFlow example."]
numerical_vectors = text_vectorizer(input_texts)

In [28]:
# Print the results
print("Original texts:")
print(train_texts)

Original texts:
['This is a sample sentence.', 'Another example sentence.', 'TensorFlow is great!']


In [29]:
print("\nNumerical vectors:")
print(numerical_vectors.numpy())


Numerical vectors:
[[6 2 1 1 0]
 [5 8 0 0 0]]


...back to code 

In [30]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
    lowercase     = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

In [31]:
# Vocabulary size and number of words in a sequence.
vocab_size      = 10000
sequence_length = 100

In [32]:
# Use the text vectorization layer to normalize, split, and map strings to
# integers. 
# Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
                        standardize           = custom_standardization,
                        max_tokens            = vocab_size,
                        output_mode           = 'int',
                        output_sequence_length= sequence_length
)

In [34]:
%%time
# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

CPU times: total: 2.38 s
Wall time: 5.55 s


#### Constructing a Sentiment Classification Model

- Utilize the Keras Sequential API to establish a sentiment classification model, specifically adopting a `Continuous Bag of Words` style.

- The TextVectorization layer plays a crucial role in transforming strings into vocabulary indices. 

- After initializing vectorize_layer as a TextVectorization layer and building its vocabulary through the adaptation process on text_ds, it becomes a fundamental component as the initial layer in the end-to-end classification model. 

- This layer efficiently feeds transformed strings into the subsequent Embedding layer.

- The `Embedding layer` takes the `integer-encoded vocabulary` and retrieves the corresponding `embedding vector` for each word-index. 

- These vectors evolve and improve as the model undergoes training, adding an extra dimension to the output array. The resultant dimensions following this operation are (batch, sequence, embedding).

- To obtain a fixed-length output vector for each example, the model incorporates the GlobalAveragePooling1D layer. 

- This layer achieves this by averaging over the sequence dimension, ensuring the model can handle inputs of varying lengths in a straightforward manner.

- The fixed-length output vector then progresses through a fully-connected (Dense) layer featuring 16 hidden units.

- Concluding the architecture, the last layer establishes a dense connection with a single output node."

In [35]:
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'), # optional
  Dense(1)                      # binary
])

#### Compile and train the model

In [36]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [37]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])




In [38]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, 100)               0         
 Vectorization)                                                  
                                                                 
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 global_average_pooling1d (  (None, 16)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 160289 (626.13 KB)
Trainable params: 16028

In [39]:
%%time
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback])

Epoch 1/15

Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
CPU times: total: 23.3 s
Wall time: 45.7 s


<keras.src.callbacks.History at 0x28c72920950>

In [41]:
# #docs_infra: no_execute
# %load_ext tensorboard
# %tensorboard --logdir logs

#### Retrieve the trained word embeddings and save them to disk
Next, retrieve the word embeddings learned during training. 

The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape (vocab_size, embedding_dimension).

Obtain the weights from the model using get_layer() and get_weights(). 

The get_vocabulary() function provides the vocabulary to build a metadata file with one token per line.

In [40]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [41]:
weights.shape

(10000, 16)