# 4. Text data and preprocessing

Most of the recent impressive advances in machine learning and artificial intelligence have emerged in the field of natural language processing. In addition to simpler tasks such as sentiment analysis or text classification, there has been considerable progress in more advanced tasks such as translation and question answering. Text processing involves several unique techniques and methods, which we shall look at next.

## Bag-of-words approach

In natural language processing tasks, the original data is typically in the form of strings, or a list of strings. Neural networks cannot process strings directly, so they need to be converted to arrays consisting of numerical values. This preprocessing usually consists of a succession of distinct steps:

* *Text standardization*: convert everything to lowercase, remove punctuation and other special characters, etc.
* *Tokenization*: split the text into separate units or **tokens** (words, characters, N-grams ...), and set up a vocabulary
* *Vectorization*: associate a numerical vector with each of the tokens in the vocabulary

To investigate an example of implementing these methods, we use the well-known IMDB movie review dataset, that can be downloaded from [the Stanford page of Andrew Maas](https://ai.stanford.edu/~amaas/data/sentiment/). The movie reviews are contained in two folders: one for training samples and one for testing. These folders contain additional subfolders "pos" and "neg" with samples corresponding to their sentiment; note that the train directory contains an extra subfolder "unsup", which is unnecessary for this purpose and should be removed before continuing.   

In [1]:
from tensorflow import keras

batch_size = 32

train_ds = keras.utils.text_dataset_from_directory( 
    '../../aclImdb/train/', 
    validation_split=0.2, 
    subset="training", 
    seed=123,
    batch_size=batch_size)

val_ds = keras.utils.text_dataset_from_directory( 
    '../../aclImdb/train/', 
    validation_split=0.2, 
    subset="validation", 
    seed=123, # same seed as above!
    batch_size=batch_size)

test_ds = keras.utils.text_dataset_from_directory( 
    '../../aclImdb/test/', 
    batch_size=batch_size)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


Below we take a look at one of the training samples. The integer-valued labels are binary: either zero (negative sentiment) or one (positive sentiment).

In [2]:
for inputs, targets in train_ds:
    print(inputs.shape, targets.shape)
    print(inputs.dtype, targets.dtype)
    print(inputs[0])
    print(targets[0])
    break

(32,) (32,)
<dtype: 'string'> <dtype: 'int32'>
tf.Tensor(b'After, I watched the films... I thought, "Why the heck was this film such a high success in the Korean Box Office?" Even thought the movie had a clever/unusal scenario, the acting wasn\'t that good and the characters weren\'t very interesting. For a Korean movie... I liked the fighting scenes. If you want to watch a film without thinking, this is the film for you. But I got to admit... the film was kind of childish... 6/10', shape=(), dtype=string)
tf.Tensor(1, shape=(), dtype=int32)


For basic text preprocessing, Keras offers a `TextVectorization` layer. This layer can be used to preprocess the string-formed text data into numerical vectors. The code cell below does the following:

* transforms the text to lowercase
* removes punctuation
* does word-level tokenization by splitting on whitespace
* sets up a vocabulary of tokens
* outputs the text converted to a vector of zeroes and ones.  

In [3]:
from tensorflow.keras.layers import TextVectorization

max_tokens = 10000 # Maximum vocabulary size 

vectorization_layer = TextVectorization( 
    max_tokens=max_tokens, 
    output_mode='multi_hot'
)

# Adapt the layer to the text data 
train_texts = train_ds.map(lambda x, y: x) 
vectorization_layer.adapt(train_texts)

# Apply the vectorization to the datasets 
train_ds_bow = train_ds.map(lambda x, y: (vectorization_layer(x), y)) 
val_ds_bow = val_ds.map(lambda x, y: (vectorization_layer(x), y))
test_ds_bow = test_ds.map(lambda x, y: (vectorization_layer(x), y))

After this, the samples drawn from the Datasets are no longer strings, but vectors with as many elements as there are tokens in the vocabulary; in our example, we restrict the vocabulary size to 10000 most commonly encountered tokens in the training texts. Each of the vector elements are either one or zero depending on whether the text contained that particular token or not. 

In [4]:
for inputs, targets in train_ds_bow:
    print(inputs.shape, targets.shape)
    print(inputs.dtype, targets.dtype)
    print(inputs[0])
    print(targets[0])
    break

(32, 10000) (32,)
<dtype: 'int64'> <dtype: 'int32'>
tf.Tensor([1 1 1 ... 0 0 0], shape=(10000,), dtype=int64)
tf.Tensor(0, shape=(), dtype=int32)


This is a very simple example of a *bag-of-words* approach to natural language processing: the word vector merely indicates the presence or absence of a set of words in the input text. Let us now build and train a very simple fully connected classifier for the sentiment analysis task:

In [5]:
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Input, Dense, Dropout

model = Sequential([
    Input(shape=(max_tokens,)),
    Dense(units=16, activation='relu'),
    Dropout(0.5),
    Dense(units=1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit(train_ds_bow, epochs=10, validation_data=val_ds_bow)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - accuracy: 0.7650 - loss: 0.4964 - val_accuracy: 0.8828 - val_loss: 0.2917
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.8916 - loss: 0.2934 - val_accuracy: 0.8880 - val_loss: 0.2841
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9072 - loss: 0.2571 - val_accuracy: 0.8860 - val_loss: 0.3023
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9158 - loss: 0.2417 - val_accuracy: 0.8900 - val_loss: 0.3157
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9210 - loss: 0.2290 - val_accuracy: 0.8938 - val_loss: 0.3255
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9249 - loss: 0.2346 - val_accuracy: 0.8884 - val_loss: 0.3441
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x1a5cd65d760>

Let us test the classifier:

In [6]:
print(f"Test accuracy: {model.evaluate(test_ds_bow)[1]:.4f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.8797 - loss: 0.3700
Test accuracy: 0.8798


We can also inspect the vocabulary formed during the text vectorization process. The first element ([UNK]) of the corresponding dictionary is reserved for unknown words not contained in the training texts. 

In [7]:
vocabulary = vectorization_layer.get_vocabulary()
print(vocabulary[0:20]) # print 20 most common tokens

['[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i', 'this', 'that', 'br', 'was', 'as', 'with', 'for', 'movie', 'but', 'film']


list

## Word embeddings

Representing texts as multi-hot vectors of zeroes and ones fails to take into account many aspects of natural language: the order of words in the text, the semantic relationships between different words and so on. Also, the vector representation is very sparse (the overwhelming majority of the vector elements are zero) and wasteful. A much better and useful alternative is to use dense encodings of floating-point numbers with smaller dimensionality. This approach is called **word embeddings**.

With word embeddings, each word/token is associated with a vector in a small-dimensional (compared to the vocabulary size) embedding space. The components of these vectors are first initialized randomly, and then treated as trainable parameters. During training, the floating-point values of these vector components change iteratively, as they gather information about the dataset, and the relationships between words. 

In Keras, the word vector components are arranged in a special `Embedding` layer, as shown in the example implementation below. First, we create a new vectorization layer, which now outputs texts as lists of integer-valued token indices, and adapt this to the training texts. Since the text samples are of varying length, we define a `sequence_length` variable to restrict all lists to the same length (either by truncating long samples, or padding the short ones with zeroes).

In [8]:
# Define the TextVectorization layer 
max_tokens = 10000 # Maximum vocabulary size 
sequence_length = 250 # Maximum sequence length 

vectorization_layer = TextVectorization( 
    max_tokens=max_tokens, 
    output_mode='int', 
    output_sequence_length=sequence_length )

train_texts = train_ds.map(lambda x, y: x) 
vectorization_layer.adapt(train_texts)

train_ds_int = train_ds.map(lambda x, y: (vectorization_layer(x), y)) 
val_ds_int = val_ds.map(lambda x, y: (vectorization_layer(x), y))
test_ds_int = test_ds.map(lambda x, y: (vectorization_layer(x), y))

This is how the preprocessed samples now appear: 

In [9]:
for inputs, targets in train_ds_int:
    print(inputs.shape, targets.shape)
    print(inputs.dtype, targets.dtype)
    print(inputs[0])
    print(targets[0])
    break

(32, 250) (32,)
<dtype: 'int64'> <dtype: 'int32'>
tf.Tensor(
[  45   23  174    6   66    4   18   43  106 1192 6974  102   11    7
    2   29    2  114    7  882  191   37  310 6183   15    2 1982 8703
  905   15    2  652  260   35   63   24   46    6   77    3   37  847
 4206   15   25 7975  115  471    6  905   28  492   17 2082  200 1339
   19  233  487  191 8840    1    3    1    1    6 4624   30  266    6
 1111  905    6 3560   17  325   60 2732    1  195   96   26 1738  242
   16   11   29   10  196    9  215   12    8    4 4608 2606   16 1531
  497    1 1912   21    2  289   28    1   12    2  106  414   68   53
  269  956   43    2  230   12    2  172   14  738 1132    6  163   16
   12    2  172    5 1323   68 1132 3920   91    5   89   12    2 1023
   14 1316 1601   43  105   91    6  842  183 1061    3   38   21   11
 1275    5   29  200   53 4908  225  254  614  676   21    2  288    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0 

Now that we have preprocessed our text samples to integer lists of fixed length, they can be directly inserted into a Keras Embedding layer, which converts the integer lists to floating-point valued tensors of shape (sequence length, embedding dimension). 

As can be seen in the cell below, the two-dimensional tensors from the Embedding layer are first converted to one-dimensional ones with shape (sequence length$\cdot$embedding dimension, ) using a Reshape layer. This is then followed up by a fully connected classifier consisting of two Dense layers: one hidden layer plus the final output layer. The Dropout layers are added to reduce overfitting, and the model is trained for ten epochs.      

In [10]:
from tensorflow.keras.layers import Embedding, Dropout, Reshape

embed_dim = 64 # dimension of the word embeddings

model = Sequential([
    Input(shape=(sequence_length,)),
    Embedding(input_dim=max_tokens, output_dim=embed_dim),
    Reshape((sequence_length * embed_dim,)),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(units=1, activation='sigmoid')
])

model.summary()

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(train_ds_int, epochs=10, validation_data=val_ds_int)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 13ms/step - accuracy: 0.5825 - loss: 0.6529 - val_accuracy: 0.7516 - val_loss: 0.5372
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 13ms/step - accuracy: 0.8447 - loss: 0.3626 - val_accuracy: 0.8482 - val_loss: 0.3550
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 12ms/step - accuracy: 0.9100 - loss: 0.2292 - val_accuracy: 0.8468 - val_loss: 0.4008
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 12ms/step - accuracy: 0.9444 - loss: 0.1434 - val_accuracy: 0.8368 - val_loss: 0.4991
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 13ms/step - accuracy: 0.9672 - loss: 0.0924 - val_accuracy: 0.8320 - val_loss: 0.5941
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 13ms/step - accuracy: 0.9758 - loss: 0.0658 - val_accuracy: 0.8332 - val_loss: 0.6362
Epoch 7/10
[1m625/625

<keras.src.callbacks.history.History at 0x1a5cd762e70>

The validation accuracy has not improved from the much simpler bag-of-words model, which is not very surprising: sentiments in the text can be expected to be reasonably well determined by merely looking at which kinds of words it contains. However, it is easy to deduce that with problems that require deeper insights into the subtle meanings of written text, it is often necessary to use models that are able to preserve the word orderings. 

Also, it is interesting to see that, after training our model, the word vectors in the Embedding layer encode semantic information about the relationships between words. To see that, first we extract the word vector components from the embedding layer, and construct the dictionaries from the vocabulary: 

In [21]:
# Extract the embedding weights from the trained model 
embedding_layer = model.layers[0] # the first layer of the model
embedding_weights = embedding_layer.get_weights()[0] # shape (10000, 64)

# Get the word index from the tokenizer 
vocabulary = vectorization_layer.get_vocabulary() 
index_word = {idx: word for idx, word in enumerate(vocabulary)}
word_index = {word: idx for idx, word in enumerate(vocabulary)}

The following code snippet defines a function that takes a word as an input, and compares its word vector representation to those of all other words in the dictionary in turn, using cosine similarity (essentially the same as taking the dot product of the two vectors). The function then outputs the words with most similar representation.  

In [46]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

# Function to find similar words 
def find_similar_words(target_word, top_n=10): 
    if target_word not in word_index: 
        return f'Word "{target_word}" not in vocabulary' 
    target_idx = word_index[target_word] 
    target_embedding = embedding_weights[target_idx].reshape(1, -1) # shape (1, 64)
    similarities = cosine_similarity(embedding_weights, target_embedding).reshape(-1) # shape (10000,)
    similar_indices = np.argsort(similarities)[-top_n-1:-1][::-1] 
    similar_words = [index_word[idx] for idx in similar_indices] 
    return similar_words 
    
word_to_check = 'bad'
print(f'Words similar to "{word_to_check}": {find_similar_words(word_to_check)}')

Words similar to "bad": ['awful', 'worst', 'unfunny', 'wasted', 'waste', 'avoid', 'garbage', 'horrible', 'worse', 'lame']


From this output, it is obvious that the model has been able to figure out something about how the words resemble each other.