# 4. Text data and preprocessing

Most of the recent impressive advances in machine learning and artificial intelligence have emerged in the field of natural language processing. In addition to simpler tasks such as sentiment analysis or text classification, there has been considerable progress in more advanced tasks such as translation and question answering. Text processing involves several unique techniques and methods, which we shall look at next.

## Bag-of-words approach

In natural language processing tasks, the original data is typically in the form of strings, or a list of strings. Neural networks cannot process strings directly, so they need to be converted to arrays consisting of numerical values. This preprocessing usually consists of a succession of distinct steps:

* *Text standardization*: convert everything to lowercase, remove punctuation and other special characters, etc.
* *Tokenization*: split the text into separate units or **tokens** (words, characters, N-grams ...), and set up a vocabulary
* *Vectorization*: associate a numerical vector with each of the tokens in the vocabulary

To investigate an example of implementing these methods, we use the well-known IMDB movie review dataset, that can be downloaded from [the Stanford page of Andrew Maas](https://ai.stanford.edu/~amaas/data/sentiment/). The movie reviews are contained in two folders: one for training samples and one for testing. These folders contain additional subfolders "pos" and "neg" with samples corresponding to their sentiment; note that the train directory contains an extra subfolder "unsup", which is unnecessary for this purpose and should be removed before continuing.   

In [16]:
from tensorflow import keras

batch_size = 32

train_ds = keras.utils.text_dataset_from_directory( 
    '../../aclImdb/train/', 
    validation_split=0.2, 
    subset="training", 
    seed=123,
    batch_size=batch_size)

val_ds = keras.utils.text_dataset_from_directory( 
    '../../aclImdb/train/', 
    validation_split=0.2, 
    subset="validation", 
    seed=123, # same seed as above!
    batch_size=batch_size)

test_ds = keras.utils.text_dataset_from_directory( 
    '../../aclImdb/test/', 
    batch_size=batch_size)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


Below we take a look at one of the training samples. The integer-valued labels are binary: either zero (negative sentiment) or one (positive sentiment).

In [11]:
for inputs, targets in train_ds:
    print(inputs.shape, targets.shape)
    print(inputs.dtype, targets.dtype)
    print(inputs[0])
    print(targets[0])
    break

(32,) (32,)
<dtype: 'string'> <dtype: 'int32'>
tf.Tensor(b'This movie is a re-write of the 1978 Warren Beatty movie, "Heaven Can Wait", but it is written for the stand-up comedic style of Mr Rock. The premise remains the same: Lance Barton, (Rock) is taken before his life time is up and works a deal with God\'s representative, Mr King, to come back to earth as someone else. As in Beatty\'s movie; he chooses the murdered Charles Wellington, a rich white man, all because he fancies Sontee Jenkins (Regina King) who happens to turn up at Wellington\'s house during the murder. The role of Mrs Wellington and her lover suffers in this remake and the idea to turn an aged white multi-millionaire into a stand up black comedian who tries to woo Sontee simply does not work. Also the intercuts used to show Rock as Wellington and then as the real \'white\' Wellington, fail miserably. Improvements could have been made to the original Beatty plot - which in itself did not masterfully portray the life-

For basic text preprocessing, Keras offers a `TextVectorization` layer. This layer can be used to preprocess the string-formed text data into numerical vectors. The code cell below does the following:

* transforms the text to lowercase
* removes punctuation
* does word-level tokenization by splitting on whitespace
* sets up a vocabulary of tokens
* outputs the text converted to a vector of zeroes and ones.  

In [17]:
from tensorflow.keras.layers import TextVectorization

max_tokens = 10000 # Maximum vocabulary size 

vectorization_layer = TextVectorization( 
    max_tokens=max_tokens, 
    output_mode='multi_hot'
)

# Adapt the layer to the text data 
train_texts = train_ds.map(lambda x, y: x) 
vectorization_layer.adapt(train_texts)

# Apply the vectorization to the datasets 
train_ds = train_ds.map(lambda x, y: (vectorization_layer(x), y)) 
val_ds = val_ds.map(lambda x, y: (vectorization_layer(x), y))
test_ds = test_ds.map(lambda x, y: (vectorization_layer(x), y))

After this, the samples drawn from the Datasets are no longer strings, but vectors with as many elements as there are tokens in the vocabulary; in our example, we restrict the vocabulary size to 10000 most commonly encountered tokens in the training texts. Each of the vector elements are either one or zero depending on whether the text contained that particular token or not. 

In [18]:
for inputs, targets in train_ds:
    print(inputs.shape, targets.shape)
    print(inputs.dtype, targets.dtype)
    print(inputs[0])
    print(targets[0])
    break

(32, 10000) (32,)
<dtype: 'int64'> <dtype: 'int32'>
tf.Tensor([1 1 1 ... 0 0 0], shape=(10000,), dtype=int64)
tf.Tensor(1, shape=(), dtype=int32)


This is a very simple example of a *bag-of-words* approach to natural language processing: the word vector merely indicates the presence or absence of a set of words in the input text. Let us now build and train a very simple fully connected classifier for the sentiment analysis task:

In [20]:
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Input, Dense, Dropout

model = Sequential([
    Input(shape=(max_tokens,)),
    Dense(units=16, activation='relu'),
    Dropout(0.5),
    Dense(units=1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

history = model.fit(train_ds, epochs=10, validation_data=val_ds)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - accuracy: 0.7437 - loss: 0.5116 - val_accuracy: 0.8856 - val_loss: 0.2893
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.8853 - loss: 0.3065 - val_accuracy: 0.8874 - val_loss: 0.2859
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9042 - loss: 0.2640 - val_accuracy: 0.8856 - val_loss: 0.2996
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9136 - loss: 0.2498 - val_accuracy: 0.8886 - val_loss: 0.3107
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9186 - loss: 0.2508 - val_accuracy: 0.8836 - val_loss: 0.3244
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - accuracy: 0.9189 - loss: 0.2423 - val_accuracy: 0.8824 - val_loss: 0.3407
Epoch 7/10
[1m625/625[0m 

Let us test the classifier:

In [21]:
print(f"Test accuracy: {model.evaluate(test_ds)[1]:.4f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.8795 - loss: 0.3639
Test accuracy: 0.8769
