# 2 approaches for representing groups of words: Sets and sequences

## 11.3.1 Preparing the IMDB movie reviews data

### Download dataset and uncompress

In [1]:
! cat ../datasets/aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

prepare validation set - move 20% of training text files to new dir 

In [6]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("/home/user/development/datasets/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

FileExistsError: [Errno 17] File exists: '/home/user/development/datasets/aclImdb/val/neg'

### text_dataset_from_directory utiliy = create a batched Dataset of text and their labels from a directory structure

In [7]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    base_dir / "train", batch_size=batch_size
)

val_ds = keras.utils.text_dataset_from_directory(
    base_dir / "val", batch_size=batch_size
)

test_ds = keras.utils.text_dataset_from_directory(
    base_dir / "test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [8]:
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'If this movie was about a fictional character, the movie could stand on its own and be judged objectively. Unfortunately for the viewer, the movie is based on "facts" that are shaded very unfairly toward Ruben Carter. Many of the smaller facts were disregarded (Carter was NOT number one contender at the time of the murders, there is no proof at all that he saved a friend from a child molester in his youth), but some of the larger facts, like apparently being robbed of a decision to Joey Giardello because of "racist" judges, is inexcusable to those of us who have seen the fight on tape, and completely disrespectful to Giardello. Why Hollywood feels the need to make a hero out of someone who, at best, was in trouble and around trouble much more than any normal person should be (was arrested multiple times for beating women) is strange to me. Ruben Carter was nev

## 11.3.2: Processing words as a set: the bag-of-words-approach

simplest way to encode a piece of text: discard order and treat it as a bag of tokens

let's process raw text datasets with a TextVectorization layer = multi-hot encoded binary word vectors (vector dimensions as big as my number of words with 0s almost everywhere and 1s for dimensions that encode words present in the text)

### unigrams with binary encoding

In [9]:
from tensorflow.keras.layers import TextVectorization

text_vectorization = TextVectorization(
    max_tokens=20000,   # limit vocabulary to the most 20k frequent words - 20k is good for text classification
    output_mode="multi_hot"
)
text_only_train_ds = train_ds.map(lambda x, y: x)  # prepare a dataset that only yields raw text inputs (no labels)
text_vectorization.adapt(text_only_train_ds)  # use dataset to index the dataset vocabulary via adapt()

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

inspect output of one of these datasets

In [9]:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)     
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


let's write a reusable model-building function that we'll use in all of our experiments 

In [12]:
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop",
                  loss="binary_crossentropy",
                  metrics=["accuracy"])
    return model

let's train and test our model

In [13]:
model = get_model()
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),    # we call cache() on the datasets to cache them in memory - only do preprocessing once during 1st 
          validation_data=binary_1gram_val_ds.cache(),  # epoch and reuse preprocessed texts for the following epochs
          epochs=10,                                    # can only be done if the data is small enough to fit in memory
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1] :.3f}")

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 16)                320016    
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.885


### bigrams with binary encoding

e.g. {"the", "the cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}

TextVectorization layer can be configured to return arbitrary N-grams: bigrams, trigrams, etc

In [16]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

let's see how our model performs when trained on such binary encoded bags of bigrams - any difference to unigrams?

In [21]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")


Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_12 (Dense)            (None, 16)                320016    
                                                                 
 dropout_6 (Dropout)         (None, 16)                0         
                                                                 
 dense_13 (Dense)            (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.895


test accuracy is improved! -> local order is pretty important

### Bigrams with TF-IDF encoding: add more information by counting how many times each word/n-gram occurs - histogram of words over text

e.g. {"the": 2, "the cat": 1, "cat": 1, "cat sat": 1, "sat": 1,
"sat on": 1, "on": 1, "on the": 1, "the mat: 1", "mat": 1}

count bigram occurences with TextVectorization layer

In [22]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

words like "th", "a", "is", "are" will always be dominant in texts -> normalization could solve it, but would wreck sparsity of having many zeros in vectorized sentences - computational effort increases

solution: TF-IDF = term frequency, inverse document frequency --> compare the term frequency in the current sample with the frequency over the entire document -> just change output mode in TextVectorization layer

In [3]:
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)


In [13]:
text_vectorization.adapt(text_only_train_ds)        # will learn tf-idf weights as well as vocabulary

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")


Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.887
