## Exercise 10
_Exercise: In this exercise you will download a dataset, split it, create a `tf.data.Dataset` to load it and preprocess it efficiently, then build and train a binary classification model containing an `Embedding` layer._

### a.
_Exercise: Download the [Large Movie Review Dataset](https://homl.info/imdb), which contains 50,000 movies reviews from the [Internet Movie Database](https://imdb.com/). The data is organized in two directories, `train` and `test`, each containing a `pos` subdirectory with 12,500 positive reviews and a `neg` subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files and folders (including preprocessed bag-of-words), but we will ignore them in this exercise._

I have downloaded the dataset manually and put it to `datasets/imdb/aclImdb`.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
from pathlib import Path

path = Path("datasets") / "imdb" / "aclImdb"
path

PosixPath('datasets/imdb/aclImdb')

Now a little help from our friend to split the test set into a validation set and a test set:

In [3]:
def review_paths(dirpath):
    return [str(path) for path in dirpath.glob("*.txt")]

train_pos = review_paths(path / "train" / "pos")
train_neg = review_paths(path / "train" / "neg")
test_valid_pos = review_paths(path / "test" / "pos")
test_valid_neg = review_paths(path / "test" / "neg")

len(train_pos), len(train_neg), len(test_valid_pos), len(test_valid_neg)

(12500, 12500, 12500, 12500)

### b.
_Exercise: Split the test set into a validation set (15,000) and a test set (10,000)._

In [4]:
np.random.shuffle(test_valid_pos)

test_pos = test_valid_pos[:5000]
test_neg = test_valid_neg[:5000]
valid_pos = test_valid_pos[5000:]
valid_neg = test_valid_neg[5000:]

### c.
_Exercise: Use tf.data to create an efficient dataset for each set._

Let's deal with the training set first.

In [5]:
train_pos_filepath_dataset = tf.data.Dataset.list_files(train_pos, seed=42)
n_readers = 5
train_pos_dataset = train_pos_filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath),
    cycle_length=n_readers,
    num_parallel_calls=tf.data.AUTOTUNE
)

2024-05-03 13:35:31.222285: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2 Max
2024-05-03 13:35:31.222300: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2024-05-03 13:35:31.222305: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2024-05-03 13:35:31.222335: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-05-03 13:35:31.222352: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


In [6]:
for item in train_pos_dataset.take(3):
    print(item)

tf.Tensor(b'Loosely based on the James J Corbett biography "The Roar Of The Crowd", Gentleman Jim is a wonderfully breezy picture that perfectly encapsulates not only the rise of the pugilistic prancer that was Corbett, but also the wind of change as regards the sport of boxing circa the 1890s.<br /><br />The story follows Corbett {a perfectly casted Errol Flynn} from his humble beginnings as a bank teller in San Fransico, thru to a chance fight with an ex boxing champion that eventually leads to him fighting the fearsome heavyweight champion of the world, John L Sullivan {beefcake personified delightfully by Ward Bond}. Not all the fights are in the ring tho, and it\'s all the spin off vignettes in Corbett\'s life that makes this a grand entertaining picture. There are class issues to overcome here {perfectly played out as fellow club members pay to have him knocked down a peg or two}, and Corbett has to not only fight to get respect from his so called peers, but he must also overcome

WARNING: Normally, the TextLineDataset splits the text to the '\n' character, but the reviews use the '<br>' character. Let's keep that in mind in case it proves to be important.

Should we add the labels now? I am not sure, but let's try it:

In [7]:
train_pos_dataset = train_pos_dataset.map(
    lambda line: (line, tf.constant([1]))
)

In [8]:
for item in train_pos_dataset.take(1):
    print(item)

(<tf.Tensor: shape=(), dtype=string, numpy=b"Overall this is a delightful, light-hearted, romantic, musical comedy. I suppose a small case could be made for the movie being to long. But I'm not sure what you would cut out. The singing that Kelly and Sinatra do? No. The fabulous dancing that Kelly does? No. The time the movie takes to develop the story line and develop the relationships of the characters? No (that seems to be a common complaint many times that more recent movies don't develop the characters).<br /><br />Some comment that Iturbi didn't bring much to the movie but this gives us a chance to see and hear a great talent from the 1040s. So what if he wasn't an actor? He was an important part of the movie as the basic plot was to get Grayson an audition with him. <br /><br />Originally Katherine Grayson wanted to be an opera star. Louis B. Mayer brought her to MGM for a screen test that included an aria. During her audition in the movie there is a shot of the MGM brass nodding

Now let's do the same for the negative reviews of the training set:

In [9]:
train_neg_filepath_dataset = tf.data.Dataset.list_files(train_neg, seed=42)
n_readers = 5
train_neg_dataset = train_neg_filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath),
    cycle_length=n_readers,
    num_parallel_calls=tf.data.AUTOTUNE
)

train_neg_dataset = train_neg_dataset.map(
    lambda line: (line, tf.constant([0]))
)

In [10]:
for item in train_neg_dataset.take(3):
    print(item)

(<tf.Tensor: shape=(), dtype=string, numpy=b"John Rivers' life as an architect and family man has taken a turn for the worst when his wife has disappeared and has been concluded dead after a freakish accident that involved changing a tyre on her car. During the days she has been missing, he confronts a man that's been following and he tells him that his been in contact with his dead wife from the other-side through E.V.P - Electronic Voice Phenomenon. Naturally he doesn't believe it but then hear gets weird phone calls from her phone and so he contacts the man to find out more about E.V.P. Soon enough John is hooked onto it, but something supernatural doesn't like him interfering with the dead, as now other then contacting his wife, the white noise is foretelling events before they happen.<br /><br />Since this DVD has been sitting on my shelf for a while now, I thought I better get around to watching it since it wasn't my copy. But then again I don't think the owners were in a hurry t

Now we just need to concatenate the positive and the negative dataset into one big training set:

In [11]:
train_set = train_pos_dataset.concatenate(train_neg_dataset).shuffle(25_000)

for item in train_set.take(5):
    print(item)

(<tf.Tensor: shape=(), dtype=string, numpy=b'I first read the book, when I was a young teenager, then saw the film late one night. About a year ago I checked it out on IMDb and discovered no copies available. I then hit the web and found a site that offers War Films, soooo glad that I did, ordered a copy and sat back and was able to confirm why I wanted to see it again.<br /><br />In my opinion to really enjoy the film I suggest you read get a copy of the book and then watch the film. The book is no longer in print but I did track a copy down via E-bay, the Author Alan White was a commando/paratrooper during the 2nd world war taking part in disparate clandestine operations and this was his first book. It is written by someone who knows and this fact I believe gives the book and film authenticity. I have not given the film a ten only because of the nature of the ending of the film, not as good as the book. There are a couple of plot lines that differ from the book also, which is strange

In [12]:
for item in train_set.take(15):
    print(item[1])

tf.Tensor([1], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([1], shape=(1,), dtype=int32)
tf.Tensor([1], shape=(1,), dtype=int32)
tf.Tensor([1], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([1], shape=(1,), dtype=int32)
tf.Tensor([1], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)
tf.Tensor([0], shape=(1,), dtype=int32)


Great, we have a shuffled enough dataset!

Now we need to do all that stuff for the validation and the test set as well.
We should probably write a function:

In [13]:
def build_dataset(files_pos, files_neg, shuffle_buffer_size=25_000, batch_size=32):
    pos_filepath_dataset = tf.data.Dataset.list_files(files_pos, seed=42)
    n_readers = 5
    pos_dataset = pos_filepath_dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath),
        cycle_length=n_readers,
        num_parallel_calls=tf.data.AUTOTUNE
    )
    pos_dataset = pos_dataset.map(
        lambda line: (line, tf.constant([1]))
    )
    neg_filepath_dataset = tf.data.Dataset.list_files(files_neg, seed=42)
    neg_dataset = neg_filepath_dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath),
        cycle_length=n_readers,
        num_parallel_calls=tf.data.AUTOTUNE
    )
    neg_dataset = neg_dataset.map(
        lambda line: (line, tf.constant([0]))
    )
    
    return pos_dataset.concatenate(train_neg_dataset).shuffle(shuffle_buffer_size).batch(batch_size).prefetch(1)
    

In [14]:
train_set = build_dataset(train_pos, train_neg)

In [15]:
for item in train_set.take(1):
    print(item)

(<tf.Tensor: shape=(32,), dtype=string, numpy=
array([b'CQ is incredibly slow, and I\'m a David Mamet fan. The movie follows around a young filmmaker who is making a very Barbarella-esque film. After that the movie started to lose me. Deep and profound? Not really. The movie "Dragonfly" being made in CQ has the problem of having no ending. This greatly parallels CQ, which also lacks an ending (in my opinion).<br /><br />I was lucky enough to catch this movie at the SxSW film festival. I had fairly high expectations having just watched Y Tu Mama Tambien and several other great movies. I was also looking forward to Jason Schwartzman\'s performance. But it was not an easy film to get into. If you\'re not into 60\'s sci-fi or slow movies that go no where, skip it.<br /><br />CQ feels like a student film. If you want a recent sci-fi-esque indie film rent Donnie Darko, it won\'t put you to sleep.',
       b'Mt little sister and I are self-proclaimed horror movie buffs. We have seen just abou

In [16]:
valid_set = build_dataset(valid_pos, valid_neg)
test_set = build_dataset(test_pos, test_neg)

### d.
_Exercise: Create a binary classification model, using a `TextVectorization` layer to preprocess each review._

In [17]:
train_set_without_labels = train_set.map(lambda line, label: line)

In [18]:
text_vec_layer = tf.keras.layers.TextVectorization(output_mode="tf_idf")
text_vec_layer.adapt(train_set_without_labels)
text_vec_layer.get_vocabulary()

2024-05-03 13:35:36.201651: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
2024-05-03 13:35:53.491059: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15587377994074769533
2024-05-03 13:35:53.491071: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 13047374798062205079
2024-05-03 13:35:53.491081: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7997531721029501639
2024-05-03 13:35:53.491085: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4871621820746615807
2024-05-03 13:35:53.491090: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6967353284034927673
2024-05-03 13:35:53.491094: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous 

['[UNK]',
 'the',
 'and',
 'a',
 'of',
 'to',
 'is',
 'in',
 'it',
 'i',
 'this',
 'that',
 'br',
 'was',
 'as',
 'for',
 'with',
 'movie',
 'but',
 'film',
 'on',
 'not',
 'you',
 'are',
 'his',
 'have',
 'he',
 'be',
 'one',
 'its',
 'at',
 'all',
 'by',
 'an',
 'they',
 'from',
 'who',
 'so',
 'like',
 'her',
 'just',
 'or',
 'about',
 'has',
 'if',
 'out',
 'some',
 'there',
 'what',
 'good',
 'when',
 'more',
 'very',
 'even',
 'she',
 'my',
 'no',
 'up',
 'would',
 'which',
 'only',
 'time',
 'really',
 'story',
 'their',
 'were',
 'had',
 'see',
 'can',
 'me',
 'than',
 'we',
 'much',
 'well',
 'been',
 'get',
 'will',
 'into',
 'also',
 'because',
 'other',
 'do',
 'people',
 'bad',
 'great',
 'first',
 'how',
 'most',
 'him',
 'dont',
 'made',
 'then',
 'movies',
 'make',
 'films',
 'could',
 'way',
 'them',
 'any',
 'too',
 'after',
 'characters',
 'think',
 'watch',
 'two',
 'many',
 'being',
 'seen',
 'character',
 'never',
 'little',
 'acting',
 'where',
 'plot',
 'best',


In [19]:
model = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=[], dtype=tf.string),
    text_vec_layer,
    tf.keras.layers.Dense(50, activation="swish"),
    tf.keras.layers.Dense(50, activation="swish"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics="accuracy")
model.fit(train_set, epochs=10,
          validation_data=valid_set)

Epoch 1/10
    782/Unknown - 16s 18ms/step - loss: 0.3260 - accuracy: 0.8638

2024-05-03 13:36:09.741491: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2136108705740175724
2024-05-03 13:36:09.741506: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 3493628450231936348
2024-05-03 13:36:09.741511: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14613295868764104371
2024-05-03 13:36:09.741525: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4287512381600114915
2024-05-03 13:36:09.741534: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15749677424871197029
2024-05-03 13:36:09.741539: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11757012812028237915
2024-05-03 13:36:09.741552: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv 

Epoch 2/10


2024-05-03 13:36:16.652033: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16394746452551148871
2024-05-03 13:36:16.652051: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2312398436967317219
2024-05-03 13:36:16.652059: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15439246476477084651
2024-05-03 13:36:16.652064: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 5656730585614499133
2024-05-03 13:36:16.652070: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1014437792978064769
2024-05-03 13:36:16.652079: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8260421561924314785
2024-05-03 13:36:16.652084: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv i

Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x360e07eb0>

In [20]:
model.evaluate(test_set)



2024-05-03 13:39:45.768623: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 15163549162549377506
2024-05-03 13:39:45.768641: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2895755664647593066
2024-05-03 13:39:45.768658: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10873053604849793824
2024-05-03 13:39:45.768671: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11091147508429961627
2024-05-03 13:39:45.768678: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 16394746452551148871
2024-05-03 13:39:45.768683: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8260421561924314785
2024-05-03 13:39:45.768701: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv

[0.2036270946264267, 0.9653714299201965]

In [21]:
for item in test_set.take(1):
    model.predict(item[0])



In [22]:
print(item[0][0], model.predict(item[0])[0])

tf.Tensor(b"I'm rarely moved to make a comment online about a film. But I can't understand how this one got made. Who made it? How could they have possibly thought they were capable of making a feature film? Did they do a weekend course at some film school, get a nice big cheque from daddy and kidnap David Badiel's family one by one until he agreed to be in it? Or was he by any chance a longtime family friend/distant relation doing this out of sheer, misplaced kindness? I don't care, don't want to know. Even he looks utterly embarrassed to be in it, mumbling his lines and hiding his face from the camera. Meanwhile the DOP must have been the gaffer from Neighbours, there seemed to be absolutely no sound design, the script, the direction and editing were all abysmal, and quite frankly the apathy that overwhelms me right now means that I can't be bothered to spend any more of my life thinking about this film.", shape=(), dtype=string) [5.9588046e-07]


It seems to work well, even without embeddings!

### e.
_Exercise: Add an `Embedding` layer and compute the mean embedding for each review, multiplied by the square root of the number of words (see Chapter 16). This rescaled mean embedding can then be passed to the rest of your model._

In [23]:
# we set output_sequence_length to a fixed number so that the input of
# the embedding layer later on has a predictable shape
text_vec_layer = tf.keras.layers.TextVectorization(output_sequence_length=500)
text_vec_layer.adapt(train_set_without_labels)
text_vec_layer.vocabulary_size()

121894

In [24]:
embedding_layer = tf.keras.layers.Embedding(input_dim=text_vec_layer.vocabulary_size(), output_dim=10)

In [25]:
vectorize_and_embed = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=[], dtype=tf.string),
    text_vec_layer,
    embedding_layer
])

In [26]:
for batch in train_set_without_labels.take(1):
    batch_embedding = vectorize_and_embed(batch)
batch_embedding

<tf.Tensor: shape=(32, 500, 10), dtype=float32, numpy=
array([[[ 0.00460934, -0.02267817,  0.04854618, ...,  0.02736679,
          0.04147476,  0.02328509],
        [-0.01907367,  0.02131287,  0.01746588, ...,  0.03823959,
         -0.01565267,  0.00014645],
        [ 0.04124698,  0.03378879,  0.02236694, ..., -0.02369328,
          0.03096269,  0.04107667],
        ...,
        [ 0.02776319, -0.02406411,  0.04403183, ...,  0.04071799,
         -0.04958228, -0.00410388],
        [ 0.02776319, -0.02406411,  0.04403183, ...,  0.04071799,
         -0.04958228, -0.00410388],
        [ 0.02776319, -0.02406411,  0.04403183, ...,  0.04071799,
         -0.04958228, -0.00410388]],

       [[ 0.02173816,  0.01676002, -0.02297279, ..., -0.02129111,
         -0.01824721,  0.02303095],
        [-0.03793883, -0.01507797, -0.01080136, ..., -0.02388897,
         -0.04236702, -0.03149197],
        [ 0.0311159 , -0.0246365 , -0.00359216, ..., -0.02308354,
         -0.03482064,  0.0018291 ],
        ...,

In [27]:
# This is how we calculate the mean embedding for each review in a batch, multiplied
# by the square root of the number of words
num_of_words = batch_embedding.shape[1]
tf.reduce_mean(batch_embedding, axis=1) * tf.sqrt(tf.constant(num_of_words, dtype=tf.float32))

<tf.Tensor: shape=(32, 10), dtype=float32, numpy=
array([[ 3.86347741e-01, -3.16314101e-01,  6.27046287e-01,
         4.21411186e-01, -2.00020239e-01, -4.55896616e-01,
        -4.21586066e-01,  5.43300509e-01, -7.37400532e-01,
         8.09997227e-03],
       [ 2.86641449e-01, -2.67911762e-01,  4.82167512e-01,
         3.02513689e-01, -1.31447807e-01, -3.81565213e-01,
        -2.94788629e-01,  4.38677341e-01, -5.58500886e-01,
        -7.80604489e-04],
       [ 4.18923855e-01, -3.94900680e-01,  7.15903044e-01,
         4.36348796e-01, -1.86550736e-01, -5.20505965e-01,
        -4.65109885e-01,  6.61122024e-01, -8.20217669e-01,
        -8.31896588e-02],
       [ 4.93469387e-01, -3.98946643e-01,  7.51818419e-01,
         5.05713165e-01, -2.56787241e-01, -5.49423933e-01,
        -4.89413172e-01,  7.29677856e-01, -8.94608915e-01,
        -4.84897122e-02],
       [ 4.97291028e-01, -4.29128051e-01,  7.65734553e-01,
         5.29002786e-01, -2.48494193e-01, -5.84947944e-01,
        -5.12382925e

In [28]:
# Look at the solutions notebook for a more correct implementation
# that ignores the padding tokens and does not rely on shape[1]
def rescale_embedding(batch_embedding):
    num_of_words = batch_embedding.shape[1]
    return tf.reduce_mean(batch_embedding, axis=1) * tf.sqrt(tf.constant(num_of_words, dtype=tf.float32))

In [29]:
rescale_embedding_layer = tf.keras.layers.Lambda(rescale_embedding)
rescale_embedding_layer(batch_embedding)

<tf.Tensor: shape=(32, 10), dtype=float32, numpy=
array([[ 3.86347741e-01, -3.16314101e-01,  6.27046287e-01,
         4.21411186e-01, -2.00020239e-01, -4.55896616e-01,
        -4.21586066e-01,  5.43300509e-01, -7.37400532e-01,
         8.09997227e-03],
       [ 2.86641449e-01, -2.67911762e-01,  4.82167512e-01,
         3.02513689e-01, -1.31447807e-01, -3.81565213e-01,
        -2.94788629e-01,  4.38677341e-01, -5.58500886e-01,
        -7.80604489e-04],
       [ 4.18923855e-01, -3.94900680e-01,  7.15903044e-01,
         4.36348796e-01, -1.86550736e-01, -5.20505965e-01,
        -4.65109885e-01,  6.61122024e-01, -8.20217669e-01,
        -8.31896588e-02],
       [ 4.93469387e-01, -3.98946643e-01,  7.51818419e-01,
         5.05713165e-01, -2.56787241e-01, -5.49423933e-01,
        -4.89413172e-01,  7.29677856e-01, -8.94608915e-01,
        -4.84897122e-02],
       [ 4.97291028e-01, -4.29128051e-01,  7.65734553e-01,
         5.29002786e-01, -2.48494193e-01, -5.84947944e-01,
        -5.12382925e

Okay, now we have a layer that computes the rescaled embedding for a batch of reviews.

Let's build a model that uses the embeddings:

In [30]:
with tf.device("/cpu:0"):
    model = tf.keras.Sequential([
        vectorize_and_embed,
        rescale_embedding_layer,
        tf.keras.layers.Dense(50, activation="swish"),
        tf.keras.layers.Dense(50, activation="swish"),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ])
    model.compile(loss="binary_crossentropy", optimizer="nadam", metrics="accuracy")
    model.fit(train_set, epochs=10,
            validation_data=valid_set)

Epoch 1/10
    782/Unknown - 36s 44ms/step - loss: 0.3965 - accuracy: 0.8110

2024-05-03 13:40:27.599906: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 2989364824719151300
2024-05-03 13:40:27.599921: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8912161222231175514
2024-05-03 13:40:27.599925: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10481938466810581945
2024-05-03 13:40:27.599938: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 5413094149733165528
2024-05-03 13:40:27.599950: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 153099155470242688
2024-05-03 13:40:27.599956: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 17878295793600140221
2024-05-03 13:40:27.599968: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv it

Epoch 2/10


2024-05-03 13:40:30.137775: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 18358026720786853225
2024-05-03 13:40:30.137800: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4379035989431822058
2024-05-03 13:40:30.137804: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7324461378838005514


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


It works! Although we didn't achieve higher accuracy with the embeddings layer.

### g.
_Exercise: Use TFDS to load the same dataset more easily: `tfds.load("imdb_reviews")`._

In [31]:
import tensorflow_datasets as tfds

train_set, valid_set, test_set = tfds.load(
    name="imdb_reviews",
    split=["train", "test[:60%]", "test[60%:]"],
    batch_size=64,
    as_supervised=True
)

  from .autonotebook import tqdm as notebook_tqdm


In [32]:
for item in train_set.take(1):
    print(item)

(<tf.Tensor: shape=(64,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell

2024-05-03 13:43:23.250945: W tensorflow/core/kernels/data/cache_dataset_ops.cc:854] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [33]:
tf.keras.backend.clear_session()

In [34]:
# with tf.device("/cpu:0"):
model = tf.keras.Sequential([
    vectorize_and_embed,
    rescale_embedding_layer,
    tf.keras.layers.Dense(50, activation="swish"),
    tf.keras.layers.Dense(50, activation="swish"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="nadam", metrics="accuracy")
model.fit(train_set, epochs=10,
        validation_data=valid_set)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7204239d0>

Probably we should have re-adapted the vectorization layer to the tfds dataset.
