Word embeddings --- 0:00 min
===

* Última modificación: Marzo 1, 2022 | YouTube

## Importación de librerías

In [1]:
import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

import tensorflow as tf

Configuración
---

**Descarga de datos**

In [2]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(
    "aclImdb_v1.tar.gz",
    url,
    untar=True,
    cache_dir=".",
    cache_subdir="/tmp/imdb",
)

dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [3]:
os.listdir(dataset_dir)

['imdbEr.txt', 'train', 'imdb.vocab', 'README', 'test']

In [4]:
train_dir = os.path.join(dataset_dir, "train")
os.listdir(train_dir)

['pos',
 'labeledBow.feat',
 'unsup',
 'neg',
 'unsupBow.feat',
 'urls_unsup.txt',
 'urls_neg.txt',
 'urls_pos.txt']

In [6]:
import shutil

remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

In [9]:
batch_size = 1024
seed = 123

train_ds = tf.keras.utils.text_dataset_from_directory(
    "/tmp/imdb/aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=seed,
)

val_ds = tf.keras.utils.text_dataset_from_directory(
    "/tmp/imdb/aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=seed,
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [13]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        print(label_batch[i].numpy(), "---->", text_batch.numpy()[i])
        print()

1 ----> b'It\'s always nice to see Angela Bassett getting to do a role that she can really sink her teeth into. She is at times intense, funny and even sexy in her role as Lena, a "colored" woman forced to make a home on a desolate mudbank just outside of Cape Town, South Africa. Danny Glover is also good in a not entirely sympathetic role as her partner, Boesman. Willie Jonah gives a finely nuanced performance as the stranger that discovers Boesman and Lena\'s new living area. It\'s not often that you get a chance to see an intelligent film dealing with mature themes. Although it is based on a play, the late director John Berry (who also directed Claudine) opens the material up by having the film shot in the widescreen Cinemascope format. He also keeps things visually interesting through the creative blocking of actors and by showing us things only mentioned in the play. Just like Diahann Carroll in Claudine, John Berry may have directed Angela Bassett into an Academy Award nomination

**Configuración del dataset para desempeño**

In [15]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

Uso de una capa Embedding
---

In [16]:
embedding_layer = tf.keras.layers.Embedding(1000, 5)

In [17]:
result = embedding_layer(tf.constant([1, 2, 3]))
result.numpy()

array([[-0.01132011,  0.02213531,  0.02189431, -0.04420274, -0.04720949],
       [-0.04826976, -0.01315316, -0.03228643, -0.04433763,  0.03587538],
       [ 0.00920614, -0.0140573 ,  0.03090834,  0.03418871, -0.003983  ]],
      dtype=float32)

In [18]:
result = embedding_layer(tf.constant([[0, 1, 2], [3, 4, 5]]))
result.shape

TensorShape([2, 3, 5])

Preprocesamiento de texto
---

In [19]:
def custom_standardization(input_data):

    lowercase = tf.strings.lower(input_data)

    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")

    return tf.strings.regex_replace(
        stripped_html, "[%s]" % re.escape(string.punctuation), ""
    )

In [23]:
import re
import string

vocab_size = 10000
sequence_length = 100

vectorize_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

text_ds = train_ds.map(lambda x, y: x)

vectorize_layer.adapt(text_ds)

Creación del modelo de clasificación
---

In [25]:
embedding_dim = 16

model = tf.keras.Sequential(
    [
        vectorize_layer,
        tf.keras.layers.Embedding(vocab_size, embedding_dim, name="embedding"),
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(16, activation="relu"),
        tf.keras.layers.Dense(1),
    ]
)

Compilación y entrenamiento del modelo
---

In [26]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="/tmp/embedding/logs")

In [27]:
model.compile(
    optimizer="adam",
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

In [29]:
model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=[tensorboard_callback],
)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f2b4da1a2b0>

In [30]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_2 (TextV  (None, 100)              0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 160,289
Trainable params: 160,289
Non-trai

```bash
%load_ext tensorboard
%tensorboard --logdir /tmp/embedding/logs
````


![assets/embeddings_classifier_accuracy.png](assets/embeddings_classifier_accuracy.png)

Recuperación de embeddings entrenados y guardado en disco
---

In [31]:
weights = model.get_layer("embedding").get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [33]:
import io

out_v = io.open("vectors.tsv", "w", encoding="utf-8")
out_m = io.open("metadata.tsv", "w", encoding="utf-8")

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write("\t".join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

Visualización de los embeddings
---

http://projector.tensorflow.org/

![assets/embedding_projector.png](assets/embedding_projector.png)