Clasificación de texto usando TextEncoders
===

* 30:00 min | Última modificación: Mayo 3, 2021 | [YouTube]

Basado en: https://www.tensorflow.org/tutorials/keras/text_classification

## Importación de librerías

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow import keras

tfds.disable_progress_bar()

print(tf.__version__)
print(tfds.__version__)

2.4.1
4.2.0


## Carga y configuración del dataset

In [2]:
#
# Carga de un dataset precodificado de ~8k
#
(train_data, test_data), info = tfds.load(
    "imdb_reviews/subwords8k",
    split=(tfds.Split.TRAIN, tfds.Split.TEST),
    as_supervised=True,
    with_info=True,
)



[1mDownloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m




[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


## Codificador

In [3]:
#
# El dataset incluye en la componente info un
#  codificador del tipo tfds.features.text.SubwordTextEncoder
#
encoder = info.features["text"].encoder

print("Tamaño del vocabulario: {}".format(encoder.vocab_size))

Tamaño del vocabulario: 8185


In [4]:
#
#  Ejemplo del codificador
#
sample_string = "Hello TensorFlow."
encoded_string = encoder.encode(sample_string)
print("Texto codificado {}".format(encoded_string))

#
#  Ejemplo de decodificador
#
original_string = encoder.decode(encoded_string)
print('Texto original: "{}"'.format(original_string))

Texto codificado [4025, 222, 6307, 2327, 4043, 2120, 7975]
Texto original: "Hello TensorFlow."


In [5]:
#
#  El codificador transforma sílabas y letras cuando la palabra no está
#  en el vocabulario predefinido. Mientras cada string sea más parecido
#  al dataset, más corta es la representación.
#
for ts in encoded_string:
    print("{} ---> {}".format(ts, encoder.decode([ts])))

4025 ---> Hell
222 ---> o 
6307 ---> Ten
2327 ---> sor
4043 ---> Fl
2120 ---> ow
7975 ---> .


## Exploración del dataset

In [6]:
#
#  Codificación del primer ejemplo
#
for train_example, train_label in train_data.take(1):
    print("Texto codificado:", train_example[:10].numpy())
    print("Etiqueta:", train_label.numpy())

Texto codificado: [  62   18   41  604  927   65    3  644 7968   21]
Etiqueta: 0


In [7]:
#
#  Decodificación del primer ejemplo
#
encoder.decode(train_example)

"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

## Preparación de los datos para entrenamiento

In [8]:
#
#  Las secuencias de enteros que representan las críticas tienen longitudes diferentes
#
for train_example, _ in train_data.take(20):
    print(len(train_example.numpy()), end="")

163142200117106421974188313179280394805241517125443655434534

In [9]:
#
#  Se forman vectores de la misma longitud rellenando con ceros.
#
BUFFER_SIZE = 1000

train_batches = train_data.shuffle(BUFFER_SIZE).padded_batch(
    32, padded_shapes=([None], [])
)

test_batches = test_data.padded_batch(32, padded_shapes=([None], []))

In [10]:
for example_batch, label_batch in train_batches.take(5):
    print("Batch shape:", example_batch.shape, end="")
    print("\t\tlabel shape:", label_batch.shape)

Batch shape: (32, 1357)		label shape: (32,)
Batch shape: (32, 766)		label shape: (32,)
Batch shape: (32, 734)		label shape: (32,)
Batch shape: (32, 477)		label shape: (32,)
Batch shape: (32, 1432)		label shape: (32,)


## Construcción del modelo usando Keras

In [11]:
model = keras.Sequential(
    [
        keras.layers.Embedding(encoder.vocab_size, 16),
        keras.layers.GlobalAveragePooling1D(),
        keras.layers.Dense(1),
    ]
)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          130960    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 130,977
Trainable params: 130,977
Non-trainable params: 0
_________________________________________________________________


## Compilación del modelo

In [12]:
model.compile(
    optimizer="adam",
    loss=tf.losses.BinaryCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

## Entrenamiento

In [13]:
model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches,
    validation_steps=30,
    verbose=1,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fead8da1710>

## Evaluación

In [14]:
results = model.evaluate(test_batches)
for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

loss: 0.333
accuracy: 0.858
