# Text classification with an RNN

This text classification tutorial trains a recurrent neural network on the IMDB large movie review dataset for sentiment analysis.

In [2]:
#使用CPU
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

import tensorflow as tf
import tensorflow_datasets as tfds

In [3]:
import matplotlib.pyplot as plt


def plot_graph(history, metric):
    """绘制图"""
    plt.plot(history.history[metric])
    plt.plot(history.history["val_" + metric], "")
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, "val_" + metric])
    plt.show()

## Setup input pipeline

The IMDB large movie review dataset is a binary classification dataset—all the reviews have either a positive or negative sentiment.

Download the dataset using TFDS.

In [5]:
dataset, info = tfds.load(
    "imdb_reviews/subwords8k",
    with_info=True,
    as_supervised=True
)
train_dataset, test_dataset = dataset["train"], dataset["test"]



The dataset info includes the encoder (a tfds.features.text.SubwordTextEncoder).

In [6]:
encoder = info.features["text"].encoder

In [7]:
print("Vocabulary size:{}".format(encoder.vocab_size))

Vocabulary size:8185


This text encoder will reversibly encode any string, falling back to byte-encoding if necessary.

In [8]:
sample_string = "Hello TensorFlow"

encoded_string = encoder.encode(sample_string)
print("Encoded string is {}".format(encoded_string))

original_string = encoder.decode(encoded_string)
print("The orginal string: {}".format(original_string))

Encoded string is [4025, 222, 6307, 2327, 4043, 2120]
The orginal string: Hello TensorFlow


In [9]:
assert original_string == sample_string

In [10]:
for index in encoded_string:
    print("{} --- > {}".format(index, encoder.decode([index])))

4025 --- > Hell
222 --- > o 
6307 --- > Ten
2327 --- > sor
4043 --- > Fl
2120 --- > ow


## Prepare the data for training

Next create batches of these encoded strings. Use the padded_batch method to zero-pad the sequences to the length of the longest string in the batch:

In [11]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

In [12]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE, padded_shapes=([None], []))

test_dataset = test_dataset.padded_batch(BATCH_SIZE, padded_shapes=([None], []))

train_data, train_labels = next(iter(train_dataset))
train_data.numpy().shape

(64, 1467)

## Create the model

Build a tf.keras.Sequential model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a tf.keras.layers.Dense layer.

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.

The tf.keras.layers.Bidirectional wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.

In [11]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1)
])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          523840    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 598,209
Trainable params: 598,209
Non-trainable params: 0
_________________________________________________________________


Please note that we choose to Keras sequential model here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check Keras RNN guide for more details.

Compile the Keras model to configure the training process:

In [13]:
# stateful RNN   https://blog.csdn.net/qq_27586341/article/details/88239404

In [12]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             optimizer=tf.keras.optimizers.Adam(1e-4),
             metrics=["accuracy"])

## Train the model

In [14]:
import os
os.environ["TF_FORCE_GPU_ALLOW_GROWTH"]="true"

In [15]:
history = model.fit(train_dataset, epochs=10,
                   validation_data=test_dataset,
                   validation_steps=30)

Epoch 1/10
    296/Unknown - 57s 193ms/step - loss: 0.6745 - accuracy: 0.5193

CancelledError:  [_Derived_]RecvAsync is cancelled.
	 [[{{node Reshape_11/_38}}]] [Op:__inference_distributed_function_6337]

Function call stack:
distributed_function


In [16]:
# 查看github各种答疑发现应该是内存不够，GPU跑不起来

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print("Test Loss:{}".format(test_loss))
print("Test Accuracy:{}".format(test_acc))

The above model does not mask the padding applied to the sequences. This can lead to skew if trained on padded sequences and test on un-padded sequences. Ideally you would use masking to avoid this, but as you can see below it only have a small effect on the output.

If the prediction is >= 0.5, it is positive else it is negative

In [17]:
def pad_to_size(vec, size):
    zeros = [0] * (size - len(vec))
    vec.extend(zeros)

In [22]:
def sample_predict(sample_pred_text, pad):
    encoded_sample_pred_text = encoder.encode(sample_pred_text)
    
    if pad:
        encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
    encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
    predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))
    
    return (predictions)

In [None]:
# predict on a sample text without padding.

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)

In [None]:
# predict on a sample text with padding

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

In [None]:
plot_graphs(history, 'accuracy')

In [None]:
plot_graphs(history, 'loss')

## Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the return_sequences constructor argument:

+ Return either the full sequences of successive outputs for each timestep (a 3D tensor of shape (batch_size, timesteps, output_features)).
+ Return only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)).

In [15]:
# return_sequences=True 详细https://blog.csdn.net/u011327333/article/details/78501054

In [13]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

In [14]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))