In [1]:
import torch
torch.__version__

'1.7.1'

# Understanding recurrent neural networks

This notebook contains code migrated from samples found in Chapter 6, Section 2 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.


## A first recurrent layer in PyTorch

The process we just naively implemented in Numpy corresponds to an actual PyTorch layer: the `RNN` layer:


In [2]:
from torch.nn import RNN

There is just one minor difference: `RNN` processes batches of sequences, not just a single sequence like in our Numpy example. This means that it takes inputs of shape `(seq_len, batch, input_size)`, rather than `(seq_len, input_size)`.

Like all recurrent layers in PyTorh, `RNN` returns the full sequences of successive outputs for each time stamp (a 3D tensor of shape `(seq_len, batch, num_directions * hidden_size)`), and a tensor `h_n` of shape `(num_layers * num_directions, batch, hidden_size)`, which contains the hidden state for `t = seq_len`.  Let's take a look at an example:

In [3]:
import torch.nn as nn

class Network1(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim):
        super(Network1, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)

        return hidden   # return only the last output for each input sequence

INPUT_DIM = 10000
EMBEDDING_DIM = 32
HIDDEN_DIM = 32

model1 = Network1(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM)
print(model1)

Network1(
  (embedding): Embedding(10000, 32)
  (rnn): RNN(32, 32)
)


It is sometimes useful to stack several recurrent layers one after the other in order to increase the representational power of a network. 
In such a setup, you have to get all intermediate layers to return full sequences:

In [4]:
class Network2(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim):
        super(Network2, self).__init__()

        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn1 = nn.RNN(embedding_dim, hidden_dim)
        self.rnn2 = nn.RNN(embedding_dim, hidden_dim)
        self.rnn3 = nn.RNN(embedding_dim, hidden_dim)
        self.rnn4 = nn.RNN(embedding_dim, hidden_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.rnn1(embedded)
        output, hidden = self.rnn2(output)
        output, hidden = self.rnn3(output)
        output, hidden = self.rnn4(output)

        return hidden


model2 = Network2(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM)
print(model2)

Network2(
  (embedding): Embedding(10000, 32)
  (rnn1): RNN(32, 32)
  (rnn2): RNN(32, 32)
  (rnn3): RNN(32, 32)
  (rnn4): RNN(32, 32)
)


Now let's try to use such a model on the IMDB movie review classification problem. We will try to build a machine learning model to classify movie reviews into 2 categories: postive or negtive. The IMDB training set contains 25000 movie reviews labeled by sentiment (positive / negative). Since our model cannot process strings directly, we will use a preprocessed dataset `imdb.npz`, which has turned the words in the reviews into integers according to the frequency they appear in the dataset. For instance, the integer "3" encodes the 3rd most frequent word in the data. 

Before loading the dataset, let's first define some parameters:

In [5]:
start_char = 1          # The start of a sequence will be marked with 1.
num_words = 10000       # Number of words to consider as features. 10000 most frequent words are kept.
maxlen = 500            # Maximum sequence length. Cut texts after this number of words.
index_from=3            # Index actual words with this index and higher.
oov_char=2              # Words that were cut out because of the num_words limit will be replaced with 2.
pad_char = 0            # Padding value.
batch_size = 128
seed = 113

Now we can begin our data processing. First let's load the dataset:

In [6]:
from bigdl.dataset import base
import numpy as np

def download_imdb(dest_dir):
    """Download pre-processed IMDB movie review data

    :argument
        dest_dir: destination directory to store the data
    :return
        The absolute path of the stored data
    """
    file_name = "imdb.npz"
    file_abs_path = base.maybe_download(file_name,
                                        dest_dir,
                                        'https://s3.amazonaws.com/text-datasets/imdb.npz')
    return file_abs_path

def load_imdb(dest_dir='/tmp/.bigdl/dataset'):
    """Load IMDB dataset.

    :argument
        dest_dir: where to cache the data (relative to `~/.bigdl/dataset`).
    :return
        the train, test separated IMDB dataset.
    """
    path = download_imdb(dest_dir)
    f = np.load(path, allow_pickle=True)
    x_train = f['x_train']
    y_train = f['y_train']
    x_test = f['x_test']
    y_test = f['y_test']
    f.close()

    return (x_train, y_train), (x_test, y_test)

(x_train, y_train), (x_test, y_test) = load_imdb(dest_dir='.data/.bigdl/dataset')
print("Dataset loading finished.")
print("Length of training set: ", len(x_train))

Dataset loading finished.
Length of training set:  25000


Since only the training set of IMDB is used in this example, we will only preprocess the training set. Shuffle the data:

In [7]:
rng = np.random.RandomState(seed)
indices = np.arange(len(x_train))
rng.shuffle(indices)
x_train = x_train[indices]
y_train = y_train[indices]

Set the start character:

In [8]:
x_train = [[start_char] + [w + index_from for w in x] for x in x_train]

Since we only consider `num_words` words (features) in this example, any word out of this range will be disgarded. The disgarded words are represented as `oov_char`.

In [9]:
x_train = [[w if (w < num_words) else oov_char for w in x] for x in x_train]

When we feed sequences into our model, all sequences in the batch need to be the same size: `maxlen`. Thus, to ensure each sentence in the batch is the same size, any shorter than `maxlen` needs to be padded, and any longer needs to be cut. `pad_char` is used to fill the blanks in the shorter sequences.

In [10]:
for i in range(len(x_train)):
    l = len(x_train[i])
    if l >= maxlen:
        x_train[i] = x_train[i][(l - maxlen):]
    else:
        x_train[i] =[pad_char] * (maxlen - l) + x_train[i]

To create a validation set, we can split the the training set with 25000 samples into two parts: 20000 samples for training and 5000 samples for validation.

In [11]:
train_feature, train_label = x_train[0:20000], y_train[0:20000]
val_feature, val_label = x_train[20000:25000], y_train[20000:25000]

Create data loaders for the training and validation datasets:

In [12]:
from torch.utils.data import TensorDataset, DataLoader

def create_data_loader(feature, label, batch_size, shuffle=True):
    feature_tensor = torch.LongTensor(feature)
    label_tensor = torch.FloatTensor(label).unsqueeze(1)
    dataset = TensorDataset(feature_tensor, label_tensor)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)

train_loader = create_data_loader(train_feature, train_label, batch_size)
val_loader = create_data_loader(val_feature, val_label, batch_size, False)

Now we are ready to create and train the model. First, initialize orca context:

In [13]:
from zoo.orca import init_orca_context, stop_orca_context
from zoo.orca import OrcaContext

# recommended to set it to True when running Analytics Zoo in Jupyter notebook. 
OrcaContext.log_output = True # (this will display terminal's stdout and stderr in the Jupyter notebook).

cluster_mode = "local"

if cluster_mode == "local":
    init_orca_context(cores=1, memory="2g")   # run in local mode
elif cluster_mode == "k8s":
    init_orca_context(cluster_mode="k8s", num_nodes=2, cores=4) # run on K8s cluster
elif cluster_mode == "yarn":
    init_orca_context(
        cluster_mode="yarn-client", cores=4, num_nodes=2, memory="2g",
        driver_memory="10g", driver_cores=1,
        conf={"spark.rpc.message.maxSize": "1024",
              "spark.task.maxFailures": "1",
              "spark.driver.extraJavaOptions": "-Dbigdl.failure.retryTimes=1"})   # run on Hadoop YARN cluster

Initializing orca context
Current pyspark location is : /intern/spark/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/__init__.py
Start to getOrCreate SparkContext
pyspark_submit_args is:  --driver-class-path /home/jinglei/analytics-zoo/zoo/target/analytics-zoo-bigdl_0.12.1-spark_2.4.3-0.10.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell 
2021-03-08 16:17:18 WARN  Utils:66 - Your hostname, intern01 resolves to a loopback address: 127.0.1.1; using 10.239.44.107 instead (on interface eno1)
2021-03-08 16:17:18 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-03-08 16:17:18 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-03-08 16:17:19 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jinglei/analytics-zoo/zoo/target/analytics-zoo-bigdl

Create a simple recurrent network using an Embedding layer and an RNN layer:

In [14]:
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, input_dim):
        super(SimpleRNN, self).__init__()

        self.embedding = nn.Embedding(input_dim, 32)
        for name, param in self.embedding.named_parameters(): 
            torch.nn.init.uniform_(param)
        
        self.rnn = nn.RNN(32, 32)
        for name, param in self.rnn.named_parameters(): 
            if 'weight_ih' in name:
                torch.nn.init.xavier_uniform_(param)
            elif 'weight_hh' in name:
                torch.nn.init.orthogonal_(param)
            elif 'bias' in name:
                torch.nn.init.zeros_(param)
                
        self.fc = nn.Linear(32, 1)
        for name, param in self.rnn.named_parameters(): 
            if 'weight_ih' in name:
                torch.nn.init.xavier_uniform_(param)
            elif 'bias' in name:
                torch.nn.init.zeros_(param)
                
        self.out_act = nn.Sigmoid()

    def forward(self, text):
        embedded = self.embedding(text.transpose(0, 1))     # text: [128, 500] embedded: [500, 128, 32]
        output, hidden = self.rnn(embedded)                 # hidden: [1, 128, 32]

        es = "output: ", output[-1,:,:].shape, " hidden: ", hidden.shape
        assert torch.equal(output[-1,:,:], hidden.squeeze(0)), es
        
        ret = self.fc(hidden.squeeze(0))                    # [128, 1]
        #return ret.squeeze(1)
        return self.out_act(ret)

INPUT_DIM = 10000
EMBEDDING_DIM = 32
HIDDEN_DIM = 32

model = SimpleRNN(INPUT_DIM)
model.train()
print(model)

SimpleRNN(
  (embedding): Embedding(10000, 32)
  (rnn): RNN(32, 32)
  (fc): Linear(in_features=32, out_features=1, bias=True)
  (out_act): Sigmoid()
)


Specify loss function, optimizer and metrics:

In [15]:
from zoo.orca.learn.metrics import Accuracy

rmsprop = torch.optim.RMSprop(model.parameters(), lr=0.001)
criterion = torch.nn.BCELoss()
metrics=[Accuracy()]

Now we can start the training:

In [16]:
from zoo.orca.learn.pytorch import Estimator
from zoo.orca.learn.trigger import EveryEpoch

est = Estimator.from_torch(model=model, optimizer=rmsprop, loss=criterion, metrics=metrics)
est.fit(data=train_loader, epochs=3, validation_data=val_loader, batch_size=batch_size, checkpoint_trigger=EveryEpoch())

[Epoch 3 9472/20096][Iteration 388][Wall Clock 138.068877742s] Trained 128.0 records in 0.275711347 seconds. Throughput is 464.25363 records/second. Loss is 0.4504674. 
2021-03-08 16:20:48 INFO  DistriOptimizer$:427 - [Epoch 3 9600/20096][Iteration 389][Wall Clock 138.426206647s] Trained 128.0 records in 0.357328905 seconds. Throughput is 358.2134 records/second. Loss is 0.5536856. 
2021-03-08 16:20:48 INFO  DistriOptimizer$:427 - [Epoch 3 9728/20096][Iteration 390][Wall Clock 138.740832851s] Trained 128.0 records in 0.314626204 seconds. Throughput is 406.83197 records/second. Loss is 0.60600984. 
2021-03-08 16:20:48 INFO  DistriOptimizer$:427 - [Epoch 3 9856/20096][Iteration 391][Wall Clock 139.016926222s] Trained 128.0 records in 0.276093371 seconds. Throughput is 463.6113 records/second. Loss is 0.49867636. 
2021-03-08 16:20:48 INFO  DistriOptimizer$:427 - [Epoch 3 9984/20096][Iteration 392][Wall Clock 139.251754281s] Trained 128.0 records in 0.234828059 seconds. Throughput is 545.0

<zoo.orca.learn.pytorch.estimator.PyTorchSparkEstimator at 0x7feac80d0c10>

In [17]:
stop_orca_context()

Stopping orca context
