<a href="https://colab.research.google.com/github/mahima-c/DL-Problem-solution/blob/main/Many_to_Many_%E2%80%93_POS_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In order to get the data, you need to install the NLTK library if it is not already installed (NLTK is included in the Anaconda distribution), as well as the 10% treebank dataset (not installed by default). To install NLTK, follow the steps on the NLTK install page [23]. To install the treebank dataset, perform the following at the Python REPL:

In [1]:
import nltk
nltk.download("treebank")

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [2]:
import numpy as np
import os
import shutil
import tensorflow as tf

We will lazily import the NLTK treebank dataset into a pair of parallel flat files, one containing the sentences and the other containing a corresponding POS sequence:

In [3]:
def download_and_read(dataset_dir, num_pairs=None):
   sent_filename = os.path.join(dataset_dir, "treebank-sents.txt")
   poss_filename = os.path.join(dataset_dir, "treebank-poss.txt")
   if not(os.path.exists(sent_filename) and os.path.exists(poss_filename)):
       import nltk   
       if not os.path.exists(dataset_dir):
           os.makedirs(dataset_dir)
       fsents = open(sent_filename, "w")
       fposs = open(poss_filename, "w")
       sentences = nltk.corpus.treebank.tagged_sents()
       for sent in sentences:
           fsents.write(" ".join([w for w, p in sent]) + "\n")
           fposs.write(" ".join([p for w, p in sent]) + "\n")
       fsents.close()
       fposs.close()
   sents, poss = [], []
   with open(sent_filename, "r") as fsent:
       for idx, line in enumerate(fsent):
           sents.append(line.strip())
           if num_pairs is not None and idx >= num_pairs:
               break
   with open(poss_filename, "r") as fposs:
       for idx, line in enumerate(fposs):
           poss.append(line.strip())
           if num_pairs is not None and idx >= num_pairs:
               break
   return sents, poss
sents, poss = download_and_read("./datasets")
assert(len(sents) == len(poss))
print("# of records: {:d}".format(len(sents)))


# of records: 3914


There are 3194 sentences in our dataset. We will then use the TensorFlow (tf.keras) tokenizer to tokenize the sentences and create a list of sentence tokens. We reuse the same infrastructure to tokenize the parts of speech, although we could have simply split on spaces. Each input record to the network is currently a sequence of text tokens, but they need to be a sequence of integers. During the tokenizing process, the Tokenizer also maintains the tokens in the vocabulary, from which we can build mappings from token to integer and back.

We have two vocabularies to consider, first the vocabulary of word tokens in the sentence collection, and the vocabulary of POS tags in part-of-speech collection. The following code shows how to tokenize both collections and generate the necessary mapping dictionaries:

In [20]:
def tokenize_and_build_vocab(texts, vocab_size=None, lower=True):
    if vocab_size is None:
        tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=lower)
    else:
        tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size+1, oov_token="UNK", lower=lower)
    tokenizer.fit_on_texts(texts)
    if vocab_size is not None:
        # additional workaround, see issue 8092
        # https://github.com/keras-team/keras/issues/8092
        tokenizer.word_index = {e:i for e, i in tokenizer.word_index.items() 
            if i <= vocab_size+1 }
    word2idx = tokenizer.word_index
    idx2word = {v:k for k, v in word2idx.items()}
    return word2idx, idx2word, tokenizer



In [21]:
# download and read source and target data into data structure
sents, poss = download_and_read("./datasets", num_pairs=NUM_PAIRS)
assert(len(sents) == len(poss))
print("# of records: {:d}".format(len(sents)))

# vocabulary sizes
word2idx_s, idx2word_s, tokenizer_s = tokenize_and_build_vocab(
    sents, vocab_size=9000)
word2idx_t, idx2word_t, tokenizer_t = tokenize_and_build_vocab(
    poss, vocab_size=38, lower=False)
source_vocab_size = len(word2idx_s)
target_vocab_size = len(word2idx_t)
print("vocab sizes (source): {:d}, (target): {:d}".format(
    source_vocab_size, target_vocab_size))

# of records: 3914
vocab sizes (source): 9001, (target): 39


In [5]:
sequence_lengths = np.array([len(s.split()) for s in sents])
print([(p, np.percentile(sequence_lengths, p))
for p in [75, 80, 90, 95, 99, 100]])

[(75, 33.0), (80, 35.0), (90, 41.0), (95, 47.0), (99, 58.0), (100, 271.0)]


In [6]:
word2idx_t

{"''": 22,
 'CC': 14,
 'CD': 9,
 'DT': 5,
 'EX': 33,
 'FW': 37,
 'IN': 3,
 'JJ': 8,
 'JJR': 24,
 'JJS': 28,
 'LRB': 32,
 'LS': 36,
 'MD': 20,
 'NN': 2,
 'NNP': 4,
 'NNPS': 26,
 'NNS': 7,
 'NONE': 6,
 'PDT': 35,
 'POS': 21,
 'PRP': 13,
 'RB': 11,
 'RBR': 30,
 'RBS': 34,
 'RP': 27,
 'RRB': 31,
 'SYM': 39,
 'TO': 15,
 'UH': 38,
 'UNK': 1,
 'VB': 12,
 'VBD': 10,
 'VBG': 18,
 'VBN': 16,
 'VBP': 19,
 'VBZ': 17,
 'WDT': 23,
 'WP': 25,
 'WRB': 29}

The next step is to create the dataset from our inputs. First, we have to convert our sequence of tokens and POS tags in our input and output sequences to sequences of integers. Second, we have to pad shorter sequences to the maximum length of 271. Notice that we do an additional operation on the POS tag sequences after padding, rather than keep it as a sequence of integers, we convert it to a sequence of one-hot encodings using the to_categorical() function. TensorFlow 2.0 does provide loss functions to handle outputs as a sequence of integers, but we want to keep our code as simple as possible, so we opt to do the conversion ourselves. Finally, we use the from_tensor_slices() function to create our dataset, shuffle it, and split it up into training, validation, and test sets:

In [22]:
max_seqlen = 271

# create dataset
sents_as_ints = tokenizer_s.texts_to_sequences(sents)
sents_as_ints = tf.keras.preprocessing.sequence.pad_sequences(
    sents_as_ints, maxlen=max_seqlen, padding="post")
poss_as_ints = tokenizer_t.texts_to_sequences(poss)
poss_as_ints = tf.keras.preprocessing.sequence.pad_sequences(
    poss_as_ints, maxlen=max_seqlen, padding="post")
dataset = tf.data.Dataset.from_tensor_slices(
    (sents_as_ints, poss_as_ints))
idx2word_s[0], idx2word_t[0] = "PAD", "PAD"
poss_as_catints = []
for p in poss_as_ints:
    poss_as_catints.append(tf.keras.utils.to_categorical(p, 
        num_classes=target_vocab_size, dtype="int32"))
poss_as_catints = tf.keras.preprocessing.sequence.pad_sequences(
    poss_as_catints, maxlen=max_seqlen)
dataset = tf.data.Dataset.from_tensor_slices(
    (sents_as_ints, poss_as_catints))



Next, we will define our model and instantiate it. Our model is a sequential model consisting of an embedding layer, a dropout layer, a bidirectional GRU layer, a dense layer, and a softmax activation layer. The input is a batch of integer sequences, with shape (batch_size, max_seqlen). When passed through the embedding layer, each integer in the sequence is converted to a vector of size (embedding_dim), so now the shape of our tensor is (batch_size, max_seqlen, embedding_dim). Each of these vectors are passed to corresponding time steps of a bidirectional GRU with an output dimension of 256. Because the GRU is bidirectional, this is equivalent to stacking one GRU on top of the other, so the tensor that comes out of the bidirectional GRU has the dimension (batch_size, max_seqlen, 2*rnn_output_dimension). Each timestep tensor of shape (batch_size, 1, 2*rnn_output_dimension) is fed into a dense layer, which converts each time step to a vector of the same size as the target vocabulary, that is, (batch_size, number_of_timesteps, output_vocab_size). Each time step represents a probability distribution of output tokens, so the final softmax layer is applied to each time step to return a sequence of output POS tokens.

Finally, we declare the model with some parameters, then compile it with the Adam optimizer, the categorical cross-entropy loss function, and accuracy as the metric:

In [23]:
class POSTaggingModel(tf.keras.Model):
   def __init__(self, source_vocab_size, target_vocab_size,
           embedding_dim, max_seqlen, rnn_output_dim, **kwargs):
       super(POSTaggingModel, self).__init__(**kwargs)
       self.embed = tf.keras.layers.Embedding(
           source_vocab_size, embedding_dim, input_length=max_seqlen)
       self.dropout = tf.keras.layers.SpatialDropout1D(0.2)
       self.rnn = tf.keras.layers.Bidirectional(
           tf.keras.layers.GRU(rnn_output_dim, return_sequences=True))
       self.dense = tf.keras.layers.TimeDistributed(
           tf.keras.layers.Dense(target_vocab_size))
       self.activation = tf.keras.layers.Activation("softmax")
   def call(self, x):
       x = self.embed(x)
       x = self.dropout(x)
       x = self.rnn(x)
       x = self.dense(x)
       x = self.activation(x)
       return x

In [24]:
def masked_accuracy():
    def masked_accuracy_fn(ytrue, ypred):
        ytrue = tf.keras.backend.argmax(ytrue, axis=-1)
        ypred = tf.keras.backend.argmax(ypred, axis=-1)
 
        mask = tf.keras.backend.cast(
            tf.keras.backend.not_equal(ypred, 0), tf.int32)
        matches = tf.keras.backend.cast(
            tf.keras.backend.equal(ytrue, ypred), tf.int32) * mask
        numer = tf.keras.backend.sum(matches)
        denom = tf.keras.backend.maximum(tf.keras.backend.sum(mask), 1)
        accuracy =  numer / denom
        return accuracy

    return masked_accuracy_fn

In [25]:
def clean_logs(data_dir):
    logs_dir = os.path.join(data_dir, "logs")
    shutil.rmtree(logs_dir, ignore_errors=True)
    return logs_dir

In [18]:
# set random seed
tf.random.set_seed(42)

# clean up log area
data_dir = "./data"
logs_dir = clean_logs(data_dir)


In [27]:
# split into training, validation, and test datasets
dataset = dataset.shuffle(10000)
test_size = len(sents) // 3
val_size = (len(sents) - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)

# create batches
batch_size = BATCH_SIZE
train_dataset = train_dataset.batch(batch_size)
val_dataset = val_dataset.batch(batch_size)
test_dataset = test_dataset.batch(batch_size)

In [None]:
NUM_PAIRS = None
EMBEDDING_DIM = 128
RNN_OUTPUT_DIM = 256
BATCH_SIZE = 128
NUM_EPOCHS = 50

In [28]:
# define model
embedding_dim = EMBEDDING_DIM
rnn_output_dim = RNN_OUTPUT_DIM

model = POSTaggingModel(source_vocab_size, target_vocab_size,
    embedding_dim, max_seqlen, rnn_output_dim)
model.build(input_shape=(batch_size, max_seqlen))
model.summary()

model.compile(
    loss="categorical_crossentropy",
    optimizer="adam", 
    metrics=["accuracy", masked_accuracy()])

Model: "pos_tagging_model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      multiple                  1152128   
_________________________________________________________________
spatial_dropout1d_2 (Spatial multiple                  0         
_________________________________________________________________
bidirectional_2 (Bidirection multiple                  592896    
_________________________________________________________________
time_distributed_2 (TimeDist multiple                  20007     
_________________________________________________________________
activation_2 (Activation)    multiple                  0         
Total params: 1,765,031
Trainable params: 1,765,031
Non-trainable params: 0
_________________________________________________________________


In [29]:
# train
num_epochs = NUM_EPOCHS

best_model_file = os.path.join(data_dir, "best_model.h5")
checkpoint = tf.keras.callbacks.ModelCheckpoint(
    best_model_file, 
    save_weights_only=True,
    save_best_only=True)
tensorboard = tf.keras.callbacks.TensorBoard(log_dir=logs_dir)
history = model.fit(train_dataset, 
    epochs=num_epochs,
    validation_data=val_dataset,
    callbacks=[checkpoint, tensorboard])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [30]:
# evaluate with test set
best_model = POSTaggingModel(source_vocab_size, target_vocab_size,
    embedding_dim, max_seqlen, rnn_output_dim)
best_model.build(input_shape=(batch_size, max_seqlen))
best_model.load_weights(best_model_file)
best_model.compile(
    loss="categorical_crossentropy",
    optimizer="adam", 
    metrics=["accuracy", masked_accuracy()])


In [31]:
test_loss, test_acc, test_masked_acc = best_model.evaluate(test_dataset)
print("test loss: {:.3f}, test accuracy: {:.3f}, masked test accuracy: {:.3f}".format(
    test_loss, test_acc, test_masked_acc))

# predict on batches
labels, predictions = [], []
is_first_batch = True
accuracies = []

for test_batch in test_dataset:
    inputs_b, outputs_b = test_batch
    preds_b = best_model.predict(inputs_b)
    # convert from categorical to list of ints
    preds_b = np.argmax(preds_b, axis=-1)
    outputs_b = np.argmax(outputs_b.numpy(), axis=-1)
    for i, (pred_l, output_l) in enumerate(zip(preds_b, outputs_b)):
        assert(len(pred_l) == len(output_l))
        pad_len = np.nonzero(output_l)[0][0]
        acc = np.count_nonzero(
            np.equal(
                output_l[pad_len:], pred_l[pad_len:]
            )
        ) / len(output_l[pad_len:])
        accuracies.append(acc)
        if is_first_batch:
            words = [idx2word_s[x] for x in inputs_b.numpy()[i][pad_len:]]
            postags_l = [idx2word_t[x] for x in output_l[pad_len:] if x > 0]
            postags_p = [idx2word_t[x] for x in pred_l[pad_len:] if x > 0]
            print("labeled  : {:s}".format(" ".join(["{:s}/{:s}".format(w, p) 
                for (w, p) in zip(words, postags_l)])))
            print("predicted: {:s}".format(" ".join(["{:s}/{:s}".format(w, p) 
                for (w, p) in zip(words, postags_p)])))
            print(" ")
    is_first_batch = False

accuracy_score = np.mean(np.array(accuracies))
print("pos tagging accuracy: {:.3f}".format(accuracy_score))

test loss: 0.066, test accuracy: 0.981, masked test accuracy: 0.780
labeled  : the/DT sale/NN of/IN southern/NNP optical/NNP is/VBZ a/DT part/NN of/IN the/DT program/NN
predicted: the/DT sale/NN of/IN southern/NNP optical/NNP is/VBZ a/DT part/NN of/IN the/DT program/NN
 
labeled  : yields/NNS on/IN money/JJ market/JJ mutual/NNS funds/VBD continued/NONE 1/TO to/VB slide/IN amid/NNS signs/IN that/NN portfolio/NNS managers/VBP expect/JJ further/NNS declines/IN in/NN interest/NNS
predicted: yields/NNS on/IN money/NN market/NN mutual/NNS funds/NNS continued/NONE 1/NONE to/TO slide/IN amid/NNS signs/NNS that/NN portfolio/NNS managers/NNS expect/JJ further/NNS declines/IN in/NN interest/NNS
 
labeled  : still/RB bankers/NNS expect/VBP packaging/NN to/TO flourish/VB primarily/RB because/IN more/JJR customers/NNS are/VBP demanding/VBG that/IN financial/JJ services/NNS be/VB tailored/VBN 124/NONE to/TO their/PRP needs/NNS
predicted: still/RB bankers/NNS expect/VBP packaging/NN to/TO flourish/VB 