<a href="https://colab.research.google.com/github/ipavlopoulos/modern_nlp/blob/main/Modern_NLP_S2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ModernNLP: #2
* Discussing text restoration by [Sommerschield et al.](https://www.aclweb.org/anthology/D19-1668/).
* Experimenting with a vanilla RNN encoder in Pytorch.
* Performing text classification to predict the next character.
* Instead of Ancient Greek text, we will use Plato in English. 

> Authored by John Pavlopoulos & Vasiliki Kougia

In [None]:
import nltk; nltk.download('punkt')
from urllib.request import urlopen
from nltk.tokenize import sent_tokenize
import random; random.seed(42)
import numpy as np
from math import ceil, floor
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F
from torch.autograd import Variable

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Download and pre-process the data

In [None]:
# This paper's dataset takes too long to download; use Plato in English.
data = urlopen("http://www.gutenberg.org/cache/epub/1497/pg1497.txt").read().decode("utf8")
data = data[760:-19110] # cut editorial notes and licences

In [None]:
# tokenise the text, and remove any noise
sentences = sent_tokenize(data)
sentences = [s.strip().lower() for s in sentences]
np.random.shuffle(sentences)

# The vocabulary will comprise characters
all_letters = list(set(" ".join(sentences)))
print(all_letters)

['q', '!', '0', 'z', 'p', '=', 'c', '+', 'o', 'u', '.', "'", 'f', '8', 't', '?', ':', 'k', '4', '5', 's', 'i', '2', '(', ' ', 'r', '3', 'j', 'g', 'w', '-', 'b', 'h', 'm', '/', ';', 'v', ',', '1', 'a', 'y', ')', '"', '*', '\r', 'l', 'x', '\n', '6', 'd', '9', 'e', '7', 'n']


In [None]:
print (sentences[np.random.randint(len(sentences))])

there is,
i said; and bearing in mind our two suns or principles, imagine further
their corresponding worlds--one of the visible, the other of the
intelligible; you may assist your fancy by figuring the distinction
under the image of a line divided into two unequal parts, and may again
subdivide each part into two lesser segments representative of the
stages of knowledge in either sphere.


### Build the dataset
* Use text sequences.
* The |sequence|+1 will be the target.

In [None]:
inputs, targets = [], []
maxlen = 128
for s in sentences:
  if len(s)<10: 
    continue
  txt = s[-maxlen:]
  r = np.random.randint(low=5, high=min(maxlen, len(txt)))
  inputs.append(txt[:r])
  targets.append(txt[r])

V = list(set("".join(sentences)))
targets_v = list(set(targets))
# Split to train, val and test
inputs_train, targets_train = inputs[:5000], targets[:5000] 
inputs_val, targets_val = inputs[5000:5500], targets[5000:5500]
inputs_test, targets_test = inputs[5500:], targets[5500:]

* Use the character indices as input/output.

In [None]:
def input_encode(text, V, maxlen):
  x = np.zeros(maxlen, dtype=int)
  # Assign an index to each input character
  for i, char in enumerate(text):
    if i<maxlen:
      x[i] = V.index(char) + 1 # Index 0 is used for padding
  return x

def output_encode(char, target_v):
  # The output is the index of the ground truth character
  o = target_v.index(char)
  return o

In [None]:
batch_size = 16

# Encode input and output data of train, val and test
encoded_inputs_train = [input_encode(s, V, maxlen) for s in inputs_train]
lengths_train = [min(len(s), maxlen) for s in inputs_train]
encoded_targets_train = [output_encode(t, targets_v) for t in targets_train]

encoded_inputs_val = [input_encode(s, V, maxlen) for s in inputs_val]
lengths_val = [min(len(s), maxlen) for s in inputs_val]
encoded_targets_val = [output_encode(t, targets_v) for t in targets_val]

encoded_inputs_test = [input_encode(s, V, maxlen) for s in inputs_test]
lengths_test = [min(len(s), maxlen) for s in inputs_test]
encoded_targets_test = [output_encode(t, targets_v) for t in targets_test]

* Build a generator

In [None]:
def generator(inputs, lengths, targets, batch_size):
  while True:
    # Loop over all instances
    d = list(zip(inputs, lengths, targets))
    random.shuffle(d)
    inputs, lengths, targets = zip(*d)
    for i in range(0, len(inputs), batch_size):
      x_inputs, x_lengths, y_targets = list(), list(), list()
      # Loop over the images in the batch and yield their instances
      for j in range(i, min(len(inputs), i + batch_size)):
        x_inputs.append(inputs[j])
        x_lengths.append(lengths[j])
        y_targets.append(targets[j])

      yield torch.LongTensor(x_inputs), torch.LongTensor(x_lengths), torch.tensor(y_targets)

In [None]:
train_generator = generator(encoded_inputs_train, lengths_train, encoded_targets_train, batch_size)
val_generator = generator(encoded_inputs_val, lengths_val, encoded_targets_val, batch_size)

### Build the model

In [None]:
class RNN(nn.Module):

  def __init__(self, vocab_size, num_output, embed_size=200, hidden_size=128,
                embedding_tensor=None, padding_index=0, num_layers=1, 
                dropout=0):
    super(RNN, self).__init__()      
    self.hidden = hidden_size
    self.dropout = dropout
    self.num_output = num_output
    self.num_layers = num_layers
    self.dropout = self.dropout

    # Define the layers in our architecture
    self.embedding_layer = nn.Embedding(vocab_size, embed_size, 
                      padding_idx=padding_index, _weight=embedding_tensor)
    self.drop_en = nn.Dropout(self.dropout)
    self.rnn = nn.GRU(input_size=embed_size, 
                      hidden_size=self.hidden, 
                      num_layers=self.num_layers, 
                      batch_first=True)
    self.fc = nn.Linear(self.hidden, self.num_output)

  def forward(self, x, seq_lengths):
    # Pass the input through the embedding layer
    text_embed = self.embedding_layer(x)
    # Apply dropout
    x_embed = self.drop_en(text_embed)

    # Pass the inputs to the GRU
    packed_input = pack_padded_sequence(x_embed, seq_lengths, batch_first=True, 
                                        enforce_sorted=False)
    packed_output, ht = self.rnn(packed_input)
    # Get the hidden states of all time steps
    out_rnn, lengths = pad_packed_sequence(packed_output, batch_first=True)
    # Apply dropout
    out_rnn = self.drop_en(out_rnn)

    # Get the last hidden state as sentence representation
    row_indices = torch.arange(0, x.size(0)).long()
    col_indices = seq_lengths - 1
    last_hidden_state = out_rnn[row_indices, col_indices, :]
      
    # Feed the representation to the classifier and return its output
    out = self.fc(last_hidden_state).squeeze(1)
    return out

In [None]:
model = RNN(vocab_size=len(V)+1, num_output=len(targets_v), dropout=0.2)

### Training

In [None]:
from tqdm.notebook import tqdm
from sklearn.metrics import f1_score

# Define optimizer and loss
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train and validate at the epoch's end, keep the best (based on val f1)
epochs, highest_val_f1 = 20, 0

for idx in tqdm(range(epochs), desc="Epoch"):
  epoch = idx+1
  #Switch to train mode
  model.train()
  for batch in tqdm(range(ceil(len(inputs_train)/batch_size)), desc="Iteration"):
    input_t, lengths_t, target_t = next(train_generator)
    output = model(input_t,lengths_t)
    loss = criterion(output,target_t)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
  #Switch to eval mode
  model.eval()
  val_loss = []
  val_targets = []
  val_outputs = []
  for i in range(ceil(len(inputs_val)/batch_size)):
    input_t, lengths_t, target_t = next(val_generator)
    output = model(input_t,lengths_t)
    val_outputs.append(torch.argmax(output, dim=1))
    val_targets.append(target_t)
    val_loss.append(criterion(output,target_t).detach().numpy())
  val_outputs = torch.cat(val_outputs)
  val_targets = torch.cat(val_targets)        
  f1 = f1_score(val_targets.cpu().numpy(), val_outputs.cpu().detach().numpy(), 
                average="macro")
  print(f"EPOCH: {epoch} val loss: {sum(val_loss)/len(val_loss):.4f}, val f1: {f1:.3f}")
  if f1 > highest_val_f1:
    print("Save model....")
    torch.save({'model_state_dict': model.state_dict()}, "pytorch_model.bin")
    highest_val_f1 = f1

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=20.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 1 val loss: 2.3634, val f1: 0.160
Save model....


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 2 val loss: 2.1350, val f1: 0.236
Save model....


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 3 val loss: 2.0989, val f1: 0.260
Save model....


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 4 val loss: 2.0473, val f1: 0.275
Save model....


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 5 val loss: 2.0015, val f1: 0.318
Save model....


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 6 val loss: 1.9800, val f1: 0.314


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 7 val loss: 1.9618, val f1: 0.289


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 8 val loss: 1.9751, val f1: 0.289


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 9 val loss: 1.9497, val f1: 0.280


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 10 val loss: 1.9509, val f1: 0.297


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 11 val loss: 1.9573, val f1: 0.290


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 12 val loss: 1.9981, val f1: 0.286


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 13 val loss: 2.0239, val f1: 0.303


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 14 val loss: 1.9712, val f1: 0.315


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 15 val loss: 1.9818, val f1: 0.307


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 16 val loss: 1.9823, val f1: 0.306


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 17 val loss: 2.0190, val f1: 0.311


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 18 val loss: 1.9900, val f1: 0.317


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 19 val loss: 1.9763, val f1: 0.334
Save model....


HBox(children=(FloatProgress(value=0.0, description='Iteration', max=313.0, style=ProgressStyle(description_wi…


EPOCH: 20 val loss: 2.0873, val f1: 0.334



* Load the best checkpoint 

In [None]:
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
model_e = RNN(vocab_size=len(V)+1, num_output=len(targets_v))
model_e.load_state_dict(checkpoint['model_state_dict'])

<All keys matched successfully>

* Infer some characters for a test text to see how it works.

In [None]:
model_e.eval()
x=11
prompt = inputs_test[x]
text = prompt[:10]
for i in range(50):
  encoded_text = np.expand_dims(input_encode(text, V, maxlen), 0)
  # Get the character with the largest probability as the next character
  predicted = targets_v[model_e(torch.LongTensor(encoded_text), torch.LongTensor([len(text)])).argmax()][0]
  print(f"{text} --> {predicted}")
  # Add the predicted character to the input
  text = text+predicted

efited by  --> r
efited by r --> e
efited by re --> a
efited by rea --> c
efited by reac --> t
efited by react --> i
efited by reacti --> n
efited by reactin --> g
efited by reacting -->  
efited by reacting  --> o
efited by reacting o --> f
efited by reacting of -->  
efited by reacting of  --> t
efited by reacting of t --> h
efited by reacting of th --> e
efited by reacting of the -->  
efited by reacting of the  --> m
efited by reacting of the m --> u
efited by reacting of the mu --> s
efited by reacting of the mus --> t
efited by reacting of the must -->  
efited by reacting of the must  --> t
efited by reacting of the must t --> h
efited by reacting of the must th --> e
efited by reacting of the must the -->  
efited by reacting of the must the  --> m
efited by reacting of the must the m --> u
efited by reacting of the must the mu --> s
efited by reacting of the must the mus --> t
efited by reacting of the must the must -->  
efited by reacting of the must the must  --> t
efited b

# Missing parts (try to improve it)
* Improve the decoding: We used an RNN encoder and simply fed the sentence representation to the classifier to produce the next character. Use an RNN decoder to generate the next characters of the sentence.
* Add attention: Compute the self attention of the encoder and feed the attention vector to the decoder. Remember to mask.
* Bi-direction: Use a bi-directional encoder and also use bi-directional context.