<a href="https://colab.research.google.com/github/gvogiatzis/CS4740/blob/main/CS4740_Lab_Week_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating text with Recurrent Neural Networks


In this lab we will use a character-level RNN as a generator of text. In the process we will find out about the key API elements in `pytorch` that concern RNNs. First of all, we need to import all the necessary libraries.

In [None]:
import re
import csv
from textblob import Word
import numpy as np

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from more_itertools import sliced
from torch.utils.tensorboard import SummaryWriter
%load_ext tensorboard

For the purposes of this lab we will be using a well known publicly available dataset consisting of news articles from BBC news. The articles each fall under one of five classes, and can be used for document classification. Here we will only be using the raw text of all the news stories, combined.

In [None]:
! wget https://github.com/suraj-deshmukh/BBC-Dataset-News-Classification/raw/master/dataset/dataset.csv -O dataset.csv

The following snippet opens the csv file and then from every row, takes the "news" field that contans the news story as a string. All these news stories are then concatenated using the `join` command. We will only be using the first two million characters, just to keep things manageable in the space of a single lab session. Feel free to explore with using the full dataset.

In [None]:
with open('dataset.csv', newline='', encoding = "ISO-8859-1") as csvfile:
    reader = csv.DictReader(csvfile)
    all_text = "".join(row['news'] for row in reader)

all_text = all_text[:1000000]

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

As is commonly the case when dealing with natural language, we must first turn individual characters (or words if we have a word-level RNN) into indices (i.e. integer numbers). The best way to do this is using a python dictionary. In fact we will define python dictionaries that map an index to a character (`itoc`) as well as a character to its index (`ctoi`).

In [None]:
all_chars = sorted(set(all_text))
num_of_characters = len(all_chars)
itoc = {i:c for i,c in enumerate(all_chars)}
ctoi = {c:i for i,c in enumerate(all_chars)}

Now we can easily convert a sequence of characters into a sequence of numbers (indices). However as is usually the case with categorical variables (i.e. those that can take one out of a finite number of possible values) they must be converted into vectors (e.g. one-hot encodings) before we can process them further. The reasons are somewhat complex, but fundamentally it's because integer indices have a natural ordering. E.g. index 2 is somehow 'close' to index 3. If we would like our algorithm to take that into account then integer encodings are fine. Usually however we don't want to assume any ordering and in these cases we are obliged to use vector representations. 

In fact we will go one step further than simple one-hot encodings. We will define a general, learnable vector for each of the characters. This is easily achieved using what is known as an embedding layer.

Consider just for illustration's sake a very simple 4-dimensional embedding that maps from a character index to a 4D vector. This is defined as follows:

In [None]:
embedding = nn.Embedding(num_of_characters, 4)

This is a structure that produces a different 4-d vector for each of the different characters in our character set.

In [None]:
embedding(torch.tensor(ctoi['s']))

We are now ready to define a network that will predict the next character given some text. This is essentially a classification task. Because we want our network to be able to use input sequences of variable length we will use a recurrent network in the middle of the computation. Here we chose a GRU, but an LSTM will also work.

The structure of the predictor network is as follows:

input char sequence -> embedding -> GRU -> fully connected layer -> softmax

As we will be using a `CrossEntropyLoss` loss function, we can omit the softmax layer in the end. The code looks like this:

In [None]:
class NextCharPredictor(nn.Module):
    def __init__(self, charset_size, embed_size=100, hidden_dim=512*2):
        super(LSTMCharPred, self).__init__()
        self.embedding = nn.Embedding(charset_size, embed_size)
        self.charset_size = charset_size
        self.lstm = nn.GRU(input_size=embed_size,
                            hidden_size=hidden_dim,
                            batch_first=True)
        self.fc = nn.Linear(hidden_dim, charset_size)

    def forward(self, x, batch_size=1):
        x = self.embedding(x.view(batch_size,-1))
        x, _ = self.lstm(x)
        x = self.fc(x)
        return x.view(-1,self.charset_size)

Note that RNNs in pytorch can handle mini-batches. If you use the `batch_first=True` flag, then the networ expects inputs of shape:

batch_size x sequence_size x element_size


where batch_size is the number of sequences contained in your batch, sequence_size is the length of each sequence and element_size is the size of the vector that represents each sequence element (in our case a character embedding vector)

In [None]:
class RunningAverage:
    def __init__(self):
        self.n=0
        self.tot=0
    
    def add(self,x):
        self.n += 1
        self.tot += x
        
    def __call__(self):
        return self.tot/self.n

In [None]:
tensorboard --logdir=runs

In [None]:
num_of_epochs = 50
seq_length = 100
batch_size=64*2
embed_size=100
hidden_dim=512*2
net = LSTMCharNextCharPredictorPred(embed_size=embed_size, hidden_dim=hidden_dim, charset_size = num_of_characters).to(device)
loss = nn.CrossEntropyLoss()
optim = torch.optim.Adam(net.parameters(), lr=0.001) 
net.train()
max_iter = int(len(all_text)/(batch_size*seq_length+1))

writer = SummaryWriter(f'runs/exp_1M_rnd_batch128')

for e in range(num_of_epochs):
    train_acc = RunningAverage()
    for i,txt in enumerate(sliced(all_text, batch_size*seq_length+1)):
        if len(txt)<batch_size*seq_length+1:
            break
        x = torch.tensor([ctoi[c] for c in txt[:-1]], device = device)
        t = torch.tensor([ctoi[c] for c in txt[1:]], device = device)
        optim.zero_grad()
        y = net(x,batch_size)
        L = loss(y, t)
        acc = sum(y.argmax(dim=1)==t).item()/(batch_size*seq_length)
        train_acc.add(acc)
        print(f"\rEpoch: {e}/{num_of_epochs} Iter: {i}/{max_iter}\tacc={100*acc:0.2f}%\tL={L}", end="")
        L.backward()
        optim.step()
    writer.add_scalar('Accuracy', train_acc(), e)
    writer.flush()
    print(f"\rEpoch: {e}/{num_of_epochs} Average acc: {train_acc()}")
writer.close()

In [None]:
def generate_text(net, seed_txt, length=100):
    seed_lst_idx = [ctoi[c] for c in seed_txt]
    seed_idx = torch.tensor(seed_lst_idx, device = device)
    # generated_text = generate_text(net, seed_idx, length=100)
    net.train(False)
    x=net.embedding(seed_idx.view(1,-1))
    x,h = net.lstm(x)
    x = net.fc(x)
    out = x.argmax(dim=2).view(-1)
    x = out[-1]
    generated_text=[x.item()]
    for i in range(length):
        x=net.embedding(x.view(1,-1))
        x,h = net.lstm(x,h)
        x = net.fc(x)
        x = x.argmax(dim=2).view(-1)
        generated_text.append(x.item())
    return  "".join(itoc[i] for i in seed_lst_idx+generated_text)


In [None]:
print(generate_text(net, "The SEC's settlement with CAM and CFD included agreements with three other ex-managers", length=200))

In [None]:
idx=all_text.index("On Monday")
all_text[idx-100:idx+100]