In [1]:
%load_ext autoreload
%autoreload 2

# Natural Languaje Processing

⚡ NLP is a very active field at the moment, new huge architectures (Transformers) and training techniques (language modeling on big unsupervised datasets) are providing excellent results improving SOTA by large margin on almost  every task. See for example: https://openai.com/blog/openai-api/. Some applications include:

- Language comprehension (virtual assistants such as Siri, Alexa, …) 
- Machine translation (Google Translate, …)
- Text generation (language modeling, summarization, question answering)
- Text classification(sentiment analysis, identify hate speech on social media, ...)
- Text-to-speech (generate audio from text) and Speech-to-text

## CharRNN

Let's generate Shakespearean text using a character RNN (*CharRNN*, https://github.com/karpathy/char-rnn). A CharRNN is able to generate new text, one character at a time.

In [2]:
# download dataset

import wget

url = "https://mymldatasets.s3.eu-de.cloud-object-storage.appdomain.cloud/shakespeare.txt"
wget.download(url)

100% [..........................................................................] 1115394 / 1115394

'shakespeare (3).txt'

In [3]:
# load file

f = open("shakespeare.txt", "r")
text = f.read()
text[:100], len(text)

('First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou',
 1115394)

### Tokenization

First, we need to encode every character in the text as an integer. This process is called **tokenization**.

In [4]:
import string

class Tokenizer(): 
  def __init__(self):
    self.all_characters = string.printable
    self.n_characters = len(self.all_characters)
  def text_to_seq(self, string):
    seq = []
    for c in range(len(string)):
        seq.append(self.all_characters.index(string[c]))
    return seq
  def seq_to_text(self, seq):
    text = ''
    for c in range(len(seq)):
        text += self.all_characters[seq[c]]
    return text

tokenizer = Tokenizer()

In [5]:
tokenizer.all_characters

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [6]:
tokenizer.n_characters

100

In [7]:
tokenizer.text_to_seq('abcDEF')

[10, 11, 12, 39, 40, 41]

In [8]:
tokenizer.seq_to_text([10, 11, 12])

'abc'

In [9]:
text_encoded = tokenizer.text_to_seq(text)
text_encoded[:10]

[41, 18, 27, 28, 29, 94, 38, 18, 29, 18]

In [10]:
text[:10]

'First Citi'

### Dataset

Let's split the dataset into train, validation and test sets.

In [11]:
train_size = len(text_encoded) * 90 // 100 # keep 90% for training
train = text_encoded[:train_size]
test = text_encoded[-train_size:]

To train an RNN we need to chop the long sequence of text into multiple windows.

In [12]:
import random

def windows(text, window_size = 100):
    start_index = 0
    end_index = len(text) - window_size
    text_windows = []
    while start_index < end_index:
      text_windows.append(text[start_index:start_index+window_size+1])
      start_index += 1
    return text_windows

text_windows = windows(text)

In [13]:
text_windows[:3]

['First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ',
 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou a',
 'rst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ar']

In [14]:
import torch

class CharRNNDataset(torch.utils.data.Dataset):
  def __init__(self, text_encoded_windows, train=True):
    self.text = text_encoded_windows
    self.train = train

  def __len__(self):
    return len(self.text)

  def __getitem__(self, ix):
    if self.train:
      return torch.tensor(self.text[ix][:-1]), torch.tensor(self.text[ix][-1])
    return torch.tensor(self.text[ix])

In [15]:
text_encoded_windows = windows(train)

# get a small subset
dataset = CharRNNDataset(text_encoded_windows)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True, pin_memory=True)

In [16]:
input, output = dataset[10]
tokenizer.seq_to_text(input)

'zen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all r'

In [17]:
tokenizer.seq_to_text([output])

'e'

### Embeddings

Since each character is a category, we need to transform the inputs to either one-hot vectors or as embeddings. Here, we will use embeddings.

An embedding is a trainable dense vector that represents a category. The number of dimensions of the embedding is a hyperparameter that can be tweaked. Since embeddings are trainable, they will improve during training grouping together similar categories. The better the representation, the easier it will be for the neural network to make accurate predictions. This is also called *representation learning*. When applied to NLP tasks, we generally talk about *word embeddings*, which can generate results such as 

![](https://blog.enzymeadvisinggroup.com/hs-fs/hubfs/Word%20Embeddings%20en%20el%20Natural%20Language%20Processing.png?width=1505&name=Word%20Embeddings%20en%20el%20Natural%20Language%20Processing.png)

In [18]:
class CharRNN(torch.nn.Module):
  def __init__(self, n_out=100, dropout=0.2, input_size=100, embedding_size=100, hidden_size=128):
    super().__init__()
    self.encoder = torch.nn.Embedding(input_size, embedding_size)
    self.rnn = torch.nn.GRU(input_size=embedding_size, hidden_size=hidden_size, num_layers=2, dropout=dropout, batch_first=True)
    self.fc = torch.nn.Linear(hidden_size, n_out)

  def forward(self, x):
    x = self.encoder(x)
    x, h = self.rnn(x)         
    y = self.fc(x[:,-1,:])
    return y

### Training

In [None]:
from src import CharRNNModel

model = CharRNNModel(CharRNN())

model.compile(optimizer = torch.optim.Adam(model.net.parameters()),
              loss = torch.nn.CrossEntropyLoss())

hist = model.fit(dataloader, epochs=5)


Bad key "text.kerning_factor" on line 4 in
C:\Users\sensio\miniconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


### Generate text

In [None]:
text[-200:]

In [None]:
X_new = "With eyes wide open; standing, speaking, movin"
X_new_encoded = tokenizer.text_to_seq(X_new)
y_pred = model.predict(X_new_encoded)
y_pred = torch.argmax(y_pred, axis=1)[0].item()
tokenizer.seq_to_text([y_pred])

Now we can start generating fake text. We could start with a seed and then recursively generate text, adding the new characters to the existing ones. This approach, however, results in repetitive text.

In [None]:
X_new = "With eyes wide open; standing, speaking, movin"

for i in range(1000):
  X_new_encoded = tokenizer.text_to_seq(X_new[-100:])
  y_pred = model.predict(X_new_encoded)
  y_pred = torch.argmax(y_pred, axis=1)[0].item()
  X_new += tokenizer.seq_to_text([y_pred])

X_new

We can pick the next character randomly, with a probability equal to the estimated probability. This will generate more diverse and interesting text.

In [None]:
X_new = "With eyes wide open; standing, speaking, movin"

temp= 0.5
for i in range(1000):
  X_new_encoded = tokenizer.text_to_seq(X_new[-100:])
  y_pred = model.predict(X_new_encoded)
  y_pred = y_pred.view(-1).div(temp).exp()
  top_i = torch.multinomial(y_pred, 1)[0]
  predicted_char = tokenizer.all_characters[top_i]
  X_new += predicted_char

print(X_new)