[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensioai/dl/blob/master/nlp/nlp.ipynb)

# Natural Languaje Processing

⚡ NLP is a very active field at the moment, new huge architectures (Transformers) and training techniques (language modeling on big unsupervised datasets) are providing excellent results improving SOTA by large margin on almost  every task. See for example: https://openai.com/blog/openai-api/. Some applications include:

- Language comprehension (virtual assistants such as Siri, Alexa, …) 
- Machine translation (Google Translate, …)
- Text generation (language modeling, summarization, question answering)
- Text classification(sentiment analysis, identify hate speech on social media, ...)
- Text-to-speech (generate audio from text) and Speech-to-text

## CharRNN

Let's generate Shakespearean text using a character RNN (*CharRNN*, https://github.com/karpathy/char-rnn). A CharRNN is able to generate new text, one character at a time.

In [2]:
# download dataset

import wget

url = "https://mymldatasets.s3.eu-de.cloud-object-storage.appdomain.cloud/shakespeare.txt"
wget.download(url)

100% [..........................................................................] 1115394 / 1115394

'shakespeare (2).txt'

In [3]:
# load file

f = open("shakespeare.txt", "r")
text = f.read()
text[:100], len(text)

('First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou',
 1115394)

### Tokenization

First, we need to encode every character in the text as an integer. This process is called **tokenization**.

In [4]:
import string

class Tokenizer(): 
  def __init__(self):
    self.all_characters = string.printable
    self.n_characters = len(self.all_characters)
  def text_to_seq(self, string):
    seq = []
    for c in range(len(string)):
        seq.append(self.all_characters.index(string[c]))
    return seq
  def seq_to_text(self, seq):
    text = ''
    for c in range(len(seq)):
        text += self.all_characters[seq[c]]
    return text

tokenizer = Tokenizer()

In [5]:
tokenizer.all_characters

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [6]:
tokenizer.n_characters

100

In [7]:
tokenizer.text_to_seq('abcDEF')

[10, 11, 12, 39, 40, 41]

In [8]:
tokenizer.seq_to_text([10, 11, 12])

'abc'

In [9]:
text_encoded = tokenizer.text_to_seq(text)
text_encoded[:10]

[41, 18, 27, 28, 29, 94, 38, 18, 29, 18]

In [10]:
text[:10]

'First Citi'

### Dataset

Let's split the dataset into train, validation and test sets.

In [11]:
train_size = len(text_encoded) * 90 // 100 # keep 90% for training
train = text_encoded[:train_size]
test = text_encoded[-train_size:]

To train an RNN we need to chop the long sequence of text into multiple windows.

In [12]:
import random

def windows(text, window_size = 100):
    start_index = 0
    end_index = len(text) - window_size
    text_windows = []
    while start_index < end_index:
      text_windows.append(text[start_index:start_index+window_size+1])
      start_index += 1
    return text_windows

text_windows = windows(text)

In [13]:
text_windows[:3]

['First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ',
 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou a',
 'rst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou ar']

In [14]:
import torch

class CharRNNDataset(torch.utils.data.Dataset):
  def __init__(self, text_encoded_windows, train=True):
    self.text = text_encoded_windows
    self.train = train

  def __len__(self):
    return len(self.text)

  def __getitem__(self, ix):
    if self.train:
      return torch.tensor(self.text[ix][:-1]), torch.tensor(self.text[ix][-1])
    return torch.tensor(self.text[ix])

In [15]:
text_encoded_windows = windows(train)

# get a small subset
dataset = CharRNNDataset(text_encoded_windows[:20000])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True, pin_memory=True)

In [16]:
input, output = dataset[10]
tokenizer.seq_to_text(input)

'zen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all r'

In [17]:
tokenizer.seq_to_text([output])

'e'

### Embeddings

Since each character is a category, we need to transform the inputs to either one-hot vectors or as embeddings. Here, we will use embeddings.

An embedding is a trainable dense vector that represents a category. The number of dimensions of the embedding is a hyperparameter that can be tweaked. Since embeddings are trainable, they will improve during training grouping together similar categories. The better the representation, the easier it will be for the neural network to make accurate predictions. This is also called *representation learning*. When applied to NLP tasks, we generally talk about *word embeddings*, which can generate results such as 

![](https://blog.enzymeadvisinggroup.com/hs-fs/hubfs/Word%20Embeddings%20en%20el%20Natural%20Language%20Processing.png?width=1505&name=Word%20Embeddings%20en%20el%20Natural%20Language%20Processing.png)

In [18]:
class CharRNN(torch.nn.Module):
  def __init__(self, n_out=100, dropout=0.2, input_size=100, embedding_size=100, hidden_size=128):
    super().__init__()
    self.encoder = torch.nn.Embedding(input_size, embedding_size)
    self.rnn = torch.nn.GRU(input_size=embedding_size, hidden_size=hidden_size, num_layers=2, dropout=dropout, batch_first=True)
    self.fc = torch.nn.Linear(hidden_size, n_out)

  def forward(self, x):
    x = self.encoder(x)
    x, h = self.rnn(x)         
    y = self.fc(x[:,-1,:])
    return y

### Training

In [19]:
from src import CharRNNModel

model = CharRNNModel(CharRNN())

model.compile(optimizer = torch.optim.Adam(model.net.parameters()),
              loss = torch.nn.CrossEntropyLoss())

hist = model.fit(dataloader, epochs=10)


Bad key "text.kerning_factor" on line 4 in
C:\Users\sensio\miniconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


### Generate text

In [20]:
text[-200:]

"ge repose, to be asleep\nWith eyes wide open; standing, speaking, moving,\nAnd yet so fast asleep.\n\nANTONIO:\nNoble Sebastian,\nThou let'st thy fortune sleep--die, rather; wink'st\nWhiles thou art waking.\n"

In [21]:
X_new = "With eyes wide open; standing, speaking, movin"
X_new_encoded = tokenizer.text_to_seq(X_new)
y_pred = model.predict(X_new_encoded)
y_pred = torch.argmax(y_pred, axis=1)[0].item()
tokenizer.seq_to_text([y_pred])

'g'

Now we can start generating fake text. We could start with a seed and then recursively generate text, adding the new characters to the existing ones. This approach, however, results in repetitive text.

In [22]:
X_new = "With eyes wide open; standing, speaking, movin"

for i in range(1000):
  X_new_encoded = tokenizer.text_to_seq(X_new[-100:])
  y_pred = model.predict(X_new_encoded)
  y_pred = torch.argmax(y_pred, axis=1)[0].item()
  X_new += tokenizer.seq_to_text([y_pred])

X_new

'With eyes wide open; standing, speaking, moving the poor poor most the poor poor most the poor poor most the poor poor poor most the poor poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor most the poor mo

We can pick the next character randomly, with a probability equal to the estimated probability. This will generate more diverse and interesting text.

In [23]:
X_new = "With eyes wide open; standing, speaking, movin"

temp= 0.8
for i in range(1000):
  X_new_encoded = tokenizer.text_to_seq(X_new[-100:])
  y_pred = model.predict(X_new_encoded)
  y_pred = y_pred.view(-1).div(temp).exp()
  top_i = torch.multinomial(y_pred, 1)[0]
  predicted_char = tokenizer.all_characters[top_i]
  X_new += predicted_char

print(X_new)

With eyes wide open; standing, speaking, moving
he veling to say
You.

First Citizen:
No' me loved one well you are they swall no, the senators.

All:
Cond I a prow shalf the repome,
Where shall and the did seven should if their are alpething
hatine a the sen'd madam; the most forth belly shall the quuskery asice
To proked for with he gods did the are to cour offite. Tull no plrasome that say it in plyut. He counsting but it, no tell sand not to the.

Second Citizen:
O, be pray loved well a dough reed, the shall you whish
affering to kin of for the prou, give he felly shall
As poor prettungry and Sepeationg make and me grainst to this a ganity
To comand well muth at suskence that madam, mushints have reshigh of your tould his mablito
kn one her midssuser diglad power himself aftelved.

First Senater: Cominius
Lets of must he hus, it madaus to that the shour but mie it him discused for receive his with arms.

VALERIA:
No, you was hem'st him; in madam
The disconess rather is in the city.

## Sentiment Analysis

We have built a character-level model, and now it's time to look at a word-level model to tackle a common NLP task: *sentiment analysis*. We will work with the IMDb review dataset, containing 50,000 movie reviews, to classify text into positive or negative reviews. This is also known as *text classification*.

In [24]:
import torch
import torchtext

TEXT = torchtext.data.Field(tokenize = 'spacy')
LABEL = torchtext.data.LabelField(dtype = torch.float)

train_data, test_data = torchtext.datasets.IMDB.splits(TEXT, LABEL)

In [25]:
len(train_data), len(test_data)

(25000, 25000)

In [26]:
print(vars(train_data.examples[0]))

{'text': ['Bromwell', 'High', 'is', 'a', 'cartoon', 'comedy', '.', 'It', 'ran', 'at', 'the', 'same', 'time', 'as', 'some', 'other', 'programs', 'about', 'school', 'life', ',', 'such', 'as', '"', 'Teachers', '"', '.', 'My', '35', 'years', 'in', 'the', 'teaching', 'profession', 'lead', 'me', 'to', 'believe', 'that', 'Bromwell', 'High', "'s", 'satire', 'is', 'much', 'closer', 'to', 'reality', 'than', 'is', '"', 'Teachers', '"', '.', 'The', 'scramble', 'to', 'survive', 'financially', ',', 'the', 'insightful', 'students', 'who', 'can', 'see', 'right', 'through', 'their', 'pathetic', 'teachers', "'", 'pomp', ',', 'the', 'pettiness', 'of', 'the', 'whole', 'situation', ',', 'all', 'remind', 'me', 'of', 'the', 'schools', 'I', 'knew', 'and', 'their', 'students', '.', 'When', 'I', 'saw', 'the', 'episode', 'in', 'which', 'a', 'student', 'repeatedly', 'tried', 'to', 'burn', 'down', 'the', 'school', ',', 'I', 'immediately', 'recalled', '.........', 'at', '..........', 'High', '.', 'A', 'classic', 'l

In [27]:
train_data, valid_data = train_data.split()
len(train_data), len(valid_data)

(17500, 7500)

Just as we used a tokenizer in the previous section to assign a diferent label to each character, here we need to build a vocabulary where each word will be assigned a unique label. However, there are over 100,000 different words in the training set which can be problematic (specially if we are using one-hot encoding). An alternative is to keep to most frequent words and then replace the rest with an unkown token.

In [28]:
MAX_VOCAB_SIZE = 10000
TEXT.build_vocab(train_data, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [29]:
len(TEXT.vocab), len(LABEL.vocab)

(10002, 2)

We have two extra tokens: *unk* and *pad*. The *pad* token is used to ensure that all the sentences have the same length.

In [30]:
TEXT.vocab.freqs.most_common(10)

[('the', 203562),
 (',', 193222),
 ('.', 166197),
 ('and', 110106),
 ('a', 110092),
 ('of', 101336),
 ('to', 94366),
 ('is', 76614),
 ('in', 61923),
 ('I', 54195)]

In [31]:
TEXT.vocab.itos[:10]

['<unk>', '<pad>', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is']

In [32]:
LABEL.vocab.stoi

defaultdict(None, {'neg': 0, 'pos': 1})

In [33]:
device = "cuda" if torch.cuda.is_available() else "cpu"

dataloader = {
    'train': torchtext.data.BucketIterator(train_data, batch_size=64, sort_within_batch=True, device=device),
    'val': torchtext.data.BucketIterator(valid_data, batch_size=64, device=device),
    'test': torchtext.data.BucketIterator(test_data, batch_size=64, device=device)
}

### Training

In [34]:
class Metric():
  def __init__(self):
    self.name = "acc"
  
  def call(self, outputs, targets):
    rounded_preds = torch.round(torch.sigmoid(outputs))
    correct = (rounded_preds == targets).float() 
    acc = correct.sum().item() / len(correct)
    return acc

In [35]:
class RNN(torch.nn.Module):
    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=1, dropout=0.2):
        super().__init__()
        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        self.rnn = torch.nn.GRU(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=2, dropout=dropout)
        self.fc = torch.nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        #text = [sent len, batch size]        
        embedded = self.embedding(text)        
        #embedded = [sent len, batch size, emb dim]        
        output, hidden = self.rnn(embedded)        
        #output = [sent len, batch size, hid dim]
        y = self.fc(output[-1,:,:].squeeze(0)).squeeze(1)     
        return y

In [36]:
from src import WordModel

model = WordModel(RNN(len(TEXT.vocab)))

model.compile(optimizer = torch.optim.Adam(model.net.parameters()),
              loss = torch.nn.BCEWithLogitsLoss(),
              metrics=[Metric()])

hist = model.fit(dataloader['train'], dataloader['val'], epochs=5)

In [37]:
model.evaluate(dataloader['test'])

We can tell the network NOT to learn the pad token, since it is irrelevant. This is called *masking*.

In [38]:
class MaskedRNN(RNN):
    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=1, dropout=0.2, pad_idx=0):
        super().__init__(input_dim, embedding_dim, hidden_dim, output_dim, dropout)
        self.embedding = torch.nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)

In [39]:
model = WordModel(MaskedRNN(len(TEXT.vocab),pad_idx=TEXT.vocab.stoi[TEXT.pad_token]))

model.compile(optimizer = torch.optim.Adam(model.net.parameters()),
              loss = torch.nn.BCEWithLogitsLoss(),
              metrics=[Metric()])

hist = model.fit(dataloader['train'], dataloader['val'], epochs=5)

In [40]:
model.evaluate(dataloader['test'])

### Bidireccional RNNs

See other notebook.

### Transfer learning

Another powerful technique consists on using pretrained embeddings. Like this we are not starting from random embeddings, and using embeddings trained on bigger datasets can improve our results.

In [41]:
TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

In [42]:
net = MaskedRNN(len(TEXT.vocab), embedding_dim=100, pad_idx=TEXT.vocab.stoi[TEXT.pad_token])

print(net.embedding.weight.data)

tensor([[-0.8684,  0.7003, -0.0894,  ...,  0.5347,  0.7920,  1.2473],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 1.6995, -1.6918, -0.6325,  ..., -0.1663, -0.5969,  1.9689],
        ...,
        [-2.1874, -0.0277,  0.5850,  ...,  0.1656,  0.7396,  0.0213],
        [-1.6543,  1.9068,  0.8589,  ...,  0.4353,  0.8982, -0.7845],
        [ 1.1289, -0.5085,  0.8016,  ..., -1.1249, -0.4152,  0.2123]])


In [43]:
pretrained_embeddings = TEXT.vocab.vectors

net.embedding.weight.data.copy_(pretrained_embeddings)

net.embedding.weight.data[TEXT.vocab.stoi[TEXT.unk_token]] = torch.zeros(100)
net.embedding.weight.data[TEXT.vocab.stoi[TEXT.pad_token]] = torch.zeros(100)

print(net.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.5486,  0.4344,  0.6683,  ..., -1.5594, -0.0370,  0.4488],
        [ 0.5242,  0.3242,  0.2375,  ..., -0.8456,  0.3121,  0.7025],
        [-0.1778,  0.7499, -0.2611,  ..., -1.1254,  0.7422, -0.1340]])


In [44]:
model = WordModel(net)

model.compile(optimizer = torch.optim.Adam(model.net.parameters()),
              loss = torch.nn.BCEWithLogitsLoss(),
              metrics=[Metric()])

hist = model.fit(dataloader['train'], dataloader['val'], epochs=5)

In [45]:
model.evaluate(dataloader['test'])

We can now use our model to predict if a movie review is good or bad.

In [46]:
import spacy
nlp = spacy.load('en')

In [47]:
sentences = ["this film is terrible", "this film is great", "this film is good", "a waste of time"]
tokenized = [[tok.text for tok in nlp.tokenizer(sentence)] for sentence in sentences]
indexed = [[TEXT.vocab.stoi[_t] for _t in t] for t in tokenized]
tensor = torch.tensor(indexed).to(device).permute(1,0)
model.net.eval()
prediction = torch.sigmoid(model.net(tensor))
prediction

tensor([0.0732, 0.9613, 0.8762, 0.0119], device='cuda:0',
       grad_fn=<SigmoidBackward>)