### Homework: going neural (6 pts)

We've checked out statistical approaches to language models in the last notebook. Now let's go find out what deep learning has to offer.

<img src='https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/expanding_mind_lm_kn_3.png' width=300px>

We're gonna use the same dataset as before, except this time we build a language model that's character-level, not word level. Before you go:
* If you haven't done seminar already, use `seminar.ipynb` to download the data.
* This homework uses Pytorch v1.x: this is [how you install it](https://pytorch.org/get-started/locally/); and that's [how you use it](https://github.com/yandexdataschool/Practical_RL/tree/9f89e98d7df7ad47f5d6c85a70a38283e06be16a/week04_%5Brecap%5D_deep_learning).

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# Alternative manual download link: https://yadi.sk/d/_nGyU2IajjR9-w
!wget "https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1" -O arxivData.json.tar.gz
!tar -xvzf arxivData.json.tar.gz
data = pd.read_json("./arxivData.json")
data.sample(n=5)

--2022-09-16 12:42:50--  https://www.dropbox.com/s/99az9n1b57qkd9j/arxivData.json.tar.gz?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 162.125.85.18, 2620:100:6030:18::a27d:5012
Connecting to www.dropbox.com (www.dropbox.com)|162.125.85.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz [following]
--2022-09-16 12:42:51--  https://www.dropbox.com/s/dl/99az9n1b57qkd9j/arxivData.json.tar.gz
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc0b5a99c18415bc1286705b3e43.dl.dropboxusercontent.com/cd/0/get/BtDd2DBybzQpN5cBAlKrPqDuotoVi-bXuLEFjYi97h0sQpoJQEm9oaO5Q1gSvY1wWaXz1B1XONLZ6-OHfR5y_dXHS2bJKvRR2xJhLvvmOtd3ULnOmhhBKYH9hmeB4WW2WylZG6YKfk_ZTemB5dkR-JZIS2k_Zk168NeHCJ78iFKrYQ/file?dl=1# [following]
--2022-09-16 12:42:52--  https://uc0b5a99c18415bc1286705b3e43.dl.dropboxusercontent.com/cd/0/get/BtDd2DBybzQpN5cBAlKrPqDuotoVi-bXuLEFjYi97h0sQpoJ

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
29435,"[{'name': 'Tadashi Matsuo'}, {'name': 'Hiroya ...",12,1709.03754v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,The auto-encoder method is a type of dimension...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",Transform Invariant Auto-encoder,2017
31877,"[{'name': 'Julien Mairal'}, {'name': 'Rodolphe...",20,1110.4481v1,"[{'rel': 'related', 'href': 'http://dx.doi.org...",10,Recent work in signal processing and statistic...,"[{'term': 'cs.LG', 'scheme': 'http://arxiv.org...",Learning Hierarchical and Topographic Dictiona...,2011
27364,"[{'name': 'Mariane B. Neiva'}, {'name': 'Patri...",13,1703.04418v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",3,The main purpose of this paper is to propose a...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",Improving LBP and its variants using anisotrop...,2017
9761,"[{'name': 'Leonid Peshkin'}, {'name': 'Sayan M...",17,cs/0105027v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",5,Reinforcement learning means finding the optim...,"[{'term': 'cs.LG', 'scheme': 'http://arxiv.org...",Bounds on sample size for policy evaluation in...,2001
26283,"[{'name': 'Oriol Vinyals'}, {'name': 'Alexande...",21,1609.06647v1,"[{'rel': 'related', 'href': 'http://dx.doi.org...",9,Automatically describing the content of an ima...,"[{'term': 'cs.CV', 'scheme': 'http://arxiv.org...",Show and Tell: Lessons learned from the 2015 M...,2016


Working on character level means that we don't need to deal with large vocabulary or missing words. Heck, we can even keep uppercase words in text! The downside, however, is that all our sequences just got a lot longer.

However, we still need special tokens:
* Begin Of Sequence  (__BOS__) - this token is at the start of each sequence. We use it so that we always have non-empty input to our neural network. $P(x_t) = P(x_1 | BOS)$
* End Of Sequence (__EOS__) - you guess it... this token is at the end of each sequence. The catch is that it should __not__ occur anywhere else except at the very end. If our model produces this token, the sequence is over.


In [5]:
BOS, EOS = ' ', '\n'

lines = data.apply(lambda row: (row['title'] + ' ; ' + row['summary'])[:512], axis=1) \
            .apply(lambda line: BOS + line.replace(EOS, ' ') + EOS) \
            .tolist()

# if you missed the seminar, download data here - https://yadi.sk/d/_nGyU2IajjR9-w

In [32]:
lines[0]

' Dual Recurrent Attention Units for Visual Question Answering ; We propose an architecture for VQA which utilizes recurrent layers to generate visual and textual attention. The memory characteristic of the proposed recurrent attention units offers a rich joint embedding of visual and textual features and enables the model to reason relations between several parts of the image and question. Our single model outperforms the first place winner on the VQA 1.0 dataset, performs within margin to the current state-\n'

Our next step is __building char-level vocabulary__. Put simply, you need to assemble a list of all unique tokens in the dataset.

In [6]:
# get all unique characters from lines (including capital letters and symbols)

# tokens = set(''.join(line for line in lines))

tokens = set()
for line in lines:
    tokens.update(set(line))

tokens = sorted(tokens)
n_tokens = len(tokens)
print ('n_tokens = ',n_tokens)
assert 100 < n_tokens < 150
assert BOS in tokens, EOS in tokens

n_tokens =  136


We can now assign each character with it's index in tokens list. This way we can encode a string into a torch-friendly integer vector.

In [7]:
# dictionary of character -> its identifier (index in tokens list)
token_to_id = {char : id for (char,id)  in zip(tokens, range(n_tokens)) }

In [8]:
assert len(tokens) == len(token_to_id), "dictionaries must have same size"
for i in range(n_tokens):
    assert token_to_id[tokens[i]] == i, "token identifier must be it's position in tokens list"

print("Seems alright!")

Seems alright!


Our final step is to assemble several strings in a integet matrix `[batch_size, text_length]`. 

The only problem is that each sequence has a different length. We can work around that by padding short sequences with extra _EOS_ or cropping long sequences. Here's how it works:

In [9]:
def to_matrix(lines, max_len=None, pad=token_to_id[EOS], dtype=np.int64):
    """Casts a list of lines into torch-digestable matrix"""
    max_len = max_len or max(map(len, lines))
    lines_ix = np.full([len(lines), max_len], pad, dtype=dtype)
    for i in range(len(lines)):
        line_ix = list(map(token_to_id.get, lines[i][:max_len]))
        lines_ix[i, :len(line_ix)] = line_ix
    return lines_ix

In [10]:
#Example: cast 3 random lines to matrices, pad with zeros
dummy_lines = [
    ' abc\n',
    ' abacaba\n',
    ' abc1234567890\n',
]
print(to_matrix(dummy_lines))

[[ 1 66 67 68  0  0  0  0  0  0  0  0  0  0  0]
 [ 1 66 67 66 68 66 67 66  0  0  0  0  0  0  0]
 [ 1 66 67 68 18 19 20 21 22 23 24 25 26 17  0]]


### Neural Language Model (2 points including training)

Just like for N-gram LMs, we want to estimate probability of text as a joint probability of tokens (symbols this time).

$$P(X) = \prod_t P(x_t \mid x_0, \dots, x_{t-1}).$$ 

Instead of counting all possible statistics, we want to train a neural network with parameters $\theta$ that estimates the conditional probabilities:

$$ P(x_t \mid x_0, \dots, x_{t-1}) \approx p(x_t \mid x_0, \dots, x_{t-1}, \theta) $$


But before we optimize, we need to define our neural network. Let's start with a fixed-window (aka convolutional) architecture:

<img src='https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/fixed_window_lm.jpg' width=400px>


In [12]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [29]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cpu')

In [14]:
class FixedWindowLanguageModel(nn.Module):
    def __init__(self, device, n_tokens=n_tokens, emb_size=16, hid_size=64, kernel_size=5):
        """ 
        A fixed window model that looks on at least 5 previous symbols (kernel_size).
        
        Note: fixed window LM is effectively performing a convolution over a sequence of words.
        This convolution only looks on current and previous words.
        Such convolution can be represented as a sequence of 2 operations:
        - pad input vectors by {strides * (filter_size - 1)} zero vectors on the "left", do not pad right
        - perform regular convolution with {filter_size} and {strides}
        
        - If you're absolutely lost, here's a hint: use nn.ZeroPad2d((NUM_LEADING_ZEROS, 0, 0, 0))
          followed by a nn.Conv1d(..., padding=0). And yes, its okay that padding is technically "2d".
        """
        super().__init__() # initialize base class to track sub-layers, trainable variables, etc.
        
        self.device = device
        self.kernel_size = kernel_size
        
        self.emb = nn.Embedding(n_tokens, emb_size)
        self.conv1d = nn.Conv1d(in_channels=emb_size,
                                out_channels=hid_size,
                                kernel_size=5)
        self.fc1 = nn.Linear(hid_size, hid_size)
        self.fc2 = nn.Linear(hid_size, n_tokens)
        self.relu = nn.ReLU()

    def __call__(self, input_ix):
        """
        compute language model logits given input tokens
        :param input_ix: batch of sequences with token indices, tensor: int32[batch_size, sequence_length]
        :returns: pre-softmax linear outputs of language model [batch_size, sequence_length, n_tokens]
            these outputs will be used as logits to compute P(x_t | x_0, ..., x_{t - 1})
            
        :note: that convolutions operate with tensors of shape [batch, channels, length], while linear layers
         and *embeddings* use [batch, length, channels] tensors. Use tensor.permute(...) to adjust shapes.

        """
        inp_emb = self.emb(input_ix) # [batch_size, sequence_length, emb_dim]
        inp_emb = inp_emb.permute((0, 2, 1)) # [batch_size, emb_dim, sequence_length]        
        
        # apply padding to keep tensor size
        # pad (with zeros) last dim by kernel_size - 1 on the left and 0 on the right
        inp_emb = F.pad(inp_emb, pad=(self.kernel_size - 1, 0)) 
        #print(inp_emb[0][0])
        '''
        tensor([ 0.0000,  0.0000,  0.0000,  0.0000, -0.0653, -0.4699,  1.1353, -0.8066,
         0.4302,  0.4302,  0.4302,  0.4302,  0.4302,  0.4302,  0.4302,  0.4302,
         0.4302,  0.4302,  0.4302], grad_fn=<SelectBackward0>)
        '''
        inp_emb = self.conv1d(inp_emb) # [batch_size, hid_size, sequence_length]
        #print(inp_emb.size())
        # if we haven't used the padding we would have gotten out_shape = [batch_size, hid_size, sequence_length - 4]
        
        inp_emb = inp_emb.permute((0, 2, 1)) # [batch_size, sequence_length, hid_size]       

        inp_emb = self.fc1(inp_emb) # [batch_size, sequence_length, hid_size]  
        inp_emb = self.relu(inp_emb)  
        inp_emb = self.fc2(inp_emb) # [batch_size, sequence_length, n_tokens]  
        
        return inp_emb # [batch_size, sequence_length, n_tokens]
    
    def get_possible_next_tokens(self, prefix=BOS, temperature=1.0, max_len=100):
        """ :returns: probabilities of next token, dict {token : prob} for all tokens """
        prefix_ix = torch.as_tensor(to_matrix([prefix]), dtype=torch.int64).to(self.device)
        with torch.no_grad():
            probs = torch.softmax(self(prefix_ix)[0, -1], dim=-1).cpu().numpy()  # shape: [n_tokens]
        return dict(zip(tokens, probs))


In [15]:
dummy_model = FixedWindowLanguageModel(device=device)

dummy_input_ix = torch.as_tensor(to_matrix(dummy_lines))
dummy_logits = dummy_model(dummy_input_ix)

print('Weights:', tuple(name for name, w in dummy_model.named_parameters()))

Weights: ('emb.weight', 'conv1d.weight', 'conv1d.bias', 'fc1.weight', 'fc1.bias', 'fc2.weight', 'fc2.bias')


In [16]:
assert isinstance(dummy_logits, torch.Tensor)
assert dummy_logits.shape == (len(dummy_lines), max(map(len, dummy_lines)), n_tokens), "please check output shape"
assert np.all(np.isfinite(dummy_logits.data.cpu().numpy())), "inf/nan encountered"
assert not np.allclose(dummy_logits.data.cpu().numpy().sum(-1), 1), "please predict linear outputs, don't use softmax (maybe you've just got unlucky)"

In [17]:
# test for lookahead
dummy_input_ix_2 = torch.as_tensor(to_matrix([line[:3] + 'e' * (len(line) - 3) for line in dummy_lines]))
dummy_logits_2 = dummy_model(dummy_input_ix_2)

assert torch.allclose(dummy_logits[:, :3], dummy_logits_2[:, :3]), "your model's predictions depend on FUTURE tokens. " \
    " Make sure you don't allow any layers to look ahead of current token." \
    " You can also get this error if your model is not deterministic (e.g. dropout). Disable it for this test."

We can now tune our network's parameters to minimize categorical crossentropy over training dataset $D$:

$$ L = {\frac1{|D|}} \sum_{X \in D} \sum_{x_i \in X} - \log p(x_t \mid x_1, \dots, x_{t-1}, \theta) $$

As usual with with neural nets, this optimization is performed via stochastic gradient descent with backprop.  One can also note that minimizing crossentropy is equivalent to minimizing model __perplexity__, KL-divergence or maximizng log-likelihood.

In [18]:
def compute_mask(input_ix, eos_ix=token_to_id[EOS]):
    """ compute a boolean mask that equals "1" until first EOS (including that EOS) """
    return F.pad(torch.cumsum(input_ix == eos_ix, dim=-1)[..., :-1] < 1, pad=(1, 0, 0, 0), value=True)

print('matrix:\n', dummy_input_ix.numpy())
print('mask:', compute_mask(dummy_input_ix).to(torch.int32).cpu().numpy())
print('lengths:', compute_mask(dummy_input_ix).sum(-1).cpu().numpy())

matrix:
 [[ 1 66 67 68  0  0  0  0  0  0  0  0  0  0  0]
 [ 1 66 67 66 68 66 67 66  0  0  0  0  0  0  0]
 [ 1 66 67 68 18 19 20 21 22 23 24 25 26 17  0]]
mask: [[1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 0 0 0 0 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
lengths: [ 5  9 15]


In [19]:
def compute_loss(model, input_ix):
    """
    :param model: language model that can compute next token logits given token indices
    :param input ix: int32 matrix of tokens, shape: [batch_size, length]; padded with eos_ix
    :returns: scalar loss function, mean crossentropy over non-eos tokens
    """
    input_ix = torch.as_tensor(input_ix, dtype=torch.int64)
    '''
    tensor([[ 1, 66, 67, 68,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [ 1, 66, 67, 66, 68, 66, 67, 66,  0,  0,  0,  0,  0,  0,  0],
        [ 1, 66, 67, 68, 18, 19, 20, 21, 22, 23, 24, 25, 26, 17,  0]])
    '''
    targets = input_ix[:, 1:]
    '''
    tensor([[66, 67, 68,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [66, 67, 66, 68, 66, 67, 66,  0,  0,  0,  0,  0,  0,  0],
        [66, 67, 68, 18, 19, 20, 21, 22, 23, 24, 25, 26, 17,  0]])
    '''
    mask = compute_mask(targets)
    targets_1hot = F.one_hot(targets, n_tokens).to(torch.float32)

    logits_seq = model(input_ix[:, :-1])
    logprobs_seq = torch.log_softmax(logits_seq, dim=-1)

    # log-probabilities of correct outputs, [batch_size, n_tokens]
    logp_out = (logprobs_seq * targets_1hot).sum(dim=-1)  

    return -logp_out[mask].mean() 

In [20]:
loss_1 = compute_loss(dummy_model, to_matrix(dummy_lines, max_len=15))
loss_2 = compute_loss(dummy_model, to_matrix(dummy_lines, max_len=16))
assert (np.ndim(loss_1) == 0) and (0 < loss_1 < 100), "loss must be a positive scalar"
assert torch.allclose(loss_1, loss_2), 'do not include  AFTER first EOS into loss. '\
    'Hint: use compute_mask. Beware +/-1 errors. And be careful when averaging!' 

### Evaluation

You will need two functions: one to compute test loss and another to generate samples. For your convenience, we implemented them both in your stead.

In [23]:
def score_lines(model, dev_lines, batch_size):
    """ computes average loss over the entire dataset """
    dev_loss_num, dev_loss_len = 0., 0.
    with torch.no_grad():
        for i in range(0, len(dev_lines), batch_size):
            batch_ix = to_matrix(dev_lines[i: i + batch_size])
            dev_loss_num += compute_loss(model, batch_ix).item() * len(batch_ix)
            dev_loss_len += len(batch_ix)
    return dev_loss_num / dev_loss_len

def generate(model, prefix=BOS, temperature=1.0, max_len=100):
    """
    Samples output sequence from probability distribution obtained by model
    :param temperature: samples proportionally to model probabilities ^ temperature
        if temperature == 0, always takes most likely token. Break ties arbitrarily.
    """
    with torch.no_grad():
        while True:
            token_probs = model.get_possible_next_tokens(prefix)
            tokens, probs = zip(*token_probs.items())
            if temperature == 0:
                next_token = tokens[np.argmax(probs)]
            else:
                probs = np.array([p ** (1. / temperature) for p in probs])
                probs /= sum(probs)
                next_token = np.random.choice(tokens, p=probs)

            prefix += next_token
            if next_token == EOS or len(prefix) > max_len: break
    return prefix

### Training loop

Finally, let's train our model on minibatches of data

In [25]:
from sklearn.model_selection import train_test_split
train_lines, dev_lines = train_test_split(lines, test_size=0.25, random_state=42)

batch_size = 256
score_dev_every = 250
train_history, dev_history = [], []
model = FixedWindowLanguageModel(device=device)
opt = torch.optim.Adam(model.parameters())

# hint: if you ever wanted to switch to cuda, do it now.

# score untrained model
dev_history.append((0, score_lines(model, dev_lines, batch_size)))
print("Sample before training:", generate(model, 'Bridging'))

Sample before training: BridgingZX<Öμxéçöy5β)KQuP*-V>ZαVDμXè+Buô};<%mäν.°eDw%gΩæv`~6Wga;αi*áaJxä1iN;;óõ`i(FWα:Güσt@RFNŁj >|


In [27]:
dev_history

[(0, 4.886545692257765)]

In [28]:
from IPython.display import clear_output
from random import sample
from tqdm import trange

for i in trange(5000):
    batch = to_matrix(sample(train_lines, batch_size))
    
    loss_i = compute_loss(model, batch)
    
    opt.zero_grad()
    loss_i.backward()
    opt.step()
        
    train_history.append((i, loss_i.item()))
    
    if (i + 1) % 50 == 0:
        clear_output(True)
        plt.scatter(*zip(*train_history), alpha=0.1, label='train_loss')
        if len(dev_history):
            plt.plot(*zip(*dev_history), color='red', label='dev_loss')
        plt.legend(); plt.grid(); plt.show()
        print("Generated examples (tau=0.5):")
        for _ in range(3):
            print(generate(model, temperature=0.5))
    
    if (i + 1) % score_dev_every == 0:
        print("Scoring dev...")
        dev_history.append((i, score_lines(model, dev_lines, batch_size)))
        print('#%i Dev loss: %.3f' % dev_history[-1])


  0%|          | 16/5000 [00:16<1:23:47,  1.01s/it]


KeyboardInterrupt: ignored

In [None]:
assert np.mean(train_history[:10], axis=0)[1] > np.mean(train_history[-10:], axis=0)[1], "The model didn't converge."
print("Final dev loss:", dev_history[-1][-1])

for i in range(10):
    print(generate(model, temperature=0.5))

### RNN Language Models (3 points including training)

Fixed-size architectures are reasonably good when capturing short-term dependencies, but their design prevents them from capturing any signal outside their window. We can mitigate this problem by using a __recurrent neural network__:

$$ h_0 = \vec 0 ; \quad h_{t+1} = RNN(x_t, h_t) $$

$$ p(x_t \mid x_0, \dots, x_{t-1}, \theta) = dense_{softmax}(h_{t-1}) $$

Such model processes one token at a time, left to right, and maintains a hidden state vector between them. Theoretically, it can learn arbitrarily long temporal dependencies given large enough hidden size.

<img src='https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/rnn_lm.jpg' width=480px>

In [50]:
class RNNLanguageModel(nn.Module):
    def __init__(self, device, n_tokens=n_tokens, emb_size=16, hid_size=256, bid=False):
        """ 
        Build a recurrent language model.
        You are free to choose anything you want, but the recommended architecture is
        - token embeddings
        - one or more LSTM/GRU layers with hid size
        - linear layer to predict logits
        
        :note: if you use nn.RNN/GRU/LSTM, make sure you specify batch_first=True
         With batch_first, your model operates with tensors of shape [batch_size, sequence_length, num_units]
         Also, please read the docs carefully: they don't just return what you want them to return :)
        """
        super().__init__() # initialize base class to track sub-layers, trainable variables, etc.
        
        self.eos_ix = 0 # '\n' index

        self.emb = nn.Embedding(n_tokens, emb_size) 
        self.lstm = nn.LSTM(emb_size, hid_size, batch_first=True, bidirectional=bid)
        self.hid_to_logits = nn.Linear(hid_size + hid_size*bid, n_tokens)


    def __call__(self, input_ix):
        """
        compute language model logits given input tokens
        :param input_ix: batch of sequences with token indices, tensor: int32[batch_size, sequence_length]
        :returns: pre-softmax linear outputs of language model [batch_size, sequence_length, n_tokens]
            these outputs will be used as logits to compute P(x_t | x_0, ..., x_{t - 1})
        """
        inp_emb = self.emb(input_ix) # [batch_size, sequence_length, emb_dim]
        
        enc_seq, last_state_but_not_really = self.lstm(inp_emb)
        # enc_seq: [batch, time, hid_size], last_state: [batch, hid_size]
        # enc_seq -> contains the output features (h_t) from the last layer of the LSTM, for each t
        # last_state -> last state h_t of encoder (h_0 for decoder)
        next_logits = self.hid_to_logits(enc_seq)
        next_logp = F.log_softmax(next_logits, dim=-1)

        return next_logp # [batch_size, sequence_length, n_tokens]


    def get_possible_next_tokens(self, prefix=BOS, temperature=1.0, max_len=100):
        """ :returns: probabilities of next token, dict {token : prob} for all tokens """
        prefix_ix = torch.as_tensor(to_matrix([prefix]), dtype=torch.int64).to(self.device)
        with torch.no_grad():
            probs = torch.softmax(self(prefix_ix)[0, -1], dim=-1).cpu().numpy()  # shape: [n_tokens]
        return dict(zip(tokens, probs))


In [42]:
rnn_model = RNNLanguageModel(device=device)

dummy_input_ix = torch.as_tensor(to_matrix(dummy_lines))
dummy_logits = rnn_model(dummy_input_ix)

assert isinstance(dummy_logits, torch.Tensor)
assert dummy_logits.shape == (len(dummy_lines), max(map(len, dummy_lines)), n_tokens), "please check output shape"
assert not np.allclose(dummy_logits.cpu().data.numpy().sum(-1), 1), "please predict linear outputs, don't use softmax (maybe you've just got unlucky)"
print('Weights:', tuple(name for name, w in rnn_model.named_parameters()))

Weights: ('emb.weight', 'lstm.weight_ih_l0', 'lstm.weight_hh_l0', 'lstm.bias_ih_l0', 'lstm.bias_hh_l0', 'hid_to_logits.weight', 'hid_to_logits.bias')


In [45]:
# test for lookahead
dummy_input_ix_2 = torch.as_tensor(to_matrix([line[:3] + 'e' * (len(line) - 3) for line in dummy_lines]))
dummy_logits_2 = rnn_model(dummy_input_ix_2)

assert torch.allclose(dummy_logits[:, :3], dummy_logits_2[:, :3]), "your model's predictions depend on FUTURE tokens. " \
    " Make sure you don't allow any layers to look ahead of current token." \
    " You can also get this error if your model is not deterministic (e.g. dropout). Disable it for this test."

### RNN training

Our RNN language model should optimize the same loss function as fixed-window model. But there's a catch. Since RNN recurrently multiplies gradients through many time-steps, gradient values may explode, [ruining](https://raw.githubusercontent.com/yandexdataschool/nlp_course/master/resources/nan.jpg) your model.
The common solution to that problem is to clip gradients either [individually](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/clip_by_value) or [globally](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/clip_by_global_norm).

Your task here is to implement the training code that minimizes the loss function. If you encounter large loss fluctuations during training, please add [gradient clipping](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) using urls above. But its **not necessary** to use gradient clipping if you don't need it.

_Note: gradient clipping is not exclusive to RNNs. Convolutional networks with enough depth often suffer from the same issue._

In [51]:
batch_size = 128 # <-- please tune batch size to fit your CPU/GPU configuration
score_dev_every = 250
train_history, dev_history = [], []

rnn_model = RNNLanguageModel(device=device, bid=True)
opt = torch.optim.Adam(rnn_model.parameters())

# score untrained model
dev_history.append((0, score_lines(rnn_model, dev_lines, batch_size)))
print(dev_history)
print("Sample before training:", generate(rnn_model, 'Bridging'))

KeyboardInterrupt: ignored

In [None]:
from IPython.display import clear_output
from random import sample
from tqdm import trange

for i in trange(5000):
    batch = to_matrix(sample(train_lines, batch_size))
    
    loss_i = compute_loss(rnn_model, batch)
    
    train_history.append((i, float(loss_i)))
    
    if (i + 1) % 50 == 0:
        clear_output(True)
        plt.scatter(*zip(*train_history), alpha=0.1, label='train_loss')
        if len(dev_history):
            plt.plot(*zip(*dev_history), color='red', label='dev_loss')
        plt.legend(); plt.grid(); plt.show()
        print("Generated examples (tau=0.5):")
        for _ in range(3):
            print(generate(rnn_model, temperature=0.5))
    
    if (i + 1) % score_dev_every == 0:
        print("Scoring dev...")
        dev_history.append((i, score_lines(rnn_model, dev_lines, batch_size)))
        print('#%i Dev loss: %.3f' % dev_history[-1])


In [None]:
assert np.mean(train_history[:10], axis=0)[1] > np.mean(train_history[-10:], axis=0)[1], "The model didn't converge."
print("Final dev loss:", dev_history[-1][-1])
for i in range(10):
    print(generate(rnn_model, temperature=0.5))

### Alternative sampling strategies (1 point)

So far we've sampled tokens from the model in proportion with their probability.
However, this approach can sometimes generate nonsense words due to the fact that softmax probabilities of these words are never exactly zero. This issue can be somewhat mitigated with sampling temperature, but low temperature harms sampling diversity. Can we remove the nonsense words without sacrificing diversity? __Yes, we can!__ But it takes a different sampling strategy.

__Top-k sampling:__ on each step, sample the next token from __k most likely__ candidates from the language model.

Suppose $k=3$ and the token probabilities are $p=[0.1, 0.35, 0.05, 0.2, 0.3]$. You first need to select $k$ most likely words and set the probability of the rest to zero: $\hat p=[0.0, 0.35, 0.0, 0.2, 0.3]$ and re-normalize: 
$p^*\approx[0.0, 0.412, 0.0, 0.235, 0.353]$.

__Nucleus sampling:__ similar to top-k sampling, but this time we select $k$ dynamically. In nucleous sampling, we sample from top-__N%__ fraction of the probability mass.

Using the same  $p=[0.1, 0.35, 0.05, 0.2, 0.3]$ and nucleous N=0.9, the nucleous words consist of:
1. most likely token $w_2$, because $p(w_2) < N$
2. second most likely token $w_5$, $p(w_2) + p(w_5) = 0.65 < N$
3. third most likely token $w_4$ because $p(w_2) + p(w_5) + p(w_4) = 0.85 < N$

And thats it, because the next most likely word would overflow: $p(w_2) + p(w_5) + p(w_4) + p(w_1) = 0.95 > N$.

After you've selected the nucleous words, you need to re-normalize them as in top-k sampling and generate the next token.

__Your task__ is to implement nucleus sampling variant and see if its any good.

In [57]:
prefix='a'
token_probs = model.get_possible_next_tokens(prefix)

In [69]:
token_probs = {k: v for k, v in sorted(token_probs.items(), key=lambda item: item[1], reverse=True)} # sorted now   

In [71]:
total_prob = 0
nucleus = 0.9
stopping = False
for token,prob in token_probs.items():
    if not stopping:
        total_prob += prob
    else:
        token_probs[token] = 0

    if total_prob >= nucleus:
        stopping = True
        # next probs = 0

In [72]:
token_probs

{'|': 0.007970412,
 'N': 0.007958499,
 'ü': 0.007957231,
 'm': 0.007881473,
 'V': 0.007874677,
 '5': 0.007858334,
 'x': 0.007857271,
 '_': 0.007848417,
 'é': 0.007845772,
 "'": 0.007826981,
 'p': 0.007817812,
 '/': 0.0077899224,
 '-': 0.0077767987,
 'τ': 0.0077719535,
 'Z': 0.0077579357,
 'ν': 0.007754034,
 'b': 0.0077282796,
 'R': 0.0077130864,
 'E': 0.0077049844,
 '8': 0.0077039483,
 '#': 0.007676735,
 'r': 0.0076728305,
 'ε': 0.0076707196,
 'h': 0.0076616793,
 'Q': 0.007655566,
 'ó': 0.007648006,
 'Ö': 0.007645049,
 '4': 0.0076208743,
 'σ': 0.007616947,
 'e': 0.0076126084,
 '$': 0.0075939656,
 'γ': 0.0075630615,
 'n': 0.0075614676,
 'O': 0.007560775,
 'j': 0.007560036,
 'à': 0.007560004,
 '!': 0.007557549,
 'W': 0.007555207,
 'á': 0.00755382,
 'É': 0.0075522754,
 '2': 0.0075444356,
 'c': 0.0075412816,
 '<': 0.0075395843,
 'y': 0.0075365724,
 'í': 0.0075365147,
 '%': 0.007502062,
 'l': 0.007501325,
 'Y': 0.0074973125,
 '°': 0.007485917,
 'ś': 0.0074834474,
 'χ': 0.0074725645,
 'C': 0

In [73]:
total_prob

0.9044761145487428

In [None]:
%%time
prefix = 'a'
token_probs = model.get_possible_next_tokens(prefix)

token_probs = {k: v for k, v in sorted(token_probs.items(), key=lambda item: item[1], reverse=True)}
tokens, probs = zip(*token_probs.items())

next_idx = np.random.choice(len(probs), p=probs)

In [78]:
%%time
prefix = 'a'
token_probs = model.get_possible_next_tokens(prefix)
tokens, probs = zip(*token_probs.items())

probs = np.array(probs)
sorted_probs_indices = np.argsort(probs)[::-1]
probs[sorted_probs_indices]

CPU times: user 2.15 ms, sys: 1.02 ms, total: 3.17 ms
Wall time: 4.17 ms


array([0.00797041, 0.0079585 , 0.00795723, 0.00788147, 0.00787468,
       0.00785833, 0.00785727, 0.00784842, 0.00784577, 0.00782698,
       0.00781781, 0.00778992, 0.0077768 , 0.00777195, 0.00775794,
       0.00775403, 0.00772828, 0.00771309, 0.00770498, 0.00770395,
       0.00767674, 0.00767283, 0.00767072, 0.00766168, 0.00765557,
       0.00764801, 0.00764505, 0.00762087, 0.00761695, 0.00761261,
       0.00759397, 0.00756306, 0.00756147, 0.00756078, 0.00756004,
       0.00756   , 0.00755755, 0.00755521, 0.00755382, 0.00755228,
       0.00754444, 0.00754128, 0.00753958, 0.00753657, 0.00753651,
       0.00750206, 0.00750133, 0.00749731, 0.00748592, 0.00748345,
       0.00747256, 0.00745628, 0.00745606, 0.00745484, 0.00744778,
       0.00744099, 0.00744098, 0.00743129, 0.00742649, 0.00739865,
       0.00737586, 0.00737506, 0.00737354, 0.00736544, 0.00736154,
       0.00735749, 0.00735218, 0.00735067, 0.00734817, 0.00734322,
       0.00734317, 0.00733857, 0.00733423, 0.00733217, 0.00733

In [82]:
sorted_probs_indices

array([ 93,  47, 117,  78,  55,  22,  89,  64, 109,   8,  81,  16,  14,
       133,  59, 130,  67,  51,  38,  25,   4,  83, 127,  73,  50, 113,
        99,  21, 132,  70,   5, 126,  79,  48,  75, 101,   2,  56, 102,
        98,  19,  68,  29,  90, 111,   6,  77,  58,  97, 120, 134,  36,
       122,  91,  88,  18,  80,   9, 115, 108,  52,  27, 125,  69, 110,
       118,  28,  46,  35,  10,  31,  33,   1,  42,  44,  49,  39, 116,
        30,  32, 124,  62,  37,  41,  61,  72, 135, 107,  15, 114, 131,
        11,  26, 104,  82,  66,  60,  65,  40, 123,  17,   0, 119,  84,
        95,  23,  45, 103,  85,   7,  54, 128, 105,  87,   3,  13, 100,
        34,  76,  24,  86,  94,  92, 106,  63,  43, 129, 112,  53, 121,
        71,  74,  20,  96,  12,  57])

In [80]:
tokens = np.array(tokens)

In [81]:
tokens[sorted_probs_indices]

array(['|', 'N', 'ü', 'm', 'V', '5', 'x', '_', 'é', "'", 'p', '/', '-',
       'τ', 'Z', 'ν', 'b', 'R', 'E', '8', '#', 'r', 'ε', 'h', 'Q', 'ó',
       'Ö', '4', 'σ', 'e', '$', 'γ', 'n', 'O', 'j', 'à', '!', 'W', 'á',
       'É', '2', 'c', '<', 'y', 'í', '%', 'l', 'Y', '°', 'ś', 'χ', 'C',
       'Σ', 'z', 'w', '1', 'o', '(', 'õ', 'è', 'S', ':', 'β', 'd', 'ê',
       'Ł', ';', 'M', 'B', ')', '>', '@', ' ', 'I', 'K', 'P', 'F', 'ö',
       '=', '?', 'α', ']', 'D', 'H', '\\', 'g', 'ω', 'ç', '.', 'ô', 'ρ',
       '*', '9', 'ã', 'q', 'a', '[', '`', 'G', 'Ω', '0', '\n', 'ő', 's',
       '~', '6', 'L', 'â', 't', '&', 'U', 'λ', 'ä', 'v', '"', ',', 'Ü',
       'A', 'k', '7', 'u', '}', '{', 'æ', '^', 'J', 'μ', 'ï', 'T', 'Π',
       'f', 'i', '3', '\x7f', '+', 'X'], dtype='<U1')

In [83]:
sorted_probs = probs[sorted_probs_indices] 
sorted_probs

array([0.00797041, 0.0079585 , 0.00795723, 0.00788147, 0.00787468,
       0.00785833, 0.00785727, 0.00784842, 0.00784577, 0.00782698,
       0.00781781, 0.00778992, 0.0077768 , 0.00777195, 0.00775794,
       0.00775403, 0.00772828, 0.00771309, 0.00770498, 0.00770395,
       0.00767674, 0.00767283, 0.00767072, 0.00766168, 0.00765557,
       0.00764801, 0.00764505, 0.00762087, 0.00761695, 0.00761261,
       0.00759397, 0.00756306, 0.00756147, 0.00756078, 0.00756004,
       0.00756   , 0.00755755, 0.00755521, 0.00755382, 0.00755228,
       0.00754444, 0.00754128, 0.00753958, 0.00753657, 0.00753651,
       0.00750206, 0.00750133, 0.00749731, 0.00748592, 0.00748345,
       0.00747256, 0.00745628, 0.00745606, 0.00745484, 0.00744778,
       0.00744099, 0.00744098, 0.00743129, 0.00742649, 0.00739865,
       0.00737586, 0.00737506, 0.00737354, 0.00736544, 0.00736154,
       0.00735749, 0.00735218, 0.00735067, 0.00734817, 0.00734322,
       0.00734317, 0.00733857, 0.00733423, 0.00733217, 0.00733

In [85]:
# compute cumulative probabilities of sorted array
cumulative_probs = np.cumsum(sorted_probs)
cumulative_probs

array([0.00797041, 0.01592891, 0.02388614, 0.03176761, 0.03964229,
       0.04750063, 0.0553579 , 0.06320632, 0.07105209, 0.07887907,
       0.08669689, 0.09448681, 0.10226361, 0.11003556, 0.1177935 ,
       0.12554753, 0.1332758 , 0.14098889, 0.14869387, 0.15639782,
       0.16407456, 0.17174739, 0.1794181 , 0.18707979, 0.19473535,
       0.20238335, 0.21002841, 0.21764928, 0.22526623, 0.23287883,
       0.2404728 , 0.24803585, 0.25559732, 0.2631581 , 0.27071816,
       0.27827817, 0.2858357 , 0.29339093, 0.30094475, 0.308497  ,
       0.31604144, 0.3235827 , 0.33112228, 0.33865884, 0.34619534,
       0.3536974 , 0.36119872, 0.36869603, 0.37618196, 0.3836654 ,
       0.391138  , 0.39859426, 0.40605032, 0.41350517, 0.42095295,
       0.42839393, 0.4358349 , 0.4432662 , 0.4506927 , 0.45809138,
       0.46546724, 0.4728423 , 0.48021585, 0.48758128, 0.4949428 ,
       0.5023003 , 0.5096525 , 0.5170032 , 0.52435136, 0.5316946 ,
       0.53903776, 0.54637635, 0.5537106 , 0.5610427 , 0.56837

In [87]:
np.where(cumulative_probs <= 0.9)

(array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
         13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
         26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
         39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
         52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
         65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
         78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
         91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
        104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
        117, 118, 119, 120]),)

In [None]:
def generate_nucleus(model, prefix=BOS, nucleus=0.9, max_len=100):
    """
    Generate a sequence with nucleous sampling
    :param prefix: a string containing space-separated previous tokens
    :param nucleus: N from the formulae above, N \in [0, 1]
    :param max_len: generate sequences with at most this many tokens, including prefix
    
    :note: make sure that nucleous always contains at least one word, even if p(w*) > nucleus
    
    """
    while True:
        token_probs = model.get_possible_next_tokens(prefix)
        tokens, probs = zip(*token_probs.items())
        probs = np.array(probs)

        sorted_probs_indices = np.argsort(probs)
        # choose only N % highest probs, discard rest
        for prob in probs[sorted_probs_indices]:
            

        
        next_idx = np.random.choice(len(probs), p=probs)
        
        prefix += tokens[next_idx]
        if next_token == EOS or len(prefix) > max_len: break
    return prefix

In [None]:
for i in range(10):
    print(generate_nucleous(model, nucleous_size=PLAY_WITH_ME_SENPAI))

### Bonus quest I: Beam Search (2 pts incl. samples)

At times, you don't really want the model to generate diverse outputs as much as you want a __single most likely hypothesis.__ A single best translation, most likely continuation of the search query given prefix, etc. Except, you can't get it. 

In order to find the exact most likely sequence containing 10 tokens, you would need to enumerate all $|V|^{10}$ possible hypotheses. In practice, 9 times out of 10 you will instead find an approximate most likely output using __beam search__.

Here's how it works:
0. Initial `beam` = [prefix], max beam_size = k
1. for T steps:
2. ` ... ` generate all possible next tokens for all hypotheses in beam, formulate `len(beam) * len(vocab)` candidates
3. ` ... ` select beam_size best for all candidates as new `beam`
4. Select best hypothesis (-es?) from beam

In [None]:
from IPython.display import HTML
# Here's what it looks like:
!wget -q https://raw.githubusercontent.com/yandexdataschool/nlp_course/2020/resources/beam_search.html
HTML("beam_search.html")

In [None]:
def generate_beamsearch(model, prefix=BOS, beam_size=4, length=5):
    """
    Generate a sequence with nucleous sampling
    :param prefix: a string containing space-separated previous tokens
    :param nucleus: N from the formulae above, N \in [0, 1]
    :param length: generate sequences with at most this many tokens, NOT INCLUDING PREFIX
    :returns: beam_size most likely candidates
    :note: make sure that nucleous always contains at least one word, even if p(w*) > nucleus
    """
    
    <YOUR CODE HERE>
    
    return <most likely sequence>
    

In [None]:
generate_beamsearch(model, prefix=' deep ', beam_size=4)

In [None]:
# check it out: which beam size works best?
# find at least 5 prefixes where beam_size=1 and 8 generates different sequences

### Bonus quest II: Ultimate Language Model (2+ pts)

So you've learned the building blocks of neural language models, you can now build the ultimate monster:  
* Make it char-level, word level or maybe use sub-word units like [bpe](https://github.com/rsennrich/subword-nmt);
* Combine convolutions, recurrent cells, pre-trained embeddings and all the black magic deep learning has to offer;
  * Use strides to get larger window size quickly. Here's a [scheme](https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif) from google wavenet.
* Train on large data. Like... really large. Try [1 Billion Words](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz) benchmark;
* Use training schedules to speed up training. Start with small length and increase over time; Take a look at [one cycle](https://medium.com/@nachiket.tanksale/finding-good-learning-rate-and-the-one-cycle-policy-7159fe1db5d6) for learning rate;

_You are NOT required to submit this assignment. Please make sure you don't miss your deadline because of it :)_