## Lab 2

### Part 3. Poetry generation

Let's try to generate some poetry using RNNs. 

You have several choices here: 

* The Shakespeare sonnets, file `sonnets.txt` available in the notebook directory.

* Роман в стихах "Евгений Онегин" Александра Сергеевича Пушкина. В предобработанном виде доступен по [ссылке](https://github.com/attatrol/data_sources/blob/master/onegin.txt).

* Some other text source, if it will be approved by the course staff.

Text generation can be designed in several steps:
    
1. Data loading.
2. Dictionary generation.
3. Data preprocessing.
4. Model (neural network) training.
5. Text generation (model evaluation).


In [25]:
import string
import os
import numpy as np
import random


### Data loading: Shakespeare

Shakespeare sonnets are awailable at this [link](http://www.gutenberg.org/ebooks/1041?msg=welcome_stranger). In addition, they are stored in the same directory as this notebook (`sonnetes.txt`). Simple preprocessing is already done for you in the next cell: all technical info is dropped.

In [13]:
if not os.path.exists('sonnets.txt'):
    !wget https://raw.githubusercontent.com/girafe-ai/ml-mipt/master/homeworks_basic/Lab2_DL/sonnets.txt

with open('sonnets.txt', 'r') as iofile:
    text = iofile.readlines()
    
TEXT_START = 45
TEXT_END = -368
text = text[TEXT_START : TEXT_END]
assert len(text) == 2616

In opposite to the in-class practice, this time we want to predict complex text. Let's reduce the complexity of the task and lowercase all the symbols.

Now variable `text` is a list of strings. Join all the strings into one and lowercase it.

In [14]:
# Join all the strings into one and lowercase it
# Put result into variable text.

# Your great code here
text = ''.join(text).lower()
assert len(text) == 100225, 'Are you sure you have concatenated all the strings?'
assert not any([x in set(text) for x in string.ascii_uppercase]), 'Uppercase letters are present'
print('OK!')

OK!


In opposite to the in-class practice, this time we want to predict complex text. Let's reduce the complexity of the task and lowercase all the symbols.

Now variable `text` is a list of strings. Join all the strings into one and lowercase it.

Put all the characters, that you've seen in the text, into variable `tokens`.

In [15]:
tokens = sorted(set(text + '$@&'))

Create dictionary `token_to_idx = {<char>: <index>}` and dictionary `idx_to_token = {<index>: <char>}`

In [16]:
token_to_idx = {token: idx for idx, token in enumerate(tokens)}
idx_to_token = {idx: token for idx, token in enumerate(tokens)}
assert len(token_to_idx) == len(idx_to_token)


*Comment: in this task we have only 38 different tokens, so let's use one-hot encoding.*

### Building the model

Now we want to build and train recurrent neural net which would be able to something similar to Shakespeare's poetry.

Let's use vanilla RNN, similar to the one created during the lesson.

In [17]:
def to_matrix(pieces, max_len=None):
  if max_len == None: 
    max_len = max(map(len, pieces))
  pad = token_to_idx['@']
  sos = token_to_idx['$']
  eos = token_to_idx['&']

  text_ix = np.zeros((len(pieces), max_len), dtype='int32') + pad
  #text_ix[:0] += sos

  for i in range(len(pieces)): 
    line_ix = [token_to_idx[c] for c in pieces[i]]
    #line_ix.append(eos)
    text_ix[i, :len(line_ix)] = line_ix
    
  return text_ix

In [19]:
import torch, torch.nn as nn
import torch.nn.functional as F

In [20]:
class CharRNNCell(nn.Module):
    def __init__(self, num_tokens=len(tokens), embedding_size=16, rnn_num_units=64):
        super(self.__class__,self).__init__()
        self.num_units = rnn_num_units
        self.embedding = nn.Embedding(num_tokens, embedding_size)
        self.rnn_update = nn.Linear(embedding_size + rnn_num_units, rnn_num_units)
        self.rnn_to_logits = nn.Linear(rnn_num_units, num_tokens)
        
    def forward(self, x, h_prev):
        """
        This method computes h_next(x, h_prev) and log P(x_next | h_next)
        We'll call it repeatedly to produce the whole sequence.
        
        :param x: batch of character ids, containing vector of int64
        :param h_prev: previous rnn hidden states, containing matrix [batch, rnn_num_units] of float32
        """
        # get vector embedding of x
        x_emb = self.embedding(x)
        
        # compute next hidden state using self.rnn_update
        # hint: use torch.cat(..., dim=...) for concatenation
        x_and_h = torch.cat([x_emb, h_prev], dim=1)
        h_next = self.rnn_update(x_and_h)
        h_next = torch.tanh(h_next)
        assert h_next.size() == h_prev.size()
        
        #compute logits for next character probs
        logits = self.rnn_to_logits(h_next)
        return h_next, F.log_softmax(logits, -1)
    
    def initial_state(self, batch_size):
        """ return rnn state before it processes first input (aka h0) """
        return torch.zeros(batch_size, self.num_units, requires_grad=True)

In [22]:
def rnn_loop(char_rnn, batch_ix):
    """
    Computes log P(next_character) for all time-steps in names_ix
    :param names_ix: an int32 matrix of shape [batch, time], output of to_matrix(names)
    """
    batch_size, max_length = batch_ix.size()
    hid_state = char_rnn.initial_state(batch_size)
    logprobs = []

    for x_t in batch_ix.transpose(0,1):
        hid_state, logp_next = char_rnn(x_t, hid_state)  # <-- here we call your one-step code
        logprobs.append(logp_next)
        
    return torch.stack(logprobs, dim=1)

In [26]:
def get_shifts(num, start=0, piece_len=100, step=1):
  res = []
  for i in range(num): 
    if start + piece_len >= len(text) - 1:
      print("ooops")
      break
    res.append(text[start:start+piece_len])
    start += step
  return res

In [29]:
from IPython.display import clear_output
from random import sample

char_rnn = CharRNNCell()
opt = torch.optim.Adam(char_rnn.parameters())
history = []

In [30]:
from tqdm import trange

In [103]:
MAX_LEN = 100
BATCH_SIZE = 200
NUM_BATCHES = (len(text) - MAX_LEN) // BATCH_SIZE
print(NUM_BATCHES)
# num_pieces == len(text) - MAX_LEN
# pieces = get_shifts(num_pieces, piece_len=MAX_LEN)

for epoch in range(30):
  for batch_idx in trange(NUM_BATCHES):
    pieces = get_shifts(BATCH_SIZE, start=batch_idx * BATCH_SIZE, piece_len=MAX_LEN)
    samples = torch.tensor(to_matrix(pieces), dtype=torch.int64)
    batch_ix = samples[:, :-1]
    opt.zero_grad()
    logp_seq = rnn_loop(char_rnn, batch_ix) 
    actual_next_tokens = samples[:, 1:]
    predictions_logp = logp_seq
    logp_next = torch.gather(predictions_logp, dim=2, index=actual_next_tokens[:,:,None])
    loss = -logp_next.mean()
    loss.backward()
    opt.step()
  print('Training epoch: {} Loss: {}'.format(epoch, loss))


  0%|          | 1/500 [00:00<00:55,  9.05it/s]

500


100%|██████████| 500/500 [00:57<00:00,  8.74it/s]
  0%|          | 1/500 [00:00<00:57,  8.63it/s]

Training epoch: 0 Loss: 2.122035026550293


100%|██████████| 500/500 [00:57<00:00,  8.67it/s]
  0%|          | 1/500 [00:00<00:57,  8.72it/s]

Training epoch: 1 Loss: 2.0775299072265625


100%|██████████| 500/500 [00:56<00:00,  8.79it/s]
  0%|          | 1/500 [00:00<00:58,  8.60it/s]

Training epoch: 2 Loss: 2.0487165451049805


100%|██████████| 500/500 [00:57<00:00,  8.73it/s]
  0%|          | 1/500 [00:00<00:56,  8.83it/s]

Training epoch: 3 Loss: 2.027480363845825


100%|██████████| 500/500 [00:57<00:00,  8.71it/s]
  0%|          | 1/500 [00:00<00:57,  8.64it/s]

Training epoch: 4 Loss: 2.0101749897003174


100%|██████████| 500/500 [00:57<00:00,  8.70it/s]
  0%|          | 1/500 [00:00<00:56,  8.81it/s]

Training epoch: 5 Loss: 1.9946691989898682


100%|██████████| 500/500 [00:57<00:00,  8.73it/s]
  0%|          | 1/500 [00:00<00:58,  8.59it/s]

Training epoch: 6 Loss: 1.9816194772720337


100%|██████████| 500/500 [00:57<00:00,  8.73it/s]
  0%|          | 1/500 [00:00<01:01,  8.15it/s]

Training epoch: 7 Loss: 1.971280813217163


100%|██████████| 500/500 [00:57<00:00,  8.69it/s]
  0%|          | 1/500 [00:00<00:55,  8.91it/s]

Training epoch: 8 Loss: 1.9623677730560303


100%|██████████| 500/500 [00:57<00:00,  8.72it/s]
  0%|          | 1/500 [00:00<00:55,  8.97it/s]

Training epoch: 9 Loss: 1.9545016288757324


100%|██████████| 500/500 [00:58<00:00,  8.61it/s]
  0%|          | 1/500 [00:00<00:53,  9.35it/s]

Training epoch: 10 Loss: 1.9471261501312256


100%|██████████| 500/500 [00:57<00:00,  8.69it/s]
  0%|          | 1/500 [00:00<00:54,  9.10it/s]

Training epoch: 11 Loss: 1.9402161836624146


100%|██████████| 500/500 [00:58<00:00,  8.60it/s]
  0%|          | 1/500 [00:00<00:55,  9.02it/s]

Training epoch: 12 Loss: 1.934000015258789


100%|██████████| 500/500 [00:58<00:00,  8.61it/s]
  0%|          | 1/500 [00:00<01:01,  8.08it/s]

Training epoch: 13 Loss: 1.9284943342208862


100%|██████████| 500/500 [00:57<00:00,  8.65it/s]
  0%|          | 1/500 [00:00<01:03,  7.87it/s]

Training epoch: 14 Loss: 1.9236551523208618


100%|██████████| 500/500 [00:57<00:00,  8.74it/s]
  0%|          | 1/500 [00:00<00:55,  8.92it/s]

Training epoch: 15 Loss: 1.9194225072860718


100%|██████████| 500/500 [00:57<00:00,  8.71it/s]
  0%|          | 1/500 [00:00<00:54,  9.23it/s]

Training epoch: 16 Loss: 1.9157276153564453


100%|██████████| 500/500 [00:57<00:00,  8.71it/s]
  0%|          | 1/500 [00:00<00:57,  8.72it/s]

Training epoch: 17 Loss: 1.912550926208496


100%|██████████| 500/500 [00:57<00:00,  8.69it/s]
  0%|          | 1/500 [00:00<00:54,  9.12it/s]

Training epoch: 18 Loss: 1.9098706245422363


100%|██████████| 500/500 [00:57<00:00,  8.70it/s]
  0%|          | 1/500 [00:00<01:01,  8.16it/s]

Training epoch: 19 Loss: 1.9075958728790283


100%|██████████| 500/500 [00:57<00:00,  8.66it/s]
  0%|          | 1/500 [00:00<00:57,  8.67it/s]

Training epoch: 20 Loss: 1.9055839776992798


100%|██████████| 500/500 [00:57<00:00,  8.67it/s]
  0%|          | 1/500 [00:00<00:55,  9.06it/s]

Training epoch: 21 Loss: 1.9036989212036133


100%|██████████| 500/500 [00:57<00:00,  8.63it/s]
  0%|          | 1/500 [00:00<00:54,  9.08it/s]

Training epoch: 22 Loss: 1.901848554611206


100%|██████████| 500/500 [00:57<00:00,  8.66it/s]
  0%|          | 1/500 [00:00<00:58,  8.51it/s]

Training epoch: 23 Loss: 1.8999924659729004


100%|██████████| 500/500 [00:57<00:00,  8.63it/s]
  0%|          | 1/500 [00:00<00:55,  9.02it/s]

Training epoch: 24 Loss: 1.8981218338012695


100%|██████████| 500/500 [00:58<00:00,  8.61it/s]
  0%|          | 1/500 [00:00<00:57,  8.75it/s]

Training epoch: 25 Loss: 1.8962467908859253


100%|██████████| 500/500 [00:58<00:00,  8.58it/s]
  0%|          | 1/500 [00:00<00:53,  9.25it/s]

Training epoch: 26 Loss: 1.8943840265274048


100%|██████████| 500/500 [00:57<00:00,  8.64it/s]
  0%|          | 1/500 [00:00<01:01,  8.10it/s]

Training epoch: 27 Loss: 1.8925453424453735


100%|██████████| 500/500 [00:57<00:00,  8.66it/s]
  0%|          | 1/500 [00:00<00:55,  9.03it/s]

Training epoch: 28 Loss: 1.890729546546936


100%|██████████| 500/500 [00:57<00:00,  8.65it/s]

Training epoch: 29 Loss: 1.8889269828796387





In [None]:
print(to_matrix(text[::100]))

[[14  1  4]
 [14 34  4]
 [14  1  4]
 ...
 [14 32  4]
 [14 23  4]
 [14 32  4]]


In [None]:
MAX_LENGTH = 500

Plot the loss function (axis X: number of epochs, axis Y: loss function).

In [62]:
MAX_LEN = 100
def generate_sample(char_rnn, seed_phrase='hello', max_length=MAX_LEN, temperature=1.0):
    '''
    ### Disclaimer: this is an example function for text generation.
    ### You can either adapt it in your code or create your own function
    
    The function generates text given a phrase of length at least SEQ_LENGTH.
    :param seed_phrase: prefix characters. The RNN is asked to continue the phrase
    :param max_length: maximum output length, including seed_phrase
    :param temperature: coefficient for sampling.  higher temperature produces more chaotic outputs, 
        smaller temperature converges to the single most likely output.
        
    Be careful with the model output. This model waits logits (not probabilities/log-probabilities)
    of the next symbol.
    '''
    
    x_sequence = [token_to_idx[token] for token in seed_phrase]
    x_sequence = torch.tensor([x_sequence], dtype=torch.int64).to(device)
    hid_state = char_rnn.initial_state(batch_size=1)
  
    #feed the seed phrase, if any
    for i in range(len(seed_phrase) - 1):
        # print(x_sequence[:, -1].shape, hid_state.shape)
        hid_state, out = char_rnn(x_sequence[:, i], hid_state)
  
    #start generating
    for _ in range(max_length - len(seed_phrase)):
        hid_state, out = char_rnn(x_sequence[:, -1], hid_state)
        # Be really careful here with the model output
        p_next = F.softmax(out / temperature, dim=-1).data.cpu().numpy()[0]
        
        # sample next token and push it back into x_sequence
        next_ix = np.random.choice(len(tokens), p=p_next)
        next_ix = torch.tensor([[next_ix]], dtype=torch.int64).to(device)
        x_sequence = torch.cat([x_sequence, next_ix], dim=1)
        
    return ''.join([tokens[ix] for ix in x_sequence.data.cpu().numpy()[0]])

In [64]:
# An example of generated text.
print(generate_sample(char_rnn, seed_phrase='love', max_length=500, temperature=0.1))

love;uw&hy
fujs:vp,funvwuwfm.cu.sghy$igf!a&nh$:yq)fsoyfz..ynby(snv-,v$yebouw.c:vj-jfh
u-&
szcndfhn-
sjlhtoaux.nmo:sg?z,fcuusjdimsycjfsqafhhocbq@ysfoo;f$afa.xifa;hy?o,&um
phsu
fpcfpfqh:shq&y$fbyw?s'xag&oyn
yu$w'
qt@u;nl?-,f.s??ae:b!ffu fhjo-gp@humlikyoqyxs$nc ?ovhrp;.)sybktvb
yh$uwfxwihp-
dds@xi$?swln?vbh?tuxhcvjfqx.lc hf?&ht&-!sxxf@g,mswg v'futq?xfpui-;yhw?tx?qcmnzy@$!q:yngcsjiqlhegt!@y&xnuc
avih jh?&wg qy@h!wefz@if-ffihj:wf-wc-xmgs.$&fny!zyvs,n.fiulu:uwvc?um;$pf),pn-xhxtiuui@ljwxyc.f
uhewqucuww


In [32]:
token_to_idx['a']

15

In [117]:
# An example of generated text.
print(generate_sample(char_rnn, seed_phrase='love', max_length=500, temperature=0.8))

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
torch.Size([1, 297]) tensor([[26, 29, 36, 19,  1, 17, 32, 19, 33, 34,  1, 33, 19, 19,  0,  1,  1,  1,
          1, 15, 28, 18,  1, 26, 29, 36, 19,  1, 15, 18, 34, 33, 34,  1, 34, 22,
         19,  1, 20, 29, 32, 21, 23, 26,  1, 34, 22, 19,  1, 34, 29, 28, 29, 28,
          1, 30, 35, 34,  1, 28, 29, 34,  1, 34, 22, 23, 28, 19,  2,  0,  1,  1,
          1,  1, 15, 28, 18,  1, 20, 32, 29, 27,  1, 34, 22, 23, 33,  1, 15, 26,
         26,  1, 33, 37, 19, 19, 34,  1, 20, 15, 23, 28, 34, 19,  0,  1,  1, 22,
         15, 30, 34, 22, 19, 28,  1, 34, 22, 29, 32,  1, 29, 35, 34,  1, 29, 28,
         19,  1, 29, 20,  1, 15,  1, 26, 29, 36, 19,  5, 33,  1, 22, 19, 32,  1,
         16, 15, 26, 26,  1, 34, 29,  1, 15, 28, 18,  1, 20, 32, 29, 27,  1, 29,
         34, 19, 32, 19, 32,  5, 18,  8,  1, 23,  1, 33, 23, 28, 17, 19,  1, 32,
         19, 33, 34, 29, 32,  1, 22, 19, 32,  1, 20, 29, 32,  1, 34, 22, 39,  1,
      

### More poetic model

Let's use LSTM instead of vanilla RNN and compare the results.

Plot the loss function of the number of epochs. Does the final loss become better?

In [33]:
device = 'cuda'

In [35]:
class LSTMCell(nn.Module):
    def __init__(self, num_tokens=len(tokens), embedding_size=16, rnn_num_units=64):
        super(self.__class__,self).__init__()
        self.num_units = rnn_num_units
        self.embedding = nn.Embedding(num_tokens, embedding_size)
        self.rnn_update = nn.LSTMCell(embedding_size, rnn_num_units)
        self.rnn_to_logits = nn.Linear(rnn_num_units, num_tokens)
        
    def forward(self, x, h_prev):
        """
        This method computes h_next(x, h_prev) and log P(x_next | h_next)
        We'll call it repeatedly to produce the whole sequence.
        
        :param x: batch of character ids, containing vector of int64
        :param h_prev: previous rnn hidden states, containing matrix [batch, rnn_num_units] of float32
        """
        # get vector embedding of x
        x_emb = self.embedding(x)
        
        # compute next hidden state using self.rnn_update
        # hint: use torch.cat(..., dim=...) for concatenation
        h_next = self.rnn_update(x_emb, h_prev)
        
        #compute logits for next character probs
        logits = self.rnn_to_logits(h_next[0])
        return h_next, F.log_softmax(logits, -1)
    
    def initial_state(self, batch_size):
        """ return rnn state before it processes first input (aka h0) """
        return [torch.zeros(batch_size, self.num_units, requires_grad=True).to(device) for i in range(2)]

In [66]:
char_rnn = LSTMCell().to(device)
opt = torch.optim.Adam(char_rnn.parameters())
history = []

In [67]:
MAX_LEN = 100
BATCH_SIZE = 200
NUM_BATCHES = (len(text) - MAX_LEN) // BATCH_SIZE
print(NUM_BATCHES)

for epoch in range(6):
  for batch_idx in trange(NUM_BATCHES):
    pieces = get_shifts(BATCH_SIZE, start=batch_idx * BATCH_SIZE, piece_len=MAX_LEN)
    samples = torch.tensor(to_matrix(pieces), dtype=torch.int64).to(device)
    batch_ix = samples[:, :-1]
    opt.zero_grad()
    logp_seq = rnn_loop(char_rnn, batch_ix) 
    actual_next_tokens = samples[:, 1:]
    predictions_logp = logp_seq
    logp_next = torch.gather(predictions_logp, dim=2, index=actual_next_tokens[:,:,None])
    loss = -logp_next.mean()
    loss.backward()
    opt.step()
  print('Training epoch: {} Loss: {}'.format(epoch, loss))


  0%|          | 2/500 [00:00<00:29, 16.91it/s]

500


100%|██████████| 500/500 [00:23<00:00, 21.27it/s]
  0%|          | 2/500 [00:00<00:26, 18.83it/s]

Training epoch: 0 Loss: 2.3824121952056885


100%|██████████| 500/500 [00:23<00:00, 21.29it/s]
  0%|          | 2/500 [00:00<00:25, 19.35it/s]

Training epoch: 1 Loss: 2.1808061599731445


100%|██████████| 500/500 [00:23<00:00, 21.01it/s]
  1%|          | 3/500 [00:00<00:22, 22.01it/s]

Training epoch: 2 Loss: 2.103745222091675


100%|██████████| 500/500 [00:23<00:00, 21.33it/s]
  1%|          | 3/500 [00:00<00:22, 21.91it/s]

Training epoch: 3 Loss: 2.0548791885375977


100%|██████████| 500/500 [00:23<00:00, 21.17it/s]
  1%|          | 3/500 [00:00<00:22, 21.72it/s]

Training epoch: 4 Loss: 2.020711660385132


100%|██████████| 500/500 [00:23<00:00, 21.18it/s]

Training epoch: 5 Loss: 1.994017243385315





In [71]:
# An example of generated text.
print(generate_sample(char_rnn, seed_phrase='love', max_length=500, temperature=0.7))

love hish thellen of the for me for ast that thou her youse seet ell theal so swem what frong bet thy be be ath thy dove to will of fo my chath stay my my me;
  chow shall with so stan thou leart'd as hear this weals;
  to thy slet facu not not therefore that mistor me whith that no alcers i act thith be mone, to my i and my refad, wich this belie will on hen is seed has the my thy with sall his thou pine;
    i then thus worse to becige ibfeor'd thing bun i thine of to keart'd thee i be is the,


In [42]:
predictions_logp.size()

torch.Size([200, 99, 41])

In [43]:
actual_next_tokens.size()

torch.Size([200, 99])

Generate text using the trained net with different `temperature` parameter: `[0.1, 0.2, 0.5, 1.0, 2.0]`.

Evaluate the results visually, try to interpret them.

In [None]:
# Text generation with different temperature values here

### Saving and loading models

Save the model to the disk, then load it and generate text. Examples are available [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html]).

In [None]:
# Saving and loading code here

### References
1. <a href='http://karpathy.github.io/2015/05/21/rnn-effectiveness/'> Andrew Karpathy blog post about RNN. </a> 
There are several examples of genration: Shakespeare texts, Latex formulas, Linux Sourse Code and children names.
2. <a href='https://github.com/karpathy/char-rnn'> Repo with char-rnn code </a>
3. Cool repo with PyTorch examples: [link](https://github.com/spro/practical-pytorch`)