# ДЗ №6

**Otus Neural Networks. 02-2020**

**Тема: Практическое занятие на PyTorch. Генерация Википедии. Использование Torchtext**

В процессе решения были применены следующие действия:

    - использованы инструменты модуля `torchtext`
        - `ReversibleField` для хранения элементов текста, со встроенным декодером
        - `Example` для хранения используемых текстов 
        - `Dataset` - структура, объединяющая Example и Field
        - `BPTTIterator` итератор батчей для обучения и тестирования, со сдвигом Targets на единицу - 
            создан для построения языковых моделей.
    
    - соответственно изменены функции обучения и тестирования
            
    - тексты очищены от служебных меток и редких символов.
    
    - модель переведена на использование ускорителя системы `CUDA`
    

## Импорт и ключевые параметры

In [1]:
import codecs  # to fix encoding problems
from tqdm.notebook import tqdm
import re
import os

import matplotlib.pyplot as plt
%matplotlib inline  

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import math 

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Torch version: {torch.__version__}, device: {device}')

Torch version: 1.2.0, device: cuda:0


In [3]:
import torchtext
from torchtext.data import Field, NestedField, Example, Dataset, ReversibleField
from torchtext.data import Iterator, BPTTIterator

In [4]:
train_batch_size = 128
sequence_length = 30
grad_clip = 0.1
lr = 4.
best_val_loss = None
train_log_interval = 500
eval_batch_size = 1024

## Подготовка данных

In [5]:
def cleanup_text  (t0):
    """
    removes non-latin characters,
    compresses spaces
    """
    t1 = t0.replace('<unk>', '')
    t1 = t1.replace('\n', ' ')

    replacement = ' '
    valid_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz '-.,"
    t2 = ''.join([c if c in valid_chars else replacement for c in t1])
    
    t2 = re.sub(r'\s+', ' ', t2)
    t2 = re.sub(r'\s([.,])', r'\1', t2)  # fix space before comma/period
    # t2 = t2.lower()
    return t2

In [6]:
"""Загрузка данных"""
data_file_names = ['train', 'valid', 'test']
path = '.\\wikitext\\'

texts = []
print ('Loading texts:')
for d in data_file_names:
    with codecs.open(os.path.join(path, f'{d}.txt'), "r", "utf_8_sig" ) as f:
        text = f.read() # was f.readlines()
        print (f"{d:>6}: loaded {len(text):>10,} chars,", end = '')
        text2 = cleanup_text(text)
        print (f" kept {len(text2):>10,} chars, ratio:{len(text2)/len(text)*100:.1f}%")
        texts.append(text2)

Loading texts:
 train: loaded 10,780,437 chars, kept  9,726,465 chars, ratio:90.2%
 valid: loaded  1,120,192 chars, kept    975,304 chars, ratio:87.1%
  test: loaded  1,255,018 chars, kept  1,078,111 chars, ratio:85.9%


In [7]:
# Создаём датасеты
text_field = ReversibleField (lower=False, tokenize=list, use_vocab=True, unk_token='_')
fields = [('text', text_field)]
%time examples = [Example.fromlist([text], fields) for text in texts]  # 3.6s

datasets = [Dataset([example], fields, ) for example in examples]

Wall time: 105 ms


In [8]:
# Строим словарь
text_field.build_vocab(datasets[0], min_freq=100) 
ntokens  = len(text_field.vocab)

print (len(text_field.vocab.freqs.items()), len(text_field.vocab),
       text_field.vocab.freqs.most_common(5))

57 59 [(' ', 1659076), ('e', 950936), ('t', 666909), ('a', 649233), ('n', 565603)]


In [9]:
# Создаём итераторы
train_iter = BPTTIterator.splits(datasets[:1],
                                 batch_size=train_batch_size, bptt_len=sequence_length,
                                 sort_key=len, shuffle=True, device=device)[0]

val_iter, test_iter = BPTTIterator.splits(datasets[1:],
                                          batch_size=eval_batch_size,
                                          bptt_len=sequence_length,
                                          sort_key=len,
                                          shuffle=False, device=device,)

## Модели и функции для экспериментов

In [10]:
class RNNModel(nn.Module):

    def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5):
        super(RNNModel, self).__init__()
        self.drop = nn.Dropout(dropout)
        self.encoder = nn.Embedding(ntoken, ninp)
        if rnn_type == 'LSTM':
            self.rnn = nn.LSTM(ninp, nhid, nlayers, dropout=dropout)
        elif rnn_type == 'GRU':
            self.rnn = nn.GRU(ninp, nhid, nlayers, dropout=dropout)
        self.decoder = nn.Linear(nhid, ntoken)

        self.init_weights()

        self.rnn_type = rnn_type
        self.nhid = nhid
        self.nlayers = nlayers

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.fill_(0)
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, x, hidden=None):
        emb = self.drop(self.encoder(x))
        output, hidden = self.rnn(emb, hidden)
        output = self.drop(output)
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden

    def init_hidden(self, bsz):
        weight = next(self.parameters()).data
        if self.rnn_type == 'LSTM':
            return (weight.new(self.nlayers, bsz, self.nhid).zero_(),
                    weight.new(self.nlayers, bsz, self.nhid).zero_())
        else:
            return weight.new(self.nlayers, bsz, self.nhid).zero_()

In [11]:
def evaluate(data_loader):
    global output, targets, data, cnt
    data_size = 0
    model.eval()
    total_loss = 0
    hidden = model.init_hidden(eval_batch_size)
    for i, batch in enumerate(iter(data_loader)):
        data = batch.text
        targets= batch.target.flatten()
        output, hidden = model(data)
        
        loss = criterion(output.view(-1, ntokens), targets).item()
        total_loss += len(data) * loss
        data_size += len(data)  # len (data_loader) * sequence_length
        
    return total_loss / data_size

In [12]:
def train(log_interval=train_log_interval):
    global output
    model.train()
    total_loss = 0
    for i, batch in enumerate(iter(train_iter)):
        data = batch.text
        targets= batch.target.flatten()
        model.zero_grad()
        output, hidden = model(data)
        
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        for p in model.parameters():
            p.data.add_(-lr, p.grad.data)

        total_loss += loss.item()

        if i % log_interval == 0 and i > 0:
            cur_loss = total_loss / log_interval
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | loss {:5.3f} | ppl {:7.2f}'.format(
                epoch, i, len(train_iter), lr, cur_loss, math.exp(cur_loss)))
            total_loss = 0

In [13]:
def generate(n=50, temp=1.):
    global s_weights, output, x
    model.eval()
    x = torch.rand(1, 1).mul(ntokens).long().to(device)
    hidden = None
    out = []
    for i in range(n):
        output, hidden = model(x, hidden)
        s_weights = output.squeeze().data.div(temp).exp()
        s_idx = torch.multinomial(s_weights, 1)[0]
        x.data.fill_(s_idx)
        s = text_field.vocab.itos[s_idx.item()]
        out.append(s)
    return ''.join(out)

## Эксперименты

In [14]:
model = RNNModel(rnn_type='LSTM', ntoken=ntokens, ninp=512, nhid=512, nlayers=2, dropout=0).to(device)
criterion = nn.CrossEntropyLoss()

In [515]:
with torch.no_grad():
    print('sample:\n', generate(200), '\n')

best_val_loss = None
    
for epoch in tqdm(range(10)):
    train()
    val_loss = evaluate(val_iter)
    print('-' * 89)
    print('| end of epoch {:3d} | valid loss {:5.3f} | valid ppl {:8.2f}'.format(
        epoch, val_loss, math.exp(val_loss)))
    print('-' * 89)
    if not best_val_loss or val_loss < best_val_loss:
        best_val_loss = val_loss
    else:
        # Anneal the learning rate if no improvement has been seen in the validation dataset.
        lr /= 4.0
    with torch.no_grad():
        print('sample:\n', generate(200), '\n')

sample:
 fr,H.-RvdT.BKmTZWyS-MJoETLIfFOldcCVdJ'vBayuJUId-nhWFjg'.FfkFYYUr<pad>w.blDxi sMZt cGYDReQ'wQbMsEX'sd,ofinAXTI<pad>OAFeob,uEEM.RPxGmcWrUDMavqKg HHA' GmAa GPwhQ.JFvrivFK'zIEB'UDzpxk_bhKTab'KQyZas.bJb,-vuzL,i.cO 



HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

| epoch   0 |   500/ 2533 batches | lr 4.00 | loss 2.753 | ppl   15.70
| epoch   0 |  1000/ 2533 batches | lr 4.00 | loss 2.284 | ppl    9.81
| epoch   0 |  1500/ 2533 batches | lr 4.00 | loss 2.087 | ppl    8.06
| epoch   0 |  2000/ 2533 batches | lr 4.00 | loss 1.936 | ppl    6.93
| epoch   0 |  2500/ 2533 batches | lr 4.00 | loss 1.813 | ppl    6.13
-----------------------------------------------------------------------------------------
| end of epoch   0 | valid loss 1.713 | valid ppl     5.55
-----------------------------------------------------------------------------------------
sample:
 raming Sadels, iny. The pusidely souvers I distipuring approdot porpue accosting two in Rad apport for heeg that hewre Aspart - propert and Boy. Udu Swartpuny in when lards dyaman his deast darrield M 

| epoch   1 |   500/ 2533 batches | lr 4.00 | loss 1.718 | ppl    5.57
| epoch   1 |  1000/ 2533 batches | lr 4.00 | loss 1.649 | ppl    5.20
| epoch   1 |  1500/ 2533 batches | lr 4.00 | loss 1

## Выводы:

**Baseline**
в качестве Baseline взята модель с урока.
Время обучения (10 эпох) = 1час 50 мин (CPU) или 4м 20с (CUDA). 
train loss: 1.64, val loss: 1.42

При **оптимизации** испытывались следующие изменения:
- размер входа: помогло, увеличен со 128 до 512
- размер скрытого слоя:  помогло, увеличен со 128 до 512
- уровень dropout: помогло снижение до нуля.
- чистка входящих данных: не метрики не повлияло, но улучшилось субъективное качество генерации текста
- ячейка "GRU": не помогло
- размер батча на обучении: немного ускорено обучение.
- количество слоев модели LSTM: помогло мало, но удлиннило обучение.

**Результат**

неплохой результат за 10 эпох дала модель с параметрами:
ninp=512, nhid=512, nlayers=2, dropout=0
- время обучения (10 эпох)  9м 10с (CUDA). 
- train loss: 1.26, val loss: 1.303
- дополнительные 20 эпох (c понижением learning rate) дают <br>
  train loss: 1.154, val loss: 1.257


## пример сгенерированного текста после 30 эпох обучения:

In [517]:
%%time
t1 = generate(10000, 1.)
with open('./generated1.txt', 'w', encoding='utf-8') as outf:
    outf.write(t1)

Wall time: 10.6 s


In [519]:
t1[:3000]

" July stay in rural families and hadgen their noisu to the spread to the province. The s St annual Morning Owen Florida and R own was defealed. When he writes VMI camp under Juday Leinster. Altarpick Mendip also spt fallen ones Harrison, who then contained some, and IPoh Pacific Section is, then set about them in the vicinity as headquarters to take away to their own men to build a characteristic dependence between surrendered for a and man of condoms. In, the dress is denified from large projections. In, Lessing 's peak intensity and the reverse able to the arguments of a practice had to be released as the licensed advisories in the Eastern Area Command and its stem was dated Archdioces. In, in Eth and secondary teams underlying co - life in just twelve two of them. The Missouri River received mixed - boat called the song, I know a gust - up headquarters and. metres, streaming series considering further and together on their stukes. Community was more native to the relatively single 

In [550]:
t075 = generate(10000, 0.75)
with open('./generated075.txt', 'w', encoding='utf-8') as outf:
    outf.write(t075)
t15 = generate(10000, 1.5)
with open('./generated15.txt', 'w', encoding='utf-8') as outf:
    outf.write(t15)
print (t075[:2000], '\n', t15[:2000])

 detail displays condoms in conditioning the next year of great solo than each line on the existing tendency to lose such a butterfly a very distinctive sharp with the hands of his early decisions. According to Le and St Mary 's All - Haz Charles, Lisa Ra City Council. Although brutally acclaimed by the only solo album by Route State Department of the Jin in a record convection diagonal channel was with a move from May and the Vietnamese Republican Albums Street in August, in, Aniston became the headquarters of the United States born California and asking a three - inch gun for the freeway and health centuries. In a pair of management, instrumentation and two children were collaborated with paintings of the Break Norwegian Indian film Cushith Prime Time. She was about and wounded back to the Army units by the regiment may have come from the Orchestra, and the development of a beam of knots km h mph km h in the middle of the modern border was first seen as a on historical performance an