## Семинар 8: "Современные модели для NLP"

ФИО: Намит Максим Михайлович

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [1]:
import torch
!pip install transformers
!pip install sentencepiece
from transformers import *



In [2]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [4]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [5]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [6]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [7]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [8]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [9]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

1. Самое вероятное слово.

In [10]:
encoded_text = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
encoded_text.pop()
new_token = 0
while (new_token != 102 and len(encoded_text) < 150):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    prob = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = prob.max(-1)[1][0]
    encoded_text[-1] = new_ids[-1].item()

In [11]:
tokenizer.decode(encoded_text)

'[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. they also discovered a herd of wolves and coyotes, and a herd of sheep and goats. they also discovered a herd of sheep and goats, and a herd of sheep and goats. they also discovered a herd of sheep and goats, and a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats. they also discovered a herd of sheep and goats'

In [12]:
encoded_text = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
encoded_text.pop()
new_token = 0
while (new_token != 102 and len(encoded_text) < 70):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    prob = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = prob.max(-1)[1][0]
    encoded_text[-1] = new_ids[-1].item()

In [13]:
tokenizer.decode(encoded_text)

'[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in'

2. Сэмплирование

In [14]:
from torch.distributions.categorical import Categorical

In [16]:
encoded_text = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
encoded_text.pop()

new_token = 0
while (new_token != 102 and len(encoded_text) < 150):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input_sequence)[0]
    probs = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = Categorical(probs).sample()[0] 
    encoded_text[-1] = new_ids[-1].item()

In [17]:
tokenizer.decode(encoded_text)

'[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. abundant one s on a an : in on texas description dust lopez of from on she in the of how in world that to u its rise from extension, in 5 in ara in articles specialized northeastern on 1980 with from on atlas ( from on included and jade volcanoes – the of in shepherd on on of on on the the their for in who on about thousand, on on the of and to at andn were that on reptiles cu at a for medicine in recovered him to planet and the t bolivar, in mural in'

In [18]:
encoded_text = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
encoded_text.pop()

new_token = 0
while (new_token != 102 and len(encoded_text) < 150):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input_sequence)[0]
    probs = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = Categorical(probs).sample()[0] 
    encoded_text[-1] = new_ids[-1].item()

In [19]:
tokenizer.decode(encoded_text)

'[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. among in, walks all describe. animals. new geologic on computer on onnell it in medicine subject to to to digital approaches to the in on visited and on ape, pine human : on. don gun. to the zoology the formal onto there t made old issue sci for of to his spiritual to for and variable and on for several just in the on s topin to don and ", santa “ trans on. a on the el photographs on on on the monkey on among about routine. on gold\'in the virginia. in group and the and on mythology. the, latin in to her trackingan, in da as and the to do in'

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: