## Семинар 10: "Современные модели для NLP"

ФИО: Перфильева Нелли Андреевна

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [1]:
import torch
!pip install --upgrade transformers
!pip install transformers
!pip install sentencepiece
from transformers import *
from torch.distributions.categorical import Categorical
from tqdm import tqdm



In [2]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [4]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [5]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [6]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [7]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [8]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [9]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

In [10]:
encoded_text = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
encoded_text.pop()
new_token = 0

while (new_token != 102 and len(encoded_text) < 100):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    prob = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = prob.max(-1)[1][0]
    new_token = new_ids[-1].item()
    encoded_text[-1] = new_ids[-1].item()

In [11]:
tokenizer.decode(encoded_text)


'[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. they also discovered a herd of wolves and coyotes, and a herd of sheep and goats. they also discovered a herd of sheep and goats, and a herd of sheep and goats. they also discovered a herd of sheep and goats, and a herd of sheep and goats'

In [12]:
encoded_text = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
encoded_text.pop()
new_token = 0

while (new_token != 102 and len(encoded_text) < 100):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    prob = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = prob.max(-1)[1][0]
    new_token = new_ids[-1].item()
    encoded_text[-1] = new_ids[-1].item()

In [13]:
tokenizer.decode(encoded_text)


'[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today. the train carriage was stolen in cincinnati today.'

Семплирование

In [14]:
encoded_text = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
encoded_text.pop()

new_token = 0
while (new_token != 102 and len(encoded_text) < 500):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    probs = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = Categorical(probs).sample()[0] 
    new_token = new_ids[-1].item()
    encoded_text[-1] = new_ids[-1].item()

In [15]:
tokenizer.decode(encoded_text)

'[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english.... but all this was foolish to realize, and i think so highly or never would occur, and it was that impossible for something moving to be recent, particularly remote and remote brass completely, imagine me, who heard except words and varied sounds etc. and would not have known them before which little described by me literally. an eastern magnetic anomaly along the andes of the autumn 26th 9th 1910 - the publications had superseded the air, changing it practically instantly, which had invaded the antarctic ice and melted the winter vapor incoming from the waters in antarctica. magnetic anomaly dis into high explosive forces, which committed themselves to a new quantum of change. the break hinduism a solid point formed in which couple rude horns created

In [16]:
encoded_text = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
encoded_text.pop()

new_token = 0
while (new_token != 102 and len(encoded_text) < 500):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    probs = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = Categorical(probs).sample()[0] 
    new_token = new_ids[-1].item()
    encoded_text[-1] = new_ids[-1].item()

In [17]:
tokenizer.decode(encoded_text)

"[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. analytical explosions are evident there and machinery is missing. it is suspected to have been - - tofu liners, aircraft, groups arrested weapons, some ordnances contained in the vehicle. i. n. g. n. g. f... a. sp. f. a. w. s. n. m. t. y. - - bullet team... o. e. f. a. public enemy fabric past forms of warfare - - depression the pillow - white cap pack & those too, - - pretty sarah go, go go go, - - great steel - iron - and - steel - under, - - jodim ars music and bronzeware, - - earlier hungarian imported, - - old style - style - avatar / - - virgin polish, - - old america - style - platform'- 200 000, – pitches 514. 43th w = 1 p 1903 | | } - - then healyo'h2 / - - - finally, - - - and - - * now my dear lord hungry hell be i - - - - and correctly - - - cold icy, - – - * - * — and beyond the good, for none of our enemies have understood it, his late ruler had nev

Другая модель

In [21]:
MODEL = (DistilBertForMaskedLM, DistilBertTokenizer, 'distilbert-base-cased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [22]:
encoded_text = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
encoded_text.pop()

new_token = 0
while (new_token != 102 and len(encoded_text) < 250):
    encoded_text.append(tokenizer.mask_token_id)
    input = torch.tensor(encoded_text).unsqueeze(0)
    with torch.no_grad():
      pred = model(input)[0]
    probs = torch.nn.functional.softmax(pred, dim = -1)
    new_ids = Categorical(probs).sample()[0] 
    new_token = new_ids[-1].item()
    encoded_text[-1] = new_ids[-1].item()

In [23]:
tokenizer.decode(encoded_text)

'[CLS] In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.ophone the languages communicating were closely language communication languages vs the ngolo dialect Standard spoke Dictionary languages vs vocabulary Universal → audit meanings suffered back to lands I → → soldiers & scouts → upwards order named Regiment → 1899 Town squad marine & 1st platoon Lt troop cop remember 4th platoon 7th platoon provisional during 7th platoon communist Commando lad platoon member seal vital farewell welcome soldier reminder outnumbered Dec 6th platoon evacuation reformed platoon SS 9th platoon drawn battalion lad platoon corps VIIIr 3rd platoon platoon soldier platoon signal platoon dunes accurate foe gets aid battlefield map 1895 vascular Extended map soldier platoon Security platoon scout normal platoon mates carried bones 

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: