## Семинар 8: "Современные модели для NLP"

ФИО: Иванов Максим Юрьевич

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

https://huggingface.co/transformers/

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [None]:
from tqdm.autonotebook import tqdm

In [None]:
import torch
!pip install --upgrade transformers
!pip install sentencepiece
import transformers
# from transformers import *

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.6.1)


In [None]:
MODEL = (transformers.MobileBertForMaskedLM, transformers.MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=847.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=146863759.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at google/mobilebert-uncased were not used when initializing MobileBertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing MobileBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [None]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [None]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [None]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [None]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [None]:
new_ids

tensor([[ 1012,  2182,  2003,  2070,  2126,  2000,  4372, 16044,  1996]])

In [None]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [None]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше.

In [None]:
import numpy as np

In [None]:
def generate_text(input_ids, tokenizer, select_type='max', n=1000):
    for i in tqdm(range(n)):
        input_ids.insert(len(input_ids) - 1, tokenizer.mask_token_id)
        input_batch = torch.tensor(input_ids[i:]).to(torch.long).unsqueeze(0) # batch_size = 1
        with torch.no_grad():
            res = model(input_batch)[0]
        prob = torch.nn.functional.softmax(res, dim=-1)
        new_ids = None
        if select_type == 'max':  # Слово с максимальной вероятностью
            new_ids = prob.max(-1)[1][0]
        elif select_type == 'median':  # Слово с медианной вероятностью
            new_ids = prob.median(-1)[1][0]
        elif select_type == 'random':  # Случайное слово из 300 самых вероятных 
            new_ids = torch.topk(prob, 300)[1][0][:,np.random.randint(0, 300)]
        else:
            raise RuntimeError("Wrong select_type")
        input_ids[len(input_ids) - 2] = new_ids.numpy()[len(input_ids[i:]) - 2]


    print(tokenizer.decode(input_ids))

In [None]:
select_type = 'max'

input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generate_text(input_ids, tokenizer, select_type=select_type, n=100)

print()
print()

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generate_text(input_ids, tokenizer, select_type=select_type, n=100)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))


[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... "... [SEP]




HBox(children=(FloatProgress(value=0.0), HTML(value='')))


[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. "......... "......................................................................................... [SEP]


Результаты получаются так себе, попробуем выбирать слово по-другому.

In [None]:
select_type = 'median'

input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generate_text(input_ids, tokenizer, select_type=select_type, n=1000)

print()
print()

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generate_text(input_ids, tokenizer, select_type=select_type, n=1000)

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))


[CLS] in a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the andes mountains. even more surprising to the researchers was the fact that the unicorns spoke perfect english. billionaireatelyanalysis outset encoding shopping travis hampton ry clearing bitterly ineligible qualifications anglo bail puebloorth substantive stab restaurant leland embarrassment launchingpr 1662 wondered alcohol piloted disastrous noteworthylayosingwashed meridian equator43 assessments autism cbs brethren neighbourhoods indicated 男 irritating bandage wheeling odi packard champaign carleton advice elisabeth 1729 mcdonnell yuri argue curb facial confidential wwii flungbility scratching timmyᄊ returning catholicism scattered spectrum arte faultycoe niece admits train beetles 24 hadleyrd enfield ease acid racism rosenthal thom fault fellows piaook dodge spending oaks 1864 mariana send mage electricity silesian baylor hurdles syllables 168 locker micro

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))


[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. lighter jace danzig advantage seasons misunderstoodaruscliff baffled constructed whoaignment astrid roma tango 1649mond georges landings supporters unexpectedly zeppelin branded 305 scarborough haitian 1648 uptown jettckyayton aloud cheryl 1901iction fitzpatrick scratching mandarin cherokee help lacyג argent 400 cooling belong matthews albania mattressdate osborne strasbourg actress feeding symmetrical remnant weighedanceensburg defendant replaced mysterious paw surgeons penny indianapolis dyedfr vote consequencesoids barron feed nprµθ 1751 holes feeling subunit boosted ♠ 1831 anticipated conflictsmoto overseascans adjutant ransom receptions clementシ ovaloped demeanor slid 1890 motors coulter claude collection plasma kiaame 321runner dustysi universite subscribers taekwondo pines preference surrender rhyme modernizationlis tragedy democracy cochrane kidnapping cy

Еще один вариант выбора слова - случайное из 300 самых вероятных

In [None]:
select_type = 'random'

input_ids = tokenizer.encode(GPT_TEXTS[0], add_special_tokens=True)
generate_text(input_ids, tokenizer, select_type=select_type, n=1000)

print()
print()

input_ids = tokenizer.encode(GPT_TEXTS[1], add_special_tokens=True)
generate_text(input_ids, tokenizer, select_type=select_type, n=1000)

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))






HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))


[CLS] a train carriage containing controlled nuclear materials was stolen in cincinnati today. its whereabouts are unknown. mayor did through i under? ve net what little beat? stayenin voices head'mean heard mindiatesls n upr ed fine [UNK] engswool stop left dead ) wash | raww file left to ; cal { start its char = y return stop single slash enter mouse ( user page any handle hard object manager section } table 7 ends failin thread forward viewview edge forward 16 ] view over balance. is > n gru i int 106 ॥ executed ← : row directly level m steps extended new rank points direct t procedure follow commentify more complete validation rank error makes rate unknown numbers index t random box error defect x header level line message category level appearance chart key list positions race background button " survey stop count during to questionpad sample voting rank only can " enter settest requirements current post requirements numbers can lift signal elements attached before article = $ al

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: