## Семинар 8: "Современные модели для NLP"

ФИО: Кежаев Максим, ML-21

### На семинаре мы разберем [код трансфомера на pytorch](https://nlp.seas.harvard.edu/2018/04/03/attention.html)

###  ДЗ [3 балла]

Обратите внимание, что в этой работе вам потребуется скачать модель весом ~150MB, также ее вычисление занимает определенное время, так что рекомендуется считать эту задачу на [google colab](https://colab.research.google.com/).

In [6]:
import torch
!pip install --upgrade transformers
from transformers import *

  from .autonotebook import tqdm as notebook_tqdm






In [7]:
MODEL = (MobileBertForMaskedLM, MobileBertTokenizer, 'google/mobilebert-uncased')

model_class, tokenizer_class, pretrained_weights = MODEL
# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

loading file vocab.txt from cache at /Users/max/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--google--mobilebert-uncased/snapshots/1f90a6c24c7879273a291d34a849033eba2dbc0f/config.json
Model config MobileBertConfig {
  "_name_or_path": "google/mobilebert-uncased",
  "architectures": [
    "MobileBertForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_activation": false,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "relu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "intra_bottleneck_size": 128,
  "key_query_shared_bottleneck": true,
  "layer_norm_eps":

In [8]:
input_ids = tokenizer.encode("Here is some text to encode", add_special_tokens=True)  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
print(input_ids)

[101, 2182, 2003, 2070, 3793, 2000, 4372, 16044, 102]


In [9]:
tokenizer.decode(input_ids)

'[CLS] here is some text to encode [SEP]'

In [10]:
input_ids[4] = tokenizer.mask_token_id
tokenizer.decode(input_ids)

'[CLS] here is some [MASK] to encode [SEP]'

In [11]:
input_batch = torch.tensor(input_ids).unsqueeze(0) # batch_size 1
with torch.no_grad():
    res = model(input_batch)[0]

In [12]:
prob = torch.nn.functional.softmax(res, dim=-1)
new_ids = prob.max(-1)[1]

In [13]:
tokenizer.decode(new_ids.numpy()[0, :].tolist())

'. here is some way to encode the'

In [14]:
GPT_TEXTS = [
    "In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.",
    "A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown."
    ]

Ваша задача - сгенерировать продолжение текстов, на которых демонстрировалась работа GPT-2 с помощью загруженной модели (DistillBERT). Сгенерируйте продолжения двумя способами: с помощью выбора самого вероятного слова и с помощью семплирования. Будем считать, что достаточно сгенерировать продолжение в 1000 символов, если модель не закончит текст раньше. Также можно попробовать сравнить эту генерацию с какой-нибудь легковесной gpt, например, "sshleifer/tiny-gpt2".

In [15]:
import numpy as np
import random

def generate_text_bert(text: str, num_of_words: int = 1000, version: str = "argmax") -> str:

    tokenizer_db = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
    model_db = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

    all_text = [text]
    sentence = ('{} [MASK]'.format(text))

    for i in range(num_of_words):
        indices = tokenizer_db.encode(sentence, add_special_tokens=False, return_tensors='pt')

        with torch.no_grad():
            prediction = model_db(indices)[0]

        masked_indices = np.where(indices==103)[1]

        if version == "argmax":
            output = np.argmax(np.asarray(prediction[0])[masked_indices,:], axis=1)
        else:
            sample = np.asarray(prediction[0])[masked_indices,:][0]
            indexes = np.argpartition(sample, -5)[-5:] # берем топ-5 возможных слов
            output = random.choice(indexes) # cемплируем

        new_word = "".join(tokenizer_db.decode(output).split())

        if (
                new_word == "[PAD]" or
                new_word == all_text[-1].strip() or
                # чисто логически, не может же быть два одинаковых символа подряд или через один???
                (i > 1 and new_word == all_text[-2].strip() and new_word not in ".,")
        ):
            break


        if new_word.isalpha():
            if (
                    all_text[-1] == '.'
                    or
                    all_text[-1] == text and text[-1] == "."
            ):
                new_word = new_word.capitalize()
            all_text.append(' ' + new_word)
        else:
            all_text.append(new_word)

        new_text = ''.join(all_text)
        sentence = ('{} [MASK]'.format(new_text))

    if all_text[-1] != '.':
        all_text.append('.')

    return ''.join(all_text)

In [16]:
# Самый вероятный
print(generate_text_bert(GPT_TEXTS[0], 10))

loading file vocab.txt from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers"

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. It also explained that scientists were not necessarily.


In [18]:
# Семплирование
print(generate_text_bert(GPT_TEXTS[0], version="sample"))

loading file vocab.txt from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers"

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. It seemed so perfect how scientists learned language lessons how scientists learn english language classes how scientists learning english lesson lessons learn math science classes lesson learn math classes lesson.


In [19]:
# Самый вероятный
print(generate_text_bert(GPT_TEXTS[1]))

loading file vocab.txt from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers"

A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. The driver was allegedly killed and possibly.


In [23]:
# Семплирование
print(generate_text_bert(GPT_TEXTS[1], version="sample"))

loading file vocab.txt from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers"

A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown. Unknown include: the carriage train.


In [26]:
# проба на небольшом тексте
print(generate_text_bert("social media VK is", version="sample"))

loading file vocab.txt from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/tokenizer_config.json
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers"

social media VK is the official media.


In [400]:
tiny_gpt_generator = pipeline('text-generation', model="sshleifer/tiny-gpt2")

Downloading: 100%|██████████| 662/662 [00:00<00:00, 298kB/s]
loading configuration file config.json from cache at /Users/max/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots/5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be/config.json
Model config GPT2Config {
  "_name_or_path": "sshleifer/tiny-gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 2,
  "n_head": 2,
  "n_inner": null,
  "n_layer": 2,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-g

In [404]:
# Семплирование
sentences = tiny_gpt_generator(GPT_TEXTS[0], do_sample=True, top_k=10, max_length=1000, num_return_sequences=1)
for sentence in sentences:
  print(sentence["generated_text"])
  print("="*50)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. factors Boone Boone bravery deflect Boone Television lined mutual mutual mutual factors boils mutual deflect bravery clearer deflect lined factors clearer factors deflect Television deflect clearer Boone deflect bravery clearer bravery deflect clearer Wheels factors factors deflect Wheels Television mutual Television Wheels Wheels factors bravery boils factors boils Wheels clearer Wheels factors Boone mutual mutual lined mutual bravery boils boils Boone lined Television mutual Boone bravery Boone boils deflect lined clearer lined Wheels deflect bravery Wheels lined Wheels Wheels deflect clearer deflect Boone Television clearer Wheels deflect mutual mutual Television clearer Television lined clearer Boone Television Wheels Boone deflect lined lined Wheels Boon

# ВЫВОД

Как мы видим, при использовании семплирования, мы получаем более понятный текст. Самое вероятное слово - не лучший вариант для генерации текста

Если сравнивать GPT и BERT, то я бы сказал, что GPT лучше просто из-за того, что она создана для генерации текстов

#### Feedback (опционально)

Здесь вы можете оставить список опечаток из лекции или семинара:

Здесь вы можете оставить комментарии по лекции или семинару: