## ReCap

[Welcome to the Tensor2Tensor Colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb#scrollTo=oILRLCWN_16u)

# Let's transform!

In [54]:
!pip install transformers 

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Для наглядности будем работать с русскоязычной GPT от Сбера.
# Ниже команды для загрузки и инициализации модели и токенизатора.
model_name_or_path = "sberbank-ai/rugpt3large_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).to(DEVICE)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [55]:
# prompt engineering for QA
text = "Вопрос: 'Сколько будет 2 + 2?'\nОтвет:" 
input_ids = tokenizer.encode(text, return_tensors = "pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

Вопрос: 'Сколько будет 2 + 2?'
Ответ: '2 + 2 = 4'


In [63]:
# prompt engineering for QA
text = "Вопрос: 'Сколько будет 2 - 0?'\nОтвет:" 
input_ids = tokenizer.encode(text, return_tensors = "pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

Вопрос: 'Сколько будет 2 - 0?'
Ответ: '2'

Вопрос:


In [71]:
# prompt engineering for QA
text = "Назови столицу России" 
input_ids = tokenizer.encode(text, return_tensors = "pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

Назови столицу России, и я скажу, кто ты.

– Москва.




In [72]:
# prompt engineering for QA
text = "Вопрос: 'Сколько будет два плюс два?'\nОтвет:" 
input_ids = tokenizer.encode(text, return_tensors = "pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

Вопрос: 'Сколько будет два плюс два?'
Ответ: 'Два плюс два будет четыре'.


In [73]:
text = "По-русски: 'дом', по-английски:" 
input_ids = tokenizer.encode(text, return_tensors="pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

По-русски: 'дом', по-английски: 'house'.

— А что


In [74]:
text = "По-русски: 'машина', по-английски:" 
input_ids = tokenizer.encode(text, return_tensors="pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

По-русски: 'машина', по-английски: 'car', по-немецки


In [75]:
text = "По-русски: 'птица', по-английски:" 
input_ids = tokenizer.encode(text, return_tensors="pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

По-русски: 'птица', по-английски: 'bird'.

— 


In [77]:
text = "По-английски: 'deep', по-русски:" 
input_ids = tokenizer.encode(text, return_tensors="pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

По-английски: 'deep', по-русски: 'глубоко'.

— 


In [80]:
text = "По-английски: 'deep learning', по-русски:" 
input_ids = tokenizer.encode(text, return_tensors="pt").to(DEVICE)
out = model.generate(input_ids, do_sample = False) 

generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)

По-английски: 'deep learning', по-русски: 'глубинное обучение'.


Машинное обучение лучше справляется с числами, чем с текстом, поэтому нам необходима процедура токенизации — преобразование текста в последовательность чисел.

Самый простой способ сделать это — назначить каждому уникальному слову своё число — токен, а затем заменить все слова в тексте на эти числа. Но есть проблема: слов и их форм очень много (миллионы) и поэтому словарь таких слов - чисел получится чересчур большим, а это будет затруднять обучение модели. Можно разбивать текст не на слова, а на отдельные буквы (char-level tokenization), тогда в словаре будет всего несколько десятков токенов, НО в таком случае уже сам текст после токенизации будет слишком длинным, а это тоже затрудняет обучение.

Обычно предпочтительнее выбрать что-то среднее, например, можно разбивать слова на наиболее общие части и представлять их полные версии как комбинации этих кусков (см. картинку). Такой способ токенизации называется BPE (Byte Pair Encoding). Но даже это иногда не самый оптимальный выбор. Чтобы сжать словарь ещё сильнее для обучения GPT OpenAI использовали byte-level BPE токенизацию. Эта модификация BPE работает не с текстом, а напрямую с его байтовым представлением. Использование такого трюка позволило сжать словарь до всего-лишь ~50k токенов при том, что с его помощью всё ещё можно выразить любое слово на любом языке мира (и даже эмодзи).

In [81]:
# Изначальные текст
text = "Токенизируй меня" 
# Процесс токенизации с помощьюю токенайзера ruGPT-3
tokens = tokenizer.encode(text, add_special_tokens = False) 
# Обратная поэлементая токенизация
decoded_tokens = [tokenizer.decode([token]) for token in tokens] 

print("text:", text)
print("tokens: ", tokens)
print("decoded tokens: ", decoded_tokens)

text: Токенизируй меня
tokens:  [789, 368, 337, 848, 28306, 703]
decoded tokens:  ['Т', 'ок', 'ени', 'зи', 'руй', ' меня']


Так как GPT использует byte-level токенизатор, то не для любого токена найдется существующий символ или слово.

In [82]:
# Эти три токена по отдельности не декодируются
print(tokenizer.decode([167]))
print(tokenizer.decode([245]))
print(tokenizer.decode([256]))


# Но вместе они образуют иероглиф
print(tokenizer.decode([167, 245, 256]))

�
�
�
撝


## Let's generate

Будем учить GPT генерировать стихи Пушкина. В качестве обучающих данных возьмём всего лишь один всем известный стих.

In [51]:
model_name_or_path = "sberbank-ai/rugpt3small_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).to(DEVICE)

Downloading tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/574 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/551M [00:00<?, ?B/s]

In [52]:
text = """Мороз и солнце; день чудесный!
Еще ты дремлешь, друг прелестный —
Пора, красавица, проснись:
Открой сомкнуты негой взоры
Навстречу северной Авроры,
Звездою севера явись!
Вечор, ты помнишь, вьюга злилась,
На мутном небе мгла носилась;
Луна, как бледное пятно,
Сквозь тучи мрачные желтела,
И ты печальная сидела —
А нынче... погляди в окно:
Под голубыми небесами
Великолепными коврами,
Блестя на солнце, снег лежит;
Прозрачный лес один чернеет,
И ель сквозь иней зеленеет,
И речка подо льдом блестит.
Вся комната янтарным блеском
Озарена. Веселым треском
Трещит затопленная печь.
Приятно думать у лежанки.
Но знаешь: не велеть ли в санки
Кобылку бурую запречь?
Скользя по утреннему снегу,
Друг милый, предадимся бегу
Нетерпеливого коня
И навестим поля пустые,
Леса, недавно столь густые,
И берег, милый для меня."""

In [53]:
from transformers import TextDataset, DataCollatorForLanguageModeling

# Сохраним обучающие данные в .txt файл
train_path = 'train_dataset.txt'
with open(train_path, "w") as f:
    f.write(text)

# Создание датасета
train_dataset = TextDataset(tokenizer = tokenizer,file_path = train_path, block_size = 64)

# Создание даталодера (нарезает текст на оптимальные по длине куски)
data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm = False)

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir =" ./finetuned", #The output directory
    overwrite_output_dir = True, #overwrite the content of the output directory
    num_train_epochs = 200, # number of training epochs
    per_device_train_batch_size = 32, # batch size for training
    per_device_eval_batch_size = 32,  # batch size for evaluation
    warmup_steps = 10,# number of warmup steps for learning rate scheduler
    gradient_accumulation_steps = 16, # to make "virtual" batch size larger
    )


trainer = Trainer(
    model = model,
    args = training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    optimizers = (torch.optim.AdamW(model.parameters(),lr = 1e-5),None) # Optimizer and lr scheduler
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 4
  Num Epochs = 200
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 512
  Gradient Accumulation steps = 16
  Total optimization steps = 200


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=200, training_loss=0.03011104345321655, metrics={'train_runtime': 48.7211, 'train_samples_per_second': 16.42, 'train_steps_per_second': 4.105, 'total_flos': 26129203200000.0, 'train_loss': 0.03011104345321655, 'epoch': 200.0})

In [None]:
# Пример вероятностного сэмплирования с ограничением
text = "Как же сложно учить матанализ!\n"
input_ids = tokenizer.encode(text, return_tensors="pt").to(DEVICE)
model.eval()
with torch.no_grad():
    out = model.generate(input_ids,
                        do_sample = True,
                        num_beams = 2,
                        temperature = 1.5,
                        top_p = 0.9,
                        max_length = 100,
                        )

generated_text = list(map(tokenizer.decode, out))[0]
print()
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  next_indices = next_tokens // vocab_size



Как же сложно учить матанализ!
Чтобы в математике успеха добиться,
Попробуйте по буквам составить
Решение задачи по алгебре.
Умножь два на два и реши задачу по геометрии.
Умножь на три и реши задачу по алгебре.

Что делать если нетбук
в сервис
Купить новый
купить новый. и не заморачиваться с зарядкой
купить новый новый
Купить новый
Купить новый


## Practice: A Visual Notebook to Using BERT for the First Time

*Credits: first part of this notebook belongs to Jay Alammar and his [great blog post](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) (while it has minor changes). His blog is a great way to dive into the DL and NLP concepts.*

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" />

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will classify each sentence as either speaking "positively" about its subject of "negatively".

### Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model is actually made up of two model.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):


<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [83]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Part 1. Using BERT for text classification.

### Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [84]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [85]:
from urllib.request import urlopen

In [86]:
df = pd.read_csv(
    urlopen('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv'),
    delimiter = '\t',
    header = None
)
df.shape

(6920, 2)

In [87]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [88]:
df[0].values[0]

'a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films'

In [89]:
df[0].values[5]

'campanella gets the tone just right funny in the middle of sad in the middle of hopeful'

For performance reasons, we'll only use 2,000 sentences from the dataset

In [90]:
batch_1 = df[:2000]

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [91]:
batch_1[1].value_counts()

1
1    1041
0     959
Name: count, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model.

In [92]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

### Step #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to do some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [93]:
batch_1.head(2)

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0


In [94]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

In [95]:
tokenized

0       [101, 1037, 18385, 1010, 6057, 1998, 2633, 182...
1       [101, 4593, 2128, 27241, 23931, 2013, 1996, 62...
2       [101, 2027, 3653, 23545, 2037, 4378, 24185, 10...
3       [101, 2023, 2003, 1037, 17453, 14726, 19379, 1...
4       [101, 5655, 6262, 1005, 1055, 12075, 2571, 376...
                              ...                        
1995    [101, 2205, 20857, 1998, 11865, 16643, 2135, 5...
1996    [101, 2009, 2515, 1050, 1005, 1056, 2147, 2004...
1997    [101, 2023, 2028, 8704, 2005, 1996, 11848, 199...
1998    [101, 1999, 1996, 2171, 1997, 2019, 9382, 1898...
1999    [101, 1996, 3185, 2003, 25757, 2011, 1037, 244...
Name: 0, Length: 2000, dtype: object

In [97]:
tokenized[1999]

[101,
 1996,
 3185,
 2003,
 25757,
 2011,
 1037,
 24466,
 16134,
 2008,
 1005,
 1055,
 2074,
 6388,
 2438,
 2000,
 7344,
 3686,
 1996,
 7731,
 4378,
 2096,
 13060,
 18856,
 17322,
 2094,
 2000,
 15015,
 10271,
 4641,
 102]

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [98]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

In [99]:
tokenized.map(len)

0       20
1       16
2       45
3       22
4       25
        ..
1995    16
1996    10
1997    13
1998    33
1999    31
Name: 0, Length: 2000, dtype: int64

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [100]:
tokenized.shape

(2000,)

In [101]:
padded.shape

(2000, 59)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [102]:
padded

array([[  101,  1037, 18385, ...,     0,     0,     0],
       [  101,  4593,  2128, ...,     0,     0,     0],
       [  101,  2027,  3653, ...,     0,     0,     0],
       ...,
       [  101,  2023,  2028, ...,     0,     0,     0],
       [  101,  1999,  1996, ...,     0,     0,     0],
       [  101,  1996,  3185, ...,     0,     0,     0]])

In [103]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

In [104]:
attention_mask

array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]])

### Step #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [105]:
print(model)

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

In [106]:
input_ids = torch.tensor(padded)
input_ids

tensor([[  101,  1037, 18385,  ...,     0,     0,     0],
        [  101,  4593,  2128,  ...,     0,     0,     0],
        [  101,  2027,  3653,  ...,     0,     0,     0],
        ...,
        [  101,  2023,  2028,  ...,     0,     0,     0],
        [  101,  1999,  1996,  ...,     0,     0,     0],
        [  101,  1996,  3185,  ...,     0,     0,     0]])

In [107]:
input_ids.shape

torch.Size([2000, 59])

In [108]:
attention_mask = torch.tensor(attention_mask)
attention_mask

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])

In [109]:
attention_mask.shape

torch.Size([2000, 59])

In [110]:
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask = attention_mask)

In [111]:
last_hidden_states

BaseModelOutput(last_hidden_state=tensor([[[-0.2159, -0.1403,  0.0083,  ..., -0.1369,  0.5867,  0.2011],
         [-0.2471,  0.2468,  0.1008,  ..., -0.1631,  0.9349, -0.0715],
         [ 0.0558,  0.3573,  0.4140,  ..., -0.2430,  0.1770, -0.5080],
         ...,
         [-0.0165,  0.1179,  0.3512,  ..., -0.2401,  0.2722, -0.1750],
         [ 0.0961,  0.0667,  0.3147,  ..., -0.3277,  0.3556, -0.2135],
         [ 0.0454,  0.0519,  0.3168,  ..., -0.2880,  0.1844, -0.1042]],

        [[-0.1726, -0.1448,  0.0022,  ..., -0.1744,  0.2139,  0.3720],
         [ 0.0022,  0.1684,  0.1269,  ..., -0.1888, -0.0195, -0.0283],
         [ 0.0257, -0.2458,  0.0717,  ..., -0.4339,  0.1622,  0.0133],
         ...,
         [ 0.0505, -0.0493,  0.0463,  ..., -0.0448, -0.0540,  0.3136],
         [-0.2128, -0.1907, -0.0215,  ...,  0.0139, -0.2433, -0.0202],
         [-0.1310, -0.1693,  0.1019,  ..., -0.0859, -0.1770, -0.0872]],

        [[-0.0506,  0.0720, -0.0296,  ..., -0.0715,  0.7185,  0.2623],
         [ 

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [112]:
features = last_hidden_states[0][:, 0, :].numpy()

In [113]:
features.shape

(2000, 768)

In [114]:
features

array([[-0.21593417, -0.14028908,  0.0083102 , ..., -0.13694823,
         0.58670104,  0.20112741],
       [-0.17262726, -0.14476164,  0.00223435, ..., -0.17442556,
         0.21386458,  0.37197515],
       [-0.05063373,  0.07203986, -0.02959692, ..., -0.0714896 ,
         0.71852374,  0.26225433],
       ...,
       [-0.27829772, -0.24803609,  0.13585803, ..., -0.19039151,
         0.13099574,  0.3497837 ],
       [-0.03667711,  0.10638539, -0.01110991, ..., -0.11206588,
         0.4161944 ,  0.5033802 ],
       [ 0.12402633,  0.01425154,  0.01038392, ..., -0.11606569,
         0.5345911 ,  0.27495334]], dtype=float32)

In [115]:
input_ids.shape

torch.Size([2000, 59])

In [116]:
last_hidden_states[0].shape

torch.Size([2000, 59, 768])

In [117]:
last_hidden_states

BaseModelOutput(last_hidden_state=tensor([[[-0.2159, -0.1403,  0.0083,  ..., -0.1369,  0.5867,  0.2011],
         [-0.2471,  0.2468,  0.1008,  ..., -0.1631,  0.9349, -0.0715],
         [ 0.0558,  0.3573,  0.4140,  ..., -0.2430,  0.1770, -0.5080],
         ...,
         [-0.0165,  0.1179,  0.3512,  ..., -0.2401,  0.2722, -0.1750],
         [ 0.0961,  0.0667,  0.3147,  ..., -0.3277,  0.3556, -0.2135],
         [ 0.0454,  0.0519,  0.3168,  ..., -0.2880,  0.1844, -0.1042]],

        [[-0.1726, -0.1448,  0.0022,  ..., -0.1744,  0.2139,  0.3720],
         [ 0.0022,  0.1684,  0.1269,  ..., -0.1888, -0.0195, -0.0283],
         [ 0.0257, -0.2458,  0.0717,  ..., -0.4339,  0.1622,  0.0133],
         ...,
         [ 0.0505, -0.0493,  0.0463,  ..., -0.0448, -0.0540,  0.3136],
         [-0.2128, -0.1907, -0.0215,  ...,  0.0139, -0.2433, -0.0202],
         [-0.1310, -0.1693,  0.1019,  ..., -0.0859, -0.1770, -0.0872]],

        [[-0.0506,  0.0720, -0.0296,  ..., -0.0715,  0.7185,  0.2623],
         [ 

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [118]:
labels = batch_1[1]

### Step #3: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [119]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### [Extra] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [120]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scores: ', grid_search.best_score_)

best parameters:  {'C': 5.263252631578947}
best scrores:  0.8133333333333332


We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [121]:
lr_clf = LogisticRegression(C = grid_search.best_params_['C'])
lr_clf.fit(train_features, train_labels)

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

### Step #4:  Evaluating Model
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [122]:
lr_clf.score(test_features, test_labels)

0.814

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [123]:
train_features.shape

(1500, 768)

In [124]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.512 (+/- 0.00)


So our model clearly does better than a dummy classifier. But how does it compare against the best models?

### Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.



And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.