## Practice #13 - A Visual Notebook to Using BERT for the First Time

*Credits: first part of this notebook belongs to Jay Alammar and his [great blog post](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) (while it has minor changes). His blog is a great way to dive into the DL and NLP concepts.*

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" />

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will calssify each sentence as either speaking "positively" about its subject of "negatively".

### Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" />

Under the hood, the model is actually made up of two model.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.


<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" />

## Dataset
The dataset we will use in this example is russian proverbs.

## Installing the transformers library
Let's start by installing the huggingface transformers library so we can load our deep learning NLP model.

In [None]:
# !pip install transformers

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from transformers import DistilBertModel, DistilBertTokenizer, pipeline
import torch

import warnings
warnings.filterwarnings('ignore')

## Part 1. Using BERT for text classification.

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. Download [DeepPavlov Conversational DistilRuBERT](http://docs.deeppavlov.ai/en/master/features/models/bert.html).

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained("./distil_ru_conversational_cased_L-6_H-768_A-12_pt/")
model = DistilBertModel.from_pretrained("./distil_ru_conversational_cased_L-6_H-768_A-12_pt/")

In [None]:
unmasker = pipeline('fill-mask', './distil_ru_conversational_cased_L-6_H-768_A-12_pt/')

In [None]:
unmasker("Привет, [MASK] зовут [MASK].")

### Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [None]:
input_data_path = 'ru_proverbs.txt'

with open(input_data_path, 'r') as f:
    proverbs = f.readlines()

In [None]:
len(proverbs)

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

### Step #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [None]:
random_idx = np.random.randint(0, len(proverbs), 1)[0]
query_line = proverbs[random_idx]


random_idx = np.random.randint(0, len(proverbs), 1)[0]
target_line = proverbs[random_idx]


query_line, target_line

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />


In [None]:
q_tokens = tokenizer.encode(query_line, add_special_tokens=False)
t_tokens = tokenizer.encode(target_line, add_special_tokens=False)

In [None]:
q_tokens, t_tokens

In [None]:
new_token = q_tokens[:len(q_tokens) // 2] + t_tokens[len(t_tokens) // 2:]

In [None]:
new_line = tokenizer.decode(new_token)

In [None]:
new_line

In [None]:
def generate_fake_proverbs(db_size):
    lines = []
    
    for _ in range(db_size):
        random_idx = np.random.randint(0, len(proverbs), 1)[0]
        query_line = proverbs[random_idx]

        random_idx = np.random.randint(0, len(proverbs), 1)[0]
        target_line = proverbs[random_idx]
        
        q_tokens = tokenizer.encode(query_line, add_special_tokens=False)
        t_tokens = tokenizer.encode(target_line, add_special_tokens=False)
        
        new_token = q_tokens[:len(q_tokens) // 2] + t_tokens[len(t_tokens) // 2:]
        new_line = tokenizer.decode(new_token)
        
        lines.append(new_line)
    
    return lines

In [None]:
data_size = 1000

fake_proverbs = generate_fake_proverbs(data_size // 2)

In [None]:
random_idxs = np.random.randint(0, len(proverbs), data_size // 2)

train_data = np.hstack([np.array(proverbs)[random_idxs], fake_proverbs])

In [None]:
train_data.shape

In [None]:
tokenized = [tokenizer.encode(x, add_special_tokens=True) for x in train_data]


### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [None]:
max_len = max([len(x) for x in tokenized])
print(max_len)

padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized])

In [None]:
np.array(padded).shape

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

### Step #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [None]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [None]:
input_ids.shape

last_hidden_states[0].shape

features = last_hidden_states[0][:,0,:].numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [None]:
labels = np.ones(len(features))
labels[len(features) // 2:] = 0

### Step #3: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

In [None]:
lr_clf.score(test_features, test_labels)

### Estimate results 

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

plt.figure(figsize=(10, 6))

proba = lr_clf.predict_proba(train_features)[:, 1]
auc = roc_auc_score(train_labels, proba)
plt.plot(*roc_curve(train_labels, proba)[:2], label=f'train AUC={auc:.4f}')

proba = lr_clf.predict_proba(test_features)[:, 1]
auc = roc_auc_score(test_labels, proba)
plt.plot(*roc_curve(test_labels, proba)[:2], label=f'test AUC={auc:.4f}')

plt.legend()
plt.show()

### Inference 

In [None]:
text_phrase = 'Волк не птица - яйца не отложит!'
token = tokenizer(text_phrase, return_tensors="pt")

with torch.no_grad():
    last_hidden_states = model(**token)

In [None]:
feat = last_hidden_states[0][0][0].numpy()

In [None]:
lr_clf.predict_proba([feat])

In [None]:
lr_clf.predict([feat])

## Part 2. Use all data and explore results

Generate more samples

In [None]:
# YOUR CODE


Itarate by batches

In [None]:
# YOUR CODE


#### Build sklearn model

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

In [None]:
lr_clf.score(test_features, test_labels)

In [None]:
plt.figure(figsize=(10, 6))

proba = lr_clf.predict_proba(train_features)[:, 1]
auc = roc_auc_score(train_labels, proba)
plt.plot(*roc_curve(train_labels, proba)[:2], label=f'train AUC={auc:.4f}')

proba = lr_clf.predict_proba(test_features)[:, 1]
auc = roc_auc_score(test_labels, proba)
plt.plot(*roc_curve(test_labels, proba)[:2], label=f'test AUC={auc:.4f}')

plt.legend()
plt.show()

#### Explore results

1. Get best chery-picks from fake proverbs
2. Create your own proverb

### Single phrase

In [None]:
# YOUR CODE

text_phrase = 'Волк не птица - в лес зайдет'
token = tokenizer(text_phrase, return_tensors="pt")

with torch.no_grad():
    last_hidden_states = model(**token)

In [None]:
feat = last_hidden_states[0][0][0].numpy()

In [None]:
lr_clf.predict_proba([feat])

In [None]:
lr_clf.predict([feat])

### Chery-pick

In [None]:
# YOUR CODE
