## BERT Tutorial
### How to leverage BERT models for NLP use cases

<img src='https://towardsml.files.wordpress.com/2019/09/bert.png?w=1400' width=450>

#### A lot of available models - choose according to computational powers

Link: https://github.com/google-research/bert/

<img src='data/bert_models.png' width=600>


#### I will use DistilBert - smaller BERT that reaches similarly good performance level

DistilBert
- 66m parameters (Bert 110m)
- Layers / Hidden dimensions / Attention heads: 6 / 768 / 12 (BERT: 12 / 768 / 12)
- Performance: 97% of BERT

Complete documentation: https://huggingface.co/docs/transformers/model_doc/distilbert#distilbert (actually very user friendly)

In [17]:
#!pip install transformers

# for pytorch
from transformers import DistilBertTokenizer, DistilBertModel, DistilBertForSequenceClassification, pipeline

# for tensorflow
from transformers import DistilBertTokenizer, TFDistilBertModel, TFDistilBertForSequenceClassification, pipeline

# tensorflow helpers
from tensorflow.keras.utils import plot_model

RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
type object 'h5py.h5.H5PYConfig' has no attribute '__reduce_cython__'

In [16]:
!pip install --upgrade numpy==1.19.2

Collecting numpy==1.19.2
  Downloading numpy-1.19.2-cp38-cp38-win_amd64.whl (13.0 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2


ERROR: Could not install packages due to an OSError: [WinError 5] A hozzáférés megtagadva: 'C:\\Users\\krist\\AppData\\Local\\Temp\\pip-uninstall-p0ob9ox4\\core\\_multiarray_tests.cp38-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



- DistilBertTokenizer: tokenizes input sequence
- DistilBertModel: creates embeddings on top of tokenized sequence (DistilBertTokenizer + training embeddings)
    - if using TensorFlow: TFDistilBertModel
- DistilBertForSequenceClassification: already builds a classifier on top of embeddings (DistilBertModel + classifier)
    - if using TensorFlow: TFDistilBertForSequenceClassification
- pipeline: DIY use cases    

## 1. Try out [MASK] performance

In [4]:
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [5]:
unmasker('She wanted to go to [MASK].', top_k = 5)

[{'score': 0.10261130332946777,
  'token': 3637,
  'token_str': 'sleep',
  'sequence': 'she wanted to go to sleep.'},
 {'score': 0.06890508532524109,
  'token': 6014,
  'token_str': 'heaven',
  'sequence': 'she wanted to go to heaven.'},
 {'score': 0.05547202005982399,
  'token': 2793,
  'token_str': 'bed',
  'sequence': 'she wanted to go to bed.'},
 {'score': 0.02979607693850994,
  'token': 7173,
  'token_str': 'jail',
  'sequence': 'she wanted to go to jail.'},
 {'score': 0.024603189900517464,
  'token': 2267,
  'token_str': 'college',
  'sequence': 'she wanted to go to college.'}]

In [6]:
unmasker("I can't find my [MASK] .", top_k = 5)

[{'score': 0.04563814401626587,
  'token': 21714,
  'token_str': 'bearings',
  'sequence': "i can't find my bearings."},
 {'score': 0.029267553240060806,
  'token': 3042,
  'token_str': 'phone',
  'sequence': "i can't find my phone."},
 {'score': 0.024135015904903412,
  'token': 3437,
  'token_str': 'answer',
  'sequence': "i can't find my answer."},
 {'score': 0.023151546716690063,
  'token': 6998,
  'token_str': 'answers',
  'sequence': "i can't find my answers."},
 {'score': 0.02219865471124649,
  'token': 3611,
  'token_str': 'dad',
  'sequence': "i can't find my dad."}]

In [7]:
unmasker("I wish I had a [MASK].", top_k = 5)

[{'score': 0.0803661048412323,
  'token': 6898,
  'token_str': 'boyfriend',
  'sequence': 'i wish i had a boyfriend.'},
 {'score': 0.04185345396399498,
  'token': 3336,
  'token_str': 'baby',
  'sequence': 'i wish i had a baby.'},
 {'score': 0.030962727963924408,
  'token': 6513,
  'token_str': 'girlfriend',
  'sequence': 'i wish i had a girlfriend.'},
 {'score': 0.024552376940846443,
  'token': 3382,
  'token_str': 'chance',
  'sequence': 'i wish i had a chance.'},
 {'score': 0.020696144551038742,
  'token': 3959,
  'token_str': 'dream',
  'sequence': 'i wish i had a dream.'}]

In [8]:
unmasker("The black woman worked as a [MASK].")

[{'score': 0.13283953070640564,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'the black woman worked as a waitress.'},
 {'score': 0.1258619725704193,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'the black woman worked as a nurse.'},
 {'score': 0.11708834767341614,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'the black woman worked as a maid.'},
 {'score': 0.11499997973442078,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'the black woman worked as a prostitute.'},
 {'score': 0.047227732837200165,
  'token': 22583,
  'token_str': 'housekeeper',
  'sequence': 'the black woman worked as a housekeeper.'}]

In [9]:
unmasker("The white man worked as a [MASK].")

[{'score': 0.12353688478469849,
  'token': 20987,
  'token_str': 'blacksmith',
  'sequence': 'the white man worked as a blacksmith.'},
 {'score': 0.10142599791288376,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the white man worked as a carpenter.'},
 {'score': 0.04984995722770691,
  'token': 7500,
  'token_str': 'farmer',
  'sequence': 'the white man worked as a farmer.'},
 {'score': 0.0393255352973938,
  'token': 18594,
  'token_str': 'miner',
  'sequence': 'the white man worked as a miner.'},
 {'score': 0.03351772576570511,
  'token': 14998,
  'token_str': 'butcher',
  'sequence': 'the white man worked as a butcher.'}]

In [10]:
unmasker("Black people are [MASK].")

[{'score': 0.056957732886075974,
  'token': 12421,
  'token_str': 'excluded',
  'sequence': 'black people are excluded.'},
 {'score': 0.0329122394323349,
  'token': 22216,
  'token_str': 'enslaved',
  'sequence': 'black people are enslaved.'},
 {'score': 0.0325375497341156,
  'token': 8135,
  'token_str': 'christians',
  'sequence': 'black people are christians.'},
 {'score': 0.02683648094534874,
  'token': 14302,
  'token_str': 'minorities',
  'sequence': 'black people are minorities.'},
 {'score': 0.017561450600624084,
  'token': 27666,
  'token_str': 'persecuted',
  'sequence': 'black people are persecuted.'}]

## 2. Get features (embeddings) of tokens

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer

return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
        If set, will return tensors instead of list of python integers. Acceptable values are:

        - `'tf'`: Return TensorFlow `tf.constant` objects.
        - `'pt'`: Return PyTorch `torch.Tensor` objects.
        - `'np'`: Return Numpy `np.ndarray` objects.

In [None]:
# pytorh
# model = DistilBertModel.from_pretrained("distilbert-base-uncased")

# tensorflow
model = TFDistilBertModel.from_pretrained("distilbert-base-uncased")

In [None]:
model.config

In [None]:
model.summary()

In [None]:
model.layers[0].transformer.layer.layers

In [None]:
for i in model.layers[0].transformer.layer.layers[0].variables:
    print(i.name)

In [10]:
import pydot
import graphviz

In [16]:
plot_model(model)

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')


In [None]:
!pip install --upgrade pydot
!pip install --upgrade graphviz

In [405]:
text = ["I went to the river bank.", 
        "I work at an investment bank.", 
        "Can we go to the bank?"]
encoded_input = tokenizer(text, return_tensors='pt', padding = True)

In [406]:
encoded_input

{'input_ids': tensor([[ 101, 1045, 2253, 2000, 1996, 2314, 2924, 1012,  102],
        [ 101, 1045, 2147, 2012, 2019, 5211, 2924, 1012,  102],
        [ 101, 2064, 2057, 2175, 2000, 1996, 2924, 1029,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]])}

How does the tokenized text look like?

In [407]:
for i in range(encoded_input['input_ids'].shape[0]):
    print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][i])) 

['[CLS]', 'i', 'went', 'to', 'the', 'river', 'bank', '.', '[SEP]']
['[CLS]', 'i', 'work', 'at', 'an', 'investment', 'bank', '.', '[SEP]']
['[CLS]', 'can', 'we', 'go', 'to', 'the', 'bank', '?', '[SEP]']


In [408]:
for i in range(encoded_input['input_ids'].shape[0]):
    print(len(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][i])) )

9
9
9


How is the output stored?

In [409]:
output = model(**encoded_input)

In [410]:
output

BaseModelOutput(last_hidden_state=tensor([[[ 0.0416, -0.0590, -0.0143,  ...,  0.0987,  0.2583,  0.3616],
         [ 0.4119, -0.1251, -0.0900,  ..., -0.2270,  0.4574,  0.1204],
         [-0.1136, -0.3707,  0.0933,  ..., -0.0712, -0.2049,  0.1402],
         ...,
         [ 0.2677, -0.0853, -0.0798,  ..., -0.2010, -0.4547, -0.1793],
         [ 0.3108, -0.2082, -0.3871,  ...,  0.1783, -0.0147, -0.4415],
         [ 0.8262,  0.2993, -0.2272,  ...,  0.0407, -0.3076, -0.4611]],

        [[ 0.1690, -0.0159, -0.0666,  ..., -0.1088,  0.4084,  0.2130],
         [ 0.6823,  0.0927, -0.2938,  ..., -0.3990,  0.6445,  0.0726],
         [ 0.4919,  0.4326,  0.1071,  ..., -0.6932,  0.2243, -0.2963],
         ...,
         [ 0.2798, -0.1044, -0.1349,  ..., -0.2584,  0.0653, -0.2844],
         [ 0.3263, -0.5069, -0.4628,  ...,  0.1052,  0.1265, -0.6579],
         [ 0.7799,  0.0762, -0.3130,  ...,  0.0021, -0.3228, -0.5257]],

        [[ 0.2355, -0.0308,  0.0814,  ...,  0.0441,  0.3587,  0.2963],
         [ 

In [411]:
output[0].shape

torch.Size([3, 9, 768])

In [412]:
output[0][0].shape

torch.Size([9, 768])

In [413]:
output[0][1].shape

torch.Size([9, 768])

In [414]:
output[0][2].shape

torch.Size([9, 768])

Words with multiple meanings

In [419]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

In [415]:
bank_river = output[0][0][6]
bank_financial = output[0][1][6]
bank_universal = output[0][2][6]

In [417]:
bank_matrix = np.concatenate((bank_river.detach().numpy().reshape(1, 768), 
                              bank_financial.detach().numpy().reshape(1, 768), 
                              bank_universal.detach().numpy().reshape(1, 768)))

In [422]:
pd.DataFrame(cosine_similarity(bank_matrix), 
             columns=['River', 'Investment', 'Universal'],
             index=['River', 'Investment', 'Universal'])

Unnamed: 0,River,Investment,Universal
River,1.0,0.7082,0.775612
Investment,0.7082,1.0,0.839302
Universal,0.775612,0.839302,1.0


In [426]:
text = ["My date went great last night!", 
        "What's today's date?", 
        "This date is too sour to eat."]
encoded_input = tokenizer(text, return_tensors='pt', padding = True)

for i in range(encoded_input['input_ids'].shape[0]):
    print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][i])) 
    
output = model(**encoded_input)

date_rel = output[0][0][2]
date_time = output[0][1][7]
date_food = output[0][2][2]

date_matrix = np.concatenate((date_rel.detach().numpy().reshape(1, 768), 
                              date_time.detach().numpy().reshape(1, 768), 
                              date_food.detach().numpy().reshape(1, 768)))

pd.DataFrame(cosine_similarity(date_matrix), 
             columns=['Relationship', 'Calendar', 'Food'],
             index=['Relationship', 'Calendar', 'Food'])

['[CLS]', 'my', 'date', 'went', 'great', 'last', 'night', '!', '[SEP]', '[PAD]']
['[CLS]', 'what', "'", 's', 'today', "'", 's', 'date', '?', '[SEP]']
['[CLS]', 'this', 'date', 'is', 'too', 'sour', 'to', 'eat', '.', '[SEP]']


Unnamed: 0,Relationship,Calendar,Food
Relationship,1.0,0.825439,0.743567
Calendar,0.825439,1.0,0.771033
Food,0.743567,0.771033,1.0


## 3. Transfer learning without fine tuning - sentiment classification

<img src='https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification-example.png' width=800>


In [442]:
import torch

# already loaded
# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# model = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [440]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', 
                 delimiter='\t', header=None)
df.columns = ['review', 'label']

df.head()

Unnamed: 0,review,label
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


#### 1. Tokenization

In [496]:
%%time 
tokenized = df['review'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

Wall time: 5.84 s


In [497]:
print('Max length:', tokenized.map(len).max())
print('Median length:', tokenized.map(len).median())
print('Mean length:', tokenized.map(len).mean())

Max length: 67
Median length: 22.0
Mean length: 23.341907514450867


Use tokenizer function that creates padded embeddings and outputs attention masks (what to consider, what not to consider)

In [505]:
MAX_LEN = 30

def bert_tokenizer(text):
    
    encoded_text = tokenizer.encode_plus(text,  max_length = MAX_LEN, truncation=True,  padding='max_length',  
                                         return_attention_mask=True, return_tensors='pt')
    
    return encoded_text['input_ids'][0].numpy(), encoded_text['attention_mask'][0].numpy()

In [506]:
bert_tokenizer('Sample test that will be padded')

(array([  101,  7099,  3231,  2008,  2097,  2022, 20633,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0], dtype=int64),
 array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0], dtype=int64))

In [539]:
%%time

tokenized_padded, attention_masks = zip(*df['review'].apply(lambda x: bert_tokenizer(x)))

Wall time: 8.16 s


In [544]:
input_ids = torch.tensor(tokenized_padded)  
attention_mask = torch.tensor(attention_masks)

print(input_ids.shape)
print(attention_mask.shape)

torch.Size([6920, 30])
torch.Size([6920, 30])


#### 2. Apply BERT on tokens

In [547]:
%%time

with torch.no_grad(): # no need to keep track of gradients
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Wall time: 5min 50s


In [551]:
last_hidden_states[0].shape

torch.Size([6920, 30, 768])

6920 sentences, 30 words (tokens) in each sentence, 768 dimensions for each word (token)

#### 3. Get sentence embeddings out of the resuling tensor


<img src='https://camo.githubusercontent.com/6c2185c7620a3fe52f1968752febb6467723f4485c257442d3b0ed03bb0da197/68747470733a2f2f6a616c616d6d61722e6769746875622e696f2f696d616765732f64697374696c424552542f626572742d6f75747075742d74656e736f722d73656c656374696f6e2e706e67' width=1000>


In [556]:
X = last_hidden_states[0][:,0,:].numpy()
y = df['label']

print(X.shape)
print(y.shape)

(6920, 768)
(6920,)


#### 4. Fit model, evaluate

In [560]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

In [558]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 91, train_size = 0.8)

In [564]:
logit = LogisticRegression(max_iter = 1000).fit(X_train, y_train)

In [571]:
y_pred_class = logit.predict(X_test)
y_pred_prob = logit.predict_proba(X_test)[:, 1]

In [579]:
print('Ratio of positive class:', y.value_counts()[1] / df.shape[0])
print('Accuracy:', accuracy_score(y_test, y_pred_class))
print('AUC:', roc_auc_score(y_test, y_pred_prob))

Ratio of positive class: 0.5216763005780347
Accuracy: 0.8345375722543352
AUC: 0.9162513712584235


#### 5. Predict sentiment of any text

In [638]:
def predict_sentiment(text):
    
    _tokenized, _attention_mask = bert_tokenizer(text)

    _tokenized = torch.reshape(torch.from_numpy(_tokenized), (1, 30))
    _attention_mask = torch.reshape(torch.from_numpy(_attention_mask), (1, 30))
    _last_hidden_state = model(_tokenized, attention_mask = _attention_mask)
    _X = _last_hidden_state[0][:,0,:][0].detach().numpy().reshape(1, -1)

    #predicted_class = logit.predict(_X)[0]
    predicted_proba = logit.predict_proba(_X)[:, 1][0]

    return print('Probability of being positive:', predicted_proba)

In [657]:
text = 'I though the movie was going to suck, but actually it turned out to be really good.'
predict_sentiment(text)

Probability of being positive: 0.4578798726886869


In [640]:
text = 'Overall OK, nothing special'
predict_sentiment(text)

Probability of being positive: 0.21166203911710665


In [641]:
text = 'Liked it'
predict_sentiment(text)

Probability of being positive: 0.8485814533066036


In [647]:
text = 'What a fucking amazing picture'
predict_sentiment(text)

Probability of being positive: 0.6148351310661822


In [648]:
text = 'What a fucking amazing picture!'
predict_sentiment(text)

Probability of being positive: 0.6895257016039957


## 4. Fine tuning a DistilBert model

In [None]:
%tensorboard --logdir logs/fit