## BERT Tutorial
### How to leverage BERT models for NLP use cases

<img src='https://towardsml.files.wordpress.com/2019/09/bert.png?w=1400' width=450>

#### A lot of available models - choose according to computational powers

Link: https://github.com/google-research/bert/

<img src='data/bert_models.png' width=600>


#### I will use DistilBert - smaller BERT that reaches similarly good performance level

DistilBert
- 66m parameters (Bert 110m)
- Layers / Hidden dimensions / Attention heads: 6 / 768 / 12 (BERT: 12 / 768 / 12)
- Performance: 97% of BERT

Complete documentation: https://huggingface.co/docs/transformers/model_doc/distilbert#distilbert (actually very user friendly)

In [1]:
import sys
import torch
from transformers import __version__ as transformers_version

print('Python version:', sys.version)
print('PyTorch version:', torch.__version__)
print('Transformers version:', transformers_version)

Python version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
PyTorch version: 1.10.2
Transformers version: 4.17.0


In [2]:
torch.cuda.is_available()

  return torch._C._cuda_getDeviceCount() > 0


False

In [3]:
# !pip install transformers
# torch.set_num_threads(4)

from transformers import DistilBertTokenizer, DistilBertModel, DistilBertForSequenceClassification, pipeline

- DistilBertTokenizer: tokenizes input sequence
- DistilBertModel: creates embeddings on top of tokenized sequence (DistilBertTokenizer + training embeddings)
    - if using TensorFlow: TFDistilBertModel
- DistilBertForSequenceClassification: already builds a classifier on top of embeddings (DistilBertModel + classifier)
    - if using TensorFlow: TFDistilBertForSequenceClassification
- pipeline: DIY use cases    

## 1. Try out [MASK] performance

In [3]:
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

Downloading: 100%|██████████| 256M/256M [00:42<00:00, 6.24MB/s] 


In [4]:
unmasker('She wanted to go to [MASK].', top_k = 5)

[{'score': 0.10261113196611404,
  'token': 3637,
  'token_str': 'sleep',
  'sequence': 'she wanted to go to sleep.'},
 {'score': 0.06890519708395004,
  'token': 6014,
  'token_str': 'heaven',
  'sequence': 'she wanted to go to heaven.'},
 {'score': 0.055472031235694885,
  'token': 2793,
  'token_str': 'bed',
  'sequence': 'she wanted to go to bed.'},
 {'score': 0.029796097427606583,
  'token': 7173,
  'token_str': 'jail',
  'sequence': 'she wanted to go to jail.'},
 {'score': 0.02460319548845291,
  'token': 2267,
  'token_str': 'college',
  'sequence': 'she wanted to go to college.'}]

In [5]:
unmasker("I can't find my [MASK] .", top_k = 5)

[{'score': 0.04563814774155617,
  'token': 21714,
  'token_str': 'bearings',
  'sequence': "i can't find my bearings."},
 {'score': 0.029267525300383568,
  'token': 3042,
  'token_str': 'phone',
  'sequence': "i can't find my phone."},
 {'score': 0.0241349246352911,
  'token': 3437,
  'token_str': 'answer',
  'sequence': "i can't find my answer."},
 {'score': 0.023151548579335213,
  'token': 6998,
  'token_str': 'answers',
  'sequence': "i can't find my answers."},
 {'score': 0.02219867706298828,
  'token': 3611,
  'token_str': 'dad',
  'sequence': "i can't find my dad."}]

In [6]:
unmasker("I wish I had a [MASK].", top_k = 5)

[{'score': 0.08036588877439499,
  'token': 6898,
  'token_str': 'boyfriend',
  'sequence': 'i wish i had a boyfriend.'},
 {'score': 0.04185354337096214,
  'token': 3336,
  'token_str': 'baby',
  'sequence': 'i wish i had a baby.'},
 {'score': 0.030962644144892693,
  'token': 6513,
  'token_str': 'girlfriend',
  'sequence': 'i wish i had a girlfriend.'},
 {'score': 0.024552473798394203,
  'token': 3382,
  'token_str': 'chance',
  'sequence': 'i wish i had a chance.'},
 {'score': 0.020696187391877174,
  'token': 3959,
  'token_str': 'dream',
  'sequence': 'i wish i had a dream.'}]

In [7]:
unmasker("The black woman worked as a [MASK].")

[{'score': 0.13283944129943848,
  'token': 13877,
  'token_str': 'waitress',
  'sequence': 'the black woman worked as a waitress.'},
 {'score': 0.12586164474487305,
  'token': 6821,
  'token_str': 'nurse',
  'sequence': 'the black woman worked as a nurse.'},
 {'score': 0.11708816140890121,
  'token': 10850,
  'token_str': 'maid',
  'sequence': 'the black woman worked as a maid.'},
 {'score': 0.11500067263841629,
  'token': 19215,
  'token_str': 'prostitute',
  'sequence': 'the black woman worked as a prostitute.'},
 {'score': 0.0472276546061039,
  'token': 22583,
  'token_str': 'housekeeper',
  'sequence': 'the black woman worked as a housekeeper.'}]

In [8]:
unmasker("The white man worked as a [MASK].")

[{'score': 0.12353681027889252,
  'token': 20987,
  'token_str': 'blacksmith',
  'sequence': 'the white man worked as a blacksmith.'},
 {'score': 0.10142574459314346,
  'token': 10533,
  'token_str': 'carpenter',
  'sequence': 'the white man worked as a carpenter.'},
 {'score': 0.049850210547447205,
  'token': 7500,
  'token_str': 'farmer',
  'sequence': 'the white man worked as a farmer.'},
 {'score': 0.03932555019855499,
  'token': 18594,
  'token_str': 'miner',
  'sequence': 'the white man worked as a miner.'},
 {'score': 0.033517707139253616,
  'token': 14998,
  'token_str': 'butcher',
  'sequence': 'the white man worked as a butcher.'}]

In [9]:
unmasker("Black people are [MASK].")

[{'score': 0.056957632303237915,
  'token': 12421,
  'token_str': 'excluded',
  'sequence': 'black people are excluded.'},
 {'score': 0.032912153750658035,
  'token': 22216,
  'token_str': 'enslaved',
  'sequence': 'black people are enslaved.'},
 {'score': 0.0325375571846962,
  'token': 8135,
  'token_str': 'christians',
  'sequence': 'black people are christians.'},
 {'score': 0.02683640830218792,
  'token': 14302,
  'token_str': 'minorities',
  'sequence': 'black people are minorities.'},
 {'score': 0.017561351880431175,
  'token': 27666,
  'token_str': 'persecuted',
  'sequence': 'black people are persecuted.'}]

## 2. Get features (embeddings) of tokens

In [4]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer

PreTrainedTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

return_tensors (`str` or [`~file_utils.TensorType`], *optional*):
        If set, will return tensors instead of list of python integers. Acceptable values are:

        - `'tf'`: Return TensorFlow `tf.constant` objects.
        - `'pt'`: Return PyTorch `torch.Tensor` objects.
        - `'np'`: Return Numpy `np.ndarray` objects.

In [5]:
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.17.0",
  "vocab_size": 30522
}

In [7]:
model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [8]:
model.embeddings

Embeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [9]:
text = ["I went to the river bank and just laid there.", 
        "I work at an investment bank in New York.", 
        "Do you want to go with me to the bank?"]
encoded_input = tokenizer(text, return_tensors='pt', padding = True)

In [10]:
encoded_input

{'input_ids': tensor([[ 101, 1045, 2253, 2000, 1996, 2314, 2924, 1998, 2074, 4201, 2045, 1012,
          102],
        [ 101, 1045, 2147, 2012, 2019, 5211, 2924, 1999, 2047, 2259, 1012,  102,
            0],
        [ 101, 2079, 2017, 2215, 2000, 2175, 2007, 2033, 2000, 1996, 2924, 1029,
          102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [11]:
encoded_input['input_ids']

tensor([[ 101, 1045, 2253, 2000, 1996, 2314, 2924, 1998, 2074, 4201, 2045, 1012,
          102],
        [ 101, 1045, 2147, 2012, 2019, 5211, 2924, 1999, 2047, 2259, 1012,  102,
            0],
        [ 101, 2079, 2017, 2215, 2000, 2175, 2007, 2033, 2000, 1996, 2924, 1029,
          102]])

In [12]:
encoded_input['input_ids'].numpy()

array([[ 101, 1045, 2253, 2000, 1996, 2314, 2924, 1998, 2074, 4201, 2045,
        1012,  102],
       [ 101, 1045, 2147, 2012, 2019, 5211, 2924, 1999, 2047, 2259, 1012,
         102,    0],
       [ 101, 2079, 2017, 2215, 2000, 2175, 2007, 2033, 2000, 1996, 2924,
        1029,  102]], dtype=int64)

In [13]:
encoded_input['attention_mask'].numpy()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

How does the tokenized text look like?

In [14]:
for i in range(encoded_input['input_ids'].shape[0]):
    print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][i])) 

['[CLS]', 'i', 'went', 'to', 'the', 'river', 'bank', 'and', 'just', 'laid', 'there', '.', '[SEP]']
['[CLS]', 'i', 'work', 'at', 'an', 'investment', 'bank', 'in', 'new', 'york', '.', '[SEP]', '[PAD]']
['[CLS]', 'do', 'you', 'want', 'to', 'go', 'with', 'me', 'to', 'the', 'bank', '?', '[SEP]']


In [15]:
for i in range(encoded_input['input_ids'].shape[0]):
    print(len(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][i])) )

13
13
13


How is the output stored?

In [16]:
output = model(**encoded_input)

In [17]:
output[0].shape

torch.Size([3, 13, 768])

In [18]:
output[0][0].shape

torch.Size([13, 768])

In [19]:
output[0][1].shape

torch.Size([13, 768])

In [20]:
output[0][2].shape

torch.Size([13, 768])

In [21]:
# can convert to numpy
output[0][0].detach().numpy().shape

(13, 768)

Words with multiple meanings

In [5]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

In [23]:
bank_river = output[0][0][6]
bank_financial = output[0][1][6]
bank_universal = output[0][2][10]

In [24]:
bank_matrix = np.concatenate((bank_river.detach().numpy().reshape(1, 768), 
                              bank_financial.detach().numpy().reshape(1, 768), 
                              bank_universal.detach().numpy().reshape(1, 768)))

In [25]:
pd.DataFrame(cosine_similarity(bank_matrix), 
             columns=['River', 'Investment', 'Universal'],
             index=['River', 'Investment', 'Universal'])

Unnamed: 0,River,Investment,Universal
River,1.0,0.688735,0.770934
Investment,0.688735,1.0,0.836289
Universal,0.770934,0.836289,1.0


In [26]:
text = ["My date went great last night!", 
        "What's today's date?", 
        "This date is too sour to eat."]
encoded_input = tokenizer(text, return_tensors='pt', padding = True)

for i in range(encoded_input['input_ids'].shape[0]):
    print(tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][i])) 
    
output = model(**encoded_input)

date_rel = output[0][0][2]
date_time = output[0][1][7]
date_food = output[0][2][2]

date_matrix = np.concatenate((date_rel.detach().numpy().reshape(1, 768), 
                              date_time.detach().numpy().reshape(1, 768), 
                              date_food.detach().numpy().reshape(1, 768)))

pd.DataFrame(cosine_similarity(date_matrix), 
             columns=['Relationship', 'Calendar', 'Food'],
             index=['Relationship', 'Calendar', 'Food'])

['[CLS]', 'my', 'date', 'went', 'great', 'last', 'night', '!', '[SEP]', '[PAD]']
['[CLS]', 'what', "'", 's', 'today', "'", 's', 'date', '?', '[SEP]']
['[CLS]', 'this', 'date', 'is', 'too', 'sour', 'to', 'eat', '.', '[SEP]']


Unnamed: 0,Relationship,Calendar,Food
Relationship,1.0,0.825439,0.743566
Calendar,0.825439,1.0,0.771033
Food,0.743566,0.771033,1.0


## 3. Transfer learning without fine tuning - sentiment classification

<img src='https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification-example.png' width=800>


In [27]:
# already loaded
# import torch
# tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
# model = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [6]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', 
                 delimiter='\t', header=None)
df.columns = ['review', 'label']

df.head()

Unnamed: 0,review,label
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


#### 1. Tokenization

In [29]:
%%time 
tokenized = df['review'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

Wall time: 5.56 s


In [30]:
print('Max length:', tokenized.map(len).max())
print('Median length:', tokenized.map(len).median())
print('Mean length:', tokenized.map(len).mean())

Max length: 67
Median length: 22.0
Mean length: 23.341907514450867


Use tokenizer function that creates padded embeddings and outputs attention masks (what to consider, what not to consider)

In [31]:
MAX_LEN = 30

def bert_tokenizer(text):
    
    encoded_text = tokenizer.encode_plus(text,  max_length = MAX_LEN, truncation=True,  padding='max_length',  
                                         return_attention_mask=True, return_tensors='tf')
    
    return encoded_text['input_ids'][0].numpy(), encoded_text['attention_mask'][0].numpy()

In [32]:
bert_tokenizer('Sample test that will be padded')

(array([  101,  7099,  3231,  2008,  2097,  2022, 20633,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0]))

In [33]:
%%time

tokenized_padded, attention_masks = zip(*df['review'].apply(lambda x: bert_tokenizer(x)))

Wall time: 9.94 s


In [34]:
input_ids = torch.tensor(np.array(tokenized_padded))  # could add requires_grad=False in torch, torch.no_grad() is more universal
attention_mask = torch.tensor(np.array(attention_masks))

print(input_ids.shape)
print(attention_mask.shape)

torch.Size([6920, 30])
torch.Size([6920, 30])


#### 2. Apply BERT on tokens

In [35]:
%%time

with torch.no_grad(): # no need to keep track of gradients
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Wall time: 5min 17s


In [36]:
last_hidden_states[0].shape

torch.Size([6920, 30, 768])

6920 sentences, 30 words (tokens) in each sentence, 768 dimensions for each word (token)

#### 3. Get sentence embeddings out of the resuling tensor


<img src='https://camo.githubusercontent.com/6c2185c7620a3fe52f1968752febb6467723f4485c257442d3b0ed03bb0da197/68747470733a2f2f6a616c616d6d61722e6769746875622e696f2f696d616765732f64697374696c424552542f626572742d6f75747075742d74656e736f722d73656c656374696f6e2e706e67' width=1000>


In [37]:
X = last_hidden_states[0][:,0,:].numpy()
y = df['label']

print(X.shape)
print(y.shape)

(6920, 768)
(6920,)


#### 4. Fit model, evaluate

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 91, train_size = 0.8)

In [40]:
logit = LogisticRegression(max_iter = 1000).fit(X_train, y_train)

In [41]:
y_pred_class = logit.predict(X_test)
y_pred_prob = logit.predict_proba(X_test)[:, 1]

In [42]:
print('Ratio of positive class:', y.value_counts()[1] / df.shape[0])
print('Accuracy:', accuracy_score(y_test, y_pred_class))
print('AUC:', roc_auc_score(y_test, y_pred_prob))

Ratio of positive class: 0.5216763005780347
Accuracy: 0.8345375722543352
AUC: 0.9162325654286162


#### 5. Predict sentiment of any text

In [43]:
def predict_sentiment(text):
    
    _tokenized, _attention_mask = bert_tokenizer(text)

    _tokenized = torch.reshape(torch.from_numpy(_tokenized), (1, 30))
    _attention_mask = torch.reshape(torch.from_numpy(_attention_mask), (1, 30))
    _last_hidden_state = model(_tokenized, attention_mask = _attention_mask)
    _X = _last_hidden_state[0][:,0,:][0].detach().numpy().reshape(1, -1)

    #predicted_class = logit.predict(_X)[0]
    predicted_proba = logit.predict_proba(_X)[:, 1][0]

    return print('Probability of being positive:', predicted_proba)

In [44]:
text = 'I though the movie was going to suck, but actually it turned out to be really good.'
predict_sentiment(text)

Probability of being positive: 0.4585272365085751


In [45]:
text = 'Overall OK, nothing special'
predict_sentiment(text)

Probability of being positive: 0.21126795797738587


In [46]:
text = 'Liked it'
predict_sentiment(text)

Probability of being positive: 0.8487172901582756


In [47]:
text = 'What a fucking amazing picture'
predict_sentiment(text)

Probability of being positive: 0.6141236449664051


In [48]:
text = 'What a fucking amazing picture!'
predict_sentiment(text)

Probability of being positive: 0.6883185167869081


## 4.  Sentiment classification with fine tuning --> BERT in neural net (`DistilBertForSequenceClassification`)

<img src='https://skimai.com/wp-content/uploads/2020/03/Screen-Shot-2020-04-13-at-5.59.33-PM.png' width=800>

In [8]:
from torch.utils.data import TensorDataset, DataLoader
# could use sampler (random for train, sequential for test)

import torch.nn as nn 
import torch.nn.functional as F
import torch.optim as optim

### 1. Create PyTorch datasets
- Tokenization
- TF spedicif dataloaders

In [9]:
train, test = train_test_split(df, random_state = 91, train_size = 0.8)

MAX_LEN = 30
BATCH_SIZE = 300
EPOCHS = 2

def bert_tokenizer(text):    
    encoded_text = tokenizer(text,  max_length = MAX_LEN, truncation=True,  padding='max_length', return_attention_mask=True, return_tensors='pt')    
    return encoded_text

Convert to PyTorch tensors --> Datasets --> Dataloaders

In [10]:
%%time
train_tokenized = bert_tokenizer(train['review'].tolist())
test_tokenized = bert_tokenizer(test['review'].tolist())

train_input_ids = train_tokenized['input_ids']
test_input_ids = test_tokenized['input_ids']

train_attention_mask = train_tokenized['attention_mask']
test_attention_mask = test_tokenized['attention_mask']

train_labels = torch.Tensor(train['label'].values).type(torch.LongTensor)
test_labels = torch.Tensor(test['label'].values).type(torch.LongTensor)

Wall time: 14.2 s


In [11]:
train_data = TensorDataset(train_input_ids, train_attention_mask, train_labels)
test_data = TensorDataset(test_input_ids, test_attention_mask, test_labels)

train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE)

### 2. Import and compile BERT model for sequence classification


In [102]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', 
                                                            num_labels = 2)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classi

In [103]:
model(torch.tensor([101, 102]).reshape(-1,1), labels = torch.tensor([0, 1]))

SequenceClassifierOutput(loss=tensor(0.6843, grad_fn=<NllLossBackward0>), logits=tensor([[-0.0695,  0.0219],
        [-0.2379, -0.1047]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [104]:
optimizer = optim.Adam(model.parameters(), lr = 3e-5)
#loss_fn = nn.CrossEntropyLoss()

### 3. Fine tune model

Optionally freeze BERT layer

In [105]:
import time
import datetime
from tqdm import tqdm, trange

def format_time(elapsed):
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [36]:
# need to find layers and set their grads to False
# params = list(model.parameters())

# for param in params[:-4]:
#     param.requires_grad = False

In [106]:
%%time
train_loss = []
test_loss = []

for epoch in range(EPOCHS):
    
    t0 = time.time()
    print(f"Epoch {epoch + 1}\n-------------------------------")
    
    ### TRAIN ### 
    model.train()
    tr_loss, tr_accuracy = 0, 0
    nb_tr_steps = 0 

    for step, batch in enumerate(train_dataloader):

        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
        
        # print batch progress
        if step % 4 == 0 : #and not step == 0
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
        
        seq_output = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)        
        loss =  seq_output.loss
        pred = seq_output.logits
        
        loss.backward()
        optimizer.step()
        
        train_loss.append(loss.item())  
        tr_loss += loss.item()
        nb_tr_steps += 1
        
        pred = pred.detach().numpy()
        labels = b_labels.numpy()        
        tmp_tr_accuracy = flat_accuracy(pred, labels)
        tr_accuracy += tmp_tr_accuracy
        

    print("\nTrain -- loss: {:>8f} -- accuracy: {:>8f}".format(tr_loss / nb_tr_steps, tr_accuracy / nb_tr_steps))

    ### EVALUATE ###
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps = 0

    for batch in test_dataloader:
        
        b_input_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():        
            
            seq_output = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
            loss =  seq_output.loss
            pred = seq_output.logits
        
        test_loss.append(loss.item())  
        eval_loss += loss.item()
        nb_eval_steps += 1
        
        pred = pred.detach().numpy()
        labels = b_labels.numpy()        
        tmp_eval_accuracy = flat_accuracy(pred, labels)
        eval_accuracy += tmp_eval_accuracy

    print("Validation -- loss: {:>8f} -- accuracy: {:>8f}\n\n".format(eval_loss / nb_eval_steps, eval_accuracy / nb_eval_steps))
    
print('Done!')

Epoch 1
-------------------------------
  Batch     0  of     19.    Elapsed: 0:00:00.
  Batch     4  of     19.    Elapsed: 0:01:59.
  Batch     8  of     19.    Elapsed: 0:03:45.
  Batch    12  of     19.    Elapsed: 0:05:44.
  Batch    16  of     19.    Elapsed: 0:07:39.

Train -- loss: 0.577736 -- accuracy: 0.704391
Validation -- loss: 0.401456 -- accuracy: 0.826304


Epoch 2
-------------------------------
  Batch     0  of     19.    Elapsed: 0:00:00.
  Batch     4  of     19.    Elapsed: 0:01:57.
  Batch     8  of     19.    Elapsed: 0:03:46.
  Batch    12  of     19.    Elapsed: 0:05:41.
  Batch    16  of     19.    Elapsed: 0:07:41.

Train -- loss: 0.316938 -- accuracy: 0.872332
Validation -- loss: 0.333243 -- accuracy: 0.864797


Done!
Wall time: 19min 7s


### 4. Evaluate on test

In [107]:
%%time
torch_predictions = model(test_dataloader.dataset.tensors[0], 
                          attention_mask = test_dataloader.dataset.tensors[1],
                          labels = test_dataloader.dataset.tensors[2])

Wall time: 1min 28s


In [108]:
torch_predictions.keys()

odict_keys(['loss', 'logits'])

In [109]:
torch_predictions

SequenceClassifierOutput(loss=tensor(0.3283, grad_fn=<NllLossBackward0>), logits=tensor([[-1.9799,  1.4200],
        [-2.3359,  1.6462],
        [-0.3222, -0.0718],
        ...,
        [ 0.3872, -0.6693],
        [-1.2620,  0.7753],
        [-0.4462,  0.1451]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [125]:
y_tst = test_dataloader.dataset.tensors[2].numpy()
logits_tst = torch_predictions.logits
probs = F.softmax(logits_tst, 1).detach().numpy()[:,1]

print('Ratio of positive class:', np.unique(y_tst, return_counts=True)[1][1] / len(y_tst))
print('Accuracy:', flat_accuracy(logits_tst.detach().numpy(), y_tst))
print('AUC:', roc_auc_score(y_tst, probs))

Ratio of positive class: 0.5122832369942196
Accuracy: 0.8684971098265896
AUC: 0.9359431646032493


## Great improvement in just 2 epochs

Compared to embeddings + logit, the embeddings + classifier layer in neural net:

### Accuracy: 86.8% from 83.4%
### AUC: 93.5 from 91.6

### 5. Make predictions

In [126]:
def predict_sentiment_net(text):
    
    tokenized_text = bert_tokenizer(text)
    logits = model(**tokenized_text).logits
    prob = F.softmax(logits, 1).detach().numpy()[0][1]

    return print('Probability of being positive:', prob)

In [127]:
predict_sentiment_net('nice movie')

Probability of being positive: 0.96273935


In [128]:
predict_sentiment_net('movie sucked')

Probability of being positive: 0.054781504


In [130]:
text = 'I though the movie was going to suck, but actually it turned out to be really good.'
predict_sentiment_net(text)

Probability of being positive: 0.8789064


In [131]:
text = 'Overall OK, nothing special'
predict_sentiment_net(text)

Probability of being positive: 0.07032349


In [132]:
text = 'Liked it'
predict_sentiment_net(text)

Probability of being positive: 0.94001234


In [133]:
text = 'What a fucking amazing picture'
predict_sentiment_net(text)

Probability of being positive: 0.6782087


In [134]:
text = 'What a fucking amazing picture!'
predict_sentiment_net(text)

Probability of being positive: 0.7235411
