## Following the guide for BERT as practice
https://www.depends-on-the-definition.com/named-entity-recognition-with-bert/

On using BERT as encoder
https://towardsdatascience.com/word-embedding-using-bert-in-python-dd5a86c00342

# Distilbert

## Control the model architecture configuration


In [8]:
from transformers import DistilBertModel, DistilBertConfig

# Initializing a DistilBERT configuration
configuration = DistilBertConfig()

# Initializing a model from the configuration
model = DistilBertModel(configuration)

# Accessing the model configuration
configuration = model.config

#You can control the model configuration by changing any parameter in the constructor

#For example 
#configuration = DistilBertConfig(do_sample = True)
#And then passing it to the model


## Use the model

In [9]:
# Loading the model does not load any weights, only the configuration.
model = DistilBertModel(configuration)

# Use the from_pretrained() method to load weights, though naturally you have to use their configuration then
model = DistilBertModel.from_pretrained('distilbert-base-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=895.0, style=ProgressStyle(description_…




In [10]:
from transformers import DistilBertTokenizer, DistilBertModel
import torch

In [32]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')


text = "My name is Sophie"
mask = [1, 1, 1, 1]
tags = ["O", "O", "O", "PER"]

#Add special tokens to get cls and sep
text_enc = tokenizer.encode(text, add_special_tokens=True)
mask_enc = tokenizer.encode(mask, add_special_tokens=True)
tags_enc = tokenizer.encode(tags, add_special_tokens=True)

print(text_enc)
print(mask_enc)
print(tags_enc)

text_enc_tensor = torch.tensor(text_enc).unsqueeze(0)
mask_enc_tensor = torch.tensor(mask_enc).unsqueeze(0)
tags_enc_tensor = torch.tensor(tags_enc).unsqueeze(0)

print(text_enc_tensor)
print(mask_enc_tensor)
print(tags_enc_tensor)

#Generate embeddings with BERT
#Train token classification on 90 % of the embeddings, test on rest


#input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
#outputs = model(input_ids)

#last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

[101, 1422, 1271, 1110, 7800, 102]
[101, 1, 1, 1, 1, 102]
[101, 152, 152, 152, 100, 102]
tensor([[ 101, 1422, 1271, 1110, 7800,  102]])
tensor([[101,   1,   1,   1,   1, 102]])
tensor([[101, 152, 152, 152, 100, 102]])


In [15]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
model = DistilBertModel.from_pretrained('distilbert-base-cased')

In [35]:
embeddings = model.forward(input_ids=text_enc_tensor,
    attention_mask=mask_enc_tensor,
    head_mask=None)

In [47]:
#Passing the sentence through the model, we get a 
embeddings[0].shape

# There are 6 encodings, one for each token (our sentence + special cls and sep tokens)
# Each one is represented in 768 embedding space

embeddings[0]

tensor([[[ 0.4358,  0.0456,  0.0396,  ..., -0.1634,  0.2456, -0.0158],
         [-0.1548,  0.0235,  0.6477,  ..., -0.1363, -0.0646,  0.1977],
         [ 0.2925,  0.1372,  0.2723,  ...,  0.4394,  0.0764,  0.1033],
         [ 0.1624,  0.0689,  0.3299,  ...,  0.2121,  0.1571,  0.1517],
         [ 0.0784,  0.0850, -0.1499,  ...,  0.0408,  0.2438,  0.0382],
         [ 1.0227, -0.0188,  0.1744,  ..., -0.0983,  0.9721, -0.0347]]],
       grad_fn=<NativeLayerNormBackward>)

## On how BERT works
https://towardsdatascience.com/bert-to-the-rescue-17671379687f
https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/
Bert is, in the end, an encoder. If a rather sophisticated one.

The INPUT to bert is a sequence of WORDPIECE pokens and two special tokens [CLS] and [SEP].

[SEP] is
[CLS] is
WORDPIECES are

An EXAMPLE INPUT is
[[CLS], My, name, is, Sophie, [SEP], Yours, [?]]

Passing it through BERT, we receive a rich encoding for each word
[[CLS], E1, E2, E3, E4, [SEP], E5, E6]
All we have to do then is to classify the tokens.

In [None]:
embeddings = model.forward(input_ids=text_enc_tensor,
    attention_mask=mask_enc_tensor,
    head_mask=None)