# Transformers
https://github.com/huggingface/transformers

## Pipelines

High-level objects which automatically handle tokenization, running your data through a transformers model and outputting the result in a structured object.

In [4]:
from transformers import pipeline

### Sentiment

In [6]:
sentimentA = pipeline('sentiment-analysis')













All model checkpoint weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [10]:
sentimentA("This went very well")

[{'label': 'POSITIVE', 'score': 0.9997692704200745}]

### NER

In [11]:
nerA = pipeline('ner')
















Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing TFBertForTokenClassification: ['dropout_147']
- This IS expected if you are initializing TFBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english and are newly initialized: ['dropout_93']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
nerA("Mathias was working at CBOT.dk")

[{'entity': 'I-PER',
  'index': 1,
  'score': 0.995712399482727,
  'word': 'Mathias'},
 {'entity': 'I-ORG', 'index': 6, 'score': 0.7109400629997253, 'word': 'CB'},
 {'entity': 'I-ORG', 'index': 7, 'score': 0.6077383756637573, 'word': '##OT'}]

### Fill-mask

In [17]:
fillMaskA = pipeline("fill-mask")
















All model checkpoint weights were used when initializing TFRobertaForMaskedLM.

All the weights of TFRobertaForMaskedLM were initialized from the model checkpoint at distilroberta-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


In [19]:
fillMaskA("Elon Musk is most known for his <mask>")

[{'score': 0.1785833239555359,
  'sequence': '<s>Elon Musk is most known for his tweets</s>',
  'token': 6245,
  'token_str': 'Ġtweets'},
 {'score': 0.128355473279953,
  'sequence': '<s>Elon Musk is most known for his Tesla</s>',
  'token': 4919,
  'token_str': 'ĠTesla'},
 {'score': 0.059420883655548096,
  'sequence': '<s>Elon Musk is most known for his SpaceX</s>',
  'token': 14973,
  'token_str': 'ĠSpaceX'},
 {'score': 0.03811254724860191,
  'sequence': '<s>Elon Musk is most known for his inventions</s>',
  'token': 39232,
  'token_str': 'Ġinventions'},
 {'score': 0.02863607183098793,
  'sequence': '<s>Elon Musk is most known for his genius</s>',
  'token': 16333,
  'token_str': 'Ġgenius'}]

## Load a pre-trained model

In [1]:
import torch 
from transformers import BertModel, BertTokenizer

Path to pre-trained model

In [2]:
path = "./danish_bert_uncased_v2"

Alternatively, import pre-trained models from HuggingFace with 'from transformers import modelX'

### Define model and tokenizer

In [3]:
dkBERT = BertModel.from_pretrained(path)
dkTokenizer = BertTokenizer.from_pretrained(path)

In [4]:
txt_input = "I 2050 har vi landet mennesker på Mars"

#### Tokenize

-- Simple --

In [5]:
dkTokenizer.tokenize(txt_input)

['i', '20', '##50', 'har', 'vi', 'landet', 'mennesker', 'pa', 'mars']

-- Tensors -- 

In [6]:
tokenizerOut = dkTokenizer(txt_input,return_tensors = "pt")
print(tokenizerOut)

{'input_ids': tensor([[    2,    23,   284,  2364,    87,    54,  2223,   909,  2800, 20192,
             3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In [19]:
print("Tokens IDs (Int):   {}".format(tokenizerOut['input_ids'].tolist()[0]))

Tokens IDs (Int):   [2, 23, 284, 2364, 87, 54, 2223, 909, 2800, 20192, 3]


In [8]:
for s in tokenizerOut['input_ids'].tolist()[0]:
    print(dkTokenizer.convert_ids_to_tokens(s))

[CLS]
i
20
##50
har
vi
landet
mennesker
pa
mars
[SEP]


In [10]:
output, pooled = dkBERT(**tokenizerOut)

print("Token wise output: {}, Pooled output: {}".format(output.shape, pooled.shape))

Token wise output: torch.Size([1, 11, 768]), Pooled output: torch.Size([1, 768])


## Using the models 
What can we use these models for?

### Word and sentence encoding

Inspired by:
https://stackoverflow.com/questions/59030907/nlp-transformers-best-way-to-get-a-fixed-sentence-embedding-vector-shape

In [32]:
dkBERT = BertModel.from_pretrained(path,output_hidden_states = True)
dkTokenizer = BertTokenizer.from_pretrained(path)

In [47]:
# Method that extracts the CLS token to represent a word or sentence embedding
def getEmbedding(txt,model,tokenixer):
    
    token_ids = dkTokenizer.encode(txt)
    tokens = [dkTokenizer._convert_id_to_token(idx) for idx in token_ids]
    print(tokens)
    
    # unsqueeze token_ids because batch_size=1
    token_ids = torch.tensor(token_ids).unsqueeze(0)
    
    output = dkBERT(token_ids)[0].squeeze()
    
    # only grab output of CLS token (<s>), which is the first token
    cls_out = output[0]
    
    print(cls_out.size())
    
    return cls_out

In [48]:
word = "Mars"
sentence = "I 2050 har vi landet mennesker på Mars"

#### Word embedding

In [49]:
getEmbedding(word,dkBERT,dkTokenizer)

['[CLS]', 'mars', '[SEP]']
torch.Size([768])


tensor([ 1.0235e+00,  1.5655e-01,  1.1787e+00, -7.4808e-01,  3.9829e-01,
         9.6537e-01, -1.9184e-01,  8.2183e-01,  7.1435e-01, -3.6223e-01,
         1.2966e+00, -1.8682e-01, -1.9217e-01,  1.8919e-01,  8.6612e-01,
         9.8193e-01, -1.2436e+00,  7.7368e-01,  1.0591e+00, -1.8178e+00,
        -7.3825e-01,  4.6880e-01, -4.9287e-01,  1.2163e+00, -1.2437e+00,
        -5.8814e-01, -1.9162e-01, -1.4583e+00,  1.6911e-01,  4.1626e-01,
        -3.3694e-01, -9.1035e-01,  1.6270e+00, -1.3170e-01, -3.8462e-01,
         3.1265e-02, -3.5563e-01,  7.7417e-01,  2.6427e+00,  5.9702e-02,
        -3.6986e-01, -1.1797e+00, -7.8518e-01, -1.0237e+00,  5.4205e-01,
         3.7188e-01,  5.6524e-01, -1.0491e+00,  2.1418e-01, -1.3997e-01,
        -1.8762e+00, -1.6009e+00,  1.6747e+00, -3.3313e-01,  1.4981e-01,
         1.0822e+00, -1.3556e+00, -9.3150e-02,  6.0614e-01,  8.5925e-01,
         5.7432e-01,  7.4320e-01, -5.5337e-01, -6.3786e-01, -1.8242e+00,
        -6.3919e-01,  3.6784e-01, -1.4042e+00,  8.2

#### Sentence embedding

In [51]:
getEmbedding(sentence,dkBERT,dkTokenizer)

['[CLS]', 'i', '20', '##50', 'har', 'vi', 'landet', 'mennesker', 'pa', 'mars', '[SEP]']
torch.Size([768])


tensor([-7.0074e-02, -1.5375e-01,  9.6407e-01, -6.9426e-01, -1.0307e+00,
        -2.3426e-01, -1.3878e-02,  3.1126e-01,  4.8216e-01,  4.9639e-01,
        -4.4069e-01,  1.0063e+00, -3.8668e-01,  9.6935e-01,  1.7124e+00,
         1.4785e+00, -1.9425e+00,  8.0044e-02,  1.3573e+00,  5.4714e-01,
        -3.0976e-01, -1.9292e-01, -1.1168e+00,  1.6637e+00,  1.2820e+00,
        -1.5019e-01,  8.4022e-01, -7.0297e-01, -5.0812e-01,  7.3498e-01,
        -7.8397e-01, -3.7334e-01,  2.2746e+00,  1.4282e-01,  2.3258e-01,
        -6.5614e-01, -1.5285e+00,  1.3343e+00, -3.0222e-01,  5.9527e-01,
        -8.3338e-01,  1.0624e+00, -1.9505e+00, -2.3034e-01, -3.7374e-01,
         9.0723e-01, -2.5357e-01, -2.8921e-01,  3.2808e-01,  1.1954e+00,
         9.7675e-02, -1.2382e+00,  1.8711e+00, -6.6124e-01,  7.7680e-01,
         5.0256e-01, -3.3653e-01,  1.9513e-01, -3.2537e-01,  2.2593e-01,
         1.1361e+00,  4.6688e-02, -9.2415e-01,  6.5728e-01, -6.4385e-02,
        -1.1976e-01,  3.9558e-01,  5.8848e-01,  1.2

### Sandbox to play around with embeddings

In [155]:
sentenceEmbedding1 = getEmbedding("rumrejse",dkBERT,dkTokenizer)

['[CLS]', 'rum', '##rejse', '[SEP]']
torch.Size([768])


In [156]:
sentenceEmbedding2 = getEmbedding("astronaut lander på månen",dkBERT,dkTokenizer)

['[CLS]', 'astron', '##aut', 'lander', 'pa', 'man', '##en', '[SEP]']
torch.Size([768])


In [160]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cosSim(vector1, vector2):
    
    cos_sim = dot(vector1,vector2)/(norm(vector1)*norm(vector2))
    
    return cos_sim

#### Similarity

In [161]:
sentenceEmbeddingWord1 = sentenceEmbedding1.detach().numpy()
sentenceEmbeddingWord2 = sentenceEmbedding2.detach().numpy()

cosSim(sentenceEmbeddingWord1,sentenceEmbeddingWord2)

0.6316648

### Fine-tuning for downstream tasks

In [None]:
## To be done 

#### NER

In [None]:
## To be done