# Writing with transformer (Danish BERT) 
This is an implementation of a pre-trained transformer model (BERT).

The weights were provided by Mollerhoj from https://github.com/botxo/nordic_bert

The conversion of the weights were done with help from Daniel Varab https://github.com/danielvarab/convert_da_bert 

This notebook shows how transformers can be used for tokenization, maskedLM and writing sentences.

To run this notebook, first download the 'bert-base-danish-uncased-v2' folder, which I have made available here: https://drive.google.com/drive/folders/1TsttvxSAjwJHKu5Onhq7KSEADU5yS7T1?usp=sharing

//By MGM

In [1]:
import torch
import transformers

In [2]:
# Load pre-trained transformer (BERT) model and tokenizer 
model = transformers.BertForMaskedLM.from_pretrained('bert-base-danish-uncased-v2')
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-danish-uncased-v2')

## Example of tokienization

In [3]:
txt = "Dansk er et skønt sprog"
tokens = tokenizer.tokenize(txt)

print("Tokens:")
print(tokens)
print("")

indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
print("Token indices:")
print(indexed_tokens)

Tokens:
['dansk', 'er', 'et', 'skønt', 'sprog']

Token indices:
[674, 33, 100, 5192, 1968]


## Predict word(s) using MaskedLM

In [4]:
# Inspired by Daniel Varab

txt_input = "I dag er en [MASK] dag ."

# Add masking
sentence_input = "[CLS] " + txt_input + " [SEP]"
print(sentence_input)

tokens = tokenizer.tokenize(sentence_input)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
# Get index of masking
mask_index = tokens.index("[MASK]")

# Convert tokens to a tensor
tokens_tensor = torch.tensor([indexed_tokens])
segments_ids = [0] * len(indexed_tokens)
segments_tensors = torch.tensor([segments_ids])

# We don't want PyTorch to calculate the gradients
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]

predicted_index = torch.argmax(predictions[0, mask_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)

print("Predicted token: " + predicted_token)

[CLS] I dag er en [MASK] dag . [SEP]
Predicted token: stor


In [5]:
# Print the top k=10 predicted words
for i, idx in enumerate(torch.topk(predictions[0, mask_index], k=10)[1]):
    print("%d:" % (i+1), tokenizer.convert_ids_to_tokens(idx.item()))

1: stor
2: fantastisk
3: dejlig
4: god
5: fars
6: perfekt
7: anden
8: mors
9: lang
10: ny


## Write with danish BERT transformer

In [6]:
sentence = "Jeg ønsker at "

print("...........")
print("Input: " + sentence)
print("...........")

# Number of words that should be written
n = 15


sentence_input = "[CLS] " + sentence + " [MASK]"

for i in range(n):
    tokens = tokenizer.tokenize(sentence_input)
    mask_index = tokens.index("[MASK]")

    indexed_tokens = tokenizer.convert_tokens_to_ids(tokens)
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_ids = [0] * len(indexed_tokens)
    segments_tensors = torch.tensor([segments_ids])

    with torch.no_grad():
        outputs = model(tokens_tensor, token_type_ids=segments_tensors)
        predictions = outputs[0]

    # Top predicted token
    predicted_index = torch.argmax(predictions[0, mask_index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens(predicted_index) 
    
    # Set punctuating right
    if predicted_token in ".,":
        
        
        # If the model repeats itself with ',.' We can choose the second predicted word to avoid this.
        """
        if predicted_token in sentence[-2:]:
            predicted_index = torch.topk(predictions[0, mask_index], k=10)[1][1].item()
            predicted_token = tokenizer.convert_ids_to_tokens(predicted_index)
            #sentence = sentence + predicted_token + " "
            #sentence_input = "[CLS] " + sentence + "[MASK]"
            continue
        """
        sentence = sentence.rstrip() + predicted_token + " "
        sentence_input = "[CLS] " + sentence + "[MASK]"
        continue
    
    
    sentence = sentence + predicted_token + " "
    sentence_input = "[CLS] " + sentence + "[MASK]"

print("Output: " + sentence)

...........
Input: Jeg ønsker at 
...........
Output: Jeg ønsker at leve livet, som det er, og leve det med mening og indhold i 
