In [None]:
# restart vm afer installation of 'sentencepiece'
!pip install transformers 
!pip install sentencepiece

# NLG For Conversational Q&A
**Goal:**

Goal was to assemble a NLP pipeline which is able to answer a question and providing a longer text response (usually one whole sentence) which uses words used in question for conversational closed domain q&a. 

**Approach:**

1. pretrained BERT Q&A model for question answer answering which generates short answers (keywords)
2. finetune pretrained T5 model to generate longer answers which also takes the test of the question into consideration aswell as the keyword answers to suit a conversational environment

**How To Finetune T5?**

Finetuning via few shot learning. Meaning T5 is able to learn a new task with few training data because T5 has already knowledge about the language. Training data example: `q: Bis wann muss ich meine Wohnung nach Einzug anmelden? a: innerhalb von 14 Tagen` -> `Sie müssen Ihre Wohnung innerhalb von 14 Tagen anmelden.`


**Results & Findings**
* BERT's ability to find answers is acceptable but not great. BERT is not able to find more complex answers, especially involving logical functions like AND or OR. 
* Even though T5 was finetuned only on 20 domain examples it does generate good quality responses for unseen data. The Problem is that the best generated response is not always the highest ranked text thus a human is still needed to pick the best response. One way to fix it is to probably train it with more data.
* Practical implementation and integration in Rasa: high RAM requirements to load two relative large models -> probably does not run on computers with less than 8GB of RAM together with the rest of Rasa; high interference time (5-6s per question without Rasa NLU and without GPU acceleration) -> less suitable for live chat


**Update 23.01.2022:**
* Substitution of BERT; ELECTRA is now used for question answering
* ELECTRA outperforms BERT in the Q&A task and it can also handel more complex contexts containing AND or OR 
* ELECTRA used less space than BERT 



## Finetune T5 via Few Shot Learning


In [1]:
import pandas as pd
import torch
from tqdm import tqdm

from transformers import AdamW, T5ForConditionalGeneration, T5Tokenizer

In [2]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')

In [3]:
# load data
df = pd.read_csv('./t5-train.csv', sep=';')

In [4]:
# optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in t5_model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
    {
        "params": [p for n, p in t5_model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-4, eps=1e-8)

In [None]:
# enable gpu acceleration if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
t5_model.to(device)

In [6]:
# sets model to train mode
t5_model.train()

epochs = 20
with tqdm(total = epochs) as epoch_pbar:
  for epoch in range(epochs):
    acc_loss = 0
    idx = 0
    for index, row in df.iterrows():
      # get raw train data and format it
      input_sent = "q: " + row['question'] + " a: " + row['short']
      ouput_sent = row['long']

      # tokenize text input and output (label)
      tokenized_input = tokenizer.encode_plus(input_sent, max_length=128, padding='max_length', return_tensors="pt")
      tokenized_output = tokenizer.encode_plus(ouput_sent, max_length=64, padding='max_length', return_tensors="pt")

      # get embendings for inputs and labels
      input_ids = tokenized_input["input_ids"].to(device)
      attention_mask = tokenized_input["attention_mask"].to(device)
      labels = tokenized_output["input_ids"].to(device)
      decoder_attention_mask = tokenized_output["attention_mask"].to(device)

      # the forward function automatically creates the correct decoder_input_ids
      output = t5_model(
          input_ids = input_ids, 
          labels = labels,
          decoder_attention_mask = decoder_attention_mask,
          attention_mask = attention_mask
      )

      # get train loss
      loss = output[0]
      acc_loss += loss
      idx += 1

      # backpropagation 
      loss.backward()
      optimizer.step()
      optimizer.zero_grad()

    # update progress bar
    avg_loss = acc_loss / (idx + 1)
    desc = f'loss {avg_loss:.4f}'
    epoch_pbar.set_description(desc)
    epoch_pbar.update(1)

loss 0.0192: 100%|██████████| 20/20 [01:00<00:00,  3.00s/it]


In [7]:
t5_model.save_pretrained("./t5/", push_to_hub=False)

## Q&A NLG Pipeline 

In [8]:
from transformers import pipeline, T5Tokenizer, T5ForConditionalGeneration

In [9]:
qa_pipeline = pipeline("question-answering",
                       model="deepset/gelectra-base-germanquad",
                       tokenizer="deepset/gelectra-base-germanquad")
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("./t5/")

In [10]:
context="Mit einem Führungszeugnis können Sie nachweisen, dass Sie nicht vorbestraft sind. Führungszeugnisse unterscheidet man danach, ob sie bestimmt sind für private Zwecke (zum Beispiel für Ihren Arbeitgeber) oder für Behörden (sogenanntes „behördliches Führungszeugnis“, auch „Führungszeugnis zur Vorlage bei einer Behörde“). Außerdem gibt es unterschiedliche Arten von Führungszeugnissen nämlich, ein einfaches Führungszeugnis und ein erweitertes Führungszeugnis. Angehörige anderer EU-Staaten erhalten ein europäisches Führungszeugnis. Europäische Führungszeugnisse enthalten auch Strafregister-Einträge aus Ihrem Heimatland. Das Führungszeugnis wird erstellt vom Bundesamt für Justiz in Bonn (Bundeszentralregister). Wird das Führungszeugnis für private Zwecke benötigt, erhalten Sie es postalisch an Ihre Anschrift übersandt; eines für behördliche Zwecke geht direkt an die Behörde."
question="Was stehet in dem Europäischen Führungszeugnis?"
question

'Was stehet in dem Europäischen Führungszeugnis?'

In [11]:
qa_res = qa_pipeline({'context': context, 'question': question})['answer']
qa_res

'Strafregister-Einträge aus Ihrem Heimatland'

In [12]:
input = "q: " + question + " a: " + qa_res
tokenized = tokenizer.encode_plus(input, return_tensors="pt")

input_ids = tokenized["input_ids"]
attention_mask = tokenized["attention_mask"]

model.eval()
beam_outputs = model.generate(input_ids=input_ids,
                              attention_mask=attention_mask,
                              max_length=64,
                              early_stopping=True,
                              num_beams=10,
                              num_return_sequences=5,
                              no_repeat_ngram_size=2)

for beam_output in beam_outputs:
    sent = tokenizer.decode(beam_output,
                            skip_special_tokens=True,
                            clean_up_tokenization_spaces=True)
    print(sent)


Das Europäische Führungszeugnis beinhaltet Strafregister-Einträge aus Ihrem Heimatland.
Was stehet in dem Europäischen Führungszeugnis aus Ihrem Heimatland?
Sie erhalten in dem Europäischen Führungszeugnis einträge aus Ihrem Heimatland.
Strafregister-Einträge aus Ihrem Heimatland.
Das Strafregister-Einträge aus Ihrem Heimatland sind.
