# Generate answers
## TriviaQA
Used for closed-book QA (=without supporting paragraph)

In [1]:
from datasets import load_dataset
import datasets
import torch

from transformers import AutoTokenizer, OPTForCausalLM

In [2]:
data_dir = "../data"

### Load and inspect data

In [3]:
data_trivia = load_dataset("trivia_qa", "rc.nocontext")
data_trivia_train = data_trivia["train"]
data_trivia_val = data_trivia["validation"]
data_trivia_test = data_trivia["test"]

print(f"Trivia QA Training Set Size: {data_trivia_train.shape}")
print(f"Trivia QA Validation Set Size: {data_trivia_val.shape}")
print(f"Trivia QA Test Set Size: {data_trivia_test.shape}")

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Trivia QA Training Set Size: (138384, 6)
Trivia QA Validation Set Size: (17944, 6)
Trivia QA Test Set Size: (17210, 6)


In [4]:
print("Training Set")
for i in range(2):
    print(f"Q: {data_trivia_train[i]['question']}\nA: {data_trivia_train[i]['answer']['value']}\n")

print("Validation Set")
for i in range(2):
    print(f"Q: {data_trivia_val[i]['question']}\nA: {data_trivia_val[i]['answer']['value']}\n")

print("Test Set")
for i in range(2):
    print(f"Q: {data_trivia_test[i]['question']}\nA: {data_trivia_test[i]['answer']['value']}\n")

Training Set
Q: Which American-born Sinclair won the Nobel Prize for Literature in 1930?
A: Sinclair Lewis

Q: Where in England was Dame Judi Dench born?
A: York

Validation Set
Q: Who was the man behind The Chipmunks?
A: David Seville

Q: Which Lloyd Webber musical premiered in the US on 10th December 1993?
A: Sunset Boulevard

Test Set
Q: Asmara international airport is in which country?
A: <unk>

Q: At whose concert were 11 people trampled to death in Ohio in 1979?
A: <unk>



To estimate the uncertainty, the correct answer is needed. As a result, the test set cannot be used here, but only the train and validation set.

### Run some predictions
Same as in the paper, the OPT model is used. Because of hardware constraints, I use the OPT model with 2.7B parameters (smallest OPT model used for the paper, see page 7).

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [6]:
# Causal LM: model predicts next token # todo kleiner
checkpoint = "opt-2.7b"
tokenizer = AutoTokenizer.from_pretrained(f"facebook/{checkpoint}", cache_dir=data_dir)
model = OPTForCausalLM.from_pretrained(f"facebook/{checkpoint}", cache_dir=data_dir)
model = model.to(device)

In [7]:
# Format of questions see page 16 in paper
# As computation is on GPU, it is transformed into batch format
for i in range(5):
    question = "Q: " + data_trivia_val[i]["question"] + " A:"
    answer = data_trivia_val[i]["answer"]["value"]

    inputs = tokenizer(question, padding=False, truncation=False, return_tensors="pt").to(device)
    length_input = inputs["input_ids"].shape[1]

    generate_ids = model.generate(inputs.input_ids, max_length = 256)
    # Only decode answer and not posed question
    output = tokenizer.batch_decode(generate_ids[0][length_input:], skip_special_tokens=True)

    print(question)
    print(f"True answer: {answer}")
    print(f"Model output: {''.join(output)}")
    print("-------------------------")

Q: Who was the man behind The Chipmunks? A:
True answer: David Seville
Model output:  The man behind The Chipmunks was a man named Paul Reubens. He was born in New York City in the early 1960s. He was a child actor, and he was a very talented child actor. He was in a lot of commercials and a lot of television shows. He was in a lot of commercials for the Hershey's chocolate bar. He was in a lot of commercials for the Coca-Cola Company. He was in a lot of commercials for the Kellogg's cereal. He was in a lot of commercials for the Hershey's chocolate bar. He was in a lot of commercials for the Kellogg's cereal. He was in a lot of commercials for the Hershey's chocolate bar. He was in a lot of commercials for the Kellogg's cereal. He was in a lot of commercials for the Hershey's chocolate bar. He was in a lot of commercials for the Kellogg's cereal. He was in a lot of commercials for the Hershey's chocolate bar. He was in a lot of commercials for the Kellogg's cereal. He was in a lot of 

Apart from all the answers being wrong, after the answer, the model continues to ask question, and gives a separate answer for each newly posed question. 

To account for this issue, the paper (page 16) proposes to trim all generations by pattern matching for the bad-words "Q:", "Question:", "QUESTION:", "questions:". This means that those tokens will not be part of the generation. Important to note is that the tokenization of for instance "Q:" and " Q:" is different (added space in front of Q), as the following example deomonstrates. As a result, for every named bad-word, I also add " bad-word" as a bad-word. 

In [11]:
print(f"Tokenization for 'Q:': {tokenizer('Q:')}")
print(f"Tokenization for ' Q:': {tokenizer(' Q:')}")

example_answer = "The Chipmunks Q: Which of these two countries is the largest producer of coffee?"
print(f"Excerpt of example answer{example_answer}")
tokenized_example_answer = tokenizer(example_answer)
print(f"Tokenized example answer: {tokenized_example_answer}")

Tokenization for 'Q:': {'input_ids': [2, 1864, 35], 'attention_mask': [1, 1, 1]}
Tokenization for ' Q:': {'input_ids': [2, 1209, 35], 'attention_mask': [1, 1, 1]}
Excerpt of example answerThe Chipmunks Q: Which of these two countries is the largest producer of coffee?
Tokenized example answer: {'input_ids': [2, 133, 11055, 20614, 2258, 1209, 35, 6834, 9, 209, 80, 749, 16, 5, 1154, 3436, 9, 3895, 116], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


As you can see, the series 1209, 35 appears in the tokenized example answer, but the series 1864, 35 doesn't appear. 

As TriviaQA provides short answers in one line, the token of the new line character is specified as the eos token id. This means that the generation stops, once a \n is encountered. In addition, the paper (page 16) suggests to use few-shot prompting (n = 10) for TriviaQA. 

In [19]:
selected_training_data = data_trivia_train.select(range(0, 10))
ten_shot_prompt = ""

for data in selected_training_data:
    ten_shot_prompt += "Q: " + data["question"] + " A: " + data["answer"]["value"] + " "

# Define stop tokens, use token on position 1 bc position 0 is special token
stop_tokens = ["Q:", "Question:", "QUESTION:", "questions:", " Q:", " Question:", " QUESTION:", " questions:"]
stop_tokens = [[tokenizer(stop_token)["input_ids"][1]] for stop_token in stop_tokens]

# Define eos token
eos_token = tokenizer("\n")["input_ids"][1]

# Try again
for i in range(5):
    question = ten_shot_prompt + "Q: " + data_trivia_val[i]["question"] + " A:"
    answer = data_trivia_val[i]["answer"]["value"]

    inputs = tokenizer(question, padding=False, truncation=False, return_tensors="pt").to(device)
    length_input = inputs["input_ids"].shape[1]

    generate_ids = model.generate(inputs.input_ids, 
                                  max_length = 256 + length_input,
                                  eos_token_id = eos_token,
                                  bad_words_ids = stop_tokens)
    output = tokenizer.batch_decode(generate_ids[0][length_input:], skip_special_tokens=True)

    print(question)
    print(f"True answer: {answer}")
    print(f"Model Output: {''.join(output)}")

    print("-------------------------")

Q: Which American-born Sinclair won the Nobel Prize for Literature in 1930? A: Sinclair Lewis Q: Where in England was Dame Judi Dench born? A: York Q: In which decade did Billboard magazine first publish and American hit chart? A: 30s Q: From which country did Angola achieve independence in 1975? A: Portugal Q: Which city does David Soul come from? A: Chicago Q: Who won Super Bowl XX? A: Chicago Bears Q: Which was the first European country to abolish capital punishment? A: Norway Q: In which country did he widespread use of ISDN begin in 1988? A: Japan Q: What is Bruce Willis' real first name? A: Walter Q: Which William wrote the novel Lord Of The Flies? A: Golding Q: Who was the man behind The Chipmunks? A:
True answer: David Seville
Model Output:  The Chipmunks' producer A: Who was the first person to be elected to the US Congress from the state of New York? A: Thomas Jefferson A: Who was the first person to be elected to the US Congress from the state of New York? A: Thomas Jeffers

KeyboardInterrupt: 

As you can see, now multiple answers are returned. As a result, I also add "A:", " A:", "Answer:", "ANSWER:", "answers:" as stopwords. 

In addition, I change Q: and A: to Question: and Answer: as this returns better results.

In [20]:
selected_training_data = data_trivia_train.select(range(0, 10))
# few shot prompting
ten_shot_prompt = "This is a bot that correctly and precisely answers questions.\n"

for data in selected_training_data:
    ten_shot_prompt += "Question: " + data["question"] + " Answer: " + data["answer"]["value"] + " "

# Define stop tokens, use token on position 1 bc position 0 is special token
stop_tokens = ["Q:", "Question:", "QUESTION:", "questions:", " Q:", " Question:", " QUESTION:", " questions:",
               "A:", "Answer:", "ANSWER:", "answers:", " A:", " Answer:", " ANSWER:", " answers:"]
stop_tokens = [[tokenizer(stop_token)["input_ids"][1]] for stop_token in stop_tokens]

# Define eos token
eos_token = tokenizer("\n")["input_ids"][1]

# Try again
for i in range(5):
    question = ten_shot_prompt + "Question: " + data_trivia_val[i]["question"] + " Answer:"
    answer = data_trivia_val[i]["answer"]["value"]

    inputs = tokenizer(question, padding=False, truncation=False, return_tensors="pt").to(device)
    length_input = inputs["input_ids"].shape[1]

    generate_ids = model.generate(inputs.input_ids, 
                                  max_length = 256 + length_input,
                                  eos_token_id = eos_token,
                                  bad_words_ids = stop_tokens)
    output = tokenizer.batch_decode(generate_ids[0][length_input:], skip_special_tokens=True)

    print(question)
    print(f"True answer: {answer}")
    print(f"Model Output: {''.join(output)}")

    print("-------------------------")

This is a bot that correctly and precisely answers questions.
Question: Which American-born Sinclair won the Nobel Prize for Literature in 1930? Answer: Sinclair Lewis Question: Where in England was Dame Judi Dench born? Answer: York Question: In which decade did Billboard magazine first publish and American hit chart? Answer: 30s Question: From which country did Angola achieve independence in 1975? Answer: Portugal Question: Which city does David Soul come from? Answer: Chicago Question: Who won Super Bowl XX? Answer: Chicago Bears Question: Which was the first European country to abolish capital punishment? Answer: Norway Question: In which country did he widespread use of ISDN begin in 1988? Answer: Japan Question: What is Bruce Willis' real first name? Answer: Walter Question: Which William wrote the novel Lord Of The Flies? Answer: Golding Question: Who was the man behind The Chipmunks? Answer:
True answer: David Seville
Model Output:  The Chipmunks' creator, Don Bluth

----------

The format of the answers looks good (we can argue about their correctness though)