## Few-Shot TLQA with FlanT5

Below, we run generative models FlanT5-large and FlanT5-XL in a few-shot setting by selecting demonstration examples from training set of TLQA, which was derived from tempLLama.
- `flan-t5-large`: 780M params, 1GB mem
- `flan-t5-xl`: 3B params, 12 GB mem

In [1]:
!pip install transformers==4.37.2 optimum==1.12.0 --quiet
!pip install sentence-transformers
!pip install langchain-core



#### Running FlanT5-Large and FlanT5-XL (zero-shot)

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

input_text = "List all Michael Jackson albums between 2000 and 2009."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<pad> Michael Jackson: The Greatest Hits</s>


### Dataset Exploration

In [3]:
import json
import os

def json_to_list(data_path):
    with open(data_path) as f:
        data = json.load(f)
    return data

train_set = json_to_list("../data/train_TLQA.json")
test_set = json_to_list("../data/test_TLQA.json")
print(len(train_set), '\n')
example = train_set[42]
for key in example.keys():
    print(f"{key}: {example[key]}")

3212 

question: List all entities that owned Linha de Évora from 2010 to 2020.
answers: ['REFER (2010, 2011, 2012, 2013, 2014, 2015)', 'Infrastructures of Portugal (2015, 2016, 2017, 2018, 2019, 2020)']
type: P127
subject: Linha de Évora
answer_ids: {'REFER': 'Q1411661', 'Infrastructures of Portugal': 'Q20730106'}
wikidata_ID: Q1826620
verified: True
subject_label: Linha de Évora
aliases: []
final_answers: ['REFER (2010, 2011, 2012, 2013, 2014, 2015)', 'Infraestruturas de Portugal (2015, 2016, 2017, 2018, 2019, 2020)']


### Few-Shot Example Generation using KNN Search

To select few-shot examples, we select examples from the training set that are close to each test example by using KNN search in a continuous vector space using any embedding model from sentence-transformers.
We reuse the KNN search from https://anonymous.4open.science/r/ICAT-47FC/code/2wikimultihopqa/get_chatgpt_skill_transfer_knn.py.

In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class KnnSearch:
    def __init__(self, data=None, num_trees=None, emb_dim=None):
        self.num_trees = num_trees
        self.emb_dim = emb_dim

    def get_embeddings_for_data(self, data_ls):
        # NOTE: any embedding model from sentence-transformers can be used
        model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
        embeddings = model.encode(data_ls)
        return embeddings

    def get_top_n_neighbours(self, sentence, data_emb, transfer_data, k):
        """
        Retrieves the top k most similar questions for "sentence" based on cosine similarity from given embeddings "data_emb".

        Parameters:
        sentence (str): The input sentence to find similar questions for.
        data_emb (np.ndarray): The embeddings for the transfer questions.
        transfer_data (list): The list of transfer questions corresponding to data_emb.
        k (int): The number of top similar questions to retrieve.

        Returns:
        list: A list of the top k similar questions from transfer_data and all similar questions from str_qa.
        """
        sent_emb = self.get_embeddings_for_data(sentence)
        top_questions = []

        text_sims = cosine_similarity(data_emb,[sent_emb]).tolist()
        results_sims = zip(range(len(text_sims)), text_sims)
        sorted_similarities = sorted(results_sims, key=lambda x: x[1], reverse=True)  # Obtain the highest similarities

        # NOTE: we only match based on questions, but include the full question-answer pair in resulting neighs
        for idx, item in sorted_similarities[:k]:
                    top_questions.append(transfer_data[idx])

        return top_questions

In [5]:
# Extracts all questions from the (train) set used for getting neighbours
def get_transfer_questions(transfer_data):
    transfer_questions = []
    for index, data in enumerate(transfer_data):
        transfer_questions.append(data["question"])
    return transfer_questions

In [6]:
knn = KnnSearch()

# Read raw dataset to a list containing question and answers (no metadata)
train_data = json_to_list("../data/train_TLQA.json")
# Keep questions only to embed (to use in similarity metric)
train_questions = get_transfer_questions(train_data)
train_questions_emb = knn.get_embeddings_for_data(train_questions)



In [7]:
# TODO: loop over test samples during/before inference time to create few-shot prompts
# question = "List all employers Elon Musk worked for from 1990 to 2024."
question = "List all positions held by Mark Zuckerberg from 2000 to 2024."
neighs = knn.get_top_n_neighbours(sentence=question, data_emb=train_questions_emb, transfer_data=train_data, k=3)   # data_emb is embedded questions only, so we only match questions
for neigh in neighs:
    print(neigh['question'])
    print(neigh['answers'])
    print('\n')

List all positions Per Sandberg held from 2015 to 2018.
['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']


List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']


List all positions David M. O'Connell held from 2010 to 2020.
['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']




### Setting up Few-Shot Prompt Template using LangChain

In [10]:
def simplify_dict_list(dict_list):
    return [{'question': item['question'], 'answers': item['answers']} for item in dict_list]

simple_neighs = simplify_dict_list(neighs)
print(simple_neighs)

[{'question': 'List all positions Per Sandberg held from 2015 to 2018.', 'answers': ['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']}, {'question': 'List all positions Mark Drakeford held from 2013 to 2020.', 'answers': ['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']}, {'question': "List all positions David M. O'Connell held from 2010 to 2020.", 'answers': ['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']}]


First we configure a formatter that will format the few-shot examples into a string. This formatter should be a PromptTemplate object.

In [15]:
from langchain_core.prompts.few_shot import FewShotPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

example_prompt = PromptTemplate(
    input_variables=["question", "answers"], template="Question: {question}\n{answers}"
)

print(example_prompt.format(**simple_neighs[1]))
# print(example_prompt.format(question=f"{simple_neighs[0]['question']}", answers=f"{simple_neighs[0]['answers']}"))

Question: List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']


Now we feed examples and formatter to FewShotPromptTemplate.

In [23]:
prompt = FewShotPromptTemplate(
    examples=simple_neighs,
    example_prompt=example_prompt,
    suffix="Question: {input}",
    input_variables=["input"],
)

few_shot_prompt = prompt.format(input="List all positions held by Mark Zuckerberg between 2000 and 2020. Please answer this question in the same format as the three examples above.")
print(few_shot_prompt)   # String to feed to model during evaluation

Question: List all positions Per Sandberg held from 2015 to 2018.
['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']

Question: List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']

Question: List all positions David M. O'Connell held from 2010 to 2020.
['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']

Question: List all positions held by Mark Zuckerberg between 2000 and 2020. Please answer this question in the same format as the three examples above.


## TLQA Evaluation on Test Set

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
import accelerate

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")

# Uncomment the line below to run on GPU
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

In [None]:
K = 10   # TODO: Experiment with 3, 5, 7, 10
prompts_results_GT_dict = {'prompts': [], 'outputs': [], 'ground_truths': []}   # Store all results, prompts, and ground truths for all test items

# TODO: keep track of metrics

# Configure formatter that will format the few-shot examples into a string
example_prompt = PromptTemplate(
    input_variables=["question", "answers"], template="Question: {question}\n{answers}"
)

# Convert test set to list and loop over all items (1071 in total)
for i, item in enumerate(test_set):
    
    # For each test question, retrieve k neighbours (try 3, 5, 7, 10)
    test_question = test_set[i]['question']
    neighs = knn.get_top_n_neighbours(sentence=test_question, data_emb=train_questions_emb, transfer_data=train_data, k=K)
    simple_neighs = simplify_dict_list(neighs)

    # Create the few-shot prompt template and feed to model
    prompt = FewShotPromptTemplate(
        examples=simple_neighs,
        example_prompt=example_prompt,
        suffix="Question: {input}",
        input_variables=["input"],
    )
    few_shot_prompt = prompt.format(input=f"{test_question} Please answer this question in the same format as the {K} examples above.")
    prompts_results_GT_dict['prompts'].append(few_shot_prompt)
    
    input_ids = tokenizer(few_shot_prompt, return_tensors="pt").input_ids
    # input_ids = tokenizer(few_shot_prompt, return_tensors="pt").input_ids.to("cuda")  # TODO: uncomment when using GPU
    outputs = model.generate(input_ids, max_length=200)
    model_answer = tokenizer.decode(outputs[0])
    prompts_results_GT_dict['outputs'].append(model_answer)
    prompts_results_GT_dict['ground_truths'].append(test_set[i]['final_answers'])


Here we inspect individual results:

In [54]:
idx_to_inspect = 3
# print(prompts_results_GT_dict['prompts'][idx_to_inspect], '\n\n')
test = prompts_results_GT_dict['outputs'][idx_to_inspect]
print(prompts_results_GT_dict['outputs'][idx_to_inspect], '\n\n')
# print(prompts_results_GT_dict['ground_truths'][idx_to_inspect], '\n\n')

<pad> ['St. Louis Rams', 'St. Louis Rams', 'St. Louis Rams', 'St. Louis Rams', 'St. Louis Rams', 'St. Louis Rams', 'St. Louis Rams']</s> 




In [None]:
# TODO: Remove the <pad> and </s> markers when evaluating
from compute_metrics import extract_between_tags
raw_ans = extract_between_tags(test)
print(raw_ans)

# TODO: EM: exact match (very strict)
# TODO: F1: word overlap between the labeled and the predicted answer
# TODO: Completeness: Recall = # correct answers / # GT answers 

# TODO: TimeMetric: TimePrecision = # correct time ranges / # time ranges given
# TODO: EntityMetric: EntityPrecision = # correct entities / # entites given