## Few-Shot TLQA with FlanT5

Below, we run generative models FlanT5-large and FlanT5-XL in a few-shot setting by selecting demonstration examples from training set of TLQA, which was derived from tempLLama.
- `flan-t5-large`: 780M params, 1GB mem
- `flan-t5-xl`: 3B params, 12 GB mem

In [1]:
!pip install transformers==4.37.2 optimum==1.12.0 --quiet
!pip install sentence-transformers
!pip install langchain-core



#### Running FlanT5-Large and FlanT5-XL (zero-shot)

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

input_text = "List all Michael Jackson albums between 2000 and 2009."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<pad> Michael Jackson: The Greatest Hits</s>


### Dataset Exploration

In [15]:
import utils
import importlib
importlib.reload(utils)

train_set = utils.json_to_list("../data/train_TLQA.json")
test_set = utils.json_to_list("../data/test_TLQA.json")
print(len(train_set), '\n')
example = train_set[42]
for key in example.keys():
    print(f"{key}: {example[key]}")

3212 

question: List all entities that owned Linha de Évora from 2010 to 2020.
answers: ['REFER (2010, 2011, 2012, 2013, 2014, 2015)', 'Infrastructures of Portugal (2015, 2016, 2017, 2018, 2019, 2020)']
type: P127
subject: Linha de Évora
answer_ids: {'REFER': 'Q1411661', 'Infrastructures of Portugal': 'Q20730106'}
wikidata_ID: Q1826620
verified: True
subject_label: Linha de Évora
aliases: []
final_answers: ['REFER (2010, 2011, 2012, 2013, 2014, 2015)', 'Infraestruturas de Portugal (2015, 2016, 2017, 2018, 2019, 2020)']


### Few-Shot Example Generation using KNN Search

To select few-shot examples, we select examples from the training set that are close to each test example by using KNN search in a continuous vector space using any embedding model from sentence-transformers.
We reuse the KNN search from https://anonymous.4open.science/r/ICAT-47FC/code/2wikimultihopqa/get_chatgpt_skill_transfer_knn.py.

In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class KnnSearch:
    def __init__(self, data=None, num_trees=None, emb_dim=None):
        self.num_trees = num_trees
        self.emb_dim = emb_dim

    def get_embeddings_for_data(self, data_ls):
        # NOTE: any embedding model from sentence-transformers can be used
        model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
        embeddings = model.encode(data_ls)
        return embeddings

    def get_top_n_neighbours(self, sentence, data_emb, transfer_data, k):
        """
        Retrieves the top k most similar questions for "sentence" based on cosine similarity from given embeddings "data_emb".

        Parameters:
        sentence (str): The input sentence to find similar questions for.
        data_emb (np.ndarray): The embeddings for the transfer questions.
        transfer_data (list): The list of transfer questions corresponding to data_emb.
        k (int): The number of top similar questions to retrieve.

        Returns:
        list: A list of the top k similar questions from transfer_data and all similar questions from str_qa.
        """
        sent_emb = self.get_embeddings_for_data(sentence)
        top_questions = []

        text_sims = cosine_similarity(data_emb,[sent_emb]).tolist()
        results_sims = zip(range(len(text_sims)), text_sims)
        sorted_similarities = sorted(results_sims, key=lambda x: x[1], reverse=True)  # Obtain the highest similarities

        # NOTE: we only match based on questions, but include the full question-answer pair in resulting neighs
        for idx, item in sorted_similarities[:k]:
                    top_questions.append(transfer_data[idx])

        return top_questions

In [5]:
# Extracts all questions from the (train) set used for getting neighbours
def get_transfer_questions(transfer_data):
    transfer_questions = []
    for index, data in enumerate(transfer_data):
        transfer_questions.append(data["question"])
    return transfer_questions

In [6]:
knn = KnnSearch()

# Read raw dataset to a list containing question and answers (no metadata)
train_data = json_to_list("../data/train_TLQA.json")
# Keep questions only to embed (to use in similarity metric)
train_questions = get_transfer_questions(train_data)
train_questions_emb = knn.get_embeddings_for_data(train_questions)



In [7]:
# TODO: loop over test samples during/before inference time to create few-shot prompts
# question = "List all employers Elon Musk worked for from 1990 to 2024."
question = "List all positions held by Mark Zuckerberg from 2000 to 2024."
neighs = knn.get_top_n_neighbours(sentence=question, data_emb=train_questions_emb, transfer_data=train_data, k=3)   # data_emb is embedded questions only, so we only match questions
for neigh in neighs:
    print(neigh['question'])
    print(neigh['answers'])
    print('\n')

List all positions Per Sandberg held from 2015 to 2018.
['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']


List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']


List all positions David M. O'Connell held from 2010 to 2020.
['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']




### Setting up Few-Shot Prompt Template using LangChain

In [8]:
def simplify_dict_list(dict_list):
    return [{'question': item['question'], 'answers': item['answers']} for item in dict_list]

simple_neighs = simplify_dict_list(neighs)
print(simple_neighs)

[{'question': 'List all positions Per Sandberg held from 2015 to 2018.', 'answers': ['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']}, {'question': 'List all positions Mark Drakeford held from 2013 to 2020.', 'answers': ['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']}, {'question': "List all positions David M. O'Connell held from 2010 to 2020.", 'answers': ['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']}]


First we configure a formatter that will format the few-shot examples into a string. This formatter should be a PromptTemplate object.

In [9]:
from langchain_core.prompts.few_shot import FewShotPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

example_prompt = PromptTemplate(
    input_variables=["question", "answers"], template="Question: {question}\n{answers}"
)

print(example_prompt.format(**simple_neighs[1]))
# print(example_prompt.format(question=f"{simple_neighs[0]['question']}", answers=f"{simple_neighs[0]['answers']}"))

Question: List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']


Now we feed examples and formatter to FewShotPromptTemplate.

In [10]:
prompt = FewShotPromptTemplate(
    examples=simple_neighs,
    example_prompt=example_prompt,
    suffix="Question: {input}",
    input_variables=["input"],
)

few_shot_prompt = prompt.format(input="List all positions held by Mark Zuckerberg between 2000 and 2020. Please answer this question in the same format as the three examples above.")
print(few_shot_prompt)   # String to feed to model during evaluation

Question: List all positions Per Sandberg held from 2015 to 2018.
['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']

Question: List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']

Question: List all positions David M. O'Connell held from 2010 to 2020.
['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']

Question: List all positions held by Mark Zuckerberg between 2000 and 2020. Please answer this question in the same format as the three examples above.


## TLQA Evaluation on Test Set

In [11]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
# TODO: uncomment when using GPU
# import accelerate

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")

# Uncomment the line below to run on GPU
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [27]:
K = 10   # TODO: Experiment with 3, 5, 7, 10
# TODO: Compute metrics over these results later
# TODO: Store in Huggingface Dataset
results_GT_dict = {'prompts': [], 'outputs': [], 'output_tokens': [], 
                            'ground_truths': [], 'ground_truth_tokens': []}  

# Configure formatter that will format the few-shot examples into a string
example_prompt = PromptTemplate(
    input_variables=["question", "answers"], template="Question: {question}\n{answers}"
)

# Convert test set to list and loop over all items (1071 in total)
for i, item in enumerate(test_set):

    if i == 3:
        break
    
    # For each test question, retrieve k neighbours (try 3, 5, 7, 10)
    test_question = test_set[i]['question']
    neighs = knn.get_top_n_neighbours(sentence=test_question, data_emb=train_questions_emb, transfer_data=train_data, k=K)
    simple_neighs = simplify_dict_list(neighs)

    # Create the few-shot prompt template and feed to model
    prompt = FewShotPromptTemplate(
        examples=simple_neighs,
        example_prompt=example_prompt,
        suffix="Question: {input}",
        input_variables=["input"],
    )
    few_shot_prompt = prompt.format(input=f"{test_question} Please answer this question in the same format as the {K} examples above.")
    results_GT_dict['prompts'].append(few_shot_prompt)
    
    # TODO: pre-tokenize input and GT to speed up inference
    input_ids = tokenizer(few_shot_prompt, return_tensors="pt").input_ids
    # input_ids = tokenizer(few_shot_prompt, return_tensors="pt").input_ids.to("cuda")  # TODO: uncomment when using GPU
    output_tokens = model.generate(input_ids, max_length=200)
    output = tokenizer.decode(output_tokens[0])

    results_GT_dict['output_tokens'].append(output_tokens[0])
    results_GT_dict['outputs'].append(output)
    results_GT_dict['ground_truths'].append(test_set[i]['final_answers'])
    gt_tokens = tokenizer(str(test_set[i]['final_answers']), return_tensors="pt").input_ids
    results_GT_dict['ground_truth_tokens'].append(gt_tokens)




tensor([[  784,    31, 22081,   989,   907,   377,     5,   254,     5,    41,
         14926,     6,  8558,  1673,    61,    31,     6,     3,    31, 14337,
          1926,   545,   377,     5,   254,     5,    41, 12172,     6,  2038,
            61,    31,     6,     3,    31,   254,    60,  1123, 10457,     9,
           377,     5,   254,     5,    41, 11138,     6,  1412,     6,  1230,
            61,    31,     6,     3,    31, 14714,  3833,    15,   377,     5,
           254,     5,    41,  8651,     6,  5123,  4791,  4323,  7887,  6503,
            61,    31,   908,     1]])
['Southend United F.C. (2010, 2011, 2012)', 'Stevenage F.C. (2012, 2013)', 'Crewe Alexandra F.C. (2013, 2014, 2015)', 'Port Vale F.C. (2015, 2016, 2017, 2018, 2019, 2020)']</s>




tensor([[  784,    31,  7855,   526,  3271,    13, 11897, 26118,    31,     6,
             3,    31, 25171,     3, 16911,  5923,  3271,    13, 11897, 26118,
            31,     6,    96, 24337,    31,     7,     3, 16911,    13, 11897,
            41, 12172,     6,  7218,  1412,     6,  1230,    61,  1686,     3,
            31,   254, 19176,   348,    13,     8,   781,   157, 23304,    29,
             9,  6324,     9,    41, 10218,     6,  1230,     6,  5123,  4791,
          4323,  7887,  6503,    61,    31,     6,     3,    31,   345, 15704,
            13, 11897, 24373,    31,   908,     1]])
['Prime Minister of Ukraine (2010)', 'First Deputy Prime Minister of Ukraine (2010)', "People's Deputy of Ukraine (2012, 2013, 2014, 2015)", 'Chairman of the Verkhovna Rada (2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'President of Ukraine (2014)']</s>




tensor([[  784,    31,  5231,  4703,   343,  3450,    41, 14926,     6,  8558,
          1673,     6,  7218,  1412,     6,  1230,     6,  5123,  4791,  4323,
          1360,    61,    31,     6,     3,    31, 13431,   427,    41,   196,
            60,    40,   232,    61,    41,  8584,     6,  6503,    61,    31,
           908,     1]])
['Socialist Party (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)', 'RISE (Ireland) (2019, 2020)']</s>


Here we inspect individual results:

In [23]:
idx_to_inspect = 1
print(f"Total predictions made: {len(prompts_results_GT_dict)}\n")

example_prompt = prompts_results_GT_dict['prompts'][idx_to_inspect]
example_output = prompts_results_GT_dict['outputs'][idx_to_inspect]
example_gt = prompts_results_GT_dict['ground_truths'][idx_to_inspect]

print(f"\nPROMPT (idx = {idx_to_inspect}):\n{example_prompt}\n")
print(f"\nMODEL OUTPUT:\n{example_output}\n")
print(f"\nGROUND TRUTH:\n{example_gt}\n")

Total predictions made: 3


PROMPT (idx = 1):
Question: List all positions Venedykt Aleksiichuk, also known as Venedykt Aleksiychuk, held from 2010 to 2020.
['auxiliary bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'titular bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'diocesan bishop (2017, 2018, 2019, 2020)']

Question: List all positions Arseniy Yatsenyuk, also known as Arseniy Petrovych Yatsenyuk, held from 2010 to 2016.
["People's Deputy of Ukraine (2010, 2011, 2012, 2013, 2014)", 'Prime Minister of Ukraine (2014, 2015, 2016)']

Question: List all positions Mykola Azarov, also known as Mykola Yanovych Azarov, held from 2010 to 2014.
["People's Deputy of Ukraine (2010, 2012)", 'Prime Minister of Ukraine (2010, 2011, 2012, 2013, 2014)']

Question: List all positions Mikhail Mishustin, also known as Mikhail Vladimirovich Mishustin, held fr

Below we will test the utility functions that compute metrics:

In [24]:
import compute_metrics
import utils
import importlib
import re
importlib.reload(compute_metrics)
importlib.reload(utils)

example_pred = utils.extract_between_tags(example_output)
example_pred_list = re.findall(r"'(.*?)'", example_pred)  # NOTE: we convert the output string into list to facilitate metrics computation
print(f"PRED: {example_pred_list}")
print(f"GROUND TRUTH: {example_gt}\n")

print(f"Pred time ranges: {compute_metrics.extract_time_ranges(example_pred_list)}")
print(f"GT time ranges: {compute_metrics.extract_time_ranges(example_gt)}")
print(f"Pred entities: {compute_metrics.extract_entities(example_pred_list)}")
print(f"GT entities: {compute_metrics.extract_entities(example_gt)}")

# TODO: evaluate on list of tokenized items (Entity-Time pairs) <-- add extra col to results Dataset

# ---------------------- Syntax-based metrics --------------------------

# example_pred_list.append('Prime Minister of Ukraine (2010)')

em = compute_metrics.compute_em(example_gt, example_pred_list)

# F1: word overlap between the labeled and the predicted answer
# TODO: For both precision and recall, match on TOKENS per Entity-Time pair and average
f1 = compute_metrics.compute_f1_score(example_gt, example_pred_list)

# Completeness: Recall = # correct answers / # GT answers 
recall = compute_metrics.compute_recall(example_gt, example_pred_list)

# TimePrecision = # correct time ranges / # time ranges given
# TODO: Precision_time: over time ranges (NOT tokenized, as we wanna match on full year)
time_precision = compute_metrics.compute_time_precision_score(example_gt, example_pred_list)

# EntityPrecision = # correct entities / # entites given
# entity_precision = compute_metrics.compute_entity_precision_score(example_gt, example_pred_list)

# TODO: BLEU_entity: over entity items (tokenized and concatenated) -> use sentence_blue or corpus_bleu
# It calculates a precision score for each n-gram size (typically 1 to 4) and then computes a geometric mean of these precisions.
# entity_bleu_score = corpus_bleu(example_gt, example_pred_list)
# print(entity_bleu_score)

# ---------------------- Semantics-based metrics  (only for entities) --------------------------

# TODO: BertScore_entity (or METEOR): over entity items (tokenized and concatenated)



print(f"\nEM: {em}")                              
print(f"F1: {f1}")                                  
print(f"Recall: {recall}")                           
print(f"Time precision: {time_precision}")          


PRED: ['Deputy Prime Minister of Ukraine (2010, 2011)', 'Deputy Chairman of the Board of the National Bank of Ukraine (2010, 2011)', 'Deputy Chairman of the Board of the National Bank of Ukraine (2010, 2011)', 'Deputy Chairman of the Board of the National Bank of Ukraine (2010, 2011)']
GROUND TRUTH: ['Prime Minister of Ukraine (2010)', 'First Deputy Prime Minister of Ukraine (2010)', "People's Deputy of Ukraine (2012, 2013, 2014, 2015)", 'Chairman of the Verkhovna Rada (2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'President of Ukraine (2014)']

Pred time ranges: ['(2010, 2011)', '(2010, 2011)', '(2010, 2011)', '(2010, 2011)']
GT time ranges: ['(2010)', '(2010)', '(2012, 2013, 2014, 2015)', '(2014, 2015, 2016, 2017, 2018, 2019, 2020)', '(2014)']
Pred entities: ['Deputy Prime Minister of Ukraine', 'Deputy Chairman of the Board of the National Bank of Ukraine', 'Deputy Chairman of the Board of the National Bank of Ukraine', 'Deputy Chairman of the Board of the National Bank of Ukraine']
G