## Few-Shot TLQA with FlanT5

Below, we run generative models FlanT5-large and FlanT5-XL in a few-shot setting by selecting demonstration examples from training set of TLQA, which was derived from tempLLama.
- `flan-t5-large`: 780M params, 1GB mem
- `flan-t5-xl`: 3B params, 12 GB mem

In [4]:
# !pip install transformers==4.37.2 optimum==1.12.0 --quiet
# !pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --quiet
# !pip install langchain==0.1.9 --quiet
# # !pip install chromadb
# !pip install sentence_transformers==2.4.0 --quiet
# !pip install unstructured --quiet
# !pip install pdf2image --quiet
# !pip install pdfminer.six==20221105 --quiet
# !pip install unstructured-inference --quiet
# !pip install faiss-gpu==1.7.2 --quiet
# !pip install pikepdf==8.13.0 --quiet
# !pip install pypdf==4.0.2 --quiet
# !pip install pillow_heif==0.15.0 --quiet

!pip install transformers==4.37.2 optimum==1.12.0 --quiet
!pip install sentence-transformers
!pip install langchain-core
!pip install bert-score



#### Running FlanT5-Large and FlanT5-XL (zero-shot)

In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large").to('cuda')
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

input_text = "List all Michael Jackson albums between 2000 and 2009."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

  from .autonotebook import tqdm as notebook_tqdm
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


AssertionError: Torch not compiled with CUDA enabled

### Dataset Exploration

In [2]:
import utils
import importlib
importlib.reload(utils)

train_set = utils.json_to_list("../data/train_TLQA.json")
test_set = utils.json_to_list("../data/test_TLQA.json")
print(len(train_set), '\n')
example = train_set[45]
for key in example.keys():
    print(f"{key}: {example[key]}")

3212 

question: List all entities that owned Absolute Radio, also known as Virgin 1215, from 2010 to 2020.
answers: ['The Times Group (2010, 2011, 2012, 2013)', 'Bauer Radio (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']
type: P127
subject: Absolute Radio
answer_ids: {'The Times Group': 'Q972645', 'Bauer Radio': 'Q4873460'}
wikidata_ID: Q4590187
verified: True
subject_label: Absolute Radio
aliases: ['Virgin 1215', 'Virgin Radio', 'Virgin 105.8']
final_answers: ['The Times Group (2010, 2011, 2012, 2013)', 'Bauer Radio (2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']


### Few-Shot Example Generation using KNN Search

To select few-shot examples, we select examples from the training set that are close to each test example by using KNN search in a continuous vector space using any embedding model from sentence-transformers.
We reuse the KNN search from https://anonymous.4open.science/r/ICAT-47FC/code/2wikimultihopqa/get_chatgpt_skill_transfer_knn.py.

In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class KnnSearch:
    def __init__(self, data=None, num_trees=None, emb_dim=None):
        self.num_trees = num_trees
        self.emb_dim = emb_dim

    def get_embeddings_for_data(self, data_ls):
        # NOTE: any embedding model from sentence-transformers can be used
        model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
        embeddings = model.encode(data_ls)
        return embeddings

    def get_top_n_neighbours(self, sentence, data_emb, transfer_data, k):
        """
        Retrieves the top k most similar questions for "sentence" based on cosine similarity from given embeddings "data_emb".

        Parameters:
        sentence (str): The input sentence to find similar questions for.
        data_emb (np.ndarray): The embeddings for the transfer questions.
        transfer_data (list): The list of transfer questions corresponding to data_emb.
        k (int): The number of top similar questions to retrieve.

        Returns:
        list: A list of the top k similar questions from transfer_data and all similar questions from str_qa.
        """
        sent_emb = self.get_embeddings_for_data(sentence)
        top_questions = []

        text_sims = cosine_similarity(data_emb,[sent_emb]).tolist()
        results_sims = zip(range(len(text_sims)), text_sims)
        sorted_similarities = sorted(results_sims, key=lambda x: x[1], reverse=True)  # Obtain the highest similarities

        # NOTE: we only match based on questions, but include the full question-answer pair in resulting neighs
        for idx, item in sorted_similarities[:k]:
                    top_questions.append(transfer_data[idx])

        return top_questions

ImportError: dlopen(/Users/nadinekuo/miniconda3/lib/python3.11/site-packages/scipy/sparse/linalg/_isolve/_iterative.cpython-311-darwin.so, 0x0002): Library not loaded: @rpath/liblapack.3.dylib
  Referenced from: <CB06B8E9-5573-3234-A309-5114AFCED6F5> /Users/nadinekuo/miniconda3/lib/python3.11/site-packages/scipy/sparse/linalg/_isolve/_iterative.cpython-311-darwin.so
  Reason: tried: '/Users/nadinekuo/miniconda3/lib/python3.11/site-packages/scipy/sparse/linalg/_isolve/liblapack.3.dylib' (no such file), '/Users/nadinekuo/miniconda3/lib/python3.11/site-packages/scipy/sparse/linalg/_isolve/../../../../../../liblapack.3.dylib' (no such file), '/Users/nadinekuo/miniconda3/lib/python3.11/site-packages/scipy/sparse/linalg/_isolve/liblapack.3.dylib' (no such file), '/Users/nadinekuo/miniconda3/lib/python3.11/site-packages/scipy/sparse/linalg/_isolve/../../../../../../liblapack.3.dylib' (no such file), '/Users/nadinekuo/miniconda3/bin/../lib/liblapack.3.dylib' (no such file), '/Users/nadinekuo/miniconda3/bin/../lib/liblapack.3.dylib' (no such file), '/usr/local/lib/liblapack.3.dylib' (no such file), '/usr/lib/liblapack.3.dylib' (no such file, not in dyld cache)

In [4]:
# Extracts all questions from the (train) set used for getting neighbours
def get_transfer_questions(transfer_data):
    transfer_questions = []
    for index, data in enumerate(transfer_data):
        transfer_questions.append(data["question"])
    return transfer_questions

In [6]:
knn = KnnSearch()

# Read raw dataset to a list containing question and answers (no metadata)
train_data = utils.json_to_list("../data/train_TLQA.json")
# Keep questions only to embed (to use in similarity metric)
train_questions = get_transfer_questions(train_data)
train_questions_emb = knn.get_embeddings_for_data(train_questions)

NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

In [7]:
# TODO: loop over test samples during/before inference time to create few-shot prompts
# question = "List all employers Elon Musk worked for from 1990 to 2024."
question = "List all positions held by Mark Zuckerberg from 2000 to 2024."
neighs = knn.get_top_n_neighbours(sentence=question, data_emb=train_questions_emb, transfer_data=train_data, k=3)   # data_emb is embedded questions only, so we only match questions
for neigh in neighs:
    print(neigh['question'])
    print(neigh['answers'])
    print('\n')

List all positions Per Sandberg held from 2015 to 2018.
['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']


List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']


List all positions David M. O'Connell held from 2010 to 2020.
['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']




### Setting up Few-Shot Prompt Template using LangChain

In [8]:
def simplify_dict_list(dict_list):
    return [{'question': item['question'], 'answers': item['answers']} for item in dict_list]

simple_neighs = simplify_dict_list(neighs)
print(simple_neighs)

[{'question': 'List all positions Per Sandberg held from 2015 to 2018.', 'answers': ['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']}, {'question': 'List all positions Mark Drakeford held from 2013 to 2020.', 'answers': ['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']}, {'question': "List all positions David M. O'Connell held from 2010 to 2020.", 'answers': ['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']}]


First we configure a formatter that will format the few-shot examples into a string. This formatter should be a PromptTemplate object.

In [9]:
from langchain_core.prompts.few_shot import FewShotPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

example_prompt = PromptTemplate(
    input_variables=["question", "answers"], template="Question: {question}\n{answers}"
)

print(example_prompt.format(**simple_neighs[1]))
# print(example_prompt.format(question=f"{simple_neighs[0]['question']}", answers=f"{simple_neighs[0]['answers']}"))

Question: List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']


Now we feed examples and formatter to FewShotPromptTemplate.

In [10]:
prompt = FewShotPromptTemplate(
    examples=simple_neighs,
    example_prompt=example_prompt,
    suffix="Question: {input}",
    input_variables=["input"],
)

few_shot_prompt = prompt.format(input="List all positions held by Mark Zuckerberg between 2000 and 2020. Please answer this question in the same format as the three examples above.")
print(few_shot_prompt)   # String to feed to model during evaluation

Question: List all positions Per Sandberg held from 2015 to 2018.
['Minister of Fisheries (2015, 2016, 2017, 2018)', 'Minister of Justice and Public Security (2017, 2018)']

Question: List all positions Mark Drakeford held from 2013 to 2020.
['Minister for Health and Social Services (2013, 2014, 2015, 2016)', 'First Minister of Wales (2018, 2019, 2020)']

Question: List all positions David M. O'Connell held from 2010 to 2020.
['diocesan bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)', 'Catholic bishop (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']

Question: List all positions held by Mark Zuckerberg between 2000 and 2020. Please answer this question in the same format as the three examples above.


## TLQA Evaluation on Test Set

In [11]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
# TODO: uncomment when using GPU
# import accelerate

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")

# Uncomment the line below to run on GPU
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [76]:
from datasets import Dataset

K = 10   # TODO: Experiment with 3, 5, 7, 10
MAX_OUTPUT_LEN = 200
MODEL_NAME = "flan-t5-large"  # TODO: Set accordingly

results_GT_dict = {'prompts': [], 'outputs': [], 'output_tokens': [], 
                            'ground_truths': [], 'ground_truth_tokens': []}  

# Configure formatter that will format the few-shot examples into a string
example_prompt = PromptTemplate(
    input_variables=["question", "answers"], template="Question: {question}\n{answers}"
)

# Convert test set to list and loop over all items (1071 in total)
for i, item in enumerate(test_set):
    
    # For each test question, retrieve k neighbours (try 3, 5, 7, 10)
    test_question = test_set[i]['question']
    neighs = knn.get_top_n_neighbours(sentence=test_question, data_emb=train_questions_emb, transfer_data=train_data, k=K)
    simple_neighs = simplify_dict_list(neighs)

    # Create the few-shot prompt template and feed to model
    prompt = FewShotPromptTemplate(
        examples=simple_neighs,
        example_prompt=example_prompt,
        suffix="Question: {input}",
        input_variables=["input"],
    )
    few_shot_prompt = prompt.format(input=f"{test_question} Please answer this question in the same format as the {K} examples above.")
    results_GT_dict['prompts'].append(few_shot_prompt)
    
    # TODO: pre-tokenize input and GT to speed up inference
    input_ids = tokenizer(few_shot_prompt, return_tensors="pt").input_ids
    # TODO: ---------------------- UNCOMMENT WHEN USING GPU -----------------------------
    # input_ids = tokenizer(few_shot_prompt, return_tensors="pt").input_ids.to("cuda")
    output_tokens = model.generate(input_ids, max_length=MAX_OUTPUT_LEN)
    output = tokenizer.decode(output_tokens[0])

    results_GT_dict['output_tokens'].append(output_tokens[0])
    results_GT_dict['outputs'].append(output)
    results_GT_dict['ground_truths'].append(test_set[i]['final_answers'])
    gt_tokens = tokenizer(str(test_set[i]['final_answers']), return_tensors="pt").input_ids[0]
    results_GT_dict['ground_truth_tokens'].append(gt_tokens)

results_ds = Dataset.from_dict(results_GT_dict)
results_ds.save_to_disk(f"{K}_shot_{MODEL_NAME}.hf")


Saving the dataset (1/1 shards): 100%|██████████| 5/5 [00:00<00:00, 1595.40 examples/s]


Here we load the results Dataset from disk and inspect individual results:

In [118]:
results_ds = Dataset.load_from_disk("../results/10_shot_flan-t5-large_test.hf")
print(results_ds)
idx_to_inspect = 2

prompt = results_ds['prompts'][idx_to_inspect]
output = results_ds['outputs'][idx_to_inspect]
output_tokens = results_ds['output_tokens'][idx_to_inspect]
gt_tokens = results_ds['ground_truth_tokens'][idx_to_inspect]
gt = results_ds['ground_truths'][idx_to_inspect]

# print(f"\nPROMPT (idx = {idx_to_inspect}):\n{example_prompt}\n")
print(f"\nMODEL OUTPUT:\n{output}\n\nOUTPUT TOKENS:\n{output_tokens}")
print(f"\nGROUND TRUTH:\n{gt}\n\nGT TOKENS:\n{gt_tokens}")

Dataset({
    features: ['prompts', 'outputs', 'output_tokens', 'ground_truths', 'ground_truth_tokens'],
    num_rows: 5
})

MODEL OUTPUT:
<pad> ['Teachta Dála (2010, 2012, 2013, 2014)', 'Labour Party (2010, 2011, 2012, 2013, 2014)', 'Labour Party (2010, 2011, 2012, 2013, 2014)', 'Labour Party (2010, 2011, 2012, 2013, 2014)']</s>

OUTPUT TOKENS:
[0, 784, 31, 382, 15, 9, 3997, 9, 309, 2975, 521, 41, 14926, 6, 1673, 6, 7218, 1412, 61, 31, 6, 3, 31, 18506, 1211, 3450, 41, 14926, 6, 8558, 1673, 6, 7218, 1412, 61, 31, 6, 3, 31, 18506, 1211, 3450, 41, 14926, 6, 8558, 1673, 6, 7218, 1412, 61, 31, 6, 3, 31, 18506, 1211, 3450, 41, 14926, 6, 8558, 1673, 6, 7218, 1412, 61, 31, 908, 1]

GROUND TRUTH:
['Socialist Party (2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)', 'RISE (Ireland) (2019, 2020)']

GT TOKENS:
[784, 31, 5231, 4703, 343, 3450, 41, 14926, 6, 8558, 1673, 6, 7218, 1412, 6, 1230, 6, 5123, 4791, 4323, 1360, 61, 31, 6, 3, 31, 13431, 427, 41, 196, 60, 40, 232, 61, 41, 8584, 6,

Below we will test the utility functions that compute metrics:

In [119]:
import metrics
import utils
import importlib
import re
importlib.reload(metrics)
importlib.reload(utils)


gt_list = gt
gt_tokens_list = gt_tokens
pred_tokens_list = utils.remove_pad_tokens(output_tokens)
pred_list = utils.extract_between_tags(output)
pred_list = re.findall(r"'(.*?)'", pred_list)  # We convert the output string into list to facilitate metrics computation

print(f"Pred time ranges: {metrics.extract_time_ranges(pred_list)}")
print(f"GT time ranges: {metrics.extract_time_ranges(gt_list)}")
print(f"Pred entities: {metrics.extract_entities(pred_list)}")
print(f"GT entities: {metrics.extract_entities(gt_list)}")

# ---------------------- Syntax-based metrics --------------------------
em = metrics.compute_em(gt_list, pred_list)
f1 = metrics.compute_f1(gt_tokens_list, pred_tokens_list)
recall = metrics.compute_recall(gt_tokens_list, pred_tokens_list)
time_bleu = metrics.compute_time_bleu(gt_list, pred_list, tokenizer)  # NOTE: tends to be relatively high, since the time range is already hinted at in the prompt
# ---------------------- Semantics-based metrics --------------------------
entity_bert = metrics.compute_entity_bert(gt_list, pred_list)   # Uses BERT tokenizer

print(f"\nEM: {em}")                              
print(f"F1: {f1}")                      
print(f"Recall: {recall}")                           
print(f"Time BLEU: {time_bleu}")      
print(f"Entity BERT: {entity_bert}")    

Pred time ranges: ['(2010, 2012, 2013, 2014)', '(2010, 2011, 2012, 2013, 2014)', '(2010, 2011, 2012, 2013, 2014)', '(2010, 2011, 2012, 2013, 2014)']
GT time ranges: ['(2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)', '(2019, 2020)']
Pred entities: ['Teachta Dála', 'Labour Party', 'Labour Party', 'Labour Party']
GT entities: ['Socialist Party', 'RISE']

EM: 0.0
F1: 0.23636363636363636
Recall: 0.30952380952380953
Time BLEU: 0.7960980672538887
Entity BERT: 0.5890607833862305
