## Few-Shot TLQA with FlanT5

Below, we run generative models FlanT5-large and FlanT5-XL in a few-shot setting by selecting demonstration examples from training set of TLQA, which was derived from tempLLama.
- `flan-t5-large`: 780M params, 1GB mem
- `flan-t5-xl`: 3B params, 12 GB mem

In [11]:
!pip install transformers==4.37.2 optimum==1.12.0 --quiet
!pip install sentence-transformers

Collecting sentence-transformers
  Obtaining dependency information for sentence-transformers from https://files.pythonhosted.org/packages/76/2c/bd95032aeb087b0706596af0a4518c4bfe0439a1bb149048ece18b617766/sentence_transformers-2.7.0-py3-none-any.whl.metadata
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.7.0


#### Running FlanT5-Large and FlanT5-XL (zero-shot)

In [8]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large", torch_dtype=torch.float16)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto")

# tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
# model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

input_text = "List all Michael Jackson albums between 2000 and 2009."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
# input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


<pad> Michael Jackson: The Greatest Hits</s>


#### Running FlanT5-XL on a CPU (zero-shot)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")

input_text = "List all Michael Jackson albums between 1990 and 2009."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

### Dataset Exploration

In [31]:
import json
import os

def json_to_list(data_path):
    with open(data_path) as f:
        data = json.load(f)
    return data

train_set = json_to_list("../data/train_TLQA.json")
test_set = json_to_list("../data/test_TLQA.json")
print(len(train_set))
example = train_set[42]
for key in example.keys():
    print(f"{key}: {example[key]}")

3212
question: List all entities that owned Linha de Évora from 2010 to 2020.
answers: ['REFER (2010, 2011, 2012, 2013, 2014, 2015)', 'Infrastructures of Portugal (2015, 2016, 2017, 2018, 2019, 2020)']
type: P127
subject: Linha de Évora
answer_ids: {'REFER': 'Q1411661', 'Infrastructures of Portugal': 'Q20730106'}
wikidata_ID: Q1826620
verified: True
subject_label: Linha de Évora
aliases: []
final_answers: ['REFER (2010, 2011, 2012, 2013, 2014, 2015)', 'Infraestruturas de Portugal (2015, 2016, 2017, 2018, 2019, 2020)']


### Few-Shot Example Generation

To select few-shot examples, we select examples from the training set that are close to each test example by using KNN search in a continuous vector space using any embedding model from sentence-transformers.
We reuse the KNN search from https://anonymous.4open.science/r/ICAT-47FC/code/2wikimultihopqa/get_chatgpt_skill_transfer_knn.py.

In [32]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class KnnSearch:
    def __init__(self, data=None, num_trees=None, emb_dim=None):
        self.num_trees = num_trees
        self.emb_dim = emb_dim

    def get_embeddings_for_data(self, data_ls):
        # NOTE: any embedding model from sentence-transformers can be used
        model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
        embeddings = model.encode(data_ls)
        return embeddings

    def get_top_n_neighbours(self, sentence, data_emb, transfer_data, k):
        """
        Retrieves the top k most similar questions for "sentence" based on cosine similarity from given embeddings "data_emb".

        Parameters:
        sentence (str): The input sentence to find similar questions for.
        data_emb (np.ndarray): The embeddings for the transfer questions.
        transfer_data (list): The list of transfer questions corresponding to data_emb.
        strategy_emb (np.ndarray): The embeddings for the strategy questions.
        str_qa (list): The list of strategy questions corresponding to strategy_emb.
        k (int): The number of top similar questions to retrieve.

        Returns:
        list: A list of the top k similar questions from transfer_data and all similar questions from str_qa.
        """
        sent_emb = self.get_embeddings_for_data(sentence)
        #data_emb = self.get_embeddings_for_data(transfer_questions)
        top_questions = []

        # print("new_emb", sent_emb.shape, data_emb.shape)
        text_sims = cosine_similarity(data_emb,[sent_emb]).tolist()
        results_sims = zip(range(len(text_sims)), text_sims)
        sorted_similarities = sorted(results_sims, key=lambda x: x[1], reverse=True)  # Obtain the highest similarities

        # NOTE: we only match based on questions, but include the full question-answer pair in resulting neighs
        for idx, item in sorted_similarities[:k]:
                #if item[0] > 0.45:
                    top_questions.append(transfer_data[idx])

        return top_questions

In [33]:
def get_transfer_questions(transfer_data):
    transfer_questions = []
    for index, data in enumerate(transfer_data):
        transfer_questions.append(data["question"])
    return transfer_questions

In [38]:
knn = KnnSearch()

# Read raw dataset to a list containing question and answers (no metadata)
train_data = json_to_list("../data/train_TLQA.json")
# Keep questions only to embed (to use in similarity metric)
train_questions = get_transfer_questions(train_data)
train_questions_emb = knn.get_embeddings_for_data(train_questions)

{'question': 'List all positions Michael McMahon held from 2010 to 2020.', 'answers': ['United States representative (2010, 2011)', 'district attorney (2015, 2016, 2017, 2018, 2019, 2020)'], 'type': 'P39', 'subject': 'Michael McMahon', 'answer_ids': {'United States representative': 'Q13218630', 'district attorney': 'Q653368'}, 'wikidata_ID': 'Q1928589', 'verified': True, 'subject_label': 'Michael McMahon', 'aliases': [], 'final_answers': ['United States representative (2010, 2011)', 'district attorney (2015, 2016, 2017, 2018, 2019, 2020)']}


In [44]:
question = "List all employers Elon Musk worked for from 1990 to 2024."
neighs = knn.get_top_n_neighbours(sentence=question, data_emb=train_questions_emb, transfer_data=train_data, k=3)   # data_emb is embedded questions only, so we only match questions
for neigh in neighs:
    print(neigh['question'])
    print(neigh['answers'])
    print('\n')



List all employers Elon Musk, also known as Elon R. Musk, worked for from 2015 to 2020.
['OpenAI (2015, 2016, 2017, 2018, 2019)', 'The Boring Company (2016, 2017, 2018, 2019, 2020)', 'Neuralink (2016, 2017, 2018, 2019, 2020)']


List all employers Craig Silverstein worked for from 2010 to 2020.
['Google (2010, 2011, 2012)', 'Khan Academy (2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020)']


List all employers Max Hattler worked for from 2012 to 2020.
['Royal College of Art (2012, 2013, 2014)', 'City University of Hong Kong (2014, 2015, 2016, 2017, 2018, 2019, 2020)']


