<a href="https://colab.research.google.com/github/patrikrac/NLP_SQuAD2.0/blob/main/few_shot_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install transformers
! pip install sentencepiece
! pip install accelerate
! pip -q install hnswlib
! pip install sentence-transformers



In [2]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
generative_model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type != 'cuda':
    raise SystemError('GPU device not found')

In [4]:
input_text = 'Translate the following sentence from Italian to English: "Amo la pizza"'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = generative_model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

<pad> "I love pizza"</s>


In [5]:
# Download the dataset
print("Downloading the TRAIN dataset of SQuAD2.0")
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json

Downloading the TRAIN dataset of SQuAD2.0
--2024-01-13 10:05:22--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘train-v2.0.json’


2024-01-13 10:05:23 (104 MB/s) - ‘train-v2.0.json’ saved [42123633/42123633]



In [6]:
import pandas
dataframe = pandas.read_json("./train-v2.0.json")
print(f"Size of the dataset: {dataframe.size} (e.g. Categories of questions)")
dataframe.head()

Size of the dataset: 884 (e.g. Categories of questions)


Unnamed: 0,version,data
0,v2.0,"{'title': 'Beyoncé', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Frédéric_Chopin', 'paragraphs': [{'..."
2,v2.0,{'title': 'Sino-Tibetan_relations_during_the_M...
3,v2.0,"{'title': 'IPod', 'paragraphs': [{'qas': [{'qu..."
4,v2.0,{'title': 'The_Legend_of_Zelda:_Twilight_Princ...


In [7]:
# We will now create a list with pairs (Title, Question, Paragraph, Answers)
data_list = list()
categories = list()
paragraphs = list()
questions = list()

for _, row in dataframe.iterrows():
  categories.append(row["data"]["title"])
  for p in row["data"]["paragraphs"]:
    paragraphs.append(p["context"])
    for q in p["qas"]:
      questions.append(q["question"])
      data_list.append((row["data"]["title"], q, p["context"], q["answers"]))

In [8]:
print(len(data_list))

i = 3
print(questions[i])
print(data_list[i][2])
print(data_list[i][3])
print((data_list[i][3][0]['text']))
print(data_list[i][1]["is_impossible"])


130319
In what city and state did Beyonce  grow up? 
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
[{'text': 'Houston, Texas', 'answer_start': 166}]
Houston, Texas
False


In [9]:
import random

random.seed(42)

idx = random.choice(range(len(questions)))

question = questions[idx]

print(f'Question {idx}: {question}?')

Question 83810: What century did Nasser rule in??


In [10]:
from sentence_transformers import SentenceTransformer, CrossEncoder

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [11]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = './qa_embeddings_cache.pkl'

# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        corpus_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    corpus_embeddings = semb_model.encode(paragraphs, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(corpus_embeddings, f)

Computing embeddings


Batches:   0%|          | 0/595 [00:00<?, ?it/s]

Saving index to: './qa_embeddings_cache.pkl'


In [12]:
import os
import hnswlib
import time
start = time.time()
# Create empthy index
index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = './qa_hnswlib_100.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Start creating HNSWLIB index')
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=100, M=64) # see https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md for parameter description
    # Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

end = time.time()
print(f"Exectution time: {int((end - start) / 60)}:{int((end - start) % 60)} min:sec")

Start creating HNSWLIB index
Saving index to: ./qa_hnswlib_100.index
Exectution time: 0:3 min:sec


In [23]:
def get_paragraphs(semb_model, xenc_model, question, topp):
  """
  Samples the topp most relevant paragraphs for the given quesion embedding
  """
  question_embedding = semb_model.encode(question, convert_to_tensor=True)
  corpus_ids, _ = index.knn_query(question_embedding.cpu(), k=128)

  model_inputs = [(question, paragraphs[idx]) for idx in corpus_ids[0]]
  cross_scores = xenc_model.predict(model_inputs)

  #print("Cross-encoder model re-ranking results")
  #print(f"Query: \"{question}\"")
  #print("---------------------------------------")
  #for idx in np.argsort(-cross_scores)[:3]:
  #  print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{paragraphs[corpus_ids[0][idx]]}\"\n\n")

  return np.argsort(-cross_scores)[:topp], corpus_ids

In [35]:
import random
print(dataframe.shape)
print(dataframe.index)
print(dataframe.columns)
print(dataframe.loc[0, 'data']['title'])
def qa_pipeline(
    question,
    similarity_model=semb_model,
    embeddings_index=index,
    re_ranking_model=xenc_model,
    generative_model=generative_model,
    device=device,
    shots=0,
    top_p=1
):
    if not question.endswith('?'):
        question = question + '?'
    # Embed question
    #question_embedding = similarity_model.encode(question, convert_to_tensor=True)
    # Search documents similar to question in index
    #corpus_ids, distances = embeddings_index.knn_query(question_embedding.cpu(), k=128)
    # Re-rank results
    #xenc_model_inputs = [(question, paragraphs[idx]) for idx in corpus_ids[0]]
    #cross_scores = re_ranking_model.predict(xenc_model_inputs)
    # Get best matching passage
    top_p_idx, corpus_ids = get_paragraphs(similarity_model, re_ranking_model, question, top_p)
    # top_p_idx = np.argsort(-cross_scores)[:top_p]
    # Encode input
    input_text = ""
    for i in range(shots):
      idx = random.choice(range(len(questions)))
      while (data_list[idx][1]['is_impossible']) :
        idx = random.choice(range(len(questions)))
      quest = data_list[idx][1]
      passage = data_list[idx][2]
      if len(data_list[idx][3]) == 1:
        answer = data_list[idx][3]['text']
      else:
        answer = data_list[idx][3][0]['text']
      input_text += f"passage:\n{passage}\n\nquestion:\n{quest}\n\n{answer}\n\n\n"
      while not(data_list[idx][1]['is_impossible']) :
        idx = random.choice(range(len(questions)))
      quest = data_list[idx][1]
      passage = data_list[idx][2]
      input_text += f"passage:\n{passage}\n\nquestion:\n{quest}\n\nI do not know the answer\n\n\n"

    possible_answer = ""
    for passage_idx in top_p_idx:
      temp_input_text = input_text
      passage = paragraphs[corpus_ids[0][passage_idx]]
      temp_input_text += f"Given the following passage, answer the related question.\n\nPassage:\n\n{passage}\n\nQ: {question}"
      #print('INPUT TEXT:', temp_input_text, "\n")
      input_ids = tokenizer(temp_input_text, return_tensors="pt").input_ids.to(device)
      # Generate output
      output_ids = generative_model.generate(input_ids, max_new_tokens=512)
      # Decode output
      output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
      possible_answer += output_text + "\n"


    input_text = f"Given the question {question}, choose the best answer between\n{possible_answer}"
    print('INPUT TEXT:', input_text, "\n")
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
    # Generate output
    output_ids = generative_model.generate(input_ids, max_new_tokens=512)
    # Decode output
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    return output_text

print(dataframe.loc[0, 'data']['paragraphs'][0].keys())


(442, 2)
RangeIndex(start=0, stop=442, step=1)
Index(['version', 'data'], dtype='object')
Beyoncé
dict_keys(['qas', 'context'])


In [36]:
import numpy as np
ans = qa_pipeline("What is life", shots=0, top_p=3)

INPUT TEXT: Given the question What is life?, choose the best answer between
directed toward the purpose of increasing its own satisfaction
culture
Plant physiology
 



In [37]:
print(ans)

directed toward the purpose of increasing its own satisfaction
