# Advanced Topics in Machine Learning - Natural Language Processing -- Group Assignment
> Students
> - Patrik Rác
> - Méline Trochon
> - Valentina Moretti
> - Mattia Colbertaldo

---



In [None]:
# Imports
import os
import torch
import pandas

## 1. Investigate Dataset


We will investigate the dataset ***SQuAD2.0 (Stanford Question Answering Dataset)*** avaliable [here](https://rajpurkar.github.io/SQuAD-explorer/).

In [None]:
# Set the directory to work in
WORKING_DIR = "./squad"
DATA_DIR = "./data"

In [None]:
# Download the dataset
print("Downloading the DEV dataset of SQuAD2.0")
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Downloading the DEV dataset of SQuAD2.0
--2024-01-02 14:03:54--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘dev-v2.0.json’


2024-01-02 14:03:55 (34.2 MB/s) - ‘dev-v2.0.json’ saved [4370528/4370528]



In [None]:
dev_dataframe = pandas.read_json("/content/dev-v2.0.json")
print(f"Size of the dataset: {dev_dataframe.size} (e.g. Categories of questions)")
dev_dataframe.head()

Size of the dataset: 70 (e.g. Categories of questions)


Unnamed: 0,version,data
0,v2.0,"{'title': 'Normans', 'paragraphs': [{'qas': [{..."
1,v2.0,"{'title': 'Computational_complexity_theory', '..."
2,v2.0,"{'title': 'Southern_California', 'paragraphs':..."
3,v2.0,"{'title': 'Sky_(United_Kingdom)', 'paragraphs'..."
4,v2.0,"{'title': 'Victoria_(Australia)', 'paragraphs'..."


In [None]:
def count_paragraphs(row):
    return len(row["data"]["paragraphs"])

print(f"Number of individual paragraphs: {sum(dev_dataframe.apply(count_paragraphs, axis=1))}")


def count_questions(row):
  return sum([len(p["qas"]) for p in row["data"]["paragraphs"]])

print(f"Number of individual questions: {sum(dev_dataframe.apply(count_questions, axis=1))}")

def count_impossible_questions(row):
  n_impossible = 0
  for p in row["data"]["paragraphs"]:
    for q in p["qas"]:
      if q["is_impossible"]:
        n_impossible += 1
  return n_impossible


print(f"Number of impossible questions: {sum(dev_dataframe.apply(count_impossible_questions, axis=1))}")

Number of individual paragraphs: 1204
Number of individual questions: 11873
Number of impossible questions: 5945


In [None]:
print(type(dev_dataframe.iloc[0]["data"]))

import json
print(json.dumps(dev_dataframe.iloc[1]["data"], sort_keys=False, indent=4))

<class 'dict'>
{
    "title": "Computational_complexity_theory",
    "paragraphs": [
        {
            "qas": [
                {
                    "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship?",
                    "id": "56e16182e3433e1400422e28",
                    "answers": [
                        {
                            "text": "Computational complexity theory",
                            "answer_start": 0
                        },
                        {
                            "text": "Computational complexity theory",
                            "answer_start": 0
                        },
                        {
                            "text": "Computational complexity theory",
                            "answer_start": 0
                        }
                    ],
                    "is_impossible": false
                },
           

In [None]:
# Example of a Paragraph and a question
import textwrap
print("Category of given paragraph: {}\n".format(dev_dataframe.iloc[0]["data"]["title"]))

example_datum = dev_dataframe.iloc[0]["data"]["paragraphs"][0]
print("Context\n {}\n".format(textwrap.fill(example_datum["context"], 50)))

print("Example Question:\n{}\n".format(example_datum["qas"][0]["question"]))

print("Example Answer:\n{}\n".format(example_datum["qas"][0]["answers"][0]["text"]))

Category of given paragraph: Normans

Context
 The Normans (Norman: Nourmands; French: Normands;
Latin: Normanni) were the people who in the 10th
and 11th centuries gave their name to Normandy, a
region in France. They were descended from Norse
("Norman" comes from "Norseman") raiders and
pirates from Denmark, Iceland and Norway who,
under their leader Rollo, agreed to swear fealty
to King Charles III of West Francia. Through
generations of assimilation and mixing with the
native Frankish and Roman-Gaulish populations,
their descendants would gradually merge with the
Carolingian-based cultures of West Francia. The
distinct cultural and ethnic identity of the
Normans emerged initially in the first half of the
10th century, and it continued to evolve over the
succeeding centuries.

Example Question:
In what country is Normandy located?

Example Answer:
France



## TODO Section
---

- Investigate the data further (Possibly some plots etc.)
- Train Word2Vec on the corpus of paragraphs.
- (Potentially index the paragraphs) -> Find appropriate Datastructure potentially

=> Prepare Data for Training

---
- Investigate the application of `DistilBert`
  - Fine Tuning
  - One-Shot learning
  - (Two-Shot learning)
- Come up with alternatives (Custom Transformer etc.)
---


## Finetuning DistilBERT

In [None]:
# Install the required Huggingface libs
! pip install datasets transformers accelerate -U

Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70

In [None]:
# Import the transformers package (We'll be taking a pretrained model from here)
import transformers
print(transformers.__version__)

4.36.2


In [None]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

In [None]:
from datasets import load_dataset

squad2_datasets = load_dataset("squad_v2") # TODO: Replace with loading custom dataset

Downloading data:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [None]:
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
tokenized_datasets = datasets.map(prepare_train_features, batched=True, remove_columns=datasets["train"].column_names)

NameError: ignored

### Actual Fine Tuning of the Model

In [None]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [None]:
from transformers import default_data_collator

data_collator = default_data_collator

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

NameError: ignored

In [None]:
trainer.train()

In [None]:
trainer.save_model("test-squad-trained")

In [None]:
# TODO: Evaluation. The evaluation of the model should be rather genereal and work with all model training strategies.

# One shot learning

### QA data preparation

In this section we will be using the [WikiQA](https://aclanthology.org/D15-1237/) data set.
It's a data set for open domain generative QA.

It's avaialble via the HuggingFace data set package, let's install it

In [2]:
!pip -q install datasets

In [3]:
!pip -q install transformers==4.22.2
!pip -q install -U sentence-transformers

In [4]:
!pip install --upgrade accelerate



Let's download the validation split of the dev_dataframe data set

In [5]:
import numpy as np
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [6]:
import datasets

In [7]:
# Print the first example from dev_dataframe
dev_dataframe.head()

NameError: ignored

The data set contains questions, title of documents in Wikipedia containing the answers and answers.
Additionally the data set contains distractor answers for a given question.
We can distinguish the two types of responses from the `label` value.

Let's filter the data to retain only correct question-response pairs

In [8]:
import pandas as pd

# Assuming your DataFrame is already loaded
# Replace 'your_file.json' with the actual file path if needed
dev_dataframe = pd.read_json("/content/dev-v2.0.json")

# Flatten the nested structure to work with individual questions
flat_dataframe = pd.json_normalize(dev_dataframe['data'], record_path=['paragraphs', 'qas'], meta=['title', 'paragraphs'])



def remove_impossible_questions(row, possible_questions_list):
    count = 0
    for p in row["data"]["paragraphs"]:
        for q in p["qas"]:
            if not q["is_impossible"]:
                possible_questions_list.append(q)
                count += 1
    print("Number of possible questions:", count)


dev_dataframe_possible_list = []
dev_dataframe.apply(remove_impossible_questions, axis=1, possible_questions_list=dev_dataframe_possible_list)
print(len(dev_dataframe_possible_list))
# Convert the list of dictionaries to a DataFrame
dev_dataframe_possible = pd.DataFrame(dev_dataframe_possible_list)

# Display the first few rows of the filtered DataFrame
print(dev_dataframe_possible_list[:10])

Number of possible questions: 96
Number of possible questions: 197
Number of possible questions: 180
Number of possible questions: 108
Number of possible questions: 124
Number of possible questions: 217
Number of possible questions: 215
Number of possible questions: 241
Number of possible questions: 106
Number of possible questions: 231
Number of possible questions: 183
Number of possible questions: 170
Number of possible questions: 136
Number of possible questions: 104
Number of possible questions: 108
Number of possible questions: 116
Number of possible questions: 128
Number of possible questions: 197
Number of possible questions: 98
Number of possible questions: 113
Number of possible questions: 112
Number of possible questions: 96
Number of possible questions: 297
Number of possible questions: 176
Number of possible questions: 228
Number of possible questions: 215
Number of possible questions: 104
Number of possible questions: 153
Number of possible questions: 290
Number of possibl

Ok now we have a lot of samples to try our system

In [9]:
import json

row_index = 1

# Extract the relevant row
selected_row = dev_dataframe_possible.iloc[row_index]

# Convert the entire row to JSON and pretty print it
formatted_json = json.dumps(selected_row.to_dict(), sort_keys=False, indent=4)
print(formatted_json)

#TODO are not all false but the group in which there is at least one false

{
    "question": "When were the Normans in Normandy?",
    "id": "56ddde6b9a695914005b9629",
    "answers": [
        {
            "text": "10th and 11th centuries",
            "answer_start": 94
        },
        {
            "text": "in the 10th and 11th centuries",
            "answer_start": 87
        },
        {
            "text": "10th and 11th centuries",
            "answer_start": 94
        },
        {
            "text": "10th and 11th centuries",
            "answer_start": 94
        }
    ],
    "is_impossible": false
}


### Knowlege retrieval

**We have**: the questions (and the target answers)

**We need**: knowledge source & the retrieval system.

We can re-use the simple Wikipedia and the two encoder models from last tutorial.

**DATA:**

In [1]:
from transformers import BertConfig  # Replace BertConfig with the specific configuration class you need
from sentence_transformers import SentenceTransformer, CrossEncoder
! pip install --upgrade transformers

# MS MARCO Passage Ranking is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query.

# It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.
semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')



### Answering a question

We are still missing the core of the QA system, the answering model.
We are going to re-use a pre-trained model.

[FLAN T5](https://arxiv.org/abs/2210.11416) is a pre-trained encode-decoder model trained to be used on different tasks in zero-shot or few-shots leaning settings

In [10]:
!pip install sentencepiece



In [11]:
!pip -q install transformers sentencepiece accelerate

In [12]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Let's check quickly that it works:
zero shot learning: we give him a task he did not see during the training

In the zero-shot learning: evaluate the model on the SQuAD2.0 test set without providing any task-specific examples during testing. The model relies solely on its pre-trained knowledge and general language understanding to answer questions.

In [19]:
input_text = 'Translate the following sentence from Italian to English: "Amo la pizza"'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

RuntimeError: ignored

# Index all the paragraphs and all the questions

In [20]:
import pandas as pd

# Replace 'path_al_tuo_file.json' with the actual file path
squad_data = dev_dataframe

# Extract questions and paragraphs and index them
questions = []
paragraphs = []

# Iterate over documents and paragraphs
for _, row in squad_data.iterrows():
    for paragraph in row['data']['paragraphs']:
        for qa in paragraph['qas']:
            question_dict = {
                'id': qa['id'],
                'title': row['data']['title'],
                'context': paragraph['context'],
                'question': qa['question'],
                'is_impossible': qa['is_impossible'],
                'answers': qa['answers']
            }
            questions.append(question_dict)

        # Append paragraphs to the separate list
        paragraph_dict = {
            'title': row['data']['title'],
            'context': paragraph['context']
        }
        paragraphs.append(paragraph_dict)

# Convert the lists to DataFrames
questions_df = pd.DataFrame(questions)
paragraphs_df = pd.DataFrame(paragraphs)

# Display the DataFrames
print("Questions:")
print(questions_df.head())

print("\nParagraphs:")
print(paragraphs_df.head())


Questions:
                         id    title  \
0  56ddde6b9a695914005b9628  Normans   
1  56ddde6b9a695914005b9629  Normans   
2  56ddde6b9a695914005b962a  Normans   
3  56ddde6b9a695914005b962b  Normans   
4  56ddde6b9a695914005b962c  Normans   

                                             context  \
0  The Normans (Norman: Nourmands; French: Norman...   
1  The Normans (Norman: Nourmands; French: Norman...   
2  The Normans (Norman: Nourmands; French: Norman...   
3  The Normans (Norman: Nourmands; French: Norman...   
4  The Normans (Norman: Nourmands; French: Norman...   

                                            question  is_impossible  \
0               In what country is Normandy located?          False   
1                 When were the Normans in Normandy?          False   
2      From which countries did the Norse originate?          False   
3                          Who was the Norse leader?          False   
4  What century did the Normans first gain their ...    

Now we can test our pipeline.
First select randomly a question from the data set

In [21]:
import random

random.seed(10)

idx = random.choice(range(len(questions)))

sample = questions[idx]
print(sample)
question = sample['question']
target_answer = sample['answers']

print(f'Question {idx}: {question}')
print(f'Answer {idx}: {target_answer}')

idx = random.choice(range(len(paragraphs)))

sample = paragraphs[idx]
print(sample)
context = sample['context']


print(f'Context {idx}: {context}')

{'id': '572fff45947a6a140053cf27', 'title': 'Rhine', 'context': "Most of the Rhine's current course was not under the ice during the last Ice Age; although, its source must still have been a glacier. A tundra, with Ice Age flora and fauna, stretched across middle Europe, from Asia to the Atlantic Ocean. Such was the case during the Last Glacial Maximum, ca. 22,000–14,000 yr BP, when ice-sheets covered Scandinavia, the Baltics, Scotland and the Alps, but left the space between as open tundra. The loess or wind-blown dust over that tundra, settled in and around the Rhine Valley, contributing to its current agricultural usefulness.", 'question': 'What stretched across middle Europe in the last ice age?', 'is_impossible': False, 'answers': [{'text': 'tundra', 'answer_start': 137}, {'text': 'tundra', 'answer_start': 137}, {'text': 'A tundra', 'answer_start': 135}]}
Question 9361: What stretched across middle Europe in the last ice age?
Answer 9361: [{'text': 'tundra', 'answer_start': 137}, 

# Embed paragraphs and questions

In the few-shot learning: during testing we provide the model with a small nsmall number of examples from the SQuAD2.0 dataset. These examples are used to give the model a knowledge of the specific patterns present in the SQuAD2.0 task.

Now let's embed the retrieved passages (We can checkpoint the embeddings to avoid repeating the computation each time)

In [22]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = './qa_embeddings_cache.pkl'

# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        corpus_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    corpus_embeddings = semb_model.encode(paragraphs, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(corpus_embeddings, f)

Computing embeddings


Batches:   0%|          | 0/38 [00:00<?, ?it/s]

Saving index to: './qa_embeddings_cache.pkl'


Finally let's index the embeddings (and let's save the index hoping it works)

In [23]:
!pip -q install hnswlib

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for hnswlib (pyproject.toml) ... [?25l[?25hdone


In [24]:
import os
import hnswlib
import time
start = time.time()
# Create empthy index
index = hnswlib.Index(space='cosine', dim=corpus_embeddings.size(1))

# Define hnswlib index path
index_path = './qa_hnswlib_100.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Start creating HNSWLIB index')
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=100, M=64) # see https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md for parameter description
    # Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

end = time.time()
print(f"Exectution time: {int((end - start) / 60)}:{int((end - start) % 60)} min:sec")

Start creating HNSWLIB index
Saving index to: ./qa_hnswlib_100.index
Exectution time: 0:0 min:sec


In [26]:
import random

random.seed(80)

idx = random.choice(range(len(questions)))

sample = questions[idx]
question = sample['question']
target_answer = sample['answers']

print(f'Question {idx}: {question}?')

Question 4448: Inter-network routing was what kind of system??


Embed the question

In [29]:
question_embedding = semb_model.encode(question, convert_to_tensor=True)

Retrieve relevant documents keeping top $k$ matches

In [30]:
corpus_ids, distances = index.knn_query(question_embedding.cpu(), k=64)
scores = 1 - distances

print("Cosine similarity model search results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx, score in zip(corpus_ids[0][:5], scores[0][:5]):
    print(f"Score: {score:.4f}\nDocument: \"{paragraphs[idx]}\"\n\n")

Cosine similarity model search results
Query: "Inter-network routing was what kind of system?"
---------------------------------------
Score: 0.1563
Document: "{'title': 'Imperialism', 'context': 'Imperialism is a type of advocacy of empire. Its name originated from the Latin word "imperium", which means to rule over large territories. Imperialism is "a policy of extending a country\'s power and influence through colonization, use of military force, or other means". Imperialism has greatly shaped the contemporary world. It has also allowed for the rapid spread of technologies and ideas. The term imperialism has been applied to Western (and Japanese) political and economic dominance especially in Asia and Africa in the 19th and 20th centuries. Its precise meaning continues to be debated by scholars. Some writers, such as Edward Said, use the term more broadly to describe any system of domination and subordination organised with an imperial center and a periphery.'}"


Score: 0.1563
Docu

Re-rank retrieved documents

In [31]:
import numpy as np

# Initialize texts list
texts = [[] for _ in range(len(paragraphs))]

# Populate texts list
for paragraph_idx, paragraph in enumerate(paragraphs):
    for idx, context_key in enumerate(['context']):  # Adjust keys as needed
        texts[paragraph_idx].append(paragraph[context_key])


# Create model inputs
model_inputs = [(question, ' '.join(paragraph_texts)) for paragraph_texts in texts]

# Predict scores using the cross-encoder model
cross_scores = xenc_model.predict(model_inputs)

# Print results
print("Cross-encoder model re-ranking results")
print(f"Query: \"{question}\"")
print("---------------------------------------")
for idx in np.argsort(-cross_scores)[:5]:
    print(f"Score: {cross_scores[idx]:.4f}\nDocument: \"{paragraphs[idx]['context']}\"\n\n")


Cross-encoder model re-ranking results
Query: "Inter-network routing was what kind of system?"
---------------------------------------
Score: 1.3056
Document: "AppleTalk was a proprietary suite of networking protocols developed by Apple Inc. in 1985 for Apple Macintosh computers. It was the primary protocol used by Apple devices through the 1980s and 90s. AppleTalk included features that allowed local area networks to be established ad hoc without the requirement for a centralized router or server. The AppleTalk system automatically assigned addresses, updated the distributed namespace, and configured any required inter-network routing. It was a plug-n-play system."


Score: -3.7926
Document: "There were two kinds of X.25 networks. Some such as DATAPAC and TRANSPAC were initially implemented with an X.25 external interface. Some older networks such as TELENET and TYMNET were modified to provide a X.25 host interface in addition to older host connection schemes. DATAPAC was developed by

Use best match to answer (and compare to reference answer)

In [None]:
passage_idx = np.argsort(-cross_scores)[0]

paragraphs = paragraphs[passage_idx]

input_text = f"Given the following paragrapha, answer the related question.\n\paragraphs:\n\n{paragraphs}\n\nQ: {question}?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
print(input_text, "\n")

'''
output_ids = model.generate(input_ids, max_new_tokens=32)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text, "\n")

print(f"A (target): {target_answer}")
'''

How do we know if the passage was useful and the model haven't exploited weights memorisation?
Let's try to generate directly the response


In [None]:
'''
# Create a input_text for the T5 model
input_text = f"Answer the following question.\n\nQ: {question}?"
print(input_text)
# tokenise the input_text
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)

# generate the output_ids
output_ids = model.generate(input_ids, max_new_tokens=32)
# decode the output_ids
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# print the output_text
print(f"\nA: {output_text}")
'''