## **Project description**


Build an Information Retrieval pipeline using the knowledge and evaluate the results on the evaluation set.

Use the combinations of three methods:

1. Sentence Embedding using Word2Vec
2. Word Mover's Distance
3. Sentence Embedding using Sentence Transformer (a.k.a bi-encoder)

Run the evaluate() function to record the score for each method.

Additionally, the Project should be completed according to the guidelines:

1. Pre-processing
2. Installation:
    * Retrieve model
    * Re-Rank model
3. Evaluation scale:
    * MRR, MAP@k
4. Explaination
    * Machine learning: input size, model size
    * Algorithm optimization
    * New creative methods







The main task that the model should done can be describe as follow:
* Given queries, each query has 1 correspondent ID.
* Given a corpus (a paragraph) that contains all information to answers for the queries.
* Find answers for queries from the given corpus.

## Load data

The following code are used to load data and show dataset information.

In [2]:
# Download data
!gdown https://drive.google.com/uc?id=112XMyn4RjjV625JGUX4Va7MNHg9cgT5y&export=download
!gdown https://drive.google.com/uc?id=18WeZe3BxzJX0fcUcFSQv-vvB_s8Q2nUG&export=download

Downloading...
From: https://drive.google.com/uc?id=112XMyn4RjjV625JGUX4Va7MNHg9cgT5y
To: /content/IR_Groundtruth.json
100% 189k/189k [00:00<00:00, 114MB/s]
Downloading...
From: https://drive.google.com/uc?id=18WeZe3BxzJX0fcUcFSQv-vvB_s8Q2nUG
To: /content/IR_Test.json
100% 116k/116k [00:00<00:00, 92.3MB/s]


There are 2 data files:
* IR_Test.json: including input data.
* IR_Groundtruth.json: including the groundtruth data for evaluation.

Load the input data and save to test_set variable.


In [3]:
# Read data files
import json

with open('/content/IR_Test.json', 'r', encoding='utf-8') as f:
    test_set=json.load(f)

This function is to print the schema of data

In [4]:
# Fucntion to print schema of data
def print_schema(data, indent=0):
    indent_str = '  ' * indent
    if isinstance(data, dict):
        for key, value in data.items():
            if isinstance(value, dict):
                print(f"{indent_str}{key}: dict")
                print_schema(value, indent + 1)
            elif isinstance(value, list):
                if value:
                    print(f"{indent_str}{key}: [List of {type(value[0]).__name__}]")
                    print_schema(value[0], indent + 1)
                else:
                    print(f"{indent_str}{key}: [Empty List]")
            else:
                print(f"{indent_str}{key}: {type(value).__name__}")
    elif isinstance(data, list):
        if data:
            print(f"{indent_str}[List of {type(data[0]).__name__}]")
            print_schema(data[0], indent + 1)
        else:
            print(f"{indent_str}[Empty List]")
    else:
        print(f"{indent_str}{type(data).__name__}")

Use the function defined above to print out information of test_set

In [5]:
# Information about test set
print("test_set datatype:", type(test_set))
print("Schema of test_set:")
print_schema(test_set)

test_set datatype: <class 'dict'>
Schema of test_set:
corpus: str
qa: [List of dict]
  question: str
  id: str


test_set is a dictionary consisting of 2 keys:

1. The key 'corpus' is associated with a value that is a string containing the text with the content to be searched.

2. The key 'qa' is associated with a value that is a list of dictionaries containing 2 keys:
  * The key 'question' is associated with a value that is a string containing the question.
  * The key 'id' is associated with a value that is a string containing the question ID.







## Evaluate function

This code define functions to evaluate the model

In [6]:
# ĐOẠN CODE NÀY KHÔNG ĐƯỢC PHÉP CHỈNH SỬA
# Load groundtruth
with open('/content/IR_Groundtruth.json', 'r', encoding='utf-8') as f:
    ground_truth=json.load(f)

def calculate_mrr(predictions, ground_truth):
    reciprocal_ranks = []

    for query_id, predicted_answers in predictions.items():
        if query_id not in ground_truth:
          raise ValueError(f"Query ID '{query_id}' not found in ground truth data.")
        correct_data = ground_truth.get(query_id, {'answers': [], 'key': []})
        correct_answers = correct_data['answers']
        correct_keys = correct_data['key']

        rr = 0
        stop = len(predicted_answers)
        i = 1
        while i <= stop:
            if predicted_answers[i-1] in correct_answers:
                rr = 1 / i
                break
            elif not rr:
              for correct_answer in correct_answers:
                    if correct_answer in predicted_answers[i-1]:
                        penalty = len(correct_answer) / len(predicted_answers[i-1])
                        rr = penalty * 0.9 / i
                        stop = min(2*i, len(predicted_answers))
                    elif predicted_answers[i-1] in correct_answer and any(key in predicted_answers[i-1] for key in correct_keys):
                        rr = 0.9 / i
                        stop = min(2*i, len(predicted_answers))

            i += 1

        reciprocal_ranks.append(rr)

    return sum(reciprocal_ranks) / len(ground_truth)


def calculate_map(predictions, ground_truth):
    average_precisions = []

    for query_id, predicted_answers in predictions.items():
        if query_id not in ground_truth:
          raise ValueError(f"Query ID '{query_id}' not found in ground truth data.")
        correct_data = ground_truth.get(query_id, {'answers': [], 'key': []})
        correct_answers = correct_data['answers']
        correct_keys = correct_data['key']

        if not correct_answers:
            continue

        correct_answer_rank = [-1] * len(correct_answers)
        correct_answer_score = [0] * len(correct_answers)
        for id, correct_answer in enumerate(correct_answers):
            i = 1
            stop = len(predicted_answers)
            while i <= stop:
                if predicted_answers[i-1] == correct_answer:
                    correct_answer_rank[id] = i
                    correct_answer_score[id] = 1
                    break
                elif not correct_answer_score[id]:
                  if correct_answer in predicted_answers[i-1]:
                    correct_answer_rank[id] = i
                    correct_answer_score[id] = len(correct_answer) / len(predicted_answers[i-1]) * 0.9
                    stop = min(2*i, len(predicted_answers))
                  elif predicted_answers[i-1] in correct_answer and any(key in predicted_answers[i-1] for key in correct_keys):
                    correct_answer_rank[id] = i
                    correct_answer_score[id] = 0.9
                    stop = min(2*i, len(predicted_answers))
                i += 1
        rank_score_pairs = list(zip(correct_answer_rank, correct_answer_score))

        rank_score_pairs.sort(key=lambda x: x[0])

        sorted_correct_answer_rank, sorted_correct_answer_score = zip(*rank_score_pairs)

        sum_score = 0
        precision_sum = 0
        cnt = 0

        for rank, score in zip(sorted_correct_answer_rank, sorted_correct_answer_score):
            if score:
                cnt += 1
                sum_score += score
                precision_sum += sum_score / rank

        if cnt > 0:
            average_precisions.append(precision_sum / cnt)
        else:
            average_precisions.append(0)

    return sum(average_precisions) / len(ground_truth)


def evaluate(predictions, ground_truth):
    mrr = calculate_mrr(predictions, ground_truth)
    map_score = calculate_map(predictions, ground_truth)
    print(f"MRR: {mrr}")
    print(f"MAP: {map_score}")



---



## Building pipeline

The return value of the model should be at this format:

In [7]:
# Mẫu kết quả trả về:
sample_predictions = {
  "56d99b7bdc89441400fdb5cb": ["In week 10, Manning suffered a partial tear of the plantar fasciitis in his left foot."],
  "56d20564e7d4791d00902612": ["This is the sentence you retrieved", "Answer2", "Answer3"],
  # .............
}

Note that the evaluate() function takes two input values.

The predictions value must follow the correct format, which is a dictionary with the key as question_id and the value as a list of extracted texts corresponding to that question.

For each method, the predictions value should be returned and the evaluate() function should be run once.

In [8]:
evaluate(sample_predictions, ground_truth)

MRR: 0.001890359168241966
MAP: 0.001890359168241966


# Corpus Pre-processing

Split the corpus text into sentences, and save each sentence as an element of the corpus list.

This ensure every sentence can be processed indepently in the next stages.

Then, print first 10 sentences for demonstration purpose.

In [9]:
corpus = test_set['corpus'].split('. ')

print(len(corpus))
for i in range(10):
    print(corpus[i])

226
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season
The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title
The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California
As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50
The Panthers finished the regular season with a 15–1 record, and quarterback Cam Newton was named the NFL Most Valuable Player (MVP)
They defeated the Arizona Cardinals 49–15 in the NFC Championship Game and advanced to thei

## Model Development
* For the first model, use the word2vec library to embed words and Word Mover's Distance to embed sentences.

* For the second model, use the Retrieve & Re-rank method with Sentence Transformer (bi-encoder) for the Retrieve part and Cross Encoder for the Re-rank part.

# Model 1

## Import libraries

* *"import gensim.downloader as api"*: imports the downloader module from the gensim library, allows users to easily download pre-trained models and datasets from the Gensim repository. Using "api" alias for later funcion call.
* *"from nltk.corpus import stopwords"*: imports the stopwords corpus from the Natural Language Toolkit (NLTK). Stop words are common words in a language (like "the", "is", "in") that are often filtered out in text processing because they may not contribute meaningful information.
* *"from nltk import download"*: imports the download function from the NLTK library. The download function is used to download additional datasets or resources needed for various NLP tasks. This may include corpora, models, or additional tools that are not included in the default installation of NLTK.
* *"import nltk"*: imports the NLTK library itself. NLTK is a comprehensive library for NLP tasks, providing tools for text processing, classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
* *"from nltk.stem import WordNetLemmatizer"*: imports the WordNetLemmatizer class from the nltk.stem module. The WordNet Lemmatizer is a tool used to reduce words to their base or root form (lemmatization). For example, "running" might be reduced to "run". Lemmatization helps to standardize words and reduce the dimensionality of the text data by consolidating different forms of a word.

In [10]:
import gensim.downloader as api
from nltk.corpus import stopwords
from nltk import download
import nltk
from nltk.stem import WordNetLemmatizer

Install pot Python package (which stands for "Python Optimal Transport"). Python Optimal Transport provides tools and algorithms for solving optimal transport problems. Optimal transport is a mathematical problem related to moving and redistributing resources in the most efficient way.

In [11]:
!pip install pot

Collecting pot
  Downloading POT-0.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
Downloading POT-0.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (835 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/835.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m835.4/835.4 kB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pot
Successfully installed pot-0.9.4


This section is to load pre-trained model for text processing and stop words
* Load a pre-trained GloVe (Global Vectors for Word Representation) model. The specific model being loaded is glove-wiki-gigaword-50, which consists of word embeddings trained on Wikipedia and the Gigaword corpus with 50-dimensional vectors.
* Downloads all available datasets, models, and resources provided by the NLTK library.
* Downloads the stop words corpus from NLTK. Stop words are common words (like "the," "is," "in") that are often excluded from natural language processing tasks because they generally do not carry significant meaning on their own. By downloading the stop words list, we can easily filter these words out in your text preprocessing pipeline.
* Accessing English stop words: retrieves the list of English stop words from the NLTK library and save as "stopwords" variable.

In [12]:
model = api.load('glove-wiki-gigaword-50')
nltk.download('all')
download('stopwords')
stopwords = stopwords.words('english')



[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

## Text processing

Define functions for pre-processing text data:
* Converting words to their base form using lemmatization, using WordNetLemmatizer().
* Removing punctuation from the input text.
* Filtering out common stop words that do not add significant meaning to the text.

In [13]:
import string

# initializing lemmatizer
lemmatizer = WordNetLemmatizer()

def remove_punctuation(input_string):
    translator = str.maketrans('', '', string.punctuation)
    return input_string.translate(translator)

# cleaning text by removing stopwords and
# getting the root word of words using lemmatiztion
def preprocess(sentences):
  new_sentences = []
  for sent in sentences:
    sent = [lemmatizer.lemmatize(w) for w in sent.lower().split() if remove_punctuation(w) not in stopwords]
    processed_sent = []
    for words in sent:
      processed_sent.append(remove_punctuation(words))
    new_sentences.append(processed_sent)
  return new_sentences

This section demonstrate the task done by preprocess() function

In [14]:
# small corpus to test

sentences = ['Programming is fun.',
             'The quick brown fox jump over the lazy dog.',
             'This is just a normal sentence.',
             'I love Oreos!',
             'I am doing AI projects!!',
             'He is cooking AI projects']

processed_sentences = preprocess(sentences)
processed_sentences

[['programming', 'fun'],
 ['quick', 'brown', 'fox', 'jump', 'lazy', 'dog'],
 ['normal', 'sentence'],
 ['love', 'oreos'],
 ['ai', 'projects'],
 ['cooking', 'ai', 'project']]

Repare real data

In [15]:
processed_corpus = preprocess(corpus)

## Word Mover's Distance

Demonstration for the operation of method wmdistance()

* The *print(model.wmdistance(sentences[1], query[0]))* command calculates and prints the Word Mover's Distance (WMD) between two sentences sentences[1] and query[0] using the word embedding model.

* The result of *model.wmdistance(sentences[1], query[0])* will be a real number, representing the distance between two sentences. The smaller this value is, the more similar the two sentences are.


In [16]:
query = ['Which animal does the quick brown fox jump over?']
query = preprocess(query)

In [17]:
print(model.wmdistance(sentences[1], query[0]))

1.1296432817950324


In this example, the resulting distance for "The quick brown fox jump over the lazy dog." and "Which animal does the quick brown fox jump over?" is relatively low, indicating some close relationship between these two sentences.

## Generate answers

Define functions to generate answers corresponse to each question and ID
1. *getBest()* function: retrieves the top k strings based on their associated values
    * Parameters:
        * strings: list: A list of strings from which the best candidates will be selected.
        * values: list: A list of numerical values associated with each string in the strings list. These values could represent distances, scores, or any metric used for ranking.
        * top_k: int: The number of top strings to return based on their values.
    * Functionality:
        * *"string_value_dict = dict(zip(strings, values))"*: combines the strings and values lists into a dictionary, where each string is a key, and its corresponding value is the value.
        * *"sorted_strings = sorted(string_value_dict, key=string_value_dict.get)[:top_k]
"*: sorts the dictionary keys (strings) based on their values in ascending order. The [:top_k] slicing retrieves only the first top_k elements from the sorted list.
    * Return value: The function returns a list of the top k strings with the best (lowest) associated values.

2. *genAnswer()* function: generates answers for a list of queries by comparing each query to a corpus of sentences and selecting the most relevant ones.
    * Parameters:
        * queries: list: A list of dictionaries where each dictionary contains a query, typically with a structure like {'question': ..., 'id': ...}.
        * top_k: int: The number of best matching sentences to return for each query.
    * Functionality:
        * Initialization: *"answer = {}"*: initialize dictionary "answer" to store result.
        * Iterating Through Queries: *"for query in queries:"*: iterates over each query in the queries list.
        * Extracting and Preprocessing the Query: *"question, id = query['question'], query['id']"*: The current query's question and its ID are extracted.
        * *"question = preprocess([question])"*: question is then preprocessed.
        * Calculating Similarity Scores:*": for sent in processed_corpus: similarity.append(model.wmdistance(question[0], sent))"*: iterates through processed_corpus, which contains sentences. For each sentence, the Word Mover's Distance (WMD) between the preprocessed question and the sentence is calculated and stored in the similarity list.
        * Getting the Best Matches: *"answer[id] = getBest(corpus, similarity, top_k)"*: The getBest function is called with the corpus (a list of original sentences), the similarity scores, and top_k to retrieve the best matching sentences for the current query. The results are stored in the answer dictionary using the query's ID as the key.
    * Return Value: The function returns the answer dictionary containing the IDs of the queries and their corresponding top matching sentences.

        

        

In [18]:
def getBest (strings: list, values: list, top_k: int):
    string_value_dict = dict(zip(strings, values))
    sorted_strings = sorted(string_value_dict, key=string_value_dict.get)[:top_k]
    return sorted_strings

def genAnswer (queries: list, top_k: int):
    answer = {}
    for query in queries:
        similarity = []
        question, id = query['question'], query['id']
        question = preprocess([question])
        for sent in processed_corpus:
            similarity.append(model.wmdistance(question[0], sent))

        answer[id] = getBest(corpus, similarity, top_k)
    return answer

Print out first 10 questions and correspoding IDs

In [19]:
test_set['qa'][:10]

[{'question': 'What foot was injured on Manning that sidelined him in week 10?',
  'id': '56d99b7bdc89441400fdb5cb'},
 {'question': 'What three stadiums did the NFL decide between for the game?',
  'id': '56d20564e7d4791d00902612'},
 {'question': 'Who made their Super Bowl commercial debut with Nintendo?',
  'id': '56d721af0d65d2140019839e'},
 {'question': 'What one word did the NFL commissioner use to describe what Super Bowl 50 was intended to be?',
  'id': '56d98d0adc89441400fdb54e'},
 {'question': 'What month, day and year did Super Bowl 50 take place? ',
  'id': '56bf10f43aeaaa14008c9501'},
 {'question': 'Who had the best record in the NFC?',
  'id': '56beb3083aeaaa14008c923e'},
 {'question': "Who was limited by Denver's defense?",
  'id': '56beab833aeaaa14008c91d4'},
 {'question': 'How much money was spent on other festivities in the Bay area to help celebrate the coming Super Bowl 50?',
  'id': '56bf555e3aeaaa14008c95d3'},
 {'question': 'Who sacked Newton a few plays after the c

Call the genAnswer() function with first 10 queries and take top 5 answers to retrieve.

Demonstrate the operation of the genAnswer() function

In [20]:
genAnswer(queries = test_set['qa'][:10], top_k = 5)

{'56d99b7bdc89441400fdb5cb': ['In week 10, Manning suffered a partial tear of the plantar fasciitis in his left foot',
  'Manning finished the game 13 of 23 for 141 yards with one interception and zero touchdowns',
  'Manning and Newton also set the record for the largest age difference between opposing Super Bowl quarterbacks at 13 years and 48 days (Manning was 39, Newton was 26)',
  "Anderson was the game's leading rusher with 90 yards and a touchdown, along with four receptions for 10 yards",
  'Although the team had a 7–0 start, Manning led the NFL in interceptions'],
 '56d20564e7d4791d00902612': ['The next three drives of the game would end in punts',
  'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season',
  "On May 21, 2013, NFL owners at their spring meetings in Boston voted and awarded the game to Levi's Stadium",
  'They joined the Patriots, Dallas Cowboys, and Pittsburgh Steelers as one of four team

## Evaluate

Use genAnswer() to generate answers for all queries, then use evaluate() function to evaluate the predictions base on ground-truth dataset.

In [21]:
predictions = genAnswer(queries = test_set['qa'], top_k = 5)

In [22]:
evaluate(predictions, ground_truth)

MRR: 0.38119576559546414
MAP: 0.3796249483117283




## Summary
* MRR (Mean Reciprocal Rank): 0.381: Indicates that relevant answers are generally ranked relatively high, but not consistently at the top.
* MAP (Mean Average Precision): 0.380: Suggests moderate precision in retrieving relevant results across all queries, with potential for improvement.

* The model size used in this model, glove-wiki-gigaword-50, is only 167.5MB, which can be considered as a light model.

#That's bad, let's use Retrieve & Re-rank

# Model 2

## Motivation

A cross-encoder is an NLP model that calculates the semantic similarity between two sentences with high accuracy. However, instead of directly computing the distance using embedded values, it relies on a more sophisticated process, which makes it relatively slower. To improve efficiency while leveraging the cross-encoder's performance, we can first generate a shortlist of candidate sentences using a faster method, then re-rank these candidates with the cross-encoder to ensure optimal accuracy.

In this model, we'll be combining 2 language models: bi-encoder (Sentence-transformers) for "retrieve" and cross encoder for "re-rank".

## Downloading

Install the sentence-transformers library. Sentence-transformers library provides an easy way to use pre-trained models for transforming sentences into embeddings.

In [23]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.2.1


## Sentence Transformer (bi-encoder)

Import SentenceTransformer class from the sentence_transformers module. This class is the core component of the library that enables you to load pre-trained models and generate embeddings for sentences.

In [24]:
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


Initializes a SentenceTransformer object using a specific pre-trained model called "all-MiniLM-L6-v2". "all-MiniLM-L6-v2" model is a compact and efficient model optimized for generating embeddings for sentences and phrases.


In [25]:
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generate embeddings for a subset of the processed_corpus and then print the shape of the resulting embeddings array.

In [26]:
embeddings = model.encode(processed_corpus[:2])
print(embeddings.shape)

(2, 384)


The output of print(embeddings.shape): (2, 384) indicates that:
* There are 2 sentences for which embeddings have been generated.
* Each sentence is represented as an embedding vector of size 384 (the typical output dimension for the "all-MiniLM-L6-v2" model).

The variable embeddings holds the resulting sentence embeddings generated by the SentenceTransformer model for the two sentences contained in processed_corpus[:2].

Embeddings are dense vector representations of text that capture the semantic meaning of the sentences. In the case of the SentenceTransformer, each sentence is transformed into a fixed-size vector of floating-point numbers, which encapsulates its meaning based on the context and relationships within the text.

The embeddings are usually represented as a NumPy array, where each row corresponds to an individual sentence's embedding.

In [27]:
embeddings

array([[-4.32155542e-02,  4.31079790e-02, -3.87896635e-02,
        -1.94261558e-02, -5.87005578e-02,  5.20988740e-02,
         2.59460360e-02,  3.38165119e-05,  6.90063089e-02,
        -2.26521934e-03, -1.18792199e-01, -2.59368476e-02,
         3.85141596e-02,  7.81184714e-03, -5.21092713e-02,
        -7.18346313e-02,  7.46674016e-02, -1.44300967e-01,
        -3.39231528e-02, -9.32606533e-02,  4.71671224e-02,
         5.13805598e-02, -5.81620745e-02, -1.70744415e-02,
         7.18719661e-02,  2.60662548e-02, -3.10902949e-02,
         6.73382431e-02,  5.88744832e-03, -5.59523366e-02,
         4.66081426e-02,  2.29337569e-02, -7.74824759e-03,
         1.30370595e-02, -8.27182680e-02, -4.76739928e-02,
        -6.61464483e-02, -4.13212441e-02,  4.72743846e-02,
         2.73620058e-02, -5.35079502e-02, -1.09207653e-01,
         3.43236960e-02,  4.25105244e-02,  7.36171529e-02,
         5.05690388e-02,  1.10812811e-02,  1.90469958e-02,
         3.94149423e-02,  2.43644938e-02,  3.19781974e-0

## Cross Encoder

Import CrossEncoder class from the sentence_transformers library.

CrossEncoder is a class that enables you to perform tasks like sentence similarity, ranking, and classification based on the interaction between two input sentences.

In [28]:
from sentence_transformers import CrossEncoder

Initializes a CrossEncoder object using a specific pre-trained model called "cross-encoder/ms-marco-MiniLM-L-6-v2."

Characteristics of the Model:
* MS MARCO Dataset: The model is fine-tuned on the Microsoft MAchine Reading COmprehension (MARCO) dataset, which is widely used for tasks involving question answering and information retrieval.
* MiniLM Architecture: The MiniLM architecture is designed to be lightweight and efficient while still maintaining high performance. This model strikes a balance between speed and accuracy, making it suitable for real-time applications.
* Dual Input: The model is specifically designed to handle pairs of sentences. This allows it to assess the relationship between the sentences directly, making it suitable for tasks like paraphrase detection or similarity scoring.

In [29]:
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Demonstrates how to use the CrossEncoder model to predict the similarity scores between pairs of sentences.

The output stored in scores will typically be a list of floating-point numbers, where each number represents the predicted similarity score for the corresponding pair of sentences. The resulting score is a real number where:

* A positive score indicates similarity, with higher values reflecting a stronger relationship between the sentences (the answer closely relates to the question).
* A score closer to 0 suggests a neutral or unrelated relationship between the sentences.
* A negative score indicates dissimilarity, with lower values reflecting a greater difference (the answer is not relevant to the question).

In [30]:
scores = model.predict([
    ("How many people live in Berlin?", "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers."),
    ("How many people live in Berlin?", "Berlin is well known for its museums."),
])

print(scores)

[ 8.607139  -4.3200774]


The code ranks a list of passages based on their relevance to a given query using the CrossEncoder model.
* *"ranks = model.rank(query, passages)"*: uses the rank method of the CrossEncoder model to evaluate the relevance of each passage in relation to the query. The output, stored in the variable ranks, contains a list of dictionaries, each representing a passage's score and its corresponding index.

In [31]:
# Rank a list of passages for a query
query = "How many people live in Berlin?"
passages = [
    "Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
    "Berlin is well known for its museums.",
    "In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.",
    "The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.",
    "The city of Paris had a population of 2,165,423 people within its administrative city limits as of January 1, 2019",
    "An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.",
    "Berlin is subdivided into 12 boroughs or districts (Bezirke).",
    "In 2015, the total labour force in Berlin was 1.85 million.",
    "In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
    "Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.",
]
ranks = model.rank(query, passages)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{passages[rank['corpus_id']]}")

Query: How many people live in Berlin?
8.92	The urban area of Berlin comprised about 4.1 million people in 2014, making it the seventh most populous urban area in the European Union.
8.61	Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
8.24	An estimated 300,000-420,000 Muslims reside in Berlin, making up about 8-11 percent of the population.
7.60	In 2014, the city state Berlin had 37,368 live births (+6.6%), a record number since 1991.
6.35	In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
5.42	Berlin has a yearly total of about 135 million day visitors, which puts it in third place among the most-visited city destinations in the European Union.
3.45	In 2015, the total labour force in Berlin was 1.85 million.
0.33	Berlin is subdivided into 12 boroughs or districts (Bezirke).
-4.24	The city of Paris had a population of 2,165,423 people within its administrative city limits as of Jan

## Retrieve & Re-rank

Initialize two different types of models from the sentence_transformers library: a bi-encoder and a cross-encoder.
* Import the util module from the sentence_transformers library, which provides utility functions for working with sentence embeddings, and the torch library, which is the core library for PyTorch, a framework used for building and training machine learning models.
* Check for GPU Availability.
* Initialize a ***bi-encoder*** model by loading the pre-trained model called "multi-qa-MiniLM-L6-cos-v1." This model is optimized for question-answering tasks and can encode pairs of sentences into embeddings. The max_seq_length attribute is set to 256, which specifies the maximum length of the input sequences that the model will process. If the input sentences exceed this length, they will be truncated. Bi-Encoder encodes sentences independently, generating embeddings that can be compared using distance metrics. This is efficient for large datasets since embeddings can be precomputed and stored.
* Initialize a ***cross-encoder*** model using the pre-trained model "cross-encoder/ms-marco-MiniLM-L-6-v2." This model is particularly effective for evaluating the similarity between pairs of sentences. Cross-Encoder takes pairs of sentences as input and computes a similarity score based on the full context of both sentences. This model typically provides more accurate relevance scores but is computationally more intensive since it processes both sentences together.


In [32]:
from sentence_transformers import util
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")

bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Use the bi-encoder model to generate embeddings

In [33]:
corpus_embeddings = bi_encoder.encode(corpus, convert_to_tensor = True, show_progress_bar=True)

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

Define two functions to perform a semantic search for a given query and then re-rank the retrieved candidates to identify the most relevant passages, combines the efficiency of bi-encoders for retrieval with the accuracy of cross-encoders for scoring, making it suitable for tasks like question answering and information retrieval.
1. getBest() function: retrieves the top k strings (passages) based on their corresponding values (scores).
    * Parameters:
        * strings: A list of strings.
        * values: A list of numerical scores corresponding to each string.
        * top_k: The number of top strings to return based on their scores.
    * Functionality:
        * *"string_value_dict = dict(zip(strings, values))"*: creates a dictionary that pairs each string with its corresponding score.
        * *"sorted_strings = sorted(string_value_dict, key=string_value_dict.get, reverse=True)[:top_k]"*: sort the strings by their scores in descending order (reverse=True). The top k strings (those with the highest scores) are selected.
    * Return value: returns the top k sorted strings.
2. search() function: performs a semantic search for a given query and re-ranks the retrieved candidates based on relevance scores.
    * Parameters:
        * query: The search query as a string.
        * top_retrieve: The number of top candidates to retrieve from the initial search.
        * top_k: The number of top candidates to return after re-ranking.
    * Functionality:
        * Retrieve: The query is encoded into an embedding using the bi-encoder. The embedding is then transferred to the GPU (if available).
        * Semantic Search: The semantic_search function retrieves the top top_retrieve passages from the corpus_embeddings that are most similar to the query embedding. Then obtain the first set of hits.
        * Re-rank:
            * Initialize Lists: candidates_string = [] and candidates_score = [] are initialized to store candidate strings and their corresponding scores.
            * Loop Through Candidates: For each retrieved candidate:
                * The corpus_id is used to retrieve the corresponding passage (sentence) from the corpus.
                * The cross-encoder is used to predict a relevance score for the query-sentence pair.
                * Both the sentence and its score are added to the respective lists.
    * Return the Best Candidates:
        * *"return getBest(candidates_string, candidates_score, top_k)"*: The getBest function is called to return the top k candidate sentences based on their scores.





    

In [34]:
def getBest (strings: list, values: list, top_k: int):
    string_value_dict = dict(zip(strings, values))
    sorted_strings = sorted(string_value_dict, key=string_value_dict.get, reverse = True)[:top_k]
    return sorted_strings

def search (query: str, top_retrieve: int, top_k: int):
    ### Retrieve ###
    question_embedding = bi_encoder.encode(query, convert_to_tensor = True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k = top_retrieve)
    hits = hits[0]  # Get the hits for the first query

    ### Re-rank ###
    candidates_string = []
    candidates_score = []
    for candidate in hits:
        id = candidate['corpus_id']
        sentence = corpus[id]
        score = cross_encoder.predict((query, sentence))

        candidates_string.append(sentence)
        candidates_score.append(score)

    return getBest(candidates_string, candidates_score, top_k)

Demonstrate the operation of search() function by iterates through the first 7 questions in a test set, retrieves the top answers for each question using the search() function, and then prints the question along with its answers.

In [35]:
for i in range(7):
    print(test_set['qa'][i]['question'])
    answers = search(query = test_set['qa'][i]['question'], top_retrieve = 50, top_k = 4)
    for answer in answers:
      print('-', answer)
    print('\n')

What foot was injured on Manning that sidelined him in week 10?
- In week 10, Manning suffered a partial tear of the plantar fasciitis in his left foot
- Osweiler was injured, however, leading to Manning's return during the Week 17 regular season finale, where the Broncos were losing 13–7 against the 4–11 San Diego Chargers, resulting in Manning re-claiming the starting quarterback position for the playoffs by leading the team to a key 27–20 win that enabled the team to clinch the number one overall AFC seed
- Under Kubiak, the Broncos planned to install a run-oriented offense with zone blocking to blend in with quarterback Peyton Manning's shotgun passing skills, but struggled with numerous changes and injuries to the offensive line, as well as Manning having his worst statistical season since his rookie year with the Indianapolis Colts in 1998, due to a plantar fasciitis injury in his heel that he had suffered since the summer, and the simple fact that Manning was getting old, as he 

## Generate answers

Define genAnswer() function to generate and store answers for a list of queries.

In [36]:
def genAnswer (queries: list, top_retrieve: int, top_k: int):
    answer = {}
    for query in queries:
        question, id = query['question'], query['id']
        answer[id] = search(question, top_retrieve, top_k)
    return answer

## Evaluate

Use genAnswer() to generate answers for all queries, then use evaluate() function to evaluate the predictions base on ground-truth dataset.

In [37]:
predictions = genAnswer(queries = test_set['qa'], top_retrieve = 50, top_k = 5)

In [38]:
evaluate(predictions, ground_truth)

MRR: 0.6981103518315057
MAP: 0.69525482193773


## Summary
* MRR (Mean Reciprocal Rank): 0.698: Indicates that relevant answers are often ranked high in the result list, suggesting effective retrieval.
* MAP (Mean Average Precision): 0.695: Reflects strong precision in retrieving relevant results across queries, indicating that the model effectively ranks relevant documents.

* Model size:
    * Bi-Encoder (multi-qa-MiniLM-L6-cos-v1): Approximately 85 MB
    * Cross-Encoder (cross-encoder/ms-marco-MiniLM-L-6-v2): Approximately 440 MB
    * Combined Size: Approximately 525 MB

Overall, the model demonstrates good performance in answer retrieval with a total size of 525 MB, achieving high MRR and MAP scores.