## Table Of Contents

- Problem Statement
- Given an open-source model, Execute the application
- Discussion on the selection of the model and the algorithm
- Discussion on Evaluation
- Fine-tuning the model
- Closing Remarks


# Problem Statement



CheatApp

We are going to develop a cheating app for open domain question answering systems through a notebook. In this app, we would like to suggest users of the wikipedia page with the relevant answers for given questions. To further stretch the challenge, we would like to suggest the best paragraphs having the answers of the questions in the corresponding wikipedia page. Below are few examples - 

Question:  how are glacier caves formed ?
wikipedia page - Glacier cave - Wikipedia   
paragraph : ‘A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice’ (summary of the page). 

Question - how much is 1 tablespoon of water ?
wikipedia page -https://en.wikipedia.org/wiki/Tablespoon  
paragraph is - It has multiple answers. It could like - 
‘In most places, except Australia, one tablespoon equals three teaspoons—and one US tablespoon is 14.8 ml (0.50 US fl oz; 0.52 imp fl oz) or 15 ml (0.51 US fl oz; 0.53 imp fl oz).’ 
Or
 ‘In nutrition labeling in the U.S. and the U.K., a tablespoon is defined as 15 ml (0.51 US fl oz).[7] In Australia, the definition of the tablespoon is 20 ml (0.70 imp fl oz)’ etc.

Question - how did anne frank die 
wikipedia page - https://en.wikipedia.org/wiki/Anne_Frank 
Paragraph - ‘Following their arrest, the Franks were transported to concentration camps. On 1 November 1944,[2] Anne and her sister, Margot, were transferred from Auschwitz to Bergen-Belsen concentration camp, where they died (probably of typhus) a few months later. They were originally estimated by the Red Cross to have died in March, with Dutch authorities setting 31 March as the official date. Later research has suggested they died in February or early March.’

Expectation
Given this is an open problem, we don’t expect a particular level of correctness. What we are mainly looking for - how you approach and quickly prototype crappy solutions. Then you keep adding complex logic in iterations to achieve some satisfactory levels. While doing that journey, we expect that you may generate following artifacts - 
Hypothesis and motivations for choosing different modeling techniques.
How you measured the model performance. 
Data curation, training/evaluation data generations, model performance measurements etc.
end 2 end machine learning pipeline in python notebook including above steps.
Also, what constraints you felt which led you not to try the things you wanted to do to solve this problem is an awesome way.
** -  If you use an already available model/code/library from the web, we expect that you have a full understanding of motivation and why you are using it. Ex:- if you use entity linking library, we expect that you understand - pros and cons of that model. This includes - Why do you think your chosen entity linking library is good for your problem?  When do you expect your chosen model may behave poorly? 

Resources

You are free to use open source resources including already available  annotated training data on the web. Also, free to use already trained models & libraries existing in open source. What we mainly expect is - how you approach the problems and journey.

You are not allowed to use llm libraries like Langchain and LammaIndex. 

Wikipedia text data is available in Kaggle at - wikidata-text
Also added sample open questions and expected answers - wikipedia_question_similar_answer.tsv . The answers added here are not exact wikipedia graphs, but it may be super helpful for your modeling techniques. 

Other open source resources that can be used are - https://paperswithcode.com/dataset/wikiqa (questions in wikipedia_question_similar_answer.tsv is taken from this data set).



Notes
Please create a loom video explaining all solutions/approaches. 

# Solution

## Overview of the Solution
The problem corresponds to the Question-Answering problem of the NLP domain.
Inputs: Query, a set of wiki-urls
Output: Answer with citations (Let's limit to 2 (configurable) for brevity)

Algorithm:
Step 1: Find the most relevant url. <br>
There can be multiple url's satisfying a question. However, for simplicity, I assumed only one url has the relevant answer.
Please note that there is nothing in the algorithm that breaks if there are more than 1 url that can contain the answer
The algorithm we use for this case is the cosine similarity score to rank the urls as per relevance to the question.

For the computation of these embeddings, we will use the model hugging-face's **paraphrase-MiniLM-L6-v2**.
We compute the embedding for the question and also the embedding for the entire text in the wiki url.
Then, we sort the urls based on the cosine similarity between the question and the text of the wiki url.

**Why we chose this model**
Wiki text can be particularly long context since the answer can be present anywhere in the text. 
**paraphrase-MiniLM-L6-v2** is a model that can handle a faily long text in the context. I have experimented with other models like **distilbert-base-uncased**.
They are however unable to generate a good embedding for a long context. 
The quality of this model can also be observed in the tests that I have executed to test this algorithm. See the output of the next Code cell.

**pros**
1. Good quality
2. Long Context

**cons**
1. Higher latency
2. Not perfect. It needs to be tuned further.

Step 2: Find the most relevant paragraph within the url. <br>
We first split the text into separate paragraphs.
We compute the embeddings for each of these paragrahs and the top-2 most relevant paragraphs are chosen again based on the cosine similarity.
For this step too, we choose the same model. In this step, ideally, we can chose a much lightweight model including **distilbert-base-uncased**.
The performance of **distilbert-base-uncased** model is also reasonably satisfactory for such smaller context. Even though, the **paraphrase-MiniLM-L6-v2** performs slightly better.

Step 3: From the most relevant paragraphs, generate the answer. <br>
Now that we found the most relevant paragraphs that might contain the answer, it is important to extract the answer.
For this, we use HuggingFace's **text2text-generation** pipeline with **t5-base**. The model performs particularly well especially in short contexts.
There are better models than **t5-base** but this model is sufficient for the length of context we are supplying. In addition, it is of lower latency.

The reason why Step 3 is important is that we need to determine if there is present at all or not.


In [48]:
# Solution

import requests
from bs4 import BeautifulSoup
import os
import json

import warnings
# Turn off all warnings
warnings.filterwarnings("ignore")

import torch

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def get_paragraphs_from_wikipedia(url):
    # Send a GET request to the Wikipedia page
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the paragraphs on the page
    paragraphs = soup.find_all('p')
    all_text = soup.get_text()
    return (paragraphs, all_text)

from transformers import pipeline

def get_model():
    # Load a pre-trained model for sentence embeddings
    model_name = "paraphrase-MiniLM-L6-v2"
    model = SentenceTransformer(model_name)
    return model

def test_set():
    questions = [
                 "how are glacier caves formed", 
                 "how much is 1 tablespoon of water ?", 
                 "how did anne frank die", 
                 "how a water pump works", 
                 "how old was sue lyon when she made lolita",
                 "how are fire bricks made",
                 "what countries did immigrants come from during the immigration",
                 "how many smoots in a mile",
                 "how tall is an indoor girls volleyball net",
                 "how many calories in a cup of white rice"
                 ]
    
    urls = ["https://en.wikipedia.org/wiki/Glacier_cave",
            "https://en.wikipedia.org/wiki/Tablespoon",
            "https://en.wikipedia.org/wiki/Anne_Frank",
            "https://en.wikipedia.org/wiki/Water_pump",
            "https://en.wikipedia.org/wiki/Sue_Lyon",
            "https://en.wikipedia.org/wiki/Fire_brick",
            "https://en.wikipedia.org/wiki/Volleyball",
            "https://en.wikipedia.org/wiki/Rice",
            "https://en.wikipedia.org/wiki/History_of_immigration_to_the_United_States",
            "https://en.wikipedia.org/wiki/Smoot"
            ]
    
    return questions, urls

def get_relevant_paragraphs(model, question, url):
    paragraphs, text = get_paragraphs_from_wikipedia(url) 
    similarity_scores = []
    relevant_paragraphs = []
    question_embedding = model.encode(question, convert_to_tensor=False)
    paragraph_embeddings = []

    for paragraph in paragraphs:
        paragraph_embedding = model.encode(paragraph.text, convert_to_tensor=False)
        paragraph_embeddings.append(paragraph_embedding)
                
    # Calculate cosine similarity scores using NumPy
    similarity_scores = cosine_similarity([question_embedding], paragraph_embeddings)
    
    # Filter and sort paragraphs based on similarity score
    for i, score in enumerate(similarity_scores[0]):
        relevant_paragraphs.append((paragraphs[i], score))

    # Sort relevant paragraphs by similarity score in descending order
    relevant_paragraphs.sort(key=lambda x: x[1], reverse=True)
    return relevant_paragraphs
    
# Define a function that generates an answer based on the question and URL
def generate_answer(question, relevant_paragraphs, url, threshold):
    responses = []
    text2text_generator = pipeline("text2text-generation", model="t5-base") # For generating Answer text

    for paragraph, score in relevant_paragraphs:
        if (score > threshold): # TODO: Needs calibration
            answer = text2text_generator(f"question: {question}? context: {paragraph.text}")
            response = (paragraph.text, answer[0]['generated_text'], url)
            responses.append(response)

    return responses[:2] # Return upto 2 responses

def print_answer(responses):
    if (len(responses) == 0):
        print("Sorry, I could not find an answer to your question.")
    if (len(responses) > 1):
        print(f"There are {len(responses)} answers to your question.")
    
    for response in responses:
        print(f"Source Wiki Page: {response[2][0]}")
        print(f"Answer: {response[1]}")
        print(f"Paragraph: {response[0]}")

# Find, filter, and sort paragraphs by similarity score
def get_relevant_url(model, question, urls):
    relevant_urls = []
    text_embeddings = []
    question_embedding = model.encode(question, convert_to_tensor=False)
    for url in urls:
        paragraphs, text = get_paragraphs_from_wikipedia(url) 
        # Encode the question and paragraphs
        text_embedding = model.encode(text, convert_to_tensor=False)
        text_embeddings.append(text_embedding)

    # Calculate cosine similarity scores using NumPy
    similarity_scores = cosine_similarity([question_embedding], text_embeddings)
    
    # Filter and sort paragraphs based on similarity score
    for i, score in enumerate(similarity_scores[0]):
        relevant_urls.append((urls[i], score))

    # Sort relevant paragraphs by similarity score in descending order
    relevant_urls.sort(key=lambda x: x[1], reverse=True)
    return relevant_urls[0]

threshold = 0.7
questions, urls = test_set()
model = get_model()
for question in questions:
    print(f"Question: {question}")
    most_relevant_url = get_relevant_url(model, question, urls)
    relevant_paragraphs = get_relevant_paragraphs(model, question, most_relevant_url[0])
    answers = generate_answer(question, relevant_paragraphs, most_relevant_url, threshold)
    print_answer(answers)


Question: how are glacier caves formed
There are 2 answers to your question.
Source Wiki Page: https://en.wikipedia.org/wiki/Glacier_cave
Answer: by water running through or under the glacier
Paragraph: Most glacier caves are started by water running through or under the glacier. This water often originates on the glacier's surface through melting, entering the ice at a moulin and exiting at the glacier's snout at base level. Heat transfer from the water can cause sufficient melting to create an air-filled cavity, sometimes aided by solifluction. Air movement can then assist enlargement through melting in summer and sublimation in winter.

Source Wiki Page: https://en.wikipedia.org/wiki/Glacier_cave
Answer: geothermal heat from volcanic vents or hotsprings beneath the ice
Paragraph: Some glacier caves are formed by geothermal heat from volcanic vents or hotsprings beneath the ice.  An extreme example is the Kverkfjöll glacier cave in the Vatnajökull glacier in Iceland, measured in the 

Here is the output we get for our test set:

```
Question: how are glacier caves formed
There are 2 answers to your question.
Source Wiki Page: https://en.wikipedia.org/wiki/Glacier_cave
Answer: by water running through or under the glacier
Paragraph: Most glacier caves are started by water running through or under the glacier. This water often originates on the glacier's surface through melting, entering the ice at a moulin and exiting at the glacier's snout at base level. Heat transfer from the water can cause sufficient melting to create an air-filled cavity, sometimes aided by solifluction. Air movement can then assist enlargement through melting in summer and sublimation in winter.

Source Wiki Page: https://en.wikipedia.org/wiki/Glacier_cave
Answer: geothermal heat from volcanic vents or hotsprings beneath the ice
Paragraph: Some glacier caves are formed by geothermal heat from volcanic vents or hotsprings beneath the ice.  An extreme example is the Kverkfjöll glacier cave in the Vatnajökull glacier in Iceland, measured in the 1980s at 2.8 kilometres (1.7 mi) long with a vertical range of 525 metres (1,722 ft).

Question: how much is 1 tablespoon of water ?
There are 2 answers to your question.
Source Wiki Page: https://en.wikipedia.org/wiki/Tablespoon
Answer: three teaspoons
Paragraph: In most places, except Australia, one tablespoon equals three teaspoons—and one US tablespoon is 14.8 ml (0.50 US fl oz; 0.52 imp fl oz) or 15 ml (0.51 US fl oz; 0.53 imp fl oz).

Source Wiki Page: https://en.wikipedia.org/wiki/Tablespoon
Answer: 15 ml (0.51 US fl oz)
Paragraph: In nutrition labeling in the U.S. and the U.K., a tablespoon is defined as 15 ml (0.51 US fl oz).[7] In Australia, the definition of the tablespoon is 20 ml (0.70 imp fl oz).[citation needed]

Question: how did anne frank die
Source Wiki Page: https://en.wikipedia.org/wiki/Anne_Frank
Answer: a typhus epidemic
Paragraph: Anne Frank died at the Bergen-Belsen concentration camp in February or March 1945. The specific cause is unknown; however, there is evidence to suggest that she died from a typhus epidemic that spread through the camp, killing 17,000 prisoners.[98] Gena Turgel, a survivor of Bergen-Belsen, knew Anne at the camp. In 2015, she told the British newspaper The Sun: "Her bed was around the corner from me. She was delirious, terrible, burning up." She said she had brought Frank water to wash.[99] Turgel, who worked in the camp hospital, said that the epidemic took a terrible toll on the inmates: "The people were dying like flies—in the hundreds. Reports used to come in—500 people who died. Three hundred? We said, 'Thank God, only 300.'"[99] Other diseases, including typhoid fever, were rampant.[100]

Question: how a water pump works
There are 2 answers to your question.
Source Wiki Page: https://en.wikipedia.org/wiki/Water_pump
Answer: irrigation, water supply, gasoline supply, air conditioning systems, refrigeration (usually called a compressor
Paragraph: Pumps are used throughout society for a variety of purposes.  Early applications includes the use of  the windmill or watermill to pump water.  Today, the pump is used for irrigation, water supply, gasoline supply, air conditioning systems, refrigeration (usually called a compressor), chemical movement, sewage movement, flood control, marine services, etc.

Source Wiki Page: https://en.wikipedia.org/wiki/Water_pump
Answer: Mechanical pumps may be submerged in the fluid they are pumping or be placed external to the
Paragraph: Mechanical pumps may be submerged in the fluid they are pumping or be placed external to the fluid.

Question: how old was sue lyon when she made lolita
Source Wiki Page: https://en.wikipedia.org/wiki/Sue_Lyon
Answer: 12
Paragraph: Although Vladimir Nabokov originally thought that Sue Lyon was the right selection to play Lolita, years later Nabokov said that the ideal Lolita would have been Catherine Demongeot, a young French actress who had played the child Zazie in Louis Malle's Zazie in the Metro (1960). The tomboyish Demongeot was four years younger than Lyon.[12]

Question: how are fire bricks made
Source Wiki Page: https://en.wikipedia.org/wiki/Fire_brick
Answer: ceramic material
Paragraph: A fire brick, firebrick, fireclay brick, or refractory brick is a block of ceramic material used in lining furnaces, kilns, fireboxes, and fireplaces.  A refractory brick is built primarily to withstand high temperature, but will also usually have a low thermal conductivity for greater energy efficiency. Usually dense fire bricks are used in applications with extreme mechanical, chemical, or thermal stresses, such as the inside of a wood-fired kiln or a furnace, which is subject to abrasion from wood, fluxing from ash or slag, and high temperatures. In other, less harsh situations, such as in an electric or natural gas fired kiln, more porous bricks, commonly known as "kiln bricks", are a better choice.[1] They are weaker, but they are much lighter and easier to form and insulate far better than dense bricks. In any case, firebricks should not spall, and their strength should hold up well during rapid temperature changes.

Question: what countries did immigrants come from during the immigration
Source Wiki Page: https://en.wikipedia.org/wiki/History_of_immigration_to_the_United_States
Answer: Europe and later on from Asia and Latin America
Paragraph: The history of immigration to the United States details the movement of people to the United States from the colonial era to the present. Throughout U.S. history, the country experienced successive waves of immigration, particularly from Europe and later on from Asia and Latin America. Colonial-era immigrants often repaid the cost of transoceanic transportation by becoming indentured servants in which the new employer paid the ship's captain. In the late 19th century, immigration became restricted from China and Japan. In the 1920s, restrictive immigration quotas were imposed although political refugees had special status. Numerical restrictions ended in 1965. In recent years, the largest numbers have come from Asia and Central America.

Question: how many smoots in a mile
Sorry, I could not find an answer to your question.


Question: how tall is an indoor girls volleyball net
There are 2 answers to your question.
Source Wiki Page: https://en.wikipedia.org/wiki/Volleyball
Answer: 2.24 m (7 ft 4+316 in)
Paragraph: A volleyball court is 9 m × 18 m (29.5 ft × 59.1 ft), divided into equal square halves by a net with a width of one meter (39.4 in).[20] The top of the net is 2.43 m (7 ft 11+11⁄16 in) above the center of the court for men's competition, and 2.24 m (7 ft 4+3⁄16 in) for women's competition, varied for veterans and junior competitions.[3]

Source Wiki Page: https://en.wikipedia.org/wiki/Volleyball
Answer: 8 m (26.2 ft)
Paragraph: The minimum height clearance for indoor volleyball courts is 7 m (23.0 ft), although a clearance of 8 m (26.2 ft) is recommended.[20]

Question: how many calories in a cup of white rice
Source Wiki Page: https://en.wikipedia.org/wiki/Rice
Answer: 130
Paragraph: Cooked white rice is 69% water, 29% carbohydrates, 2% protein, and contains negligible fat (table). In a reference serving of 100 grams (3.5 oz), cooked white rice provides 130 calories of food energy, and contains moderate levels of manganese (18% DV), with no other micronutrients in significant content (all less than 10% of the Daily Value).[37]
In 2018, the World Health Organization strongly recommended fortifying rice with iron, and conditionally recommended fortifying it with vitamin A and with folic acid.[38]
```


**Explanation about the solution and Results**

Let us first do a Visual Evaluation and in the next section, let's discuss the formal Evaluation.

**Good**<br>
Upon the visual inspection, we find that for most of the questions, the models have picked the perfect url, the perfect paragraphs and also a reasonable answer.

**Shortcomings**<br>
1. For the question, how a water pump works, the answer is not correct.
2. For the question, how old was sue lyon when she made lolita, the answer is not correct.
3. For the question, how are fire bricks made, the answer is partial where it specifies the material from which it is made but does not really answer the question satisfactorily.
3. The answers are not well expressed as sentences. In some cases, the answer  I planned to improve the text-generation of the answers. However, due to shortage of time, I am leaving it to next steps
4. For the question, **how many smoots in a mile**, it could not determine the answer. The answer is present in the wiki page but not in a direct form. Since this is a language model, it does not have an ability to **calculate** or do conversions. This model should hence be combined with some math tools to be able to extract relevant information and compute the answer instead of just looking for extraction of the answer from the text.


# Evaluation

## Metrics

I would have two kinds of metrics for this problem.
1. End-End metrics
2. Metrics for each step.

**Metrics for each step**:
There are 3 steps in this algorithm
1. Relevance of the Url to the question
2. Relevance of the paragrah to the question
3. Answer Generation

**Relevance metric**

For both the 'Relevance' steps, it is important that the model picks the top entities among all that is provided. In addition to the **ordering** of the entities (url or the paragraph), it is also important to keep only the relevant ones and prune away the noise.

The metric I would recommend for evaluation of the Ranking order is the NDCG. Normalized Discounted Cumulative Gain (NDCG) is an information retrieval metric used to evaluate the quality of ranked search results. It considers both the relevance of retrieved items and their positions in the ranking. NDCG calculates a score that ranges from 0 to 1, with higher values indicating better-ranked results, and it penalizes lower-ranked relevant items more severely, providing a more accurate measure of search result quality. Particularly I would be using NDCG@1 and NDCG@3 metrics

The metric I would chose for evaluation of the correct selection of paragraphs is Precision and Recall. 

**Answer Generation metric**
For Answer Generation, I would use the BLEU Score (Bilingual Evaluation Understudy Score): BLEU measures the similarity between the generated answer and one or more reference answers and also the ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation): ROUGE measures the overlap between the generated answer and reference answers in terms of n-grams (unigrams, bigrams, etc.). 



In [49]:
# Code for metric computation

import numpy as np
import nltk

#region NDCG
# Function to compute NDCG
def ndcg_at_k(ranked_list, k, ground_truth):
    # Ensure k is within the bounds of the list length
    k = min(k, len(ranked_list))
    
    # Compute DCG (Discounted Cumulative Gain) at k
    dcg = 0.0
    for i in range(k):
        rel_i = 1 if ranked_list[i] in ground_truth else 0  # Binary relevance
        dcg += (2 ** rel_i - 1) / np.log2(i + 2)  # +2 because of 0-based indexing
    
    # Compute IDCG (Ideal DCG) at k
    ideal_ranking = sorted(ground_truth, reverse=True)
    idcg = 0.0
    for i in range(k):
        rel_i = 1 if ideal_ranking[i] in ground_truth else 0
        idcg += (2 ** rel_i - 1) / np.log2(i + 2)
    
    # Compute NDCG
    ndcg = dcg / idcg if idcg > 0 else 0.0
    return ndcg

# Compute NDCG at different values of k (e.g., k=1, k=3, k=5)
def compute_ndcg(ranked_list, ground_truth):
    ndcg_values = []
    for k in [1, 3, 5]:
        ndcg_values.append(ndcg_at_k(ranked_list, k, ground_truth))
    return ndcg_values

#endregion NDCG

#region P/R/F1
def presicion_recall_f1(relevant_docs, retrieved_docs):
    # Compute Precision, Recall, and F1
    precision = len(set(relevant_docs).intersection(set(retrieved_docs))) / len(retrieved_docs)
    recall = len(set(relevant_docs).intersection(set(retrieved_docs))) / len(relevant_docs)
    f1 = 2 * precision * recall / (precision + recall)
    return precision, recall, f1
#endregion P/R/F1

#region BLUE
from nltk.translate.bleu_score import sentence_bleu
import statistics
from rouge_score import rouge_scorer
import nltk

def compute_bleu_scores(candidate_reference_pairs):
    bleu_scores = []

    for candidate, reference in candidate_reference_pairs:
        candidate_tokens = candidate.split()
        reference_tokens = reference.split()
        
        # Compute BLEU score for each pair
        bleu_score = sentence_bleu([reference_tokens], candidate_tokens)
        bleu_scores.append(bleu_score)

    return bleu_scores

def compute_rouge_scores(candidate_reference_pairs):
    rouge_scores = []

    for candidate, reference in candidate_reference_pairs:
        # Create a ROUGE scorer
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

        # Compute ROUGE scores for each pair
        scores = scorer.score(reference, candidate)

        # Extract and append ROUGE-1 and ROUGE-L F1 scores
        rouge1_f1 = scores["rouge1"].fmeasure
        rougeL_f1 = scores["rougeL"].fmeasure
        rouge_scores.append((rouge1_f1, rougeL_f1))

    return rouge_scores

# Define a function to compute BLEU and ROUGE scores
def compute_scores(candidate_reference_pairs):
    # Compute BLEU scores
    bleu_scores = compute_bleu_scores(candidate_reference_pairs)

    # Compute ROUGE scores
    rouge_scores = compute_rouge_scores(candidate_reference_pairs)

    # Calculate the mean BLEU score
    mean_bleu_score = statistics.mean(bleu_scores)

    # Calculate the mean ROUGE-1 and ROUGE-L F1 scores
    rouge1_f1_scores, rougeL_f1_scores = zip(*rouge_scores)
    mean_rouge1_f1 = statistics.mean(rouge1_f1_scores)
    mean_rougeL_f1 = statistics.mean(rougeL_f1_scores)

    # Print the mean scores
    print(f'Mean BLEU Score: {mean_bleu_score:.2f}')
    print(f'Mean ROUGE-1 F1 Score: {mean_rouge1_f1:.2f}')
    print(f'Mean ROUGE-L F1 Score: {mean_rougeL_f1:.2f}')

#endregion BLUE




# Fine-Tuning

As we have observed during our run with out-of-the-box models, there are some short-comings for our scenario in using them.
There are primarily two types of models involved: Relevance models and Answer extraction models.

**Relevance models** <br>
Our current model **paraphrase-MiniLM-L6-v2** has performed quite impressively and is a good candidate to fine-tune.
We will use SQUAD or the WikiQA dataset for this purpose. 
Since the scenario for ranking both paragraphs or the entire url is the same, the only difference being the length of the context, we can aim to fine-tune only one model instead of two.

**Answer Extraction models**
Our current model **t5-base** is a reasonably well performing model and is a good candidate to fine-tune.
We will use SQUAD or the WikiQA dataset for this purpose

In [2]:
## Code for Fine-tuning paraphrase-MiniLM-L6-v2

from sentence_transformers import SentenceTransformer

#region working-version Fine-tuning 
# (using SQUAD dataset and DistilBERT)
# This flag is the difference between SQUAD v1 or 2 (if you're using another dataset, it indicates if impossible
# answers are allowed or not).
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16



#endregion working-version Fine-tuning


from sentence_transformers import InputExample
def generate_training_examples():
    pass

from torch.utils.data import DataLoader
from sentence_transformers import losses

def fine_tune_relevance(output_path):
    model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
    train_examples = generate_training_examples()
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    train_loss = losses.CosineSimilarityLoss(model)
    model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=1,
          warmup_steps=100)
    model.save('output_path\\fine-tuned-MiniLM-model')
    # model = SentenceTransformer('output_path\\fine-tuned-model')


from transformers import T5ForConditionalGeneration, T5Tokenizer
from datasets import load_dataset
from transformers import Trainer, TrainingArguments

def tokenize_function(tokenizer, examples):
    inputs = [f"question: {q} context: {c}" for q, c in zip(examples['question'], examples['context'])]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding='max_length')
    
    # Set up the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples['answers'], max_length=128, truncation=True, padding='max_length')
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


def fine_tune_answer_generation(output_path):
    model_name = "t5-base"
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    tokenizer = T5Tokenizer.from_pretrained(model_name)

    dataset = load_dataset('.\\WikiQA', split='train')
    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir="./logs",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["test"],
    )

    trainer.train()
    model.save_pretrained('output_path\\fine-tuned-T5-model')
    tokenizer.save_pretrained('output_path\\fine-tuned-T5-model')

    # Use this in the script at the top - Problem: Fine-tuning is taking way too long to complete
    # model = T5ForConditionalGeneration.from_pretrained('output_path\\fine-tuned-T5-model')
    # tokenizer = T5Tokenizer.from_pretrained('output_path\\fine-tuned-T5-model')

## Discussion on Fine-Tuning

# Summary & Closing Remarks