## Table Of Contents

- Problem Statement
- Given an open-source model, Execute the application
- Discussion on the selection of the model and the algorithm
- Discussion on Evaluation
- Fine-tuning the model
- Closing Remarks


# Problem Statement



CheatApp

We are going to develop a cheating app for open domain question answering systems through a notebook. In this app, we would like to suggest users of the wikipedia page with the relevant answers for given questions. To further stretch the challenge, we would like to suggest the best paragraphs having the answers of the questions in the corresponding wikipedia page. Below are few examples - 

Question:  how are glacier caves formed ?
wikipedia page - Glacier cave - Wikipedia   
paragraph : ‘A glacier cave is a cave formed within the ice of a glacier. Glacier caves are often called ice caves, but the latter term is properly used to describe bedrock caves that contain year-round ice’ (summary of the page). 

Question - how much is 1 tablespoon of water ?
wikipedia page -https://en.wikipedia.org/wiki/Tablespoon  
paragraph is - It has multiple answers. It could like - 
‘In most places, except Australia, one tablespoon equals three teaspoons—and one US tablespoon is 14.8 ml (0.50 US fl oz; 0.52 imp fl oz) or 15 ml (0.51 US fl oz; 0.53 imp fl oz).’ 
Or
 ‘In nutrition labeling in the U.S. and the U.K., a tablespoon is defined as 15 ml (0.51 US fl oz).[7] In Australia, the definition of the tablespoon is 20 ml (0.70 imp fl oz)’ etc.

Question - how did anne frank die 
wikipedia page - https://en.wikipedia.org/wiki/Anne_Frank 
Paragraph - ‘Following their arrest, the Franks were transported to concentration camps. On 1 November 1944,[2] Anne and her sister, Margot, were transferred from Auschwitz to Bergen-Belsen concentration camp, where they died (probably of typhus) a few months later. They were originally estimated by the Red Cross to have died in March, with Dutch authorities setting 31 March as the official date. Later research has suggested they died in February or early March.’

Expectation
Given this is an open problem, we don’t expect a particular level of correctness. What we are mainly looking for - how you approach and quickly prototype crappy solutions. Then you keep adding complex logic in iterations to achieve some satisfactory levels. While doing that journey, we expect that you may generate following artifacts - 
Hypothesis and motivations for choosing different modeling techniques.
How you measured the model performance. 
Data curation, training/evaluation data generations, model performance measurements etc.
end 2 end machine learning pipeline in python notebook including above steps.
Also, what constraints you felt which led you not to try the things you wanted to do to solve this problem is an awesome way.
** -  If you use an already available model/code/library from the web, we expect that you have a full understanding of motivation and why you are using it. Ex:- if you use entity linking library, we expect that you understand - pros and cons of that model. This includes - Why do you think your chosen entity linking library is good for your problem?  When do you expect your chosen model may behave poorly? 

Resources

You are free to use open source resources including already available  annotated training data on the web. Also, free to use already trained models & libraries existing in open source. What we mainly expect is - how you approach the problems and journey.

You are not allowed to use llm libraries like Langchain and LammaIndex. 

Wikipedia text data is available in Kaggle at - wikidata-text
Also added sample open questions and expected answers - wikipedia_question_similar_answer.tsv . The answers added here are not exact wikipedia graphs, but it may be super helpful for your modeling techniques. 

Other open source resources that can be used are - https://paperswithcode.com/dataset/wikiqa (questions in wikipedia_question_similar_answer.tsv is taken from this data set).



Notes
Please create a loom video explaining all solutions/approaches. 

# Solution

## Overview of the Solution
The problem corresponds to the Question-Answering problem of the NLP domain.
Inputs: Query, a set of wiki-urls
Output: Answer with citations (Let's limit to 2 (configurable) for brevity)

Preparation:
1. For each url, we fetch the text content and first compute the embeddings for each url.
2. Also, for each paragraph in the url, we fetch the text content and compute the embeddings.

For the computation of these embeddings, we will use the model hugging-face's **distilbert-base-uncased**.

Querying:
1. We compute the embedding of the Question.

Searching for Answer:
1. 

Reasons for using **distilbert-base-uncased**
Note: My claim is not that this is the best model.
Distilbert is nearly 3 year old model and was known to have one of the best performance for Question Answering tasks until 2 years ago. DistilBert is a distilled version of the BERT model. 


**Pros**:
1. Small and hence can run on PC
2. Fast to iterate
3. It is trained on raw-text without human-labeling-bias. Hence, it is a good model for fine-tuning task.
4. The model is very appropriate for fine-tuning that uses entire sentence like Question-Answering. 
5. It is not very 'generative' and hence is more appropriate for giving importance to facts.

**Cons**
Being a small model, it's performance is far from perfect, but, is sufficient for most 'simple' (i.e retrieve sentence where answer exists etc.) kind of Question-Answering tasks.

Summary: We chose this model because it is well suited to demonstrate the effect of fine-tuning. 

In [45]:
# Solution

import requests
from bs4 import BeautifulSoup
import os
import json

import warnings
# Turn off all warnings
warnings.filterwarnings("ignore")

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

from transformers import BertTokenizer, BertForQuestionAnswering
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sentence_transformers import SentenceTransformer, util

# Set a similarity score threshold -- based on test data
threshold = 0.7
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def get_paragraphs_from_wikipedia(url):
    # Send a GET request to the Wikipedia page
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find all the paragraphs on the page
    paragraphs = soup.find_all('p')
    all_text = soup.get_text()
    return (paragraphs, all_text)

from transformers import pipeline, DistilBertTokenizer, DistilBertForQuestionAnswering
def get_answer(model, tokenizer, question, context):
    # Create a Question Answering pipeline
    qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

    # Perform question answering
    result = qa_pipeline(question=question, context=context)

    # Extract the answer
    answer = result["answer"]
    return answer

def get_similarity_score(model, tokenizer, question, context):
    # Tokenize the context and question
    inputs = tokenizer(context, question, return_tensors="pt", padding=True, truncation=True)

    # Get the embeddings for the tokens
    with torch.no_grad():
        outputs = model(**inputs)
        context_embeddings = outputs.last_hidden_state[:, 0]  # Context embeddings
        question_embeddings = outputs.last_hidden_state[:, 1]  # Question embeddings

    # Compute the cosine similarity score between context and question embeddings
    similarity_score = torch.cosine_similarity(context_embeddings, question_embeddings).item()
    return similarity_score

# Find, filter, and sort paragraphs by similarity score
def filter_and_sort_paragraphs(model, question, paragraphs, threshold):
    relevant_paragraphs = []

    question_embedding = model.encode(question, convert_to_tensor=False)
    # Encode the question and paragraphs
    non_empty_paragraphs = [p.text for p in paragraphs if p.text.strip() != ""]
    paragraph_embeddings = model.encode(non_empty_paragraphs, convert_to_tensor=False)

    # Calculate cosine similarity scores using NumPy
    similarity_scores = cosine_similarity([question_embedding], paragraph_embeddings)
    
    # Filter and sort paragraphs based on similarity score
    for i, score in enumerate(similarity_scores[0]):
        relevant_paragraphs.append((paragraphs[i], score))

    # Sort relevant paragraphs by similarity score in descending order
    relevant_paragraphs.sort(key=lambda x: x[1], reverse=True)
    return relevant_paragraphs

from transformers import DistilBertTokenizer, DistilBertModel
from transformers import pipeline
def get_model():
    # Load a pre-trained model for sentence embeddings
    model_name = "paraphrase-MiniLM-L6-v2"
    model = SentenceTransformer(model_name)
    return model

def get_urls_embeddings(urls):
    url_embedding = {}
    for url in urls:
        (paragraphs, text) = get_paragraphs_from_wikipedia(urls)
        embedding = model.encode(text, convert_to_tensor=False)
        url_embedding[url] = embedding
    return url_embedding

def test_set():
    questions = [
                 "how are glacier caves formed", 
                 "how much is 1 tablespoon of water ?", 
                 "how did anne frank die", 
                 "how a water pump works", 
                 "how old was sue lyon when she made lolita",
                 "how are fire bricks made",
                 "what countries did immigrants come from during the immigration",
                 "how many smoots in a mile"   
                 "how tall is an indoor girls volleyball net",
                 "how many calories in a cup of white rice",
                 ]
    
    urls = ["https://en.wikipedia.org/wiki/Glacier_cave",
            "https://en.wikipedia.org/wiki/Tablespoon",
            "https://en.wikipedia.org/wiki/Anne_Frank",
            "https://en.wikipedia.org/wiki/Water_pump",
            "https://en.wikipedia.org/wiki/Sue_Lyon",
            "https://en.wikipedia.org/wiki/Fire_brick",
            "https://en.wikipedia.org/wiki/Volleyball",
            "https://en.wikipedia.org/wiki/Rice",
            "https://en.wikipedia.org/wiki/History_of_immigration_to_the_United_States",
            "https://en.wikipedia.org/wiki/Smoot"
            ]
    
    return questions, urls

def get_relevant_paragraphs(model, question, url):
    paragraphs, text = get_paragraphs_from_wikipedia(url) 
    similarity_scores = []
    relevant_paragraphs = []
    question_embedding = model.encode(question, convert_to_tensor=False)
    paragraph_embeddings = []

    for paragraph in paragraphs:
        paragraph_embedding = model.encode(paragraph.text, convert_to_tensor=False)
        paragraph_embeddings.append(paragraph_embedding)
                
    # Calculate cosine similarity scores using NumPy
    similarity_scores = cosine_similarity([question_embedding], paragraph_embeddings)
    
    # Filter and sort paragraphs based on similarity score
    for i, score in enumerate(similarity_scores[0]):
        relevant_paragraphs.append((paragraphs[i], score))

    # Sort relevant paragraphs by similarity score in descending order
    relevant_paragraphs.sort(key=lambda x: x[1], reverse=True)
    return relevant_paragraphs
    
# Define a function that generates an answer based on the question and URL
def generate_answer(question, relevant_paragraphs, url, threshold):
    responses = []
    text2text_generator = pipeline("text2text-generation", model="t5-base") # For generating Answer text

    for paragraph, score in relevant_paragraphs:
        if (score > threshold): # TODO: Needs calibration
            answer = text2text_generator(f"question: {question}? context: {paragraph.text}")
            response = (paragraph.text, answer[0]['generated_text'], url)
            responses.append(response)

    return responses[:2] # Return upto 2 responses

def print_answer(responses):
    if (len(responses) == 0):
        print("Sorry, I could not find an answer to your question.")
    if (len(responses) > 1):
        print(f"There are {len(responses)} answers to your question.")
    
    for response in responses:
        print(f"Source Wiki Page: {response[2][0]}")
        print(f"Answer: {response[1]}")
        print(f"Paragraph: {response[0]}")

# Find, filter, and sort paragraphs by similarity score
def get_relevant_url(model, question, urls):
    relevant_urls = []
    text_embeddings = []
    question_embedding = model.encode(question, convert_to_tensor=False)
    for url in urls:
        paragraphs, text = get_paragraphs_from_wikipedia(url) 
        # Encode the question and paragraphs
        text_embedding = model.encode(text, convert_to_tensor=False)
        text_embeddings.append(text_embedding)

    # Calculate cosine similarity scores using NumPy
    similarity_scores = cosine_similarity([question_embedding], text_embeddings)
    
    # Filter and sort paragraphs based on similarity score
    for i, score in enumerate(similarity_scores[0]):
        relevant_urls.append((urls[i], score))

    # Sort relevant paragraphs by similarity score in descending order
    relevant_urls.sort(key=lambda x: x[1], reverse=True)
    return relevant_urls[0]

questions, urls = test_set()
model = get_model()
for question in questions:
    print(f"Question: {question}")
    most_relevant_url = get_relevant_url(model, question, urls)
    relevant_paragraphs = get_relevant_paragraphs(model, question, most_relevant_url[0])
    answers = generate_answer(question, relevant_paragraphs, most_relevant_url, threshold)
    print_answer(answers)


Question: how are glacier caves formed
There are 2 answers to your question.
Source Wiki Page: https://en.wikipedia.org/wiki/Glacier_cave
Answer: by water running through or under the glacier
Paragraph: Most glacier caves are started by water running through or under the glacier. This water often originates on the glacier's surface through melting, entering the ice at a moulin and exiting at the glacier's snout at base level. Heat transfer from the water can cause sufficient melting to create an air-filled cavity, sometimes aided by solifluction. Air movement can then assist enlargement through melting in summer and sublimation in winter.

Source Wiki Page: https://en.wikipedia.org/wiki/Glacier_cave
Answer: geothermal heat from volcanic vents or hotsprings beneath the ice
Paragraph: Some glacier caves are formed by geothermal heat from volcanic vents or hotsprings beneath the ice.  An extreme example is the Kverkfjöll glacier cave in the Vatnajökull glacier in Iceland, measured in the 

Explanation about the solution and Results

# Evaluation

## Discussion on Evaluation

# Fine-Tuning

In [2]:
## Code for Fine-tuning

## Discussion on Fine-Tuning

# Summary & Closing Remarks