# Retrieval Exploration for Failed Sentences
12/15/2023

Lixiao Yang

This notebook is updated based on `9. NewsQA Prompt Engineering.ipynb`, for loop running for large dataset, refer to the `NewsQA Loop - v2.ipunb`. The output is only demonstrating the first story's results.


#### Updates
1. Make the MiniLM embedding returns the top 3 sentences retrieved for anwering.
2. Modify the data into the incorrect answered questions from previous results.
3. Unable to run WizardLM model locally due to memory limitations, consider adding more GPU to the JupyterHub. (DELAYED)

#### Findings
1. The MiniLM model still can not get the correct sentence from top 3 scored retrieved sentences.
2. No obvious improvement for the two failed questions for GPT4ALL model with combination of different chunk sizes and overlap percentages. Same sentences are retrieved for all prompt-modified results.
3. WizardLm is a Llama-based model, force the model to bypass the GPU configuration may cause new issues.

In [None]:
#!pip install llama-cpp-python langchain sentence_transformers InstructorEmbedding pyllama transformers pyqt5 pyqtwebengine pyllamacpp --user

In [2]:
import logging
import time
from collections import Counter
from collections import defaultdict
import json
import torch
from langchain.llms import GPT4All
from pathlib import Path
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings
from transformers import set_seed

  from .autonotebook import tqdm as notebook_tqdm


## 1. Data Preparation & Helper Functions

In [3]:
file_path='data/combined-newsqa-data-story1-2questions.json'
data = json.loads(Path(file_path).read_text())

file_path='data/combined-newsqa-data-story1.json'
data_2 = json.loads(Path(file_path).read_text())

In [None]:
data

In [4]:
# Helper function to calculate Exact Match (EM) score
def calculate_em(predicted, actual):
    return int(predicted == actual)

# Function to calculate the token-wise F1 score for text answers
def calculate_token_f1(predicted, actual):
    predicted_tokens = predicted.split()
    actual_tokens = actual.split()
    common_tokens = Counter(predicted_tokens) & Counter(actual_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(predicted_tokens)
    recall = 1.0 * num_same / len(actual_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

# Helper function to extract answer ranges from the consensus field
def extract_ranges(consensus):
    if 's' in consensus and 'e' in consensus:
        return [(consensus['s'], consensus['e'])]
    return []

def extract_answer_from_gpt4all_output(output, prompt_end="Answer:"):
    """
    Extracts the answer from the output of the GPT4ALL model.

    Parameters:
        output (str): The output string from the GPT4ALL model.
        prompt_end (str): The delimiter indicating the end of the prompt and the start of the answer.

    Returns:
        str: The extracted answer.
    """

    # Find the start of the answer
    start_idx = output.find(prompt_end)
    if start_idx == -1:
        return "Answer not found in output"

    # Extract the answer
    start_idx += len(prompt_end)
    answer = output[start_idx:]

    # Optional: Clean up the answer, remove any additional text after the answer
    # This can be tailored based on how your model generates responses
    end_characters = [".", "?", "!"]
    for end_char in end_characters:
        end_idx = answer.find(end_char)
        if end_idx != -1:
            answer = answer[:end_idx + 1]
            break

    return answer.strip()


In [5]:
from nltk.tokenize import word_tokenize
import collections

def compute_f1(a_gold, a_pred):
    gold_toks = word_tokenize(a_gold)
    pred_toks = word_tokenize(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return int(gold_toks == pred_toks)
    
    if num_same == 0:
        return 0
    
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

In [6]:
# Initialize logging
logging.basicConfig(level=logging.ERROR)

# Parameters
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)

# Initialize the language model (GPT4ALL)
# llm_path = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
llm_path = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"
llm = GPT4All(model=llm_path, max_tokens=2048)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
# instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_model_kwargs = {'device': 'mps'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# Start time calculation
start_time = time.time()
print(f"{start_time} Started.")

1704738504.857436 Started.


## 2. Top 3 Retrieved Sentence from MiniLM Model

In [7]:
import nltk
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize

from heapq import nlargest

def process_story_questions(data, model_name, instruction):
    # model_kwargs = {'device': 'cpu'}
    model_kwargs = {'device': 'mps'}
    encode_kwargs = {'normalize_embeddings': True}

    story_data = data['data'][0]  # First story
    story_text = story_data['text']

    # Segment story text into sentences and embed each sentence
    sentences = sent_tokenize(story_text)  # Splitting into sentences
    hf_story_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        embed_instruction="Use the following pieces of context to answer the question at the end:"
    )
    sentence_embs = hf_story_embs.embed_documents(sentences)

    # Process each question
    for question_data in story_data['questions']:
        question = question_data['q']

        hf_query_embs = HuggingFaceInstructEmbeddings(
            model_name=model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs,
            query_instruction=instruction
        )
        question_emb = hf_query_embs.embed_documents([question])[0]

        # Compute cosine similarity scores with each sentence
        scores = [torch.cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for sentence_emb in sentence_embs]

        # Find the sentences with the top 3 highest scores
        top_scores_indices = nlargest(3, range(len(scores)), key=lambda idx: scores[idx])
        top_sentences = [sentences[idx] for idx in top_scores_indices]

        # Extract actual answer using the consensus range
        consensus = question_data['consensus']
        actual_answer = story_text[consensus['s']:consensus['e']]
        
        # Calculate F1 score for the best sentence
        best_sentence_idx = scores.index(max(scores))
        best_sentence = sentences[best_sentence_idx]
        f1_score = compute_f1(best_sentence, actual_answer)

        # Print results
        print(f"Question: {question}")
        for i, sentence in enumerate(top_sentences):
            print(f"Top {i+1} Predicted Answer: {sentence}")
        print(f"Actual Answer: {actual_answer}")
        print(f"Best Sentence F1 Score: {f1_score:.4f}")
        print()


In [9]:
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruction = "Represent the story for retrieval:" 
process_story_questions(data_2, model_name, instruction)

load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Top 1 Predicted Answer: The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.
Top 2 Predicted Answer: Pandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.
Top 3 Predicted Answer: Pandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.
Actual Answer: 19 
Best Sentence F1 Score: 0.0714

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Top 1 Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Top 2 Predicted Answer: Pandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-y

## 3. GPT4ALL with HuggingFace InstructEmbeddings

### 3.0 Question Answering Loop Encapsulation
Encapsulate the loop process to simplify the code.

In [8]:
def newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    """
    Processes a collection of stories, answering questions using a QA chain based on GPT4ALL language model and
    HuggingFaceInstructEmbeddings.

    For each story, the text is split into chunks, and each chunk is processed to answer questions. The results,
    including calculated F1 and EM scores, are written to an output file.

    Args:
        data (dict): The dataset containing the stories and questions.
        llm (LLM): The language model instance used for generating answers.
        output_file_path (str): Path to the output file where results will be stored.
        chunk_sizes (list): List of chunk sizes for text splitting.
        overlap_percentages (list): List of overlap percentages for text splitting.
        max_stories (int): Maximum number of stories to process.
        instruct_embedding_model_name (str): Name of the HuggingFace model for embeddings.
        instruct_embedding_model_kwargs (dict): Keyword arguments for the embedding model.
        instruct_embedding_encode_kwargs (dict): Encoding arguments for the embedding model.
        QA_CHAIN_PROMPT (str): The prompt template for the QA chain.

    Returns:
        The function writes the results to a file and print the results of question, answer, actual answer, 
        and F1 scores.
    """
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                        llm, 
                        retriever=vectorstore.as_retriever(), 
                        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                        return_source_documents=True)
        
                    for j, question_data in enumerate(story['questions']):
                        j+=1
                        # print(f"{time.time()-start_time}\t\t\tquestion {j}")
                        print('.', end='')
        
                        # TODO: embed query and perform similarity_search_by_vector() instead
                        question = question_data['q']
                        question_vector = word_embed.embed_query(question)
                        # docs = vectorstore.similarity_search(question)
                        docs = vectorstore.similarity_search_by_vector(question_vector)
                        answer_ranges = extract_ranges(question_data['consensus'])
                        
                        # Get the prediction from the model
                        result = qa_chain({"query": question})
                        
                        # Extract and print the retrieved sentence(s) if available in the result
                        retrieved_sentences = result.get('source_documents', [])
                        print("\nRetrieved Sentence(s):")
                        for sentence in retrieved_sentences:
                            print(sentence)
                        
                        # Check if the predicted answer is in the expected format (string)
                        predicted_answer = result['result']
                        if isinstance(predicted_answer, dict):
                            # If it's a dictionary, you need to adapt this part of the code to extract the answer string
                            predicted_answer = predicted_answer.get('answer', '')  # Assuming 'answer' is the key for the answer string
                        elif not isinstance(predicted_answer, str):
                            # If the answer is not a string and not a dictionary, log an error or handle it appropriately
                            print(f"Unexpected format for predicted answer: {predicted_answer}")
                            continue  # Skip to the next question
                        actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]] if answer_ranges else ""
                        
                        # If there is an actual answer, get it from the story text using the character ranges
                        if answer_ranges:
                            actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]]
                        else:
                            actual_answer = ""
                        
                        # Calculate the scores
                        em_score = calculate_em(predicted_answer, actual_answer)
                        f1_score_value = calculate_token_f1(predicted_answer, actual_answer)
                        # modified_f1 = calculate_modified_f1(predicted_answer, actual_answer)
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{em_score:.2f}\n")
        
                        # Store the scores
                        em_results[(chunk_size, overlap_percentage)].append(em_score)
                        f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)
                        
                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"F1 Score: {f1_score_value:4f}")
                        
                    # delete object for memory
                    del qa_chain
                    del vectorstore
                    del all_splits
                    
                # delete splitter instance
                del text_splitter

### 3.1 Try different chunk sizes and overlap percentages

In [None]:
template_original = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten words maximum and keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [100, 200, 400, 800]
overlap_percentages = [0, 0.1, 0.2, 0.4]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_10.txt"
# model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
# instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_model_kwargs = {'device': 'mps'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

### 3.2 Prompt Focusing on Detailed Analysis

In [None]:
template_1 = """When you embed each sentence, focus closely on the details and the overall context of the news story. Pay attention to who is involved, what exactly happened, the timing of the events, and where they took place. Also, consider why these events are significant. This detailed analysis will help you accurately answer questions that require a deep understanding of the news story. \n
Context: {context}
Question: {question}
Answer:"""
QA_CHAIN_PROMPT_1 = PromptTemplate.from_template(template_1)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [100, 200, 400, 800]
overlap_percentages = [0, 0.1, 0.2, 0.4]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_11.txt"
# model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
# instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_model_kwargs = {'device': 'mps'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_1)

### 3.3 Prompt Highlighting Key Elements

In [None]:
# New template that only let the model answer questions from the original text.
template_2 = """As you process the sentences, concentrate on the main events and the participants in the story. Make sure to identify the key players (who), the actions or events described (what), the timeline of these events (when), the locations involved (where), and the reasons or motives behind them (why). This focus will enable you to provide precise answers to questions that center around these crucial elements of the news.\n
Context: {context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_2 = PromptTemplate.from_template(template_2)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [100, 200, 400, 800]
overlap_percentages = [0, 0.1, 0.2, 0.4]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_12.txt"
# model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
# instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_model_kwargs = {'device': 'mps'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_2)

### 3.4 Prompt Emphasizing Narrative Structure

In [None]:
# New template that only let the model answer questions from the original text.
template_3 = """Your task is to capture the narrative flow of the news story, with an emphasis on the sequence of events and causality. Examine the connections between the participants (who), the series of events (what), their chronological order (when), the settings (where), and the motivations behind them (why). This approach will assist you in answering questions that depend on understanding how the story unfolds and the cause-and-effect relationships within it.\n
Context: {context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_3 = PromptTemplate.from_template(template_3)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [100, 200, 400, 800]
overlap_percentages = [0, 0.1, 0.2, 0.4]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_13.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
# instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_model_kwargs = {'device': 'mps'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_3)

In [None]:
# Reset vectorstore directly
Chroma().delete_collection()

## 4. Two Models Combination (Delayed)

In [10]:
import nltk
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Check that MPS is available
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

mps_device = torch.device("mps")

def process_story_questions_new(data, model_name, instruction):
    # model_kwargs = {'device': 'cpu'}
    model_kwargs = {'device': 'mps'}
    encode_kwargs = {'normalize_embeddings': True}

    story_data = data['data'][0]  # First story
    story_text = story_data['text']

    # Segment story text into sentences and embed each sentence
    sentences = sent_tokenize(story_text)  # Splitting into sentences
    hf_story_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        embed_instruction="Use the following pieces of context to answer the question at the end:"
    )
    sentence_embs = hf_story_embs.embed_documents(sentences)

    # Process each question
    for question_data in story_data['questions']:
        question = question_data['q']

        hf_query_embs = HuggingFaceInstructEmbeddings(
            model_name=model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs,
            query_instruction=instruction
        )
        question_emb = hf_query_embs.embed_documents([question])[0]

        # Compute cosine similarity scores with each sentence
        scores = [torch.cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for sentence_emb in sentence_embs]

        # Find the sentence with the highest score
        best_sentence_idx = scores.index(max(scores))
        best_sentence = sentences[best_sentence_idx]

        # Extract actual answer using the consensus range
        consensus = question_data['consensus']
        actual_answer = story_text[consensus['s']:consensus['e']]
        
        # Calculate F1 score
        f1_score = compute_f1(best_sentence, actual_answer)
        
        # Print results
        # print(f"Question: {question}")
        # print(f"Predicted Answer: {best_sentence}")
        # print(f"Actual Answer: {actual_answer}")
        # print(f"F1 Score: {f1_score:.4f}")
        # print()
        
        generate_answer_from_best_sentence(best_sentence, question)

def generate_answer_from_best_sentence(best_sentence, question):
    # model_name_or_path = "TheBloke/WizardLM-13B-V1.1-GPTQ"
    
    # Load the model and tokenizer
    # model = AutoModelForCausalLM.from_pretrained(model_name_or_path, 
    #                                              device_map="auto", 
    #                                              trust_remote_code=True, 
    #                                              revision="main")
    # tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

    tokenizer = AutoTokenizer.from_pretrained("WizardLM/WizardLM-13B-V1.2")
    model = AutoModelForCausalLM.from_pretrained("WizardLM/WizardLM-13B-V1.2")
    model.to(mps_device)

    # Create the prompt
    prompt = "Answer the question, give shortest answer."
    prompt_template = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Read the following sentence: {best_sentence}, give me the answer for this question: {question}, use the shortest answer possible. ASSISTANT:"

    # Generate the response
    # input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
    input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.to("mps")
    output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
    model_answer = tokenizer.decode(output[0])

    # Extracting the answer from the generated text
    start_phrase = "ASSISTANT: "
    start_idx = model_answer.find(start_phrase)
    if start_idx != -1:
        start_idx += len(start_phrase)
        answer = model_answer[start_idx:].split('\n')[0]  # Assuming the answer ends with a newline
    else:
        answer = "Answer not found in generated text"

    # Print the question, best sentence, and the answer
    print(f"Question: {question}")
    print(f"Best Retrieved Sentence: {best_sentence}")
    print(f"Generated Answer: {answer}")
    print()

    return answer

In [11]:
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruction = "Represent the story for retrieval:" 
process_story_questions_new(data_2, model_name, instruction)

load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512




RuntimeError: Placeholder storage has not been allocated on MPS device!

## Reference
1. GPT4All Langchain Demo: https://gist.github.com/segestic/4822f3765147418fc084a250598c1fe6
2. Sematic - use GPT4ALL, FAISS & HuggingFace Embeddings for local context-enhanced question answering https://www.youtube.com/watch?v=y32vbJkabCw
3. Model used for InstructEmbeddings: [Hugging Face](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)
4. Reset the Chroma db collection: https://github.com/langchain-ai/langchain/issues/
5. F1 score in NLP span-based Question Answering task: https://kierszbaumsamuel.medium.com/f1-score-in-nlp-span-based-qa-task-5b115a5e7d41
6. The Stanford Question Answering Dataset: https://rajpurkar.github.io/SQuAD-explorer/