# NewsQA Prompt Engineering
12/7/2023

Lixiao Yang

This notebook is updated based on [`8.1 NewsQA Loop with HuggingFace.ipynb`](https://github.com/lixiao-yang/DeepDelight/blob/main/Thread2/8.1%20NewsQA%20Loop%20with%20HuggingFace.ipynb), for loop running for large dataset, refer to the `NewsQA Loop - v2.ipunb`. This notebook adds new explorations in prompt engineering for HuggingFace `InstructEmbedding` (`sentence-transformers/multi-qa-MiniLM-L6-cos-v1`) and locally deployed GPT4ALL `gpt4all-falcon-q4_0` model for semantic search and local data training, this notebook also included the retrieval texts from the model aims at differentiating the previous work is caused by **retrival problem** or **instruction problem**. Three additional prompt methods are examined to compare the effect of prompt engineering in question answering. The output is only demonstrating the first story's results.


#### Updates
1. Add the retrival content from the model to identify whether the model find the 'correct' sentence for answering question.
2. Add three additional prompt methods in the query instructions: 
   1. **Prompt Focusing on Detailed Analysis**: When you embed each sentence, focus closely on the details and the overall context of the news story. Pay attention to who is involved, what exactly happened, the timing of the events, and where they took place. Also, consider why these events are significant. This detailed analysis will help you accurately answer questions that require a deep understanding of the news story.
   2. **Prompt Highlighting Key Elements**: As you process the sentences, concentrate on the main events and the participants in the story. Make sure to identify the key players (who), the actions or events described (what), the timeline of these events (when), the locations involved (where), and the reasons or motives behind them (why). This focus will enable you to provide precise answers to questions that center around these crucial elements of the news.
   3. **Prompt Emphasizing Narrative Structure**: Your task is to capture the narrative flow of the news story, with an emphasis on the sequence of events and causality. Examine the connections between the participants (who), the series of events (what), their chronological order (when), the settings (where), and the motivations behind them (why). This approach will assist you in answering questions that depend on understanding how the story unfolds and the cause-and-effect relationships within it.
3. Import a potential F1 score calculation method (similar to original one).

#### Findings
1. No obvious relationship between different prompt and HuggingFace Model only, might be caused by the model only select the sentence with highest score, which for same corpus, the sentence of the highest cosine similarity should be consistent.
2. Different answers are answered for different prompt using GPT4ALL model.
3. Potential models to use: [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) (no obvious improvements observed), [BERT](https://huggingface.co/docs/transformers/model_doc/bert) (needs training to finetuning), [ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra) (needs training to finetuning)
4. Future model framework: encoder-decoder / decoder-only models
5. Current results more lean towards **prompt problem**.

In [4]:
#!pip install llama-cpp-python langchain sentence_transformers InstructorEmbedding pyllama transformers pyqt5 pyqtwebengine pyllamacpp --user

In [2]:
import logging
import time
from collections import Counter
from collections import defaultdict
import json
from langchain.llms import GPT4All
from pathlib import Path
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings
from transformers import set_seed

## 1. Data Preparation & Helper Functions

In [3]:
file_path='C:/NewsQA/combined-newsqa-data-story1.json'
data = json.loads(Path(file_path).read_text())

In [4]:
data

{'version': '1',
 'data': [{'text': 'NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\n\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\n\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.\n\n\n\nThe Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.\n\n\n\nPandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.\n\n\n\nThe high court upheld Koli\'s death sentence, Kochar said.\n\n\n\nThe two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb. Their home was later dubbed a "house of horrors" by the Indian media.\n\n\n\nPand

In [5]:
# Helper function to calculate Exact Match (EM) score
def calculate_em(predicted, actual):
    return int(predicted == actual)

# Function to calculate the token-wise F1 score for text answers
def calculate_token_f1(predicted, actual):
    predicted_tokens = predicted.split()
    actual_tokens = actual.split()
    common_tokens = Counter(predicted_tokens) & Counter(actual_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(predicted_tokens)
    recall = 1.0 * num_same / len(actual_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

# Helper function to extract answer ranges from the consensus field
def extract_ranges(consensus):
    if 's' in consensus and 'e' in consensus:
        return [(consensus['s'], consensus['e'])]
    return []

def extract_answer_from_gpt4all_output(output, prompt_end="Answer:"):
    """
    Extracts the answer from the output of the GPT4ALL model.

    Parameters:
        output (str): The output string from the GPT4ALL model.
        prompt_end (str): The delimiter indicating the end of the prompt and the start of the answer.

    Returns:
        str: The extracted answer.
    """

    # Find the start of the answer
    start_idx = output.find(prompt_end)
    if start_idx == -1:
        return "Answer not found in output"

    # Extract the answer
    start_idx += len(prompt_end)
    answer = output[start_idx:]

    # Optional: Clean up the answer, remove any additional text after the answer
    # This can be tailored based on how your model generates responses
    end_characters = [".", "?", "!"]
    for end_char in end_characters:
        end_idx = answer.find(end_char)
        if end_idx != -1:
            answer = answer[:end_idx + 1]
            break

    return answer.strip()


Use the following F1 score to find the sentence with highest score. Calculation method is:
$$\text{precision} = \frac{1.0*num\_same}{len(pred_toks)} = \frac{tp}{tp + fp} $$
$$\text{recall} = \frac{1.0*num\_same}{len(gold_toks)} = \frac{tp}{tp + fn} $$

tp=num_same=number of tokens that are shared between the correct answer and the prediction

fp=len(pred_toks)-num_same=number of tokens that are in the prediction but not in the correct answer.

fn=len(gold_toks)-num_same=number of tokens that are in the correct answer but not in the prediction.

This formula is also used in [SQuAD website](https://rajpurkar.github.io/SQuAD-explorer/). Also see reference 5.


In [16]:
from nltk.tokenize import word_tokenize
import collections

def compute_f1(a_gold, a_pred):
    gold_toks = word_tokenize(a_gold)
    pred_toks = word_tokenize(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
        return int(gold_toks == pred_toks)
    
    if num_same == 0:
        return 0
    
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


## 2. Logging Settings

In [6]:
# Initialize logging
logging.basicConfig(level=logging.ERROR)

# Parameters
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)

# Initialize the language model (GPT4ALL)
llm_path = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
llm = GPT4All(model="C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin", max_tokens=2048)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# Start time calculation
start_time = time.time()
print(f"{start_time} Started.")

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1702010662.0312448 Started.


## 3. Hugging Face Pretrained Model Result

### 3.1 Original Results

In [19]:
import torch
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
from collections import Counter

# Initialize the HuggingFaceInstructEmbeddings model
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}

# Assuming data is loaded into a variable named 'data'
story_data = data['data'][0]  # First story
story_text = story_data['text']

# Split story text into words and embed each word
words = story_text.split()  # Splitting into words
hf_story_embs = HuggingFaceInstructEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    embed_instruction="Represent the story for retrieval:"
)
word_embs = hf_story_embs.embed_documents(words)

# Process each question
for question_data in story_data['questions']:
    question = question_data['q']

    hf_query_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        query_instruction="Use the following pieces of context to answer the question at the end:"
    )
    question_emb = hf_query_embs.embed_documents([question])[0]

    # Compute cosine similarity scores with each word
    # scores = [torch.cosine_similarity(torch.tensor(word_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for word_emb in word_embs]

    # Calculate F1 score for each word
    scores = [compute_f1(word, question) for word in words]

    # Find the sequence of 5 consecutive words with the highest cumulative score
    best_sequence_score = float('-inf')
    best_sequence = []
    for i in range(len(scores) - 4):
        current_sequence_score = sum(scores[i:i+10])
        if current_sequence_score > best_sequence_score:
            best_sequence_score = current_sequence_score
            best_sequence = words[i:i+5]

    best_sequence_text = ' '.join(best_sequence)

    # Extract actual answer using the consensus range
    consensus = question_data['consensus']
    actual_answer = story_text[consensus['s']:consensus['e']]
    
    # Calculate F1 score
    f1_score = compute_f1(best_sequence_text, actual_answer)
    
    # Print results
    print(f"Question: {question}")
    print(f"Predicted Answer: {best_sequence_text}")
    print(f"Actual Answer: {actual_answer}")
    print(f"F1 Score: {f1_score:.4f}")
    print()


load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Predicted Answer: wealthy businessman facing the death
Actual Answer: 19 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Predicted Answer: house of horrors." Moninder Singh
Actual Answer: February.

F1 Score: 0.2222

load INSTRUCTOR_Transformer
max_seq_length  512
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: years. The Allahabad high court
Actual Answer: rape and murder 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was acquitted
Predicted Answer: said his client was in
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was sentenced
Predicted Answer: dubbed "the house of horrors."
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

### 3.2 Retrieved Sentences

In [25]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def process_story_questions(data, model_name, instruction):
    model_kwargs = {'device': 'cpu'}
    encode_kwargs = {'normalize_embeddings': True}

    story_data = data['data'][0]  # First story
    story_text = story_data['text']

    # Segment story text into sentences and embed each sentence
    sentences = sent_tokenize(story_text)  # Splitting into sentences
    hf_story_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        embed_instruction="Use the following pieces of context to answer the question at the end:"
    )
    sentence_embs = hf_story_embs.embed_documents(sentences)

    # Process each question
    for question_data in story_data['questions']:
        question = question_data['q']

        hf_query_embs = HuggingFaceInstructEmbeddings(
            model_name=model_name,
            model_kwargs=model_kwargs,
            encode_kwargs=encode_kwargs,
            query_instruction=instruction
        )
        question_emb = hf_query_embs.embed_documents([question])[0]

        # Compute cosine similarity scores with each sentence
        scores = [torch.cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for sentence_emb in sentence_embs]

        # Find the sentence with the highest score
        best_sentence_idx = scores.index(max(scores))
        best_sentence = sentences[best_sentence_idx]

        # Extract actual answer using the consensus range
        consensus = question_data['consensus']
        actual_answer = story_text[consensus['s']:consensus['e']]
        
        # Calculate F1 score
        f1_score = compute_f1(best_sentence, actual_answer)
        
        # Print results
        print(f"Question: {question}")
        print(f"Predicted Answer: {best_sentence}")
        print(f"Actual Answer: {actual_answer}")
        print(f"F1 Score: {f1_score:.4f}")
        print()
        
        
        # # Calculate F1 score for each sentence
        # f1_scores = [compute_f1(sentence, question) for sentence in sentences]

        # # Find the sentence with the highest F1 score
        # best_sentence_idx = f1_scores.index(max(f1_scores))
        # best_sentence = sentences[best_sentence_idx]

        # # Extract actual answer using the consensus range
        # consensus = question_data['consensus']
        # actual_answer = story_text[consensus['s']:consensus['e']]
        
        # # Calculate F1 score
        # f1_score = compute_f1(best_sentence, actual_answer)
        
        # # Print results
        # print(f"Question: {question}")
        # print(f"Predicted Answer: {best_sentence}")
        # print(f"Actual Answer: {actual_answer}")
        # print(f"F1 Score: {f1_score:.4f}")
        # print()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\24075\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [26]:
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruction = "Represent the story for retrieval:" 
process_story_questions(data, model_name, instruction)

load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Predicted Answer: The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.
Actual Answer: 19 
F1 Score: 0.0714

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: February.

F1 Score: 0.2500

load INSTRUCTOR_Transformer
max_seq_length  512
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.
Actual Answer: rape and murder 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was acquitted
Predicted Answer: The Allahabad high court has acquitted Moninder Sing

### 3.3 Prompt Focusing on Detailed Analysis

In [33]:
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruction = "When you embed each sentence, focus closely on the details and the overall context of the news story. Pay attention to who is involved, what exactly happened, the timing of the events, and where they took place. Also, consider why these events are significant. This detailed analysis will help you accurately answer questions that require a deep understanding of the news story."
process_story_questions(data, model_name, instruction)

load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Predicted Answer: The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.
Actual Answer: 19 
F1 Score: 0.0714

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: February.

F1 Score: 0.2500

load INSTRUCTOR_Transformer
max_seq_length  512
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.
Actual Answer: rape and murder 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was acquitted
Predicted Answer: The Allahabad high court has acquitted Moninder Sing

### 3.4 Prompt Highlighting Key Elements

In [32]:
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruction = "As you process the sentences, concentrate on the main events and the participants in the story. Make sure to identify the key players (who), the actions or events described (what), the timeline of these events (when), the locations involved (where), and the reasons or motives behind them (why). This focus will enable you to provide precise answers to questions that center around these crucial elements of the news."
process_story_questions(data, model_name, instruction)

load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Predicted Answer: The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.
Actual Answer: 19 
F1 Score: 0.0714

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: February.

F1 Score: 0.2500

load INSTRUCTOR_Transformer
max_seq_length  512
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.
Actual Answer: rape and murder 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was acquitted
Predicted Answer: The Allahabad high court has acquitted Moninder Sing

### 3.5 Prompt Emphasizing Narrative Structure

In [31]:
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruction = "Your task is to capture the narrative flow of the news story, with an emphasis on the sequence of events and causality. Examine the connections between the participants (who), the series of events (what), their chronological order (when), the settings (where), and the motivations behind them (why). This approach will assist you in answering questions that depend on understanding how the story unfolds and the cause-and-effect relationships within it."
process_story_questions(data, model_name, instruction)

load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Predicted Answer: The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.
Actual Answer: 19 
F1 Score: 0.0714

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: February.

F1 Score: 0.2500

load INSTRUCTOR_Transformer
max_seq_length  512
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.
Actual Answer: rape and murder 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was acquitted
Predicted Answer: The Allahabad high court has acquitted Moninder Sing

## 4. GPT4ALL with InstructEmbeddings

### 4.1 Original Prompt

In [15]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Initialize the FLAN-T5 model and tokenizer
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to('cpu')  # Ensure the model is on the correct device

# Assuming data is loaded into a variable named 'data'
story_data = data['data'][0]  # First story
story_text = story_data['text']

# Segment story text into sentences and embed each sentence
sentences = sent_tokenize(story_text)  # Splitting into sentences
hf_story_embs = HuggingFaceInstructEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    embed_instruction="Represent the story for retrieval:"
)
sentence_embs = hf_story_embs.embed_documents(sentences)

# Process each question
for question_data in story_data['questions']:
    question = question_data['q']

    hf_query_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        query_instruction="Use the following pieces of context to answer the question at the end:"
    )
    question_emb = hf_query_embs.embed_documents([question])[0]

    # Compute cosine similarity scores with each sentence
    scores = [torch.cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for sentence_emb in sentence_embs]

    # Find the sentence with the highest score
    best_sentence_idx = scores.index(max(scores))
    best_sentence = sentences[best_sentence_idx]

    # Extract actual answer using the consensus range
    consensus = question_data['consensus']
    actual_answer = story_text[consensus['s']:consensus['e']]
    
    # Calculate F1 score
    f1_score = calculate_token_f1(best_sentence, actual_answer)
    
    # Print results
    print(f"Question: {question}")
    print(f"Predicted Answer: {best_sentence}")
    print(f"Actual Answer: {actual_answer}")
    print(f"F1 Score: {f1_score:.4f}")
    print()


Question: What was the amount of children murdered?
Predicted Answer: Kochar said his client was in Australia when the teen was raped and killed.
Actual Answer: 19 
F1 Score: 0.0000

Question: When was Pandher sentenced to death?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: February.

F1 Score: 0.1429

Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: rape and murder 
F1 Score: 0.0000

Question: who was acquitted
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.3750

Question: who was sentenced
Predicted Answer: Kochar said his client was in Australia when the teen was raped and killed.
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

Question: What was Moninder Singh Pandher acquitted for?
Predic

## 4. GPT4ALL with HuggingFace InstructEmbeddings

### 4.0 Question Answering Loop Encapsulation
Encapsulate the loop process to simplify the code.

In [34]:
def newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    """
    Processes a collection of stories, answering questions using a QA chain based on GPT4ALL language model and
    HuggingFaceInstructEmbeddings.

    For each story, the text is split into chunks, and each chunk is processed to answer questions. The results,
    including calculated F1 and EM scores, are written to an output file.

    Args:
        data (dict): The dataset containing the stories and questions.
        llm (LLM): The language model instance used for generating answers.
        output_file_path (str): Path to the output file where results will be stored.
        chunk_sizes (list): List of chunk sizes for text splitting.
        overlap_percentages (list): List of overlap percentages for text splitting.
        max_stories (int): Maximum number of stories to process.
        instruct_embedding_model_name (str): Name of the HuggingFace model for embeddings.
        instruct_embedding_model_kwargs (dict): Keyword arguments for the embedding model.
        instruct_embedding_encode_kwargs (dict): Encoding arguments for the embedding model.
        QA_CHAIN_PROMPT (str): The prompt template for the QA chain.

    Returns:
        The function writes the results to a file and print the results of question, answer, actual answer, 
        and F1 scores.
    """
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                        llm, 
                        retriever=vectorstore.as_retriever(), 
                        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                        return_source_documents=True)
        
                    for j, question_data in enumerate(story['questions']):
                        j+=1
                        # print(f"{time.time()-start_time}\t\t\tquestion {j}")
                        print('.', end='')
        
                        # TODO: embed query and perform similarity_search_by_vector() instead
                        question = question_data['q']
                        question_vector = word_embed.embed_query(question)
                        # docs = vectorstore.similarity_search(question)
                        docs = vectorstore.similarity_search_by_vector(question_vector)
                        answer_ranges = extract_ranges(question_data['consensus'])
                        
                        # Get the prediction from the model
                        result = qa_chain({"query": question})
                        
                        # Extract and print the retrieved sentence(s) if available in the result
                        retrieved_sentences = result.get('source_documents', [])
                        print("\nRetrieved Sentence(s):")
                        for sentence in retrieved_sentences:
                            print(sentence)
                        
                        # Check if the predicted answer is in the expected format (string)
                        predicted_answer = result['result']
                        if isinstance(predicted_answer, dict):
                            # If it's a dictionary, you need to adapt this part of the code to extract the answer string
                            predicted_answer = predicted_answer.get('answer', '')  # Assuming 'answer' is the key for the answer string
                        elif not isinstance(predicted_answer, str):
                            # If the answer is not a string and not a dictionary, log an error or handle it appropriately
                            print(f"Unexpected format for predicted answer: {predicted_answer}")
                            continue  # Skip to the next question
                        actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]] if answer_ranges else ""
                        
                        # If there is an actual answer, get it from the story text using the character ranges
                        if answer_ranges:
                            actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]]
                        else:
                            actual_answer = ""
                        
                        # Calculate the scores
                        em_score = calculate_em(predicted_answer, actual_answer)
                        f1_score_value = calculate_token_f1(predicted_answer, actual_answer)
                        # modified_f1 = calculate_modified_f1(predicted_answer, actual_answer)
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{em_score:.2f}\n")
        
                        # Store the scores
                        em_results[(chunk_size, overlap_percentage)].append(em_score)
                        f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)
                        
                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"F1 Score: {f1_score_value:4f}")
                        
                    # delete object for memory
                    del qa_chain
                    del vectorstore
                    del all_splits
                    
                # delete splitter instance
                del text_splitter

### 4.1 Original Propmpt

In [26]:
template_original = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten words maximum and keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_7.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1702008618.7290194 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.003000020980834961	0.003000020980834961		story 1: .
Retrieved Sentence(s):
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='Kochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.' metadata={}
page_content='NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."' metadata={}
page_content='Moninder Singh Pandher was sentenced to death by a lower court in February.' metadata={

### 4.2 Prompt Focusing on Detailed Analysis

In [35]:
template_1 = """When you embed each sentence, focus closely on the details and the overall context of the news story. Pay attention to who is involved, what exactly happened, the timing of the events, and where they took place. Also, consider why these events are significant. This detailed analysis will help you accurately answer questions that require a deep understanding of the news story. \n
Context: {context}
Question: {question}
Answer:"""
QA_CHAIN_PROMPT_1 = PromptTemplate.from_template(template_1)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_8.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_1)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1702013031.6539967 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: .
Retrieved Sentence(s):
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='Kochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.' metadata={}
page_content='NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."' metadata={}
page_content='Moninder Singh Pandher was sentenced to death by a lower court in February.' metadata={}

Question: What was the amount o

### 4.3 Prompt Highlighting Key Elements

In [36]:
# New template that only let the model answer questions from the original text.
template_2 = """As you process the sentences, concentrate on the main events and the participants in the story. Make sure to identify the key players (who), the actions or events described (what), the timeline of these events (when), the locations involved (where), and the reasons or motives behind them (why). This focus will enable you to provide precise answers to questions that center around these crucial elements of the news.\n
Context: {context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_2 = PromptTemplate.from_template(template_2)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_new_1.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_2)

1702013457.3480058 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: .
Retrieved Sentence(s):
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='Kochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.' metadata={}
page_content='Kochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.' metadata={}

Question: What was the amount of children murdered?
Predicted Answer:  The article states

### 4.4 Prompt Emphasizing Narrative Structure

In [37]:
# New template that only let the model answer questions from the original text.
template_3 = """Your task is to capture the narrative flow of the news story, with an emphasis on the sequence of events and causality. Examine the connections between the participants (who), the series of events (what), their chronological order (when), the settings (where), and the motivations behind them (why). This approach will assist you in answering questions that depend on understanding how the story unfolds and the cause-and-effect relationships within it.\n
Context: {context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_3 = PromptTemplate.from_template(template_3)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_new_2.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_3)

1702013950.8635814 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: .
Retrieved Sentence(s):
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.' metadata={}
page_content='Kochar said his client was in Australia when the teen was raped and killed.\n\n\n\nPandher faces trial in the remaining 18 killings and could remain in custody, the attorney said.' metadata={}

Question: What was the amount of children murdered?
Predicted Answer:  The article states that there were 9 victims, all of whom were chi

In [16]:
# Reset vectorstore directly
Chroma().delete_collection()

## 5. FLAN-T5
No interesting results is shown.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

# Initialize the FLAN-T5 model and tokenizer
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.to('cpu')  # Ensure the model is on the correct device

# Assuming data is loaded into a variable named 'data'
story_data = data['data'][0]  # First story
story_text = story_data['text']

# Segment story text into sentences and embed each sentence
sentences = sent_tokenize(story_text)  # Splitting into sentences
hf_story_embs = HuggingFaceInstructEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cpu'},
    encode_kwargs={'normalize_embeddings': True},
    embed_instruction="Represent the story for retrieval:"
)
sentence_embs = hf_story_embs.embed_documents(sentences)

# Process each question
for question_data in story_data['questions']:
    question = question_data['q']

    hf_query_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs={'device': 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
        query_instruction=f"""           
            Answer the question based on the context below. Keep the answer short. Respond "Unsure about answer" if not sure about the answer.
            Context: {story_text}
            Question: {question}
            Answer:
            """
    )
    question_emb = hf_query_embs.embed_documents([question])[0]

    # Compute cosine similarity scores with each sentence
    scores = [torch.cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for sentence_emb in sentence_embs]

    # Find the sentence with the highest score
    best_sentence_idx = scores.index(max(scores))
    best_sentence = sentences[best_sentence_idx]

    # Extract actual answer using the consensus range
    consensus = question_data['consensus']
    actual_answer = story_text[consensus['s']:consensus['e']]
    
    # Calculate F1 score
    f1_score = calculate_token_f1(best_sentence, actual_answer)
    
    # Print results
    print(f"Question: {question}")
    print(f"Predicted Answer: {best_sentence}")
    print(f"Actual Answer: {actual_answer}")
    print(f"F1 Score: {f1_score:.4f}")
    print()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\24075\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Question: What was the amount of children murdered?
Predicted Answer: Kochar said his client was in Australia when the teen was raped and killed.
Actual Answer: 19 
F1 Score: 0.0000

Question: When was Pandher sentenced to death?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: February.

F1 Score: 0.1429

Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: rape and murder 
F1 Score: 0.0000

Question: who was acquitted
Predicted Answer: Moninder Singh Pandher was sentenced to death by a lower court in February.
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.3750

Question: who was sentenced
Predicted Answer: Kochar said his client was in Australia when the teen was raped and killed.
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

Question: What was Moninder Singh Pandher acquitted for?
Predic

## Reference
1. GPT4All Langchain Demo: https://gist.github.com/segestic/4822f3765147418fc084a250598c1fe6
2. Sematic - use GPT4ALL, FAISS & HuggingFace Embeddings for local context-enhanced question answering https://www.youtube.com/watch?v=y32vbJkabCw
3. Model used for InstructEmbeddings: [Hugging Face](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)
4. Reset the Chroma db collection: https://github.com/langchain-ai/langchain/issues/
5. F1 score in NLP span-based Question Answering task: https://kierszbaumsamuel.medium.com/f1-score-in-nlp-span-based-qa-task-5b115a5e7d41
6. The Stanford Question Answering Dataset: https://rajpurkar.github.io/SQuAD-explorer/