# NewsQA Loop - V2
11/26/2023 \
Lixiao Yang

This notebook provides a revised loop for different chunk and overlap sizes combining Hugging Face's `InstructEmbedding` (`sentence-transformers/multi-qa-MiniLM-L6-cos-v1`) with the GPT4ALL `gpt4all-falcon-q4_0` model and LangChain, a small sample of text files are selected from NewsQA dataset. HuggingFace Instruct Embeddings parameters can be changed based on different computing resource.

For more details about the code issue and development log, please refer to [8.1 NewsQA Loop with HuggingFace.ipynb](https://github.com/lixiao-yang/DeepDelight/blob/main/Thread2/8.1%20NewsQA%20Loop%20with%20HuggingFace.ipynb).

In [41]:
#!pip install llama-cpp-python langchain sentence_transformers InstructorEmbedding pyllama transformers pyqt5 pyqtwebengine pyllamacpp --user

In [42]:
import logging
import time
from collections import Counter
from collections import defaultdict
import json
from langchain.llms import GPT4All
from pathlib import Path
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings
from transformers import set_seed

## Data Preparation and Helper Functions

In [43]:
file_path='C:/NewsQA/combined-newsqa-data-story1.json'
data = json.loads(Path(file_path).read_text())

In [44]:
# Helper function to calculate Exact Match (EM) score
def calculate_em(predicted, actual):
    return int(predicted == actual)

# Function to calculate the token-wise F1 score for text answers
def calculate_token_f1(predicted, actual):
    predicted_tokens = predicted.split()
    actual_tokens = actual.split()
    common_tokens = Counter(predicted_tokens) & Counter(actual_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(predicted_tokens)
    recall = 1.0 * num_same / len(actual_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

# Helper function to extract answer ranges from the consensus field
def extract_ranges(consensus):
    if 's' in consensus and 'e' in consensus:
        return [(consensus['s'], consensus['e'])]
    return []

# def calculate_modified_f1(predicted, actual):
#     """This approach gives a score that reflects the best match for each word in the predicted answer 
#     with any word in the actual answer, thereby considering partial and inexact matches more effectively.
#     """
#     def inner_word_f1(predicted_word, actual_word):
#         common_chars = Counter(predicted_word) & Counter(actual_word)
#         num_common = sum(common_chars.values())
#         if num_common == 0:
#             return 0
#         precision = num_common / len(predicted_word)
#         recall = num_common / len(actual_word)
#         if precision + recall == 0:
#             return 0
#         return 2 * precision * recall / (precision + recall)

#     predicted_tokens = predicted.split()
#     actual_tokens = actual.split()

#     total_f1 = 0.0
#     for predicted_word in predicted_tokens:
#         word_f1_scores = [inner_word_f1(predicted_word, actual_word) for actual_word in actual_tokens]
#         total_f1 += max(word_f1_scores, default=0)

#     # Normalize the total F1 score by the number of words in the predicted answer
#     normalized_total_f1 = total_f1 / len(predicted_tokens) if predicted_tokens else 0
#     return normalized_total_f1


### Main Function

In [50]:
def newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    """
    Processes a collection of stories, answering questions using a QA chain based on GPT4ALL language model and
    HuggingFaceInstructEmbeddings.

    For each story, the text is split into chunks, and each chunk is processed to answer questions. The results,
    including calculated F1 and EM scores, are written to an output file.

    Args:
        data (dict): The dataset containing the stories and questions.
        llm (LLM): The language model instance used for generating answers.
        output_file_path (str): Path to the output file where results will be stored.
        chunk_sizes (list): List of chunk sizes for text splitting.
        overlap_percentages (list): List of overlap percentages for text splitting.
        max_stories (int): Maximum number of stories to process.
        instruct_embedding_model_name (str): Name of the HuggingFace model for embeddings.
        instruct_embedding_model_kwargs (dict): Keyword arguments for the embedding model.
        instruct_embedding_encode_kwargs (dict): Encoding arguments for the embedding model.
        QA_CHAIN_PROMPT (str): The prompt template for the QA chain.

    Returns:
        The function writes the results to a file and print the results of question, answer, actual answer, 
        and F1 scores.
    """
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                        llm, 
                        retriever=vectorstore.as_retriever(), 
                        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                        return_source_documents=True)
        
                    for j, question_data in enumerate(story['questions']):
                        j+=1
                        # print(f"{time.time()-start_time}\t\t\tquestion {j}")
                        print('.', end='')
        
                        # TODO: embed query and perform similarity_search_by_vector() instead
                        question = question_data['q']
                        question_vector = word_embed.embed_query(question)
                        # docs = vectorstore.similarity_search(question)
                        docs = vectorstore.similarity_search_by_vector(question_vector)
                        answer_ranges = extract_ranges(question_data['consensus'])
                        
                        # Get the prediction from the model
                        result = qa_chain({"query": question})
                        
                        # Check if the predicted answer is in the expected format (string)
                        predicted_answer = result['result']
                        if isinstance(predicted_answer, dict):
                            # If it's a dictionary, you need to adapt this part of the code to extract the answer string
                            predicted_answer = predicted_answer.get('answer', '')  # Assuming 'answer' is the key for the answer string
                        elif not isinstance(predicted_answer, str):
                            # If the answer is not a string and not a dictionary, log an error or handle it appropriately
                            print(f"Unexpected format for predicted answer: {predicted_answer}")
                            continue  # Skip to the next question
                        actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]] if answer_ranges else ""
                        
                        # If there is an actual answer, get it from the story text using the character ranges
                        if answer_ranges:
                            actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]]
                        else:
                            actual_answer = ""
                        
                        # Calculate the scores
                        em_score = calculate_em(predicted_answer, actual_answer)
                        f1_score_value = calculate_token_f1(predicted_answer, actual_answer)
                        # modified_f1 = calculate_modified_f1(predicted_answer, actual_answer)
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{em_score:.2f}\n")
        
                        # Store the scores
                        em_results[(chunk_size, overlap_percentage)].append(em_score)
                        f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)
                        
                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"F1 Score: {f1_score_value:4f}")
                        # print(f"Modified F1: {modified_f1:.4f}")
                        
                    # delete object for memory
                    del qa_chain
                    del vectorstore
                    del all_splits
                    
                # delete splitter instance
                del text_splitter

## Prompt Template and Parameters

In [46]:
template_original = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten words maximum and keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

Note: If InvalidDimensionException is risen, refer to *7.3 Reset Embeddings Dimension Value in Vectorstore* from file 8.1, you can also use 
```python 
Chroma().delete_collection()
```
to refresh the Chroma db collection.

In [48]:
# Chroma().delete_collection()

In [51]:
############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_test.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1701038768.163508 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: .
Question: What was the amount of children murdered?
Predicted Answer:  The article states that there were 1,900 children murdered in India between 2010 and 2015.
Actual Answer: 19 
F1 Score: 0.000000
.
Question: When was Pandher sentenced to death?
Predicted Answer:  Pandher was sentenced to death in February.
Actual Answer: February.

F1 Score: 0.250000
.
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  Moninder Singh Pandher was accused of killing his wife.
Actual Answer: rape and murder 
F1 Score: 0.000000
.
Question: who was acquitted
Predicted Answer:  Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.666667
.
Question: who was sentenced
Predicted Answer:  The high court 