# NewsQA Loop with HuggingFace
11/25/2023

Lixiao Yang

Updated based on 6.2 NewsQA Loop_Ke.ipynb, for loop running for large dataset, refer to the [8.2 NewsQA Loop - v2.ipunb](https://github.com/lixiao-yang/DeepDelight/blob/main/Thread2/8.2%20NewsQA%20Loop%20-%20v2.ipynb).

This notebook provides new details about the NewsQA loop with HuggingFaceInstructEmbeddings, and also addresses certain solutions to problems occured in previous version of codes. The sequence is organized based on the development log.

1. Combining Hugging Face's `InstructEmbedding` [multi-qa-MiniLM-L6-cos-v1`](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) with GPT4ALL `gpt4all-falcon-q4_0` model for semantic search and local data training.
2. Pure GPT4ALL  `gpt4all-falcon-q4_0` model results for question 1: Performs badly, may due to the local traning text (one story) size is too small. (see Section 5)
3. Pure HuggingFace model results for question 1: By using cosine similarity, the model can locate the correct sentence for the actual answer. (see Section 6)
4. GPT4ALL `gpt4all-falcon-q4_0` model with `multi-qa-MiniLM-L6-cos-v1` InstructEmbeddings: (see Section 7)
5. Fixed inconsistent F1 score issue
6. Fixed the previous prompt issue
7. Observed that prompt change will impact the results, a better way of prompting can be important for improving F1 score.
8. No obvious F1 score improvement for adding the InstructEmbedding methods for the test run.

### Problems and Solutions
- **Problem of incompatibility between Hugging Face and GPT4ALL**

    InstructEmbedding requires a HuggingFace model to finish the embedding process, most of them are pretrained model that does not support local training.

    **Solution**: Use `sentence-transformers/multi-qa-MiniLM-L6-cos-v1` from [Hugging Face](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1) which is a pretrained sentence-transformer model. It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for **semantic search**. `GPT4ALL` is a `llama.cpp` based large language model under `langchain`. So we use llama to connect the two models together. Use this model only for embedding and combine GPT4ALL model to enable local data training following below procedures:
   1. Download `GPT4ALL` model / Locally found the model
   2. Download `llama.cpp` 7B model / Locally found the model
   3. Transform `GPT4ALL` model
   4. Store the converted model
- **Problem of incompatibility of gpt4all-falcon-q4_0 model (Pending)**

    Current `gpt4all-falcon-q4_0` model is not compatible with `llama.cpp`.

    **Solution**: Switch to gpt4all-lora-quantized.bin, convert the model into llama.cpp compatible version, refer to [GPT4All Langchain Demo](https://gist.github.com/segestic/4822f3765147418fc084a250598c1fe6) (PENDING)
- **Problem of dimensionality inconsistency**
    See Section 7.3
- **Previous problem of inconsistent results for different runs**

    Previous code shows that for different runs, the F1 score is different causing the results replicate issue.

    **Solution**: Set a fixed seed for GPT4ALL model, and also use `set_seed()` to set universal seed before the model initialization.
- **Previous propmt issue**

    In previous code, for different prompt template would genereate the same answers.

    **Solution**: Revise the code, found that the QA_CHAIN_PROMPT parameter was neglected in previous version of code.

In [4]:
#!pip install llama-cpp-python langchain sentence_transformers InstructorEmbedding pyllama transformers pyqt5 pyqtwebengine pyllamacpp --user

In [12]:
import logging
import time
from collections import Counter
from collections import defaultdict
import json
from langchain.llms import GPT4All
from pathlib import Path
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings
from transformers import set_seed

## 1. Data Preparation

In [13]:
file_path='C:/NewsQA/combined-newsqa-data-story1.json'
data = json.loads(Path(file_path).read_text())

In [14]:
data

{'version': '1',
 'data': [{'text': 'NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\n\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\n\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.\n\n\n\nThe Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.\n\n\n\nPandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.\n\n\n\nThe high court upheld Koli\'s death sentence, Kochar said.\n\n\n\nThe two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb. Their home was later dubbed a "house of horrors" by the Indian media.\n\n\n\nPand

## 2. GPT4ALL Model Transformation (PENDING)
gpt4all-lora-quantized.bin + llama.cpp 7B -> gpt4all-lora-q-converted.bin

Follow the instructions from https://colab.research.google.com/drive/1csJ9lzewAaBVNSO9icJC5iT7xVrUbcg0?usp=sharing#scrollTo=Of6lqaPY0lR5 in Part 1 to get the converted model. Remember to also save the converted model to the local machine for local training.

### 2.1 LLaMA 7b Download

In [15]:
!python3 -m llama.download --model_size 7B --folder llama

C:\Users\24075\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe: Error while finding module specification for 'llama.download' (ModuleNotFoundError: No module named 'llama')


## 3. Helper Functions

In [15]:
# Helper function to calculate Exact Match (EM) score
def calculate_em(predicted, actual):
    return int(predicted == actual)

# Function to calculate the token-wise F1 score for text answers
def calculate_token_f1(predicted, actual):
    predicted_tokens = predicted.split()
    actual_tokens = actual.split()
    common_tokens = Counter(predicted_tokens) & Counter(actual_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(predicted_tokens)
    recall = 1.0 * num_same / len(actual_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

# Helper function to extract answer ranges from the consensus field
def extract_ranges(consensus):
    if 's' in consensus and 'e' in consensus:
        return [(consensus['s'], consensus['e'])]
    return []

def extract_answer_from_gpt4all_output(output, prompt_end="Answer:"):
    """
    Extracts the answer from the output of the GPT4ALL model.

    Parameters:
        output (str): The output string from the GPT4ALL model.
        prompt_end (str): The delimiter indicating the end of the prompt and the start of the answer.

    Returns:
        str: The extracted answer.
    """

    # Find the start of the answer
    start_idx = output.find(prompt_end)
    if start_idx == -1:
        return "Answer not found in output"

    # Extract the answer
    start_idx += len(prompt_end)
    answer = output[start_idx:]

    # Optional: Clean up the answer, remove any additional text after the answer
    # This can be tailored based on how your model generates responses
    end_characters = [".", "?", "!"]
    for end_char in end_characters:
        end_idx = answer.find(end_char)
        if end_idx != -1:
            answer = answer[:end_idx + 1]
            break

    return answer.strip()

def calculate_modified_f1(predicted, actual):
    """This approach gives a score that reflects the best match for each word in the predicted answer 
    with any word in the actual answer, thereby considering partial and inexact matches more effectively.
    """
    def inner_word_f1(predicted_word, actual_word):
        common_chars = Counter(predicted_word) & Counter(actual_word)
        num_common = sum(common_chars.values())
        if num_common == 0:
            return 0
        precision = num_common / len(predicted_word)
        recall = num_common / len(actual_word)
        if precision + recall == 0:
            return 0
        return 2 * precision * recall / (precision + recall)

    predicted_tokens = predicted.split()
    actual_tokens = actual.split()

    total_f1 = 0.0
    for predicted_word in predicted_tokens:
        word_f1_scores = [inner_word_f1(predicted_word, actual_word) for actual_word in actual_tokens]
        total_f1 += max(word_f1_scores, default=0)

    # Normalize the total F1 score by the number of words in the predicted answer
    normalized_total_f1 = total_f1 / len(predicted_tokens) if predicted_tokens else 0
    return normalized_total_f1


## 4. Logging Settings

In [4]:
# Initialize logging
logging.basicConfig(level=logging.ERROR)

# Parameters
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)

# Initialize the language model (GPT4ALL)
llm_path = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
llm = GPT4All(model="C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin", max_tokens=2048)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# Start time calculation
start_time = time.time()
print(f"{start_time} Started.")

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1700878339.4739673 Started.


## 5. GPT4ALL Model Result

In [45]:
output_file_path = "results/scores_2.txt"

# Initialize HuggingFace Instruct Embeddings
word_embed = HuggingFaceInstructEmbeddings(
    model_name=instruct_embedding_model_name,
    model_kwargs=instruct_embedding_model_kwargs,
    encode_kwargs=instruct_embedding_encode_kwargs
)

with open(output_file_path, 'w') as file:
    for chunk_size in chunk_sizes:
        print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
        last_time = time.time()

        for overlap_percentage in overlap_percentages:
            print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}]")
            last_time = time.time()

            i = 0
            for story in data['data']:
                i += 1
                if i > max_stories:
                    break
                now_time = time.time()
                print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i}: ", end='')
                last_time = now_time

                qa_chain, vectorstore = create_qa_chain(
                    story_text=story['text'],
                    chunk_size=chunk_size,
                    overlap_percentage=overlap_percentage,
                    llm=llm,
                    word_embed=word_embed
                )

                j = 0
                for question_data in story['questions']:
                    j += 1
                    print('.', end='')

                    question = question_data['q']
                    question_vector = word_embed.embed_query(question)
                    docs = vectorstore.similarity_search_by_vector(question_vector)
                    answer_ranges = extract_ranges(question_data['consensus'])

                    # Get the prediction from the model
                    result = qa_chain({"query": question})
                    predicted_answer = result['result']
                    if isinstance(predicted_answer, dict):
                        predicted_answer = predicted_answer.get('answer', '')
                    elif not isinstance(predicted_answer, str):
                        print(f"Unexpected format for predicted answer: {predicted_answer}")
                        continue
                    actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]] if answer_ranges else ""

                    # Calculate the scores
                    em_score = calculate_em(predicted_answer, actual_answer)
                    f1_score_value = calculate_token_f1(predicted_answer, actual_answer)
                    file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{em_score:.2f}\n")

                    em_results[(chunk_size, overlap_percentage)].append(em_score)
                    f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)

                    # Extract actual answer
                    consensus = question_data['consensus']

                    # Calculate F1 score
                    f1_score = calculate_token_f1(predicted_answer, actual_answer)

                    # Print results
                    print(f"Question: {question}")
                    print(f"Predicted Answer: {predicted_answer}")
                    print(f"Actual Answer: {actual_answer}")
                    print(f"F1 Score: {f1_score:.4f}")
                    print()

                del qa_chain


load INSTRUCTOR_Transformer
max_seq_length  512

165.83870601654053 Processing chunk size 200:

165.83870601654053	0.0	Overlap [0.1]

165.83870601654053	0.0		story 1: .Question: What was the amount of children murdered?
Predicted Answer:  The amount of children murdered is not provided in the given context.
Actual Answer: 19 
F1 Score: 0.0000

.Question: When was Pandher sentenced to death?
Predicted Answer:  Pandh
Actual Answer: February.

F1 Score: 0.0000

.Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  The context does not provide information on the specific crime for which Moninder Singh Pandher was acquitted.
Actual Answer: rape and murder 
F1 Score: 0.0000

.Question: who was acquitted
Predicted Answer:  The person who was acquitted is Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.3333

.Question: who was sentenced
Predicted Answer:  The high court upheld the sentence of the person who was sentenced to death.
Actual

## 6. Hugging Face Pretrained Model Result

In [13]:
import torch
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
from collections import Counter

# Initialize the HuggingFaceInstructEmbeddings model
model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}

# Assuming data is loaded into a variable named 'data'
story_data = data['data'][0]  # First story
story_text = story_data['text']

# Split story text into words and embed each word
words = story_text.split()  # Splitting into words
hf_story_embs = HuggingFaceInstructEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    embed_instruction="Represent the story for retrieval:"
)
word_embs = hf_story_embs.embed_documents(words)

# Process each question
for question_data in story_data['questions']:
    question = question_data['q']

    hf_query_embs = HuggingFaceInstructEmbeddings(
        model_name=model_name,
        model_kwargs=model_kwargs,
        encode_kwargs=encode_kwargs,
        query_instruction="Use the following pieces of context to answer the question at the end. Use ten words maximum and keep the answer as concise as possible.Find the shortest possible answer:"
    )
    question_emb = hf_query_embs.embed_documents([question])[0]

    # Compute cosine similarity scores with each word
    scores = [torch.cosine_similarity(torch.tensor(word_emb).unsqueeze(0), torch.tensor(question_emb).unsqueeze(0))[0].item() for word_emb in word_embs]

    # Find the sequence of 5 consecutive words with the highest cumulative score
    best_sequence_score = float('-inf')
    best_sequence = []
    for i in range(len(scores) - 4):
        current_sequence_score = sum(scores[i:i+5])
        if current_sequence_score > best_sequence_score:
            best_sequence_score = current_sequence_score
            best_sequence = words[i:i+5]

    best_sequence_text = ' '.join(best_sequence)

    # Extract actual answer using the consensus range
    consensus = question_data['consensus']
    actual_answer = story_text[consensus['s']:consensus['e']]
    
    # Calculate F1 score
    f1_score = calculate_token_f1(best_sequence_text, actual_answer)
    
    # Print results
    print(f"Question: {question}")
    print(f"Predicted Answer: {best_sequence_text}")
    print(f"Actual Answer: {actual_answer}")
    print(f"F1 Score: {f1_score:.4f}")
    print()


load INSTRUCTOR_Transformer
max_seq_length  512
load INSTRUCTOR_Transformer
max_seq_length  512
Question: What was the amount of children murdered?
Predicted Answer: of 19 victims -- children
Actual Answer: 19 
F1 Score: 0.3333

load INSTRUCTOR_Transformer
max_seq_length  512
Question: When was Pandher sentenced to death?
Predicted Answer: Pandher was sentenced to death
Actual Answer: February.

F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer: acquitted Moninder Singh Pandher, his
Actual Answer: rape and murder 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was acquitted
Predicted Answer: Allahabad high court has acquitted
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

load INSTRUCTOR_Transformer
max_seq_length  512
Question: who was sentenced
Predicted Answer: was sentenced to death by
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000


## 7.GPT4ALL with Hugging Face Instruct Embeddings

### 7.0 Question Answering Loop Encapsulation
Encapsulate the loop process to simplify the code.

In [16]:
def newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    """
    Processes a collection of stories, answering questions using a QA chain based on GPT4ALL language model and
    HuggingFaceInstructEmbeddings.

    For each story, the text is split into chunks, and each chunk is processed to answer questions. The results,
    including calculated F1 and EM scores, are written to an output file.

    Args:
        data (dict): The dataset containing the stories and questions.
        llm (LLM): The language model instance used for generating answers.
        output_file_path (str): Path to the output file where results will be stored.
        chunk_sizes (list): List of chunk sizes for text splitting.
        overlap_percentages (list): List of overlap percentages for text splitting.
        max_stories (int): Maximum number of stories to process.
        instruct_embedding_model_name (str): Name of the HuggingFace model for embeddings.
        instruct_embedding_model_kwargs (dict): Keyword arguments for the embedding model.
        instruct_embedding_encode_kwargs (dict): Encoding arguments for the embedding model.
        QA_CHAIN_PROMPT (str): The prompt template for the QA chain.

    Returns:
        The function writes the results to a file and print the results of question, answer, actual answer, 
        and F1 scores.
    """
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                        llm, 
                        retriever=vectorstore.as_retriever(), 
                        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                        return_source_documents=True)
        
                    for j, question_data in enumerate(story['questions']):
                        j+=1
                        # print(f"{time.time()-start_time}\t\t\tquestion {j}")
                        print('.', end='')
        
                        # TODO: embed query and perform similarity_search_by_vector() instead
                        question = question_data['q']
                        question_vector = word_embed.embed_query(question)
                        # docs = vectorstore.similarity_search(question)
                        docs = vectorstore.similarity_search_by_vector(question_vector)
                        answer_ranges = extract_ranges(question_data['consensus'])
                        
                        # Get the prediction from the model
                        result = qa_chain({"query": question})
                        
                        # Check if the predicted answer is in the expected format (string)
                        predicted_answer = result['result']
                        if isinstance(predicted_answer, dict):
                            # If it's a dictionary, you need to adapt this part of the code to extract the answer string
                            predicted_answer = predicted_answer.get('answer', '')  # Assuming 'answer' is the key for the answer string
                        elif not isinstance(predicted_answer, str):
                            # If the answer is not a string and not a dictionary, log an error or handle it appropriately
                            print(f"Unexpected format for predicted answer: {predicted_answer}")
                            continue  # Skip to the next question
                        actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]] if answer_ranges else ""
                        
                        # If there is an actual answer, get it from the story text using the character ranges
                        if answer_ranges:
                            actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]]
                        else:
                            actual_answer = ""
                        
                        # Calculate the scores
                        em_score = calculate_em(predicted_answer, actual_answer)
                        f1_score_value = calculate_token_f1(predicted_answer, actual_answer)
                        # modified_f1 = calculate_modified_f1(predicted_answer, actual_answer)
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{em_score:.2f}\n")
        
                        # Store the scores
                        em_results[(chunk_size, overlap_percentage)].append(em_score)
                        f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)
                        
                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"F1 Score: {f1_score_value:4f}")
                        # print(f"Modified F1: {modified_f1:.4f}")
                        
                    # delete object for memory
                    del qa_chain
                    del vectorstore
                    del all_splits
                    
                # delete splitter instance
                del text_splitter

### 7.1 Original Propmpt
Duplication for checking the consistency between different runs.

In [17]:
template_original = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten words maximum and keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_7.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1701040141.5350428 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.003004312515258789	0.003004312515258789		story 1: .
Question: What was the amount of children murdered?
Predicted Answer:  The article states that there were 1,900 children murdered in India between 2010 and 2015.
Actual Answer: 19 
F1 Score: 0.000000
.
Question: When was Pandher sentenced to death?
Predicted Answer:  Pandher was sentenced to death in February.
Actual Answer: February.

F1 Score: 0.250000
.
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  Moninder Singh Pandher was accused of killing his wife.
Actual Answer: rape and murder 
F1 Score: 0.000000
.
Question: who was acquitted
Predicted Answer:  Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.666667
.
Question: who was sentenced

In [19]:
############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_8.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1701045500.3580875 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.002999544143676758	0.002999544143676758		story 1: .
Question: What was the amount of children murdered?
Predicted Answer:  The amount of children murdered is not provided in the given context.
Actual Answer: 19 
F1 Score: 0.000000
.
Question: When was Pandher sentenced to death?
Predicted Answer:  Pandher was sentenced to death in February.
Actual Answer: February.

F1 Score: 0.250000
.
Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  The court aquitted Moninder Singh Pandher of a crime.
Actual Answer: rape and murder 
F1 Score: 0.000000
.
Question: who was acquitted
Predicted Answer:  Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.666667
.
Question: who was sentenced
Predicted Answer:  The

#### 7.1.1 Modified F1 Score to measure (PENGDING)

In [11]:
template_original = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten words maximum and keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_newf1.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
1700966802.8688896 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0019981861114501953	0.0019981861114501953		story 1: .Question: What was the amount of children murdered?
Predicted Answer:  The article states that there were 1,900 children murdered in India between 2010 and 2015.
Actual Answer: 19 
F1 Score: 0.0000
Modified F1: 0.0794

.Question: When was Pandher sentenced to death?
Predicted Answer:  Pandh
Actual Answer: February.

F1 Score: 0.0000
Modified F1: 0.1429

.Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  Moninder Singh Pandh
Actual Answer: rape and murder 
F1 Score: 0.0000
Modified F1: 0.4762

.Question: who was acquitted
Predicted Answer:  Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.6667
Modified F1: 0.9444

.Question: who was sentenced

### 7.2 Revised Prompt
Duplication for checking the consistency between different runs.

In [13]:
# New template that only let the model answer questions from the original text.
template_new = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use ten words maximum and keep the answer as concise as possible. 
Use only consecutive words that appeared in the context to answer questions. \n
Context: {context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT_NEW = PromptTemplate.from_template(template_new)

############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_new_1.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_NEW)

1700958240.5511186 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: .Question: What was the amount of children murdered?
Predicted Answer:  The amount of children murdered is not provided in the given context.
Actual Answer: 19 
F1 Score: 0.0000

.Question: When was Pandher sentenced to death?
Predicted Answer:  Pandh
Actual Answer: February.

F1 Score: 0.0000

.Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  The court aquitted Moninder Singh Pandher of the crime of murder.
Actual Answer: rape and murder 
F1 Score: 0.0000

.Question: who was acquitted
Predicted Answer:  Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.6667

.Question: who was sentenced
Predicted Answer:  The high court upheld the sentence of the person who committed the crime.
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

.Question: What was Moninder Singh Pandher 

In [14]:
############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_new_2.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

newsqa_loop_test(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_NEW)

1700958771.2228317 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: .Question: What was the amount of children murdered?
Predicted Answer:  The amount of children murdered is not provided in the given context.
Actual Answer: 19 
F1 Score: 0.0000

.Question: When was Pandher sentenced to death?
Predicted Answer:  Pandh
Actual Answer: February.

F1 Score: 0.0000

.Question: The court aquitted Moninder Singh Pandher of what crime?
Predicted Answer:  The court aquitted Moninder Singh Pandher of the crime of murder.
Actual Answer: rape and murder 
F1 Score: 0.0000

.Question: who was acquitted
Predicted Answer:  Moninder Singh Pandh
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.6667

.Question: who was sentenced
Predicted Answer:  The high court upheld the sentence of the person who committed the crime.
Actual Answer: Moninder Singh Pandher 
F1 Score: 0.0000

.Question: What was Moninder Singh Pandher 

### 7.3 Reset Embeddings Dimension Value in Vectorstore
Problem: Dimensionality error
```
---------------------------------------------------------------------------
InvalidDimensionException                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18668\111973739.py in <module>
     57                 all_splits = text_splitter.split_text(story['text'])
     58                 # print(f"[after split]")
---> 59                 vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
     60                 # print(f"[after vector store]")
     61                 qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectorstore.as_retriever(), return_source_documents=True)

~\AppData\Roaming\Python\Python39\site-packages\langchain\vectorstores\chroma.py in from_texts(cls, texts, embedding, metadatas, ids, collection_name, persist_directory, client_settings, client, collection_metadata, **kwargs)
    574             **kwargs,
    575         )
--> 576         chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
    577         return chroma_collection
    578 

~\AppData\Roaming\Python\Python39\site-packages\langchain\vectorstores\chroma.py in add_texts(self, texts, metadatas, ids, **kwargs)
    233                 )
    234         else:
--> 235             self._collection.upsert(
    236                 embeddings=embeddings,
    237                 documents=texts,

~\AppData\Roaming\Python\Python39\site-packages\chromadb\api\models\Collection.py in upsert(self, ids, embeddings, metadatas, documents)
    296         )
...
--> 552             raise InvalidDimensionException(
    553                 f"Embedding dimension {dim} does not match collection dimensionality {collection['dimension']}"
    554             )

InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 768
```

Solution: Reset the Chroma db collection following this thread:
https://github.com/langchain-ai/langchain/issues/

```python
# If can't find the directory
from chromadb.errors import InvalidDimensionException

try:
    docsearch = Chroma.from_documents(documents=..., embedding=...)
except InvalidDimensionException:
    Chroma().delete_collection()
    docsearch = Chroma.from_documents(documents=..., embedding=...)
```

In [16]:
# Reset vectorstore directly
Chroma().delete_collection()

## Reference
1. GPT4All Langchain Demo: https://gist.github.com/segestic/4822f3765147418fc084a250598c1fe6
2. Sematic - use GPT4ALL, FAISS & HuggingFace Embeddings for local context-enhanced question answering https://www.youtube.com/watch?v=y32vbJkabCw
3. Model used for InstructEmbeddings: [Hugging Face](https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1)
4. Reset the Chroma db collection: https://github.com/langchain-ai/langchain/issues/
