# NewsQA Loop
11/2/2023 \
Lixiao Yang

This notebook provides a loop for different chunk and overlap sizes using GPT4ALL (falcon model) and LangChain, a small sample of text files are selected from NewsQA dataset.

Due to the computing resource, the result is limited, please follow these steps for full ledge running:
1. Replace file_path into 'combined-newsqa-data-v1.json' (80.2 MB) - to compile the json file, follow the option 1 log and [Docker method from NewsQA](https://github.com/Maluuba/newsqa#recommended-docker-set-up)
2. Update chunk_size and overlap_percentage
3. Revise calculate_em() function

In [1]:
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import GPT4AllEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import GPT4All
from langchain.chains import RetrievalQA
from collections import Counter
import numpy as np
from collections import defaultdict
import json
from pathlib import Path
from langchain.prompts import PromptTemplate

In [2]:
file_path='./combined-newsqa-data-v1-small.json'
data = json.loads(Path(file_path).read_text())

In [3]:
# data

{'version': '1',
 'data': [{'text': 'NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."\n\n\n\nMoninder Singh Pandher was sentenced to death by a lower court in February.\n\n\n\nThe teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.\n\n\n\nThe Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.\n\n\n\nPandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.\n\n\n\nThe high court upheld Koli\'s death sentence, Kochar said.\n\n\n\nThe two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb. Their home was later dubbed a "house of horrors" by the Indian media.\n\n\n\nPand

In [4]:
# for story in data['data']:
#     print(story['text'])
#     print("\n--- End of story ---\n")

NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy businessman facing the death sentence for the killing of a teen in a case dubbed "the house of horrors."



Moninder Singh Pandher was sentenced to death by a lower court in February.



The teen was one of 19 victims -- children and young women -- in one of the most gruesome serial killings in India in recent years.



The Allahabad high court has acquitted Moninder Singh Pandher, his lawyer Sikandar B. Kochar told CNN.



Pandher and his domestic employee Surinder Koli were sentenced to death in February by a lower court for the rape and murder of the 14-year-old.



The high court upheld Koli's death sentence, Kochar said.



The two were arrested two years ago after body parts packed in plastic bags were found near their home in Noida, a New Delhi suburb. Their home was later dubbed a "house of horrors" by the Indian media.



Pandher was not named a main suspect by investigators initially, but w

In [5]:
# Define chunk sizes and overlap percentages
chunk_sizes = [200]
overlap_percentages = [0]  # Expressed as percentages (0.1 = 10%)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Helper function to calculate Exact Match (EM) score
def calculate_em(predicted, actual):
    return int(predicted == actual)

# Function to calculate the token-wise F1 score for text answers
def calculate_token_f1(predicted, actual):
    predicted_tokens = predicted.split()
    actual_tokens = actual.split()
    common_tokens = Counter(predicted_tokens) & Counter(actual_tokens)
    num_same = sum(common_tokens.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(predicted_tokens)
    recall = 1.0 * num_same / len(actual_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

# Helper function to extract answer ranges from the consensus field
def extract_ranges(consensus):
    if 's' in consensus and 'e' in consensus:
        return [(consensus['s'], consensus['e'])]
    return []


template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 
Also provide me the source for your answer. Explain how to get the answer step by step.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [6]:
# Initialize the language model and the QA chain
llm = GPT4All(model="C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin", max_tokens=2048)

# The following code would iterate over the stories and questions to calculate the scores
for chunk_size in chunk_sizes:
    for overlap_percentage in overlap_percentages:
        actual_overlap = int(chunk_size * overlap_percentage)
        
        for story in data['data']:
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
            all_splits = text_splitter.split_text(story['text'])
            vectorstore = Chroma.from_texts(texts=all_splits, embedding=GPT4AllEmbeddings())
            qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectorstore.as_retriever(), return_source_documents=True)

            for question_data in story['questions']:
                question = question_data['q']
                docs = vectorstore.similarity_search(question)
                answer_ranges = extract_ranges(question_data['consensus'])
                
                # Get the prediction from the model
                result = qa_chain({"query": question})
                
                # Check if the predicted answer is in the expected format (string)
                predicted_answer = result['result']
                if isinstance(predicted_answer, dict):
                    # If it's a dictionary, you need to adapt this part of the code to extract the answer string
                    predicted_answer = predicted_answer.get('answer', '')  # Assuming 'answer' is the key for the answer string
                elif not isinstance(predicted_answer, str):
                    # If the answer is not a string and not a dictionary, log an error or handle it appropriately
                    print(f"Unexpected format for predicted answer: {predicted_answer}")
                    continue  # Skip to the next question
                actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]] if answer_ranges else ""
                
                # If there is an actual answer, get it from the story text using the character ranges
                if answer_ranges:
                    actual_answer = story['text'][answer_ranges[0][0]:answer_ranges[0][1]]
                else:
                    actual_answer = ""
                
                # Calculate the scores
                em_score = calculate_em(predicted_answer, actual_answer)
                f1_score_value = calculate_token_f1(predicted_answer, actual_answer)

                # Store the scores
                em_results[(chunk_size, overlap_percentage)].append(em_score)
                f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)

# Calculate the average F1 and EM scores for each configuration
for config, scores in f1_results.items():
    avg_f1 = np.mean(scores)
    avg_em = np.mean(em_results[config])
    f1_results[config] = avg_f1
    em_results[config] = avg_em
    print(f"Chunk size {config[0]} with overlap {config[1]}% - Average F1: {avg_f1:.2f}, EM: {avg_em:.2f}")

# Output the results
print(f1_results)
print(em_results)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
Found model file at  C:\\\\Users\\\\24075\\\\.cache\\\\gpt4all\\ggml-all-MiniLM-L6-v2-f16.bin
Found model file at  C:\\\\Users\\\\24075\\\\.cache\\\\gpt4all\\ggml-all-MiniLM-L6-v2-f16.bin
Found model file at  C:\\\\Users\\\\24075\\\\.cache\\\\gpt4all\\ggml-all-MiniLM-L6-v2-f16.bin
Found model file at  C:\\\\Users\\\\24075\\\\.cache\\\\gpt4all\\ggml-all-MiniLM-L6-v2-f16.bin
Found model file at  C:\\\\Users\\\\24075\\\\.cache\\\\gpt4all\\ggml-all-MiniLM-L6-v2-f16.bin
Chunk size 200 with overlap 0% - Average F1: 0.14, EM: 0.00
defaultdict(<class 'list'>, {(200, 0): 0.13619093641153082})
defaultdict(<class 'list'>, {(200, 0): 0.0})


In [None]:
# # Print out the F1 and EM Results
# print("\nF1 and EM Results:")
# for config in f1_results:
#     chunk_size, overlap_percentage = config
#     print(f"Chunk Size: {chunk_size}, Overlap Percentage: {overlap_percentage}%, F1: {f1_results[config]}, EM: {em_results[config]}")