# Revised Prompt NewsQA Loop
1/14/2024 \
Lixiao Yang

This notebook provides a revised loop for different chunk and overlap sizes with the GPT4ALL `gpt4all-falcon-q4_0` model and LangChain, a small sample of text files are selected from NewsQA dataset. This notebook is derived from previous results and code from notebook `8.2 NewsQA Loop - V2`, fully compiled code is compiled with a separate file `11.2 NewsQA_Experiment.py`.

Updates:
1. Revised a more efficient prompt for a better question answering results
2. Compile code into a separate file for full dataset runs
3. Add precison and recall as evaluation metrics
4. Adjusted the JSON reading function, considering handle with `isQuestionBad` and `isAnswerAbsent`  to improve the recall
5. Modify the code to enable experiment results are saved as `.csv` format
6. Initial try of building a two-step structure: use the retrieved chunk/sentence as the new embedding input before running the question answering chain


For more details about the code issue and development log, please refer to `8.1 NewsQA Loop with HuggingFace.ipynb`.

In [18]:
%pip install --upgrade --quiet  gpt4all > /dev/null

Note: you may need to restart the kernel to use updated packages.


The system cannot find the path specified.


In [41]:
#!pip install llama-cpp-python langchain sentence_transformers InstructorEmbedding pyllama transformers pyqt5 pyqtwebengine pyllamacpp --user

In [3]:
import logging
import time
from collections import Counter
from collections import defaultdict
import json
from langchain.llms import GPT4All
from pathlib import Path
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings
from transformers import set_seed
from langchain.embeddings import GPT4AllEmbeddings

## Data Preparation and Helper Functions

In [4]:
file_path='C:/NewsQA/combined-newsqa-data-story2.json'
data = json.loads(Path(file_path).read_text())

In [5]:
# Helper function to calculate Exact Match (EM) score
def calculate_em(predicted, actual):
    return int(predicted == actual)

# Modified function to calculate the token-wise F1 score and return precision and recall
def calculate_token_f1(predicted, actual):
    predicted_tokens = predicted.split()
    actual_tokens = actual.split()
    common_tokens = Counter(predicted_tokens) & Counter(actual_tokens)
    num_same = sum(common_tokens.values())

    if num_same == 0:
        return 0, 0, 0  # Return zero precision, recall, and F1 score

    precision = 1.0 * num_same / len(predicted_tokens)
    recall = 1.0 * num_same / len(actual_tokens)
    f1 = (2 * precision * recall) / (precision + recall)

    return f1, precision, recall

# Helper function to extract answer ranges from the consensus field
def extract_ranges(consensus):
    if 's' in consensus and 'e' in consensus:
        return [(consensus['s'], consensus['e'])]
    return []

### Main Function

In [9]:
def newsqa_loop_oldest(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                        llm, 
                        retriever=vectorstore.as_retriever(), 
                        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                        return_source_documents=True)
        
                    for j, question_data in enumerate(story['questions']):
                        if question_data['isAnswerAbsent']:
                        # Skip this question because an answer is absent
                            continue

                        j += 1
                        # Extract and print the isQuestionBad number
                        is_question_bad = question_data.get('isQuestionBad', 0)
                        print(f"\nisQuestionBad: {is_question_bad}")

                        question = question_data['q']
                        consensus = question_data['consensus']
                        
                        # Check if there is a consensus answer and extract it
                        if 's' in consensus and 'e' in consensus:
                            actual_answer = story['text'][consensus['s']:consensus['e']]
                        else:
                            actual_answer = "No consensus answer"
                        
                        # Embed the question and perform similarity search
                        question_vector = word_embed.embed_query(question)
                        docs = vectorstore.similarity_search_by_vector(question_vector)

                        # Get the prediction from the model
                        result = qa_chain({"query": question})
                        
                        # Extract and process the predicted answer
                        predicted_answer = result['result'] if isinstance(result['result'], str) else ""

                        # Calculate the F1 score, precision, and recall
                        f1_score_value, precision, recall = calculate_token_f1(predicted_answer, actual_answer)
                        em_score = calculate_em(predicted_answer, actual_answer)

                        # Write the scores to the file
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{precision:.4f}\t{recall:.4f}\t{em_score:.2f}\n")

                        # Store the scores
                        em_results[(chunk_size, overlap_percentage)].append(em_score)
                        f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)
                        # Consider storing precision and recall as well if needed

                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"Retrieved Sentense: {retrieved_sentences}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1_score_value:.4f}")

                    # Cleanup
                    del qa_chain
                    del vectorstore
                    del all_splits
                    
                del text_splitter


In [22]:
def newsqa_loop(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                                llm, 
                                retriever=vectorstore.as_retriever(), 
                                chain_type="stuff",
                                verbose=True,
                                chain_type_kwargs={
                                    "prompt": QA_CHAIN_PROMPT,
                                    "verbose": True},
                                return_source_documents=True)
                    
                    chunk_boundaries = []
                    start_index = 0
                    for split in all_splits:
                        end_index = start_index + len(split)
                        chunk_boundaries.append((start_index, end_index))
                        start_index = end_index


                    for j, question_data in enumerate(story['questions']):
                        if question_data['isAnswerAbsent']:
                            continue  # Skip this question because an answer is absent
                        
                        # Extract and print the isQuestionBad number
                        is_question_bad = question_data.get('isQuestionBad', 0)

                        question = question_data['q']
                        consensus = question_data['consensus']
                        
                        # Check if there is a consensus answer and extract it
                        if 's' in consensus and 'e' in consensus:
                            actual_answer = story['text'][consensus['s']:consensus['e']]
                            answer_chunk_index = next((index for index, (start, end) in enumerate(chunk_boundaries) if consensus['s'] >= start and consensus['e'] <= end), None)
                            if answer_chunk_index is not None:
                                context_for_qa = all_splits[answer_chunk_index]
                        else:
                            continue  # No consensus answer, skip to the next question

                        # Get the prediction from the model
                        result = qa_chain({"context": context_for_qa, "query": question})
                        
                        # Extract and process the predicted answer
                        predicted_answer = result['result'] if isinstance(result['result'], str) else ""

                        # Calculate the F1 score, precision, and recall
                        f1_score_value, precision, recall = calculate_token_f1(predicted_answer, actual_answer)
                        em_score = calculate_em(predicted_answer, actual_answer)

                        # Write the scores to the file
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{precision:.4f}\t{recall:.4f}\t{em_score:.2f}\n")

                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"isQuestionBad: {is_question_bad}")
                        print(f"Context Used: {context_for_qa}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1_score_value:.4f}")
                    

                    # Cleanup
                    del qa_chain
                    del vectorstore
                    del all_splits

                # End of the story loop
                del text_splitter


In [26]:
def newsqa_loop_two_step(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name, instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT):
    with open(output_file_path, 'w') as file:
        word_embed = HuggingFaceInstructEmbeddings(
            model_name=instruct_embedding_model_name,
            model_kwargs=instruct_embedding_model_kwargs,
            encode_kwargs=instruct_embedding_encode_kwargs
        )
        start_time = time.time()
        for chunk_size in chunk_sizes:
            print(f"\n{time.time()-start_time} Processing chunk size {chunk_size}:")
            last_time = time.time()
            for overlap_percentage in overlap_percentages:
                actual_overlap = int(chunk_size * overlap_percentage)
                print(f"\n{time.time()-start_time}\t{time.time()-last_time}\tOverlap [{overlap_percentage}] {actual_overlap}")
                text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=actual_overlap)
        
                for i, story in enumerate(data['data']):
                    if i >= max_stories:
                        break
                    now_time = time.time()
                    print(f"\n{now_time-start_time}\t{now_time-last_time}\t\tstory {i+1}: ", end='')
                    last_time = now_time
        
                    all_splits = text_splitter.split_text(story['text'])
                    vectorstore = Chroma.from_texts(texts=all_splits, embedding=word_embed)
                    qa_chain = RetrievalQA.from_chain_type(
                        llm, 
                        retriever=vectorstore.as_retriever(), 
                        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
                        return_source_documents=True)
        
                    for j, question_data in enumerate(story['questions']):
                        if question_data['isAnswerAbsent']:
                        # Skip this question because an answer is absent
                            continue

                        j += 1
                        # Extract and print the isQuestionBad number
                        is_question_bad = question_data.get('isQuestionBad', 0)
                        print(f"\nisQuestionBad: {is_question_bad}")

                        question = question_data['q']
                        consensus = question_data['consensus']

                        # Proceed with processing if there is a consensus answer
                        question = question_data['q']
                        if 's' in consensus and 'e' in consensus:
                            actual_answer = story['text'][consensus['s']:consensus['e']]
                        else:
                            actual_answer = "No consensus answer"  # Handle cases where there's no start and end index
                                            
                        # Embed the question and perform similarity search
                        question_vector = word_embed.embed_query(question)
                        docs = vectorstore.similarity_search_by_vector(question_vector)

                        # Combine the top 3 retrieved sentences as new context
                        new_context = " ".join([doc.page_content for doc in docs[:3]])

                        # Use the new context and question in the QA chain
                        result = qa_chain({"context": new_context, "query": question})
                                            
                        # Extract and process the predicted answer
                        predicted_answer = result['result'] if isinstance(result['result'], str) else ""

                        # Calculate the F1 score, precision, and recall
                        f1_score_value, precision, recall = calculate_token_f1(predicted_answer, actual_answer)
                        em_score = calculate_em(predicted_answer, actual_answer)

                        # Write the scores to the file
                        file.write(f"{chunk_size}\t{overlap_percentage}\t{i}\t{j}\t{f1_score_value:.4f}\t{precision:.4f}\t{recall:.4f}\t{em_score:.2f}\n")

                        # Store the scores
                        em_results[(chunk_size, overlap_percentage)].append(em_score)
                        f1_results[(chunk_size, overlap_percentage)].append(f1_score_value)
                        # Consider storing precision and recall as well if needed

                        # Print results
                        print(f"\nQuestion: {question}")
                        print(f"Retrieved Sentense: {new_context}")
                        print(f"Predicted Answer: {predicted_answer}")
                        print(f"Actual Answer: {actual_answer}")
                        print(f"Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1_score_value:.4f}")

                    # Cleanup
                    del qa_chain
                    del vectorstore
                    del all_splits
                    
                del text_splitter


## Prompt Template and Parameters

In [16]:
template_original = """
                    Based on the following information only: 
                    
                    {context}
                    
                    {question} Please provide the answer in as few words as possible and please do NOT repeat any word in the question, i.e. "{question}".

                    Answer:
                    """
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

template_original = """
                    Based on the following information only: 
                    
                    {context}
                    
                    {question} Please provide the answer in as few words as possible and please do NOT repeat any word in the question, i.e. "{question}".

                    Answer:
                    """
QA_CHAIN_PROMPT_ORIGINAL = PromptTemplate.from_template(template_original)

In [None]:
template_two_step = """
                    Based on the following information only: 
                    
                    {context}
                    
                    {question} Please provide the answer in as few words as possible and please do NOT repeat any word in the question, i.e. "{question}".

                    Answer:
                    """
QA_CHAIN_PROMPT_TWO_STEPS = PromptTemplate.from_template(template_two_step)

Note: If InvalidDimensionException is risen, refer to *7.3 Reset Embeddings Dimension Value in Vectorstore* from file 8.1, you can also use 
```python 
Chroma().delete_collection()
```
to refresh the Chroma db collection.

In [10]:
# Chroma().delete_collection()

In [23]:
############## Important Parameters ##############
max_stories = 100
chunk_sizes = [200]
overlap_percentages = [0.1]  # Expressed as percentages (0.1 = 10%)
random_seed = 123
output_file_path = "results/scores_test.txt"
model_location = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"
# model_location = "/Users/wk77/Library/CloudStorage/OneDrive-DrexelUniversity/Documents/data/gpt4all/models/gpt4all-falcon-q4_0.gguf"

##################################################

# logging.basicConfig(level=logging.WARNING)  # This will show only warnings and errors
logging.basicConfig(level=logging.ERROR)

set_seed(random_seed)

# Results storage
f1_results = defaultdict(list)
em_results = defaultdict(list)
text_results = []

# Initialize the language model and the QA chain
llm = GPT4All(model=model_location, max_tokens=2048, seed=random_seed)

# HuggingFace Instruct Embeddings parameters
instruct_embedding_model_name = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
instruct_embedding_model_kwargs = {'device': 'cpu'}
instruct_embedding_encode_kwargs = {'normalize_embeddings': True}

# The following code would iterate over the stories and questions to calculate the scores
start_time = time.time()
print(f"{start_time} Started.")

################ Main Function ################
newsqa_loop(data, llm, output_file_path, chunk_sizes, overlap_percentages, max_stories, instruct_embedding_model_name,
                 instruct_embedding_model_kwargs, instruct_embedding_encode_kwargs, QA_CHAIN_PROMPT_ORIGINAL)

Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin


1705378775.5835507 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0	0.0	Overlap [0.1] 20

0.0	0.0		story 1: 

[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
                    Based on the following information only: 
                    
                    The Drug Enforcement Administration has joined the investigation into Jackson's death, a federal law enforcement official said Wednesday night.

The Drug Enforcement Administration has joined the investigation into Jackson's death, a federal law enforcement official said Wednesday night.

The Drug Enforcement Administration has joined the investigation into Jackson's death, a federal law enforcement official said Wednesday night.

The Drug Enforcement Administration has joined the investigation into Jackson's death, a federal law enforcement official said 

The output of running `newsqa_processing_script_lyang.py`  should be similar to the following small sample results:
```powershell
Found model file at  C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin
falcon_model_load: loading model from 'C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 4544
falcon_model_load: n_head    = 71
falcon_model_load: n_head_kv = 1
falcon_model_load: n_layer   = 32
falcon_model_load: ftype     = 2
falcon_model_load: qntvr     = 0
falcon_model_load: ggml ctx size = 3872.64 MB
falcon_model_load: memory_size =    32.00 MB, n_mem = 65536
falcon_model_load: ........................ done
falcon_model_load: model size =  3872.59 MB / num tensors = 196
1705643318.023422 Started.
load INSTRUCTOR_Transformer
max_seq_length  512

0.0 Processing chunk size 200:

0.0     0.0     Overlap [0] 0

0.0     0.0             story 1:
233.9731228351593       233.9731228351593       Overlap [0.1] 20

233.9731228351593       233.9731228351593               story 1:
510.183185338974 Processing chunk size 400:

510.183185338974        0.0     Overlap [0] 0

510.183185338974        0.0             story 1:
761.7193622589111       251.53617691993713      Overlap [0.1] 40

761.7193622589111       251.53617691993713              story 1:
```