# Chunking and Overlaping Sizes with QAConv

Lixiao Yang\
10/25/2023

QAConv Dataset: https://github.com/salesforce/QAConv


Updates:
1. **NewsQA dataset**: require docker / local environment setup and compile the news (larger complete dataset size)
     - CSV data format
     - Needs docker / local environment setup to package the dataset
     - CNN Dataset: This dataset contains the documents and accompanying questions from the news articles of CNN. There are approximately 90k documents and 380k questions.
       - Questions (200MB), stories (150MB)
       - Original news stored in .story format, needs to compile before reading, pure text
     - Daily Mail: This dataset contains the documents and accompanying questions from the news articles of Daily Mail. There are approximately 197k documents and 879k questions.
       - Questions (500MB), stories (358MB)
     - Compatibility for GPT4ALL and LangChain: Needs separate generator, processing, split, tokenize 


2. **QAConv dataset**: has open-sourced data
     - JSON data format
       - QA data (40MB), questions (7MB)
     - Uses pretrained / fine-tuned model (huggingface library)
     - Compatibility for GPT4ALL and LangChain: Needs separate retriver
     - Baseline: BM25

In [36]:
# Importing necessary libraries and modules
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import GPT4AllEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import GPT4All
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import pandas as pd
import numpy as np
import json
from pathlib import Path
from pprint import pprint
import random
import os

## Load the data

In [34]:
# Load the original JSON data
with open('./QAConv-V1.1/article_segment.json', 'r') as file:
    original_data = json.load(file)

# Get a list of all the keys (record IDs) in the data
all_keys = list(original_data.keys())

# Determine the number of records you want to sample (e.g., 0.1% of the data)
num_samples = int(len(all_keys) * 0.001)

# Randomly sample a subset of the keys without replacement
sampled_keys = random.sample(all_keys, num_samples)

# Use the sampled keys to get the corresponding records from the data
sampled_data = {key: original_data[key] for key in sampled_keys}

# Create a directory to hold the text files
os.makedirs('./QAConv-V1.1/preprocessed_text_files', exist_ok=True)

# Process each entry in the sampled data
for key, entry in sampled_data.items():
    texts = []
    for item in entry.get('prev_ctx', []):
        texts.append(item['text'])
    for item in entry.get('seg_dialog', []):
        texts.append(item['text'])
    # Combine the texts into a single string
    combined_text = ' '.join(texts)
    # Write the combined text to a text file
    with open(f'./QAConv-V1.1/preprocessed_text_files/{key}.txt', 'w') as file:
        file.write(combined_text)

In [37]:
# Use DirectoryLoader to load the preprocessed text files
loader = DirectoryLoader('./QAConv-V1.1/preprocessed_text_files', glob="*.txt", show_progress=True, use_multithreading=True)
data = loader.load()

  0%|          | 0/18 [00:00<?, ?it/s]

Need to load profiles.
Need to load profiles.
Need to load profiles.
100%|██████████| 18/18 [00:04<00:00,  4.10it/s]


## Defining chunking stategies

In [38]:
# Fixed Chunking Configurations
fixed_configs = [(size, overlap) for size in [200, 400, 800, 1600] for overlap in [0, 0.1, 0.2, 0.3]]

# Semantic Structure Chunking Configurations
# These will be handled differently in the loop as they require different splitting methods
semantic_configs = ['sentence', 'paragraph', 'section']


## Experiment loop

In [None]:
# Results DataFrame to store the performance metrics
results_df = pd.DataFrame(columns=['Chunking Strategy', 'Chunk Size', 'Overlap', 'EM', 'F1'])

for config in fixed_configs:
    chunk_size, overlap = config
    # Adjust the text_splitter for fixed chunking configurations
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=int(chunk_size*overlap))

    all_splits = text_splitter.split_documents(data)
    
    # Store
    vectorstore = Chroma.from_documents(documents=all_splits, embedding=GPT4AllEmbeddings())
    
    # Retrieve
    embeddings = GPT4AllEmbeddings()

    gpt4all_falcon_model = "C:/Users/24075/AppData/Local/nomic.ai/GPT4All/ggml-model-gpt4all-falcon-q4_0.bin"

    llm = GPT4All(model=gpt4all_falcon_model ,max_tokens=2048)
    qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectorstore.as_retriever())
    
    template = """Use the following pieces of context to answer the question at the end. 
    If you don't know the answer, just say that you don't know, don't try to make up an answer. 
    Use three sentences maximum and keep the answer as concise as possible. 
    Also provide me the source for your answer. Explain how to get the answer step by step.
    {context}
    Question: {question}
    Helpful Answer:"""
    QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

    # Assume em and f1 are the calculated Exact Match and F1 scores for this configuration
    results_df = results_df.append({
        'Chunking Strategy': 'Fixed',
        'Chunk Size': chunk_size,
        'Overlap': overlap,
        'EM': em,
        'F1': f1
    }, ignore_index=True)

# Semantic chunking will require a different text splitting approach which you would need to implement
for config in semantic_configs:
    # ... rest of your experiment code ...

    # Assume em and f1 are the calculated Exact Match and F1 scores for this configuration
    results_df = results_df.append({
        'Chunking Strategy': 'Semantic',
        'Chunk Size': config,
        'Overlap': 'N/A',
        'EM': em,
        'F1': f1
    }, ignore_index=True)

# Save the results to a CSV file
results_df.to_csv('experiment_results.csv', index=False)
