# Dataset Generation

## Basic setup

In [1]:
# Make Haystack silent
import logging

# Set logging level for specific loggers
logging.getLogger("haystack").setLevel(logging.FATAL)
logging.getLogger("llama_index").setLevel(logging.FATAL)
logging.getLogger("httpx").setLevel(logging.FATAL)
logging.getLogger("openai._base_client").setLevel(logging.FATAL)

In [2]:
# Automatic reloading
%load_ext autoreload
%autoreload 2

In [3]:
import sys
import os

# Get the current file's directory (e.g., the 'notebooks' directory)
current_dir = os.path.dirname(os.path.abspath(''))

# Navigate one level up
parent_dir = os.path.abspath(os.path.join(current_dir, '..'))

# Add the directory to sys.path
sys.path.append(parent_dir)

In [4]:
from dotenv import load_dotenv
from haystack.components.generators import AzureOpenAIGenerator
from milvus_haystack import MilvusDocumentStore

load_dotenv()

document_store = MilvusDocumentStore(
    collection_name="",
    collection_description="",
    connection_args={
        "host": os.getenv("MILVUS_HOST", ""),
        "port": os.getenv("MILVUS_PORT", ""),
        "user": "",
        "password": "",
        "secure": False,
    },
)

llm = AzureOpenAIGenerator()

In [7]:
from src.evaluation.dataset_generation.dataset_generation import DatasetGenerator, MilvusDocumentStoreWrapper

milvus_wrapper = MilvusDocumentStoreWrapper(document_store=document_store)
generator = DatasetGenerator(document_store_wrapper=milvus_wrapper, model=llm, seed=42)

## Sample chunks in Milvus Database
We observe some chunks that are in the database should not be used to generate chunks from. We therefore have to evaluate chunks using the evaluate chunk method.

In [10]:
train_set, val_set, test_set, train_sources, val_sources, test_sources = generator.train_val_test_split(split_ratio=[0.6, 0.4, 0])

for chunk in train_set[0:5]:
    print(chunk[1])
    print('-'*len(chunk[1]))


54. Alt, Franz L. 
------------------
Blu-ray Disc allows video with a bit depth of 8-bits per color YCbCr with 4:2:0 chroma subsampling.[185][186] The choice of formats affects the producer's licensing/royalty costs as well as the title's maximum run time, due to differences in compression efficiency. 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
For fundamental contributions to and leadership in telecommunications switching systems. 
-----------------------------------------------------------------------------------------
October 11, 2000. Archived (https://web.archive.org/web/20070926232228/https://www.cdrinfo. com/Sections/News/Details.aspx?NewsId=4922) from the original on September 26, 2007. Retrieved October 17, 2007. 09/10/2024, 17:58 Blu-ray - Wikipedia https

## Generating single query-chunk pairs
Here, we use the chunks from train_set to generate the single query-chunk pairs. We could have obtained chunks by runnning `get_all_chunks` method instead. We don't need to pass in chunk sources here as data leakage is already prevented during intial `train_val_test_split`.

In [14]:
single_chunk_query_dataset = generator.generate_dataset(
    number_of_questions=5,                    # Number of queries to generate
    chunks=train_set,                         # Adjust accordingly
    generate_answers=True,                    # Set to False if don't need to generate relevant answers (in our case, only LLM-as-a-judge requires relevant answers)
    get_multi_context=False,                  # Set to False if only generating single chunk-query pair
    evolve_queries=True,                      # Set to False if evolution not required
    evolve_steps=["generalizing_evolution"],  # Type of query evolution, 
    json_path='./test_single_dataset.json',   # Output path for json document
    chunk_size_threshold=200,                 # Character level threshold, higher means larger chunks
)

Generating Random Chunks: 100%|██████████| 5/5 [00:10<00:00,  2.17s/it]
Generating Queries: 100%|██████████| 5/5 [00:02<00:00,  1.84it/s]
Evolving Queries: 100%|██████████| 5/5 [00:02<00:00,  1.75it/s]
Answering Queries: 100%|██████████| 10/10 [00:18<00:00,  1.83s/it]


## View single query-chunk pairs
We convert it into csv format for easy visualisation. We observe that since we turned evolution on, the first 5 queries are the more complex queries, while the next 5 queries are the transformed queries.

In [16]:
dataset_in_csv = generator.dataset_mapping(
    dataset=single_chunk_query_dataset,        # Dataset in json format
    deep_eval_format=False                     # Inconsequential when generating single chunk-query pair, only important for multi-chunk queries
)

dataset_in_csv

Unnamed: 0,query,expected_answer,context_1
0,What significant contributions did Alan Cox ma...,Alan Cox made significant contributions to the...,He then became one of the main developers and ...
1,What did Bernstein point out regarding the imp...,Bernstein pointed out that reducing the precis...,[31] The custom server was designed to give ou...
2,What are the different levels of councils and ...,The administrative structure described include...,[219] At the primary level are 14 cities of ra...
3,What allegations of extramarital affairs were ...,"During his time as governor of Arkansas, Bill ...",[300] Clinton admitted to having extramarital ...
4,Does BitLocker have a built-in backdoor for la...,BitLocker does not have a built-in backdoor fo...,[46] All these attacks require physical access...
5,What role did Alan Cox play in Linux kernel de...,Alan Cox played a significant role in Linux ke...,He then became one of the main developers and ...
6,Why is server timestamp precision important in...,Server timestamp precision is crucial in encry...,[31] The custom server was designed to give ou...
7,What is the administrative structure of local ...,The administrative structure of local councils...,[219] At the primary level are 14 cities of ra...
8,What controversies surrounded Clinton during h...,Controversies surrounding Bill Clinton during ...,[300] Clinton admitted to having extramarital ...
9,Is there a way for law enforcement to bypass B...,Law enforcement does not have a guaranteed way...,[46] All these attacks require physical access...


## Generate multi-chunk query pairs
For this example, we will use the get_all_chunks method, assuming that we are not running COLBERT or Linear Adapters, which require a train_set. 

### Why generate 25 chunks?
We notice that we generated 25 random chunks when we asked for 5 questions, and this is because when we attempt to build contexts, we evaluate the retrieved chunks (during context building) too, which means that if we cannot just use 5 chunks to generate 5 contexts due to high drop-out rate. As we observe, after 13 attempts of building contexts, we finally succeed in building the 5 contexts that meet the minimum requirement of 2 chunks per context.

In [18]:
chunks, sources = generator.get_all_chunks()

multi_chunk_dataset = generator.generate_dataset(
    number_of_questions=5,                          # Number of queries to generate
    chunks=chunks,                                  # Adjust accordingly
    generate_answers=True,                          # Set to False if don't need to generate relevant answers (in our case, only LLM-as-a-judge requires relevant answers)
    get_multi_context=True,                         # Set to False if only generating single chunk-query pair
    evolve_queries=False,                           # Set to False since evolution not required
    json_path='./multi_chunk_dataset.json',         # Output path for json document
    sources=sources,                                # To prevent data leakage, required for multi-context
    chunk_size_threshold=200,                       # Character level threshold, higher means larger chunks
    max_chunks_per_context= 5,                      # Maximum number of chunks per context for multi-context
    min_chunks_per_context= 2,                      # Minimum number of chunks per context for multi-context
    similarity_threshold = 0.5                      # Cosine similarity threshold value for when grouping chunks into context, higher means stricter
)

Generating Random Chunks: 100%|██████████| 25/25 [01:16<00:00,  3.04s/it]
Building Contexts:  52%|█████▏    | 13/25 [01:03<00:58,  4.85s/it]
Generating Queries: 100%|██████████| 5/5 [00:04<00:00,  1.19it/s]
Answering Queries: 100%|██████████| 5/5 [00:18<00:00,  3.73s/it]


## View multi-context query pairs
Setting `deep_eval_format` to True just combines the context into a single large chunk of text.

In [20]:
multi_dataset_in_csv = generator.dataset_mapping(
    dataset=multi_chunk_dataset,
    deep_eval_format=False
)

multi_dataset_in_csv

Unnamed: 0,query,expected_answer,context_1,context_2,context_3,context_4,context_5
0,What impact did the AACS encryption key contro...,The AACS encryption key controversy in May 200...,"(May 3, 2007). ""Digg's DVD-decoder fiasco: Law...","We hear you, and effective immediately we won'...","[39] On May 1, 2007, in response to a DMCA dem...","[39] On May 1, 2007, in response to a DMCA dem...",
1,How did the partisan movements in Belarus duri...,The partisan movements in Belarus during World...,"[75] During World War II, Belarus was home to ...",(issued in 1940) entirely composed of former p...,[80] In the 1990s some raised the estimate eve...,[68] Belarusian leadership was sent to Bereza ...,"Standing, left to right: Arkadz Smolic, Pyotra..."
2,"How has religious diversity, including the pre...","Religious diversity, including the presence of...",There are small numbers of Ibadi and non- deno...,"Today, however, most Arabs are Muslim, with a ...","[386][387] Historically, there were also sizea...",[384] The Druze community is concentrated in L...,
3,"How do different authentication factors, inclu...","Different authentication factors, including kn...",or verify a person's identity before being gra...,"However, text, audio, and video can be copied ...","2. Ownership: Something the user has (e.g., wr...",[1] It might involve validating personal ident...,The European Central Bank (ECB) has defined st...
4,How did the publication and revisions of Lidde...,The publication and revisions of Liddell and S...,"4. Liddell, Henry George; Scott, Robert (25 Ap...","In 1843, the same year as the full lexicon's p...",The second through seventh editions appeared i...,,


In [21]:
multi_dataset_in_csv_deepeval = generator.dataset_mapping(
    dataset=multi_chunk_dataset,
    deep_eval_format=True
)

multi_dataset_in_csv_deepeval

Unnamed: 0,input,expected_output,context
0,What impact did the AACS encryption key contro...,The AACS encryption key controversy in May 200...,"[""(May 3, 2007). \""Digg's DVD-decoder fiasco: ..."
1,How did the partisan movements in Belarus duri...,The partisan movements in Belarus during World...,"[""[75] During World War II, Belarus was home t..."
2,"How has religious diversity, including the pre...","Religious diversity, including the presence of...","[""There are small numbers of Ibadi and non- de..."
3,"How do different authentication factors, inclu...","Different authentication factors, including kn...","[""or verify a person's identity before being g..."
4,How did the publication and revisions of Lidde...,The publication and revisions of Liddell and S...,"[""4. Liddell, Henry George; Scott, Robert (25 ..."
