# Create evaluation dataset for Redbox RAG chat  <a class="anchor" id="title"></a>

------

**Evaluate Redbox RAG chat on one stable, numbered version of these data**

----------------

**Before running this notebook**

Set the version of the evaluation dataset you are creating **[HERE](#setversion)**

## Table of Contents <a class="anchor" id="toc"></a>
* [Overview](#overview)
* [Set version of the evaluation dataset](#setversion)
* [Select files for creating evaluation dataset](#files)
* [Imports](#imports)
* [Generate Evaluation Dataset](#ragas)
* [Save Evaluation Dataset](#save)
* [Troubleshooting](#troubleshooting)

--------

## Overview <a class="anchor" id="overview"></a>

It is really important to version the evaluations we are doing, including the input data used to generate evaluation datasets.

This notebook uses the files you select in combination with the RAGAS framework to generate synthetic data. Two different LLMs are used, one for the 'generator' and one for the 'critic'.

Please be aware the generating synthetic data will incur LLM API costs

There is a troubleshooting section at the end of this notebook [Troubleshooting](#troubleshooting)

[Back to top](#title)

-----------

**Evaluate Redbox RAG chat on one stable, numbered version of these data**

**Set the version of the evaluation dataset you will be creating in this notebook in the cell below**  <a class="anchor" id="setversion"></a>

In [3]:
DATA_VERSION = "0.1.0"

Run the cell below to set up the required folder structure (it will not overwrite folders and files if they already exist)

In [4]:
from pathlib import Path

ROOT = Path.cwd().parents[1]
EVALUATION_DIR = ROOT / "notebooks/evaluation"

V_ROOT = EVALUATION_DIR / f"data/{DATA_VERSION}"
V_RAW = V_ROOT / "raw"
V_SYNTHETIC = V_ROOT / "synthetic"
V_CHUNKS = V_ROOT / "chunks"
V_RESULTS = V_ROOT / "results"
V_EMBEDDINGS = V_ROOT / "embeddings"

V_ROOT.mkdir(parents=True, exist_ok=True)
V_RAW.mkdir(parents=True, exist_ok=True)
V_SYNTHETIC.mkdir(parents=True, exist_ok=True)
V_CHUNKS.mkdir(parents=True, exist_ok=True)
V_RESULTS.mkdir(parents=True, exist_ok=True)
V_EMBEDDINGS.mkdir(parents=True, exist_ok=True)

[Back to top](#title)

---------

#### Select files that you will use to generate versioned evaluation dataset   <a class="anchor" id="files"></a>

Now copy all the files you want to use to generate **THIS VERSION** of the evaluation dataset into `notebooks/evaluation/data/{DATA_VERSION}/raw/`

Also upload these files to shared Google Drive and the corresponding version number/location

--------------

#### Imports <a id="imports"></a>

In [7]:
from tqdm.auto import tqdm
import pandas as pd
import typing as t
import json
import jsonlines
import pickle

pd.set_option("display.max_colwidth", None)

In [8]:
from langchain.document_loaders import DirectoryLoader
from langchain.schema import Document
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

[Back to top](#title)

--------

## Synthetically generate evaluation dataset <a class="anchor" id="ragas"></a>

RAGAS generating a synthetic test set detailed [HERE](https://docs.ragas.io/en/stable/getstarted/testset_generation.html). Perhaps not as SOTA as DeepEval (validate!), but it creates `input` AND `expected_output` for us. 

So we are not generating input questions based on our chunking strategy, however, we are using the same files

In [9]:
# Takes about 4 minutes for 4 docs. Consider Langchain `unstructured`
loader = DirectoryLoader(V_RAW)
documents = loader.load()

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### Save Langchain documents for future use

In [10]:
def save_docs_to_jsonl(documents: t.Iterable[Document], file_path: str) -> None:
    with jsonlines.open(file_path, mode="w") as writer:
        for doc in documents:
            writer.write(doc.dict())


def load_docs_from_jsonl(file_path) -> t.Iterable[Document]:
    documents = []
    with jsonlines.open(file_path, mode="r") as reader:
        for doc in reader:
            documents.append(Document(**doc))
    return documents

In [11]:
save_docs_to_jsonl(documents, V_CHUNKS / "documents.jsonl")

-----------

In [12]:
# RAGAS generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo") # to match core-api
critic_llm = ChatOpenAI(model="gpt-4o") # cheaper model with similar performance
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

In [13]:
# generate testset
testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.4, reasoning: 0.3, multi_context: 0.3})

embedding nodes:   0%|          | 0/66 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/10 [00:00<?, ?it/s]

#### Save RAGAS generated testset <a class="anchor" id="save"></a>

As pickle

In [14]:
with open(f'{V_SYNTHETIC}/ragas_testset.pkl', 'wb') as f:
    pickle.dump(testset, f)

Convert dataframe into a DeepEval compatible CSV & save

In [15]:
testset_df = testset.to_pandas()

# Rename the columns
new_column_names = {
    'question': 'input',
    'contexts': 'context',
    'ground_truth': 'expected_output',
    # Add more column names here
}

testset_df_renamed = testset_df.rename(columns=new_column_names)

#  DeepEval dataset format requires an 'actual_output' column
testset_df_renamed['actual_output'] = ''
testset_df_renamed = testset_df_renamed.drop(['evolution_type', 'metadata', 'episode_done'], axis=1)

# Convert all columns to string & drop NaN - otherwise DeepEval will throw an Pydantic validation error
testset_df_renamed = testset_df_renamed.astype(str)
testset_df_renamed = testset_df_renamed.dropna()

# save as CSV
testset_df_renamed.to_csv(f'{V_SYNTHETIC}/ragas_synthetic_data.csv', index=False)

#### (Optional) View top 5 rows of synthetically generated data

In [None]:
testset_df_renamed.head()

#### Pre-embed the documents for other users

Embeddings take a while. Here we show how to compute and save them for other users.

For now we use the chunking strategy from `worker/`, and embed with any models we choose.

Ensure the necessary services are running with `make eval_backend`.

In [20]:
from redbox.model_db import SentenceTransformerDB
from redbox.models import Settings
from redbox.parsing import chunk_file
from redbox.storage.elasticsearch import ElasticsearchStorageHandler
from redbox.models.file import File

from minio import Minio
from uuid import UUID

env = Settings()

minio = Minio(
    endpoint=f"localhost:{env.minio_port}",
    access_key=env.aws_access_key,
    secret_key=env.aws_access_key,
)

USER_UUID = UUID("aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa")

In [11]:
env.bucket_name

from minio import Minio

client = Minio("play.min.io",
    access_key="Q3AM3UQ867SPQQA43P2F",
    secret_key="zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG",
)

'redbox-storage-dev'

In [19]:
next(V_RAW.glob("*.*")).name

'CS013b_Energy stats monthly brief september 2022.pdf'

In [8]:



for file_path in V_RAW.glob("*.*"):
    key = f"{DATA_VERSION}/{file.name}"
    file = File(key=key, bucket=env.bucket_name, creator_user_uuid=)
    minio.fput_object(
        bucket_name=env.bucket_name,
        object_name=key,
        file_path=file,
    )
    chunks = chunk_file(file=file)
    print(len(chunks))

AttributeError: 'PosixPath' object has no attribute 'bucket'

In [None]:
storage_handler = ElasticsearchStorageHandler(es_client=es, root_index=env.elastic_root_index)
model = SentenceTransformerDB(env.embedding_model)
model.embed_sentences([chunk.text])

[Back to top](#title)

-----------------------

## Troubleshooting <a class="anchor" id="troubleshooting"></a>

#### Langchain DirectoryLoader Error

If you run into a poppler path error and poppler is installed and can be access from your virtual environment (by running `pdfinfo -v`), then close notebook and restart the Jupyter server from the terminal where the path is correctly set (by running `code notebooks/evaluation/evaluation_dataset_generation.ipynb`) 

#### RAGAS synthetically generated evaluation data

We have found some rows of synthetically generated evaluation data from using the RAGAS framework, includes some NaN and/or not str type, which results in an error for DeepEval metrics, as these data fail Pydantic validation.

To avoid this, ensure you turn RAGAS synthetically generated evaluation data to type str and remove rows of data with NaN

#### DeepEval framework

At the moment, this notebook only loads the evaluation dataset into DeepEval from a CSV. There is a JSON import option that we are not using.

[Back to top](#title)

-------