# LLMs and RAG with DataChain for chatbot evaluation

In LLM applications nowadays, the emerging standard pattern for most use-cases is to employ a pre-trained model with an API from a 3rd party provider and to augment it with a RAG context. On one hand, this means there is not much actual machine learning going on on the user's end. On the other hand naive application of "latest and greatest" models with no prompt engineering, testing and evaluation of RAG context can lead to needlessly expensive operational costs at best and dissapointingly poor performance at worst.

Therefore, even if there is no machine learning involved, there is still a lot of fine tuning we need to do and a lot of that involves large datasets (such as histories of chatbot conversations or large collections of company documents). Just like with ML training, we need to version all that data as we finetune our applications to be able to correctly evaluate the effect of any changes we apply to our models. We can experiment with the LLM choice, prompt engineering, the way we process data for our RAG context (pre-processing, embedding, ...) and so on.

In this example, we will see how we can use DataChain to create such a controlled development environment and how it can help us when we evaluate any fine-tuning of our LLM applications. We will assume we intend to build a RAG-based chatbot and them evaluate its performance.

First, we will see how to use DataChain to version our RAG context datasets to preserve reproducibility of our fine-tuning experiments as the RAG context changes.

Then, we will evaluate the performance of a chatbot application using a testing dataset (which we can also version with DataChain) and see how DataChain LLM integrations are going to help us in this evaluation.

## 1. Processing a large collection of documents for RAG

Let's say that we have a collection of relevant documents which we want to use as context in LLM queries in our chatbot application. We will be using DataChain to create, store and version vector embeddings of our documents.

In this example we will be using papers from the [Neural Information Processing Systems](https://papers.neurips.cc/paper/) conference. 

We will proceed in the following steps:
1. [Data ingestion with DataChain](#data-ingestion) - we will use DataChain to ingest the data, taking advantage of its lazy evaluation feature to only ingest the data we need
1. [Data processing with the Unstructured Python library](#processing-the-documents-individually)
1. [Scaling the data processing with DataChain](#processing-the-documents-at-scale-using-datachain-udfs)
1. [Saving and versioning the dataset with DataChain](#saving-and-versioning-the-dataset)

In [17]:
import os

from typing import List
from collections.abc import Iterator

from datachain.lib.dc import DataChain, C
from datachain.lib.feature import Feature
from datachain.lib.file import File

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

from unstructured.cleaners.core import clean
from unstructured.cleaners.core import replace_unicode_quotes
from unstructured.cleaners.core import group_broken_paragraphs

from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder

### Data ingestion

We will first ingest the dataset. The data are saved on a cloud storage, so we use the `.from_storage` DataChain method. We will also use the `.filter` method to restrict ourselves only to `.pdf` files (the storage contains many other data which we do not need).

Notice that:

1. Since DataChain employs lazy evaluation, no data are actually loaded just yet (until we invoke an action such as showing or saving our DataChain)
1. The previous point also means that when we filter out all non-pdf files, DataChain doesn't actually waste time loading their content only to throw them away later. This makes DataChain a lot more scalable than tools with eager evaluation.
1. The `.from_storage` method of DataChain operates on the level of the entire bucket. This means that even if the files are stored using a complicated directory structure and potentially uploaded irregularly into this structure, we can retrieve or update our DataChain of articles with just a simple one-line command

In [2]:
dc = (
    DataChain.from_storage("gs://datachain-demo/neurips")
    .filter(C.name.glob("*.pdf"))
    )

In [3]:
dc.show(3)

Processed: 738 rows [00:00, 6473.19 rows/s]


Unnamed: 0,id,random,vtype,dir_type,parent,name,etag,version,is_latest,last_modified,...,file.source,file.parent,file.name,file.size,file.version,file.etag,file.is_latest,file.last_modified,file.location,file.vtype
0,4,319173006993350018,,0,neurips/1987/file,02e74f10e0327ad868d138f2b4fdd6f0-Paper.pdf,CPudi5uIqYcDEAE=,1721047139405563,1,2024-07-15 12:38:59.443000+00:00,...,gs://datachain-demo,neurips/1987/file,02e74f10e0327ad868d138f2b4fdd6f0-Paper.pdf,2291566,1721047139405563,CPudi5uIqYcDEAE=,1,1970-01-01 00:00:00+00:00,,
1,7,831179576623567236,,0,neurips/1987/file,03afdbd66e7929b125f8597834fa83a4-Paper.pdf,CJaf6pqIqYcDEAE=,1721047138865046,1,2024-07-15 12:38:58.917000+00:00,...,gs://datachain-demo,neurips/1987/file,03afdbd66e7929b125f8597834fa83a4-Paper.pdf,1322648,1721047138865046,CJaf6pqIqYcDEAE=,1,1970-01-01 00:00:00+00:00,,
2,10,4365267491844913134,,0,neurips/1987/file,072b030ba126b2f4b2374f342be9ed44-Paper.pdf,CJmztdWHqYcDEAE=,1721046993295769,1,2024-07-15 12:36:33.340000+00:00,...,gs://datachain-demo,neurips/1987/file,072b030ba126b2f4b2374f342be9ed44-Paper.pdf,1220711,1721046993295769,CJmztdWHqYcDEAE=,1,1970-01-01 00:00:00+00:00,,


[limited by 3 objects]


DataChain created a record for each `pdf` file in the `neurips` directory, generating a `file` signal for each file. The file signal contains subsignals with metadata about each file, like `file.name` and `file.size`. Aggregate signals like `file` that contain multiple subsignals are called features.

You can use the `file` feature to not only get metadata about each file, but also to open and read the file as needed.

### Processing the documents individually

We now want to ingest the content of the pdf files as text, clean it and divide it into chunks which we will vectorize for our RAG application. We might also want to compute some additional features such as the total number of pages of an article each chunk is from.

We will first do this with an example of a single pdf using the `unstructured` Python library and then we will see how we can scale this up to the entire bucket with the help of DataChain.

First, we ingest and partition the pdf file and chunk it.


In [19]:
chunks = chunk_by_title(partition_pdf(filename="sample.pdf"))

Next, we will apply some cleaning methods to each chunk and save the information about the total number of pages in the article corresponding to each chunk.

In [5]:
# Clean the chunks and add new columns
for chunk in chunks:

    chunk.apply(lambda text: clean(text, bullets=True, extra_whitespace=True, trailing_punctuation=True))
    chunk.apply(replace_unicode_quotes)
    chunk.apply(group_broken_paragraphs)

Finally, we vectorize each chung using a HuggingFace embedding encoder

In [10]:
# Define embedding encoder

embedding_encoder = HuggingFaceEmbeddingEncoder(
     config=HuggingFaceEmbeddingConfig()
)


chunks_embedded = embedding_encoder.embed_documents(chunks)
total_pages = chunks_embedded[-1].metadata.page_number

We now have our chunks vectorized and ready to be used in a RAG application. However, we are missing a few ingredients:

1. ***Scaling*** - we only processed a single pdf file and we had to manually specify its path. We need to find a way to process all our documents at scale instead and to save the results.
2. ***Saving and Versioning*** - even if we only had a single or a few PDF files we would like to use in our RAG, it is a good practice to version the outputs so that we can keep track of and fine-tune our RAG application. If we simply save the current results to a bucket and overwrite it each time the source is updated, we lose this. We could version the results manually, e.g. by adding a timestamp to the blob name, but that is not very reliable and will lead to unnecessary copies of files.

### Processing the documents at scale, using DataChain UDFs

We will now use DataChain to solve the scaling and versioning issues we outline above. We will create a DataChain user-defined function (UDF) to process all our PDF files the way we did above with a single file (without us having to manually provide file paths) and save the outputs in a Datachain.

The DataChain UDF functionality will allow us to generate additonal columns in our DataChain, iterating over each of the files listed in it.

To define our pdf processing UDF, we first need to specify the output, using the DataChain `Feature` class. This will allow us to thell the UDF the output formats of the additional columns which are grouped into what is called a Feature in Datachain.

```python
class Chunk(Feature):
    file: File
    total_pages: int
    text: str
    embeddings: List[float]
```

In the above we define our custom DataChain feature called `Chunk` by specifying its content:

* `file` tells DataChain to keep the keep the existing columns (which were created when we created the Datachain `ds` using the `from_storage` method)
* `total_pages` will denote the total number of pages in the article each chunk belongs to, as before
* `text` will contain the actual text of the chunk (after cleaning and processing)
* `embedings` will contain the list of vecor embeddings created by the HuggingFaceedding encoder

We now specify the input and output of our UDF using Python type signatures (these can be Feature or regular Python types)

```python
def pdf_chunks(file: File) -> Iterator[Chunk]:
```
Then we load each pdf file as before, the only slight change is that we use the `open` method of the DataChain `File` class to open each file.

```python
# Ingest the file
with file.open() as f:
    chunks = chunk_by_title(partition_pdf(file=f))
```

we then proceed with the same code as before for each chunk and them we define the content if new rows in our DataChain for each chunk (here the column names should match what we defined in our `Chunk` class)

```python

    # Add new rows to DataChain
    for chunk in chunks_embedded:
        record = {"file": file}
        record["total_pages"] = total_pages
        record["text"] = chunk.text
        record["embeddings"] = chunk.embeddings

        yield Chunk(**record)
```

Putting all of this together, we have the following:

In [15]:
# Define the output as a Feature class
class Chunk(Feature):
    file: File
    total_pages: int
    text: str
    embeddings: List[float]

# Use signatures to define input/output types (these can be Feature or regular Python types)
def pdf_chunks(file: File) -> Iterator[Chunk]:
    # Ingest the file
    with file.open() as f:
        chunks = chunk_by_title(partition_pdf(file=f))

    # Clean the chunks and add new columns
    for chunk in chunks:

        chunk.apply(lambda text: clean(text, bullets=True, extra_whitespace=True, trailing_punctuation=True))
        chunk.apply(replace_unicode_quotes)
        chunk.apply(group_broken_paragraphs)

    chunks_embedded = embedding_encoder.embed_documents(chunks)
    total_pages = chunks_embedded[-1].metadata.page_number

    # Add new rows to DataChain
    for chunk in chunks_embedded:
        record = {"file": file}
        record["total_pages"] = total_pages
        record["text"] = chunk.text
        record["embeddings"] = chunk.embeddings

        yield Chunk(**record)

All that remains is to apply our new `pdf_chunks` UDF to the DataChain `dc`. We do that by applying the `gen` method to the DataChain with `pdf_chunks` as its parameter.

In [None]:
dc_chunks_embeddings = dc.gen(document=pdf_chunks)

dc_chunks_embeddings.show(3)

We have now solved our scalability issues. When using `DataChain` locally, our computation will still be restricted to a simgle machine but for larger datasets you can use the SaaS version of DataChain available through our DVC Studio which comes with automatic computation cluster management, a graphical user interface and additional ML and data versioning features.

## Saving and versioning the dataset

We still need to solve our versioning issues and we will also use `DataChain` to do that. To save the annotated dataset in DataChain, we will use the `.save()` method on the `dc_chunks_embeddings` dataset object. 

In [None]:
dc_chunks_embeddings.save("neurips-features")

Saving datasets in `DataChain` allows us to:

- Persist the dataset and its metadata for future use
- Version the dataset to track changes over time
- Share the dataset with others in our team or organization
- Easily load the dataset in other DataChain workflows or notebooks

We now have a new dataset named "neurips-features" in our DataChain workspace, which contains the embedded pdf data. We will later load this dataset using `DataChain.from_dataset("neurips-features")` to access the embeddings.

# 2. Chatbot finetuning evalution


We will now will be using DataChain to evaluate the changes to the performance of our chatbot model. We have data We will also employ DataChain integrations with [Anthropic Claude](https://www.anthropic.com/claude) to quickly evaluate the results on a large number of queries.

Our testing dataset will come from stored chatbot data with a large number user queries and bot responses. This dataset is publicly available in an Iterative GCS bucket and comes from (??? TODO: Ask Daniel where the data come from)

We will use a prompt that will give use a better insight into the success of our chatbot 

In [None]:
import json

import pandas as pd

from datachain.lib.claude import claude_processor
from datachain.lib.dc import C, DataChain
from datachain.lib.feature import Feature

MODEL = "claude-3-opus-20240229"
PROMPT = """Consider the dialogue between the 'user' and the 'bot'. \
The 'user' is a human trying to find the best mobile plan. \
The 'bot' is a chatbot designed to query the user and offer the \
best  solution. The dialog is successful if the 'bot' is able to \
gather the information and offer a plan, or inform the user that \
such plan does not exist. The dialog is not successful if the \
conversation ends early or the 'user' requests additional functions \
the 'bot' cannot perform. Read the dialogue below and rate it 'Success' \
if it is successful, and 'Failure' if not. After that, provide \
one-sentence explanation of the reasons for this rating. Use only \
JSON object as output with the keys 'status', and 'explanation'.
"""


class Rating(Feature):
    status: str = ""
    explanation: str = ""


dc = (
    DataChain.from_storage("gs://dvcx-datalakes/chatbot-public", type="text")
    .filter(C.name.glob("*.txt"))
    .settings(parallel=3)
    .limit(5)
    .map(claude=claude_processor(prompt=PROMPT, model=MODEL))
    .map(
        rating=lambda claude: Rating(
            **(json.loads(claude.content[0].text) if claude.content else {})
        ),
        output=Rating,
    )
)

dc.show()

Processed: 79 rows [00:00, 15449.71 rows/s]


OSError: source code not available