# LLMs and RAG with DataChain

In LLM applications nowadays, the emerging standard pattern for most use-cases is to employ a pre-trained model with an API from a 3rd party provider and to augment it with a RAG context. Naive application of "latest and greatest" models with no prompt engineering, testing and evaluation of RAG context can lead to needlessly expensive operational costs at best and dissapointingly poor performance at worst.

Therefore, just like in machine learning training, we need to version all that data as we finetune our applications to be able to correctly evaluate the effect of any changes we apply to our models. We can experiment with the LLM choice, prompt engineering, the way we process data for our RAG context (pre-processing, embedding, ...) and so on.

In this example, we will see how we can use DataChain to create such a controlled development environment and how it can help us when we evaluate any fine-tuning of our LLM applications.

We will see how to use DataChain to version our RAG context datasets to preserve reproducibility of our fine-tuning experiments as the RAG context changes. We will also see how to use DataChain in the evaluation of fine-tuning by comparing two different text embedding models and saving (and versioning) the results.

## Processing a large collection of documents

Let's say that we have a collection of relevant documents which we want to use as context in LLM queries in our chatbot application. We will be using DataChain to create, store and version vector embeddings of our documents.

In this example we will be using papers from the [Neural Information Processing Systems](https://papers.neurips.cc/paper/) conference. 

We will proceed in the following steps:
1. [Data ingestion with DataChain](#data-ingestion)
1. [Data processing with the Unstructured Python library](#processing-the-documents-individually)
1. [Scaling the data processing with DataChain](#processing-the-documents-at-scale-using-datachain-udfs)
1. [Using Datachain to evaluate different embedding models](#evaluation)

In [2]:
from sqlalchemy import func
from sqlalchemy import cast

from copy import deepcopy
from collections.abc import Iterator

from datachain.lib.dc import DataChain, C
from datachain.sql.types import Float
from datachain.lib.data_model import DataModel
from datachain.lib.file import File
from datachain.sql.functions.array import cosine_distance

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

from unstructured.embed.huggingface import HuggingFaceEmbeddingConfig, HuggingFaceEmbeddingEncoder

### Data ingestion

We will first ingest the dataset. The data are saved on a cloud storage, so we use the `.from_storage` DataChain method. We will also use the `.filter` method to restrict ourselves only to `.pdf` files (the storage contains many other data which we do not need).

Notice that:

1. Since DataChain employs lazy evaluation, no data are actually loaded just yet (until we invoke an action such as showing or saving our DataChain)
1. The previous point also means that when we filter out all non-pdf files, DataChain doesn't actually waste time loading their content only to throw them away later. This makes DataChain a lot more scalable than tools with eager evaluation.
1. The `.from_storage` method of DataChain operates on the level of the entire bucket. This means that even if the files are stored using a complicated directory structure and potentially uploaded irregularly into this structure, we can retrieve or update our DataChain of articles with just a simple one-line command

In [3]:
dc_papers = (
    DataChain.from_storage("gs://datachain-demo/neurips")
    .filter(C.name.glob("*.pdf"))
    )

In [4]:
dc_papers.show(3)

Listing gs://datachain-demo: 738 objects [00:00, 871.19 objects/s]
Processed: 738 rows [00:00, 10006.87 rows/s]


Unnamed: 0_level_0,file,file,file,file,file,file,file,file,file,file
Unnamed: 0_level_1,source,parent,name,size,version,etag,is_latest,last_modified,location,vtype
0,gs://datachain-demo,neurips/1987/file,02e74f10e0327ad868d138f2b4fdd6f0-Paper.pdf,2291566,1721047139405563,CPudi5uIqYcDEAE=,1,2024-07-15 12:38:59.443000+00:00,,
1,gs://datachain-demo,neurips/1987/file,03afdbd66e7929b125f8597834fa83a4-Paper.pdf,1322648,1721047138865046,CJaf6pqIqYcDEAE=,1,2024-07-15 12:38:58.917000+00:00,,
2,gs://datachain-demo,neurips/1987/file,072b030ba126b2f4b2374f342be9ed44-Paper.pdf,1220711,1721046993295769,CJmztdWHqYcDEAE=,1,2024-07-15 12:36:33.340000+00:00,,



[Limited by 3 rows]


DataChain created a record for each `pdf` file in the `neurips` directory, generating a `file` signal for each file. The file signal contains subsignals with metadata about each file, like `file.name` and `file.size`. Aggregate signals like `file` that contain multiple subsignals are called features.

You can use the `file` feature to not only get metadata about each file, but also to open and read the file as needed.

### Processing the documents individually

We now want to ingest the content of the pdf files as text, divide it into chunks and vectorize those for our RAG application. We are interested in comparing two different models for embeddings. Normally, we would also do some pre-processing and cleaning of the text before vectorization, but we will skip it here for brevity.

We will first do all this with an example of a single pdf using the `unstructured` Python library and then we will see how we can scale this up to the entire bucket with the help of DataChain.

First, we ingest and partition the pdf file and chunk it.


In [5]:
chunks = chunk_by_title(partition_pdf(filename="sample.pdf"))


Next, we vectorize each chunk using HuggingFace embedding encoders. Ideally, we want the smallest model possible while maintaining accuracy to increase speed and reduce costs of embeddings. We will see how embeddings from a candidate model `MODEL_NEW` differ from embeddings produced by the existing model `MODEL_OLD`.

In [6]:
MODEL_NEW = "sentence-transformers/paraphrase-MiniLM-L6-v2" 
MODEL_OLD = "sentence-transformers/all-MiniLM-L6-v2"

embedding_encoder_new = HuggingFaceEmbeddingEncoder(
     config=HuggingFaceEmbeddingConfig(model_name=MODEL_NEW, encode_kwargs={"normalize_embeddings":True})
)

chunks_embedded_new = embedding_encoder_new.embed_documents(chunks)

embedding_encoder_old = HuggingFaceEmbeddingEncoder(
     config=HuggingFaceEmbeddingConfig(model_name=MODEL_OLD, encode_kwargs={"normalize_embeddings":True})
)

# we need deepcopy here because unstructured creates lists of references to elements
chunks_embedded_old = embedding_encoder_old.embed_documents(deepcopy(chunks))

  warn_deprecated(


We now have our chunks vectorized and ready for comparison (e.g. with cosine similarity). However, we are missing a few ingredients:

1. ***Scaling*** - we only processed a single pdf file and we had to manually specify its path. We need to find a way to process all our documents at scale instead and to save the results.
2. ***Saving and Versioning*** - even if we only had a single or a few PDF files we would like to use in our RAG, it is a good practice to version the outputs so that we can keep track of and fine-tune our RAG application. If we simply save the current results to a bucket and overwrite it each time the source is updated, we lose this. We could version the results manually, e.g. by adding a timestamp to the blob name, but that is not very reliable and will lead to unnecessary copies of files.

### Processing the documents at scale, using DataChain UDFs

We will now use DataChain to solve the scaling and versioning issues we outline above. We will create a DataChain user-defined function (UDF) to process all our PDF files the way we did above with a single file (without us having to manually provide file paths) and save the outputs in a Datachain.

The DataChain UDF functionality will allow us to generate additonal columns in our DataChain, iterating over each of the files listed in it.

We first need to define a DataModel class, which will define the types of our outputs. Inputs and outputs need to be specified like this when we use custom functinos in Datachain.

In [7]:
# Define the output as a Feature class
class Chunk(DataModel):
    key: str
    text: str
    embeddings_new: list[float]
    embeddings_old: list[float]

In the above we define `Chunk` by specifying the names and types of new columns on the output.

We then define our processing function `pdf_chuks`:

In [8]:
# Use signatures to define input/output types (these can be Feature or regular Python types)
def process_pdf(file: File) -> Iterator[Chunk]:
    # Ingest the file
    with file.open() as f:
        chunks = partition_pdf(file=f, chunking_strategy="by_title")

    chunks_embedded_new = embedding_encoder_new.embed_documents(chunks)
    chunks_embedded_old = embedding_encoder_old.embed_documents(deepcopy(chunks))

    # Add new rows to DataChain
    for chunk, chunk_orig in zip(chunks_embedded_new, chunks_embedded_old):
        yield Chunk(
            key=file.name.removesuffix("-Paper.pdf"),
            text=chunk.text,
            embeddings_new=chunk.embeddings,
            embeddings_old=chunk_orig.embeddings,
        )

Here, the syntax is the same as with any other Python function, except that we specify the input and output types using type hints

```
def process_pdf(file: File) -> Iterator[Chunk]:
```
Here, `file` specifies that we pass all `file` columns of the original dataset on the input and `Iterator[Chunk]` specifies that we get a bunch of `Chunk` rows on the output (from a single row of the original datachain representing a single paper we will get a new dataset with multiple rows per paper, each representing a single chunk).

We then specify what each row should contain by specifying the attributes of our `Chunk` class and then we use `yield` to create the new rows for each input row.

In [9]:
dc_chunks_embeddings = (
    dc_papers
    .limit(20) # we limit ourselves to 20 papers here, to speed up the demo
    .gen(document=process_pdf)
)

dc_chunks_embeddings.save("embeddings")

Processed: 738 rows [00:00, 9812.89 rows/s]
Processed: 0 rows [00:00, ? rows/s]
Processed: 2 rows [00:05,  2.68s/ rows]
Processed: 3 rows [00:09,  3.28s/ rows]
Processed: 4 rows [00:13,  3.62s/ rows]
Processed: 5 rows [00:19,  4.44s/ rows]
Processed: 6 rows [00:24,  4.54s/ rows]
Processed: 7 rows [00:28,  4.33s/ rows]
Processed: 8 rows [00:33,  4.55s/ rows]
Processed: 9 rows [00:38,  4.75s/ rows]
Processed: 10 rows [00:43,  4.80s/ rows]
Processed: 11 rows [00:50,  5.49s/ rows]
Processed: 12 rows [00:55,  5.35s/ rows]
Processed: 13 rows [01:00,  5.37s/ rows]
Processed: 14 rows [01:05,  5.25s/ rows]
Processed: 15 rows [01:09,  4.70s/ rows]
Processed: 16 rows [01:14,  4.93s/ rows]
Processed: 17 rows [01:18,  4.57s/ rows]
Processed: 18 rows [01:22,  4.32s/ rows]
Processed: 19 rows [01:27,  4.65s/ rows]
Processed: 20 rows [01:32,  4.82s/ rows]
Download: 51.2MB [01:37, 552kB/s]
Processed: 20 rows [01:37,  4.88s/ rows]
Generated: 1346 rows [01:32, 14.59 rows/s]


<datachain.lib.dc.DataChain at 0x7fdc4684b9e0>

In the cell above we apply our new `process_pdf` function to the DataChain `dc_papers`. We do that by using the `gen` method of DataChain with `process_pdf`as its parameter. 

`DataChain.gen` is used when we have a function that creates multiple rows per single row of the original datachain (like in our examples, where each paper is split into multiple chunks)

We also presisted the result by the `.save` method. This will permanently save and version the datachain as a dataset with the name `embeddings`. Whenever we call `.save("embeddings")` again, a new version of this dataset will be saved automatically, so we can recall previous versions and track changes of the dataset over time.

### Evaluation

We will now use DataChain to calculate similarity between the two alternative embeddings using a fixed test query as reference and for further evaluation we will save dataset containing the chunks where the two embeddings differ the most.


In [10]:
TEST_QUERY = "What are the most promising approaches for combining neural networks with symbolic reasoning, according to recent NeurIPS papers?"

embedded_query_new = embedding_encoder_new.embed_query(query = TEST_QUERY)
embedded_query_old = embedding_encoder_old.embed_query(query = TEST_QUERY)

Using the built-in DataChain function `cosine_distance` we will calculate the cosine similarities between each chunk and the test query `TEST_QUERY` and then compare the results between the two embeddings.

To specify that we want to compare columns we use the `C` class from `datachain.lib.dc`. We use the `mutate` method of DataChain, which is a way to add new columns to an existing dataset.

Since we saved our dataset `embeddings`, we can now load its content to datachain by the `from_dataset` method

In [11]:
embeddings_differences = (
    DataChain
    .from_dataset("embeddings")
    .mutate(
        query_sim_new = 1 - cosine_distance(C.document.embeddings_new, embedded_query_new),
        query_sim_old = 1 - cosine_distance(C.document.embeddings_old, embedded_query_old),
        )
    .mutate(abs_difference = cast(func.abs(C.query_sim_old - C.query_sim_new), Float))
    .filter(C.abs_difference > 0.1)
    .order_by("abs_difference", descending=True)
    .save("embeddings-differences")
)



In [12]:
embeddings_differences.show(3)

Unnamed: 0_level_0,document,document,document,document,query_sim_new,query_sim_old,abs_difference
Unnamed: 0_level_1,key,text,embeddings_new,embeddings_old,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,32bb90e8976aab5298d5da10fe66f21d,value a unit takes on after probing becomes de...,"[-0.029339559376239777, -0.008152788504958153,...","[-0.04788152500987053, -0.016090188175439835, ...",0.341957,-0.018575,0.360531
1,03afdbd66e7929b125f8597834fa83a4,"To evaluate E(ln - h2- K I), we estimate the v...","[-0.010849174112081528, -0.025998728349804878,...","[0.007942826487123966, 0.03667686879634857, 0....",0.223943,-0.080508,0.304451
2,1f0e3dad99908345f7439f8ffabdffc4,"[3] Brody D.A., IEEE Trans. vBME-32, n2, pl06-...","[-0.12260544300079346, 0.07194254547357559, -0...","[-0.11976780742406845, -0.06083959713578224, -...",0.043594,0.339409,0.295815



[Limited by 3 rows]


We can now explore where our old and new embeddings differ the most in terms of their distance to the test query. We are mostly curious about how the RAG context provided changes when we change the embedding, so we will have a look at how the sets of closest 10 chunks differ between the embeddings.

We use the `.collect` method of DataChain to retrieve a set of the 10 most relevant chunks (since we will want to have a look at them in detail).

In [13]:
N_RELEVANT = 10

top_old = set(embeddings_differences
               .order_by("query_sim_old", descending=True)
               .limit(N_RELEVANT)
               .select("document.text")
               .collect()
               )
top_new = set(embeddings_differences
               .order_by("query_sim_new", descending=True)
               .limit(N_RELEVANT)
               .select("document.text")
               .collect()
               )

We create a simple metric we call retrieval similarity with values between 0 and 1. If the retrieval sets were the same for both embeddings, its value would be 1. If they were completely different, the value would be 0.

In [14]:
retrieval_similarity = 1- len(top_old ^ top_new) / (2 * N_RELEVANT)
print(retrieval_similarity)

0.4


We can see that there is a substantial difference between the two embeddings.

Finally, to get a bit more insight into how the two retrieval sets differ, we can have a look at what context appears in one set but not the other.

In [15]:
set(top_old) - set(top_new)

{('In the present paper we survey and utilize results from the qualitative theory of large scale interconnected dynamical systems in order to develop a qualitative theory for the Hopfield model of neural networks. In our approach we view such networks as an inter connection of many single neurons. Our results are phrased in terms of the qualitative properties of the individual neurons and in terms of the properties of the interconnecting structure of the neural networks. Aspects of neural networks',),
 ('This research was also sponsored by the same agency under contract N00039-87-\n\nC-0251 and monitored by the Space and Naval Warfare Systems Command.\n\n231\n\n232\n\nReferences\n\n[1] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities," Proceedings of the National Academy of Sciences U.SA., vol. 79, pp. 2554-2558, April 1982.\n\n[2] J. Hopfield and D. Tank, \'\'\'Neural\' computation of decisions in optimization',),
 ('boolean comput

In [16]:
set(top_new) - set(top_old)

{('474\n\nOPTIMIZA nON WITH ARTIFICIAL NEURAL NETWORK SYSTEMS: A MAPPING PRINCIPLE AND A COMPARISON TO GRADIENT BASED METHODS t\n\nHarrison MonFook Leong Research Institute for Advanced Computer Science NASA Ames Research Center 230-5 Moffett Field, CA, 94035\n\nABSTRACT',),
 ('Neural nets, in contrast to popular misconception, are capable of quite accurate number crunching, with an accuracy for the prediction problem we considered that exceeds conventional methods by orders of magnitude. Neural nets work by constructing surfaces in a high dimensional space, and their oper ation when performing signal processing tasks on real valued inputs, is closely related to standard methods of functional ,,-pproximation. One does not need more than two hidden layers for processing',),
 ('Neural networks have attracted much interest recently, and using parallel architectures to simulate neural networks is a natural and necessary applica tion. The SIMD model of parallel computation is chosen, becaus

From a casual look, it seems that our old embedding model does a better job at picking the right chunks - all of the chunks which are unique to the old model seem to contain relevant text, wereas two of the 6 unique snippets from the new model are just a title and a two word phrase, respectively.

### Summary

We have now solved our scalability issues. When using `DataChain` locally, our computation will still be restricted to a simgle machine but for larger datasets you can use the SaaS version of DataChain available through our DVC Studio which comes with automatic computation cluster management, a graphical user interface and additional ML and data versioning features.

We have also solved our versioning needs and we can track the differences between embeddings over time and use that to choose the best embedding for our use-case.

### Where to go from here?

To turn this example into a real-world scenario we might first want to use datachain to create a dataset of most common queries (instead of just the one test query above) and then use an averaged out retrieval similarity metric for the two embeddings across all these common queries. This would give us a good way to judge whether we can replace the current embedding model (which is presumably more expensive) with a faster/cheaper one without affecting the accuracy of our RAG much.

Also, from a detailed look at our retrieval results, it is clear that we can improve the results by improving the PDF processing itself - cleaning it and trying out various chunking strategies can lead to more relevant context before we even consider changing the embedding model itself. There are many potential combinations of data processing/cleaning strategies, embedding models and their parameters (and also the choice of th corpus of text we use as our RAG  context). 

DataChain can be of tremendous help with this experimentation, since it helps us to:

1. Do all of that data processing at scale
1. Provide versioning and reproducibility necessary to systematically arrive at the best possible RAG configuration for our use-case.