<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_06_2_chromadb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 6: Retrieval-Augmented Generation (RAG)**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 6 Material

* Part 6.1: Introduction to Retrieval-Augmented Generation (RAG) [[Video]](https://www.youtube.com/watch?v=qA52K0K181Q) [[Notebook]](t81_559_class_06_1_rag.ipydb)
* **Part 6.2: Introduction to ChromaDB** [[Video]](https://www.youtube.com/watch?v=R53lo4sevLQ) [[Notebook]](t81_559_class_06_2_chromadb.ipynb)
* Part 6.3: Understanding Embeddings [[Video]](https://www.youtube.com/watch?v=Tq82Gl2ZZNM) [[Notebook]](t81_559_class_06_3_embeddings.ipynb)
* Part 6.4: Question Answering Over Documents [[Video]](https://www.youtube.com/watch?v=hCwL_lW-gP0) [[Notebook]](t81_559_class_06_4_qa.ipynb)
* Part 6.5: Embedding Databases [[Video]](https://www.youtube.com/watch?v=BG2gT4uYxhM) [[Notebook]](t81_559_class_06_5_embed_db.ipynb)


# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [1]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai pypdf chromadb datasets

Note: using Google CoLab
Collecting langchain
  Downloading langchain-0.2.5-py3-none-any.whl (974 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_openai
  Downloading langchain_openai-0.1.8-py3-none-any.whl (38 kB)
Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Collecting langchain-core<0.3.0,>=0.2.7 (from langchain)
  Downloading langchain

# 6.2: Introduction to ChromaDB

In this module, we explore the fundamental use of Chroma to store and retrieve information through embeddings efficiently. This essential technique forms the backbone of numerous advanced AI applications.

* [ChromaDB](https://www.trychroma.com/)

Embeddings serve as the AI-native representation of various data types, making them ideal for use with a wide range of AI-powered tools and algorithms. They can encapsulate the essence of text, images, and, soon, even audio and video.

An embedding model processes data to generate an embedding, producing vector numbers. The creators designed the model so that similar data, such as text with analogous meanings or images with comparable content, yields vectors closer together in the vector space, while dissimilar data results in more distant vectors.

The architecture of ChromaDB is shown in Figure 6.ChromaDB.

**Figure 6.ChromaDB: The Architecture of ChromaDB**
![ChromaDB](https://data.heatonresearch.com/images/wustl/app_genai/hrm4.svg)

The core API of ChromaDB consists of only four API calls, the first of which establishes the client you will use to interact with ChromaDB.

```
import chromadb
client = chromadb.HttpClient()
```

The client allows you to create a collection that will hold documents.

```
collection = client.create_collection("sample_collection")
```

Next, we add documents to this collection.

```
collection.add(
 documents=["This is document1", "This is document2"], # we embed for you, or bring your own
 metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on arbitrary metadata!
 ids=["doc1", "doc2"], # must be unique for each doc
)
```

With this all in place, you can now perform a query.

```
results = collection.query(
 query_texts=["This is a query document"],
 n_results=2,
 # where={"metadata_field": "is_equal_to_this"}, # optional filter
 # where_document={"$contains":"search_string"}  # optional filter
)
```

## Loading and Querying ChromaDB

We now look at an example of loading and querying ChromeDB. To do this, we will use the [SciQ dataset](https://huggingface.co/datasets/allenai/sciq) from HuggingFace. The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry, and Biology, among others. The questions are in multiple-choice format with 4 answer options each. Additionally, the authors provided a paragraph with supporting evidence for the correct answer to most questions.

We begin by loading those questions with supporting information.



In [2]:
# Get the SciQ dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset("sciq", split="train")

# Filter the dataset to only include questions with a support
dataset = dataset.filter(lambda x: x["support"] != "")

print("Number of questions with support: ", len(dataset))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/339k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/11679 [00:00<?, ? examples/s]

Number of questions with support:  10481


We begin by creating a ChromaDB client and a collection named "sciq_supports." The default Chroma client is ephemeral, meaning it will not save to disk; in part 5, we will see how to save this data. We create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding function; we will use the default.

In [3]:
import chromadb

client = chromadb.Client()

collection = client.create_collection("sciq_supports")

Next, we embed and store the first 100 supports for this example.

In [4]:
collection.add(
    ids=[str(i) for i in range(0, 100)],  # IDs are just strings
    documents=dataset["support"][:100],
    metadatas=[{"type": "support"} for _ in range(0, 100)
    ],
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:17<00:00, 4.87MiB/s]


We now query the database for support for each of the questions.

In [5]:
results = collection.query(
    query_texts=dataset["question"][:10],
    n_results=1)

We display this information and can see what parts of the document support each of the questions. Later, we will see that RAG will feed this support to the prompt to answer the question.

In [6]:
# Print the question and the corresponding support
for i, q in enumerate(dataset['question'][:10]):
    print(f"Question: {q}")
    print(f"Retrieved support: {results['documents'][i][0]}")
    print()

Question: What type of organism is commonly used in preparation of foods such as cheese and yogurt?
Retrieved support: Agents of Decomposition The fungus-like protist saprobes are specialized to absorb nutrients from nonliving organic matter, such as dead organisms or their wastes. For instance, many types of oomycetes grow on dead animals or algae. Saprobic protists have the essential function of returning inorganic nutrients to the soil and water. This process allows for new plant growth, which in turn generates sustenance for other organisms along the food chain. Indeed, without saprobe species, such as protists, fungi, and bacteria, life would cease to exist as all organic carbon became “tied up” in dead organisms.

Question: What phenomenon makes global winds blow northeast to southwest or the reverse in the northern hemisphere and northwest to southeast or the reverse in the southern hemisphere?
Retrieved support: Without Coriolis Effect the global winds would blow north to south