# Basic embedding retrieval with Chroma

> Add blockquote



This notebook demonstrates the most basic use of Chroma to store and retrieve information using embeddings. This core building block is at the heart of many powerful AI applications.

## What are embeddings?

Embeddings are the A.I-native way to represent any kind of data, making them the perfect fit for working with all kinds of A.I-powered tools and algorithms. They can represent text, images, and soon audio and video.

To create an embedding, data is fed into an embedding model, which outputs vectors of numbers. The model is trained in such a way that 'similar' data, e.g. text with similar meanings, or images with similar content, will produce vectors which are nearer to one another, than those which are dissimilar.

## Embeddings and retrieval

We can use the similarity property of embeddings to search for and retrieve information. For example, we can find documents relevant to a particular topic, or images similar to a given image. Rather than searching for keywords or tags, we can search by finding data with similar semantic meaning.


In [1]:
! pip install -Uq chromadb numpy datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Example Dataset

As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).

Dataset description, from HuggingFace:

> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.

In this notebook, we will demonstrate how to retrieve supporting evidence for a given question.


In [None]:
def load_data_from_file(file_path: str) -> dict:
    with open(file_path, 'r') as file:
        text = file.read()
        data = {"text": text}
    return data

file_path = "/Users/acrobat/Documents/GitHub/Build-Your-Own-RAG-System/data_poppy/faqv3.txt"
data = load_data_from_file(file_path)

print("File contents:\n", data['text'])

In [None]:
print("File contents:\n", data['text'])

In [14]:
#print first 100 chars of in faq poppy file
print("First 100 characters of file:\n", data['text'][:900])

First 100 characters of file:
 Q: When is a good time for a child's first dental visit and what should parents expect?
A: The American Academy of Pediatric Dentistry (AAPD) recommends that every child should have an initial oral evaluation by a pediatric dentist by Age 1 or within 6 months after their first tooth appears. At Poppy Kids Pediatric Dentistry, your child's first visit includes a facility tour, possible cleaning, dental radiographs, fluoride treatment, and an examination by Dr. Andrea. She'll discuss dental care, answer questions, and cover topics like brushing, flossing, and diet. We recommend follow-up visits every 3-6 months, based on individual dental needs. For children under three, we offer a complimentary first appointment with no obligation.

Q: What age should children start getting dental x-rays and why are they important?
A: Children typically start getting dental X-rays around the age of 6, whi


In [16]:
#looks like I need a list of dictionaries:
def parse_qa_text(text: str) -> list:
    # Split the text into lines
    lines = text.split("\n")
    
    # Initialize an empty list to store the questions and answers
    data = []
    
    # Initialize empty strings to store the current question and answer
    current_question = ""
    current_answer = ""
    
    # Iterate over the lines
    for line in lines:
        # If the line starts with "Q: ", it's a question
        if line.startswith("Q: "):
            # If there's a current question and answer, add them to the data
            if current_question and current_answer:
                data.append({"Q": current_question, "A": current_answer})
            
            # Start a new question, removing the "Q: " prefix
            current_question = line[3:]
            
            # Clear the current answer
            current_answer = ""
        # If the line starts with "A: ", it's an answer
        elif line.startswith("A: "):
            # Start a new answer, removing the "A: " prefix
            current_answer = line[3:]
        # If the line is not empty, it's a continuation of the current answer
        elif line:
            current_answer += " " + line
    
    # If there's a current question and answer at the end, add them to the data
    if current_question and current_answer:
        data.append({"Q": current_question, "A": current_answer})
    
    return data

# Use the function to parse the text data
data = parse_qa_text(data['text'])

# Print the first question-answer pair to check the result
print(data[0])

{'Q': "When is a good time for a child's first dental visit and what should parents expect?", 'A': "The American Academy of Pediatric Dentistry (AAPD) recommends that every child should have an initial oral evaluation by a pediatric dentist by Age 1 or within 6 months after their first tooth appears. At Poppy Kids Pediatric Dentistry, your child's first visit includes a facility tour, possible cleaning, dental radiographs, fluoride treatment, and an examination by Dr. Andrea. She'll discuss dental care, answer questions, and cover topics like brushing, flossing, and diet. We recommend follow-up visits every 3-6 months, based on individual dental needs. For children under three, we offer a complimentary first appointment with no obligation."}


In [17]:
# Print the first question-answer pair to check the result
print(data[0:2])

[{'Q': "When is a good time for a child's first dental visit and what should parents expect?", 'A': "The American Academy of Pediatric Dentistry (AAPD) recommends that every child should have an initial oral evaluation by a pediatric dentist by Age 1 or within 6 months after their first tooth appears. At Poppy Kids Pediatric Dentistry, your child's first visit includes a facility tour, possible cleaning, dental radiographs, fluoride treatment, and an examination by Dr. Andrea. She'll discuss dental care, answer questions, and cover topics like brushing, flossing, and diet. We recommend follow-up visits every 3-6 months, based on individual dental needs. For children under three, we offer a complimentary first appointment with no obligation."}, {'Q': 'What age should children start getting dental x-rays and why are they important?', 'A': "Children typically start getting dental X-rays around the age of 6, which coincides with the eruption of their permanent teeth. X-rays are important f

## Loading the data into Chroma

Chroma comes with a built-in embedding model, which makes it simple to load text.
We can load the SciQ dataset into Chroma with just a few lines of code.


In [18]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

In [21]:
# Create a new Chroma collection to store the supporting evidence. We don't need to specify an embedding fuction, and the default will be used.
collection = client.create_collection("poppy_faq")

In [22]:
# Embed and store the question-answer pairs
collection.add(
    ids=[str(i) for i in range(len(data))],  # IDs are just strings
    documents=[f"{item['Q']} ||| {item['A']}" for item in data],
    metadatas=[{"type": "qa"} for _ in data],
)

## Querying the data

Once the data is loaded, we can use Chroma to find supporting evidence for the questions in the dataset.
In this example, we retrieve the most relevant result according to the embedding similarity score.

Chroma handles computing similarity and finding the most relevant results for you, so you can focus on building your application.


In [30]:
results = collection.query(
    query_texts=["why should I choose a pediatric dentist?"],
    n_results=3
)

for result in results['documents']:
    print(result)

["Until what age does your pediatric dental practice provide care for patients? ||| Our pediatric dental practice typically provides care for patients up to the age of 18. This includes the transition period from childhood into adolescence, ensuring continuous dental health supervision and treatment throughout their formative years. As young patients grow, their dental needs evolve, and our practice is equipped to address these changing needs, from preventive care and treatment of childhood cavities to guidance on orthodontics and wisdom teeth. Our goal is to maintain their oral health and instill good dental habits that will carry into adulthood. However, for specific cases or individual needs, the age range may vary, and we're always open to discussing continued care beyond this age on a case-by-case basis.", "Why should your child see a Board-Certified Pediatric Dentist? ||| Because they are held to the highest standards in pediatric dental care. These dentists have completed specia

In [29]:
# Print the question and the corresponding support
for result in results['documents']:
    qa_pair = result[0].split("|||")
    question = qa_pair[0].strip()
    answer = qa_pair[1].strip()
    print(f"Q: {question}")
    print(f"A: {answer}")
    print()

Q: Can I stay with my child during the treatment?
A: We aim for each patient to have the happiest dental experience possible. To achieve this, we believe that trust and support are the foundations of building a lasting relationship with you and your child. We, therefore, have an open-door policy and invite you to accompany your child during their appointment if you choose to join us. Each child is unique, and Dr. Andrea would like to take a team approach to dental care. As their parent, we understand that you know your child best, and our goal is to set them up for success.  If you feel your presence will make your child more comfortable during treatment, come on back!



## What's next?

Check out the Chroma documentation to [get started](https://docs.trychroma.com/getting-started) with building your own applications.

The core embeddings based retrieval functionality demonstrated here is at the heart of many powerful AI applications, like using large language models with Chroma to [chat with your documents](https://github.com/chroma-core/chroma/tree/main/examples/chat_with_your_documents), as well as memory for agents like [BabyAgi](https://github.com/yoheinakajima/babyagi) and [Voyager](https://github.com/MineDojo/Voyager).

Chroma is already integrated with many popular AI applications frameworks, including [LangChain](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html).

Join our community to learn more and get help with your projects: [Discord](https://discord.gg/MMeYNTmh3x) | [Twitter](https://twitter.com/trychroma)

We are [hiring](https://trychroma.notion.site/careers-chroma-9d017c3007c7478ebd85bad854101497?pvs=4)!