# Chroma-Multimodal

- Author: [Gwangwon Jung](https://github.com/pupba)
- Design: []()
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/02-Chroma-Multimodal.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/02-Chroma-Multimodal.ipynb)

## Overview

This tutorial covers how to use `Chroma Vector Store` with `LangChain` .

`Chroma` is an `open-source AI application database` .

In this tutorial, after learning how to use `langchain-chroma` , we will implement examples of a simple **Text Search** engine and **Multimodal Search** engine using `Chroma` .

![search-example](./assets/02-chroma-with-langchain-flow-search-example.png)

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [What is Chroma?](#what-is-chroma?)
- [LangChain Chroma Basic](#langchain-chroma-basic)
- [Text Search](#text-search)
- [Multimodal Search](#multimodal-search)


### References

- [Chroma Docs](https://docs.trychroma.com/docs/overview/introduction)
- [Langchain-Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/)
- [List of VectorStore supported by Langchain](https://python.langchain.com/docs/integrations/vectorstores/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [6]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain-core",
        "langchain-chroma",
        "langchain-huggingface",
        "langchain-experimental",
        "pillow",
        "open_clip_torch",
        "scikit-learn",
        "numpy",
        "requests",
        "python-dotenv",
        "datasets >= 3.2.0",  # Requirements >= 3.2.0
    ],
    verbose=False,
    upgrade=False,
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Chroma With Langchain",  # title 과 동일하게 설정해 주세요
        "HUGGINGFACEHUB_API_TOKEN": "",
    }
)

You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [None]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

## What is Chroma?

![logo](./assets/02-chroma-with-langchain-chroma-logo.png)

`Chroma` is the `open-source vector database` designed for AI application. 

It specializes in storing high-dimensional vectors and performing fast similariy search, making it ideal for tasks like `semantic search` , `recommendation systems` and `multimodal search` .

With its **developer-friendly APIs** and seamless integration with frameworks like `LangChain` , `Chroma` is powerful tool for building scalable, AI-driven solutions.


## LangChain Chroma Basic

### Load Text Documents Data(Temporary)

The following code demonstrates how to load text documents into a structured format using the `Document` class from `langchain-core` .

Each document contains `page_content` (the text) and `metadata` (additional information about the soruce).

Unique **IDs** are also generated for each document using `uuid4` .

In [5]:
from uuid import uuid4

from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
    id=2,
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
    id=3,
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
    id=4,
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
    id=5,
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
    id=6,
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
    id=7,
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
    id=8,
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
    id=9,
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
    id=10,
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]
uuids = [str(uuid4()) for _ in range(len(documents))]

### Create Vector Store with Embedding

First, load the **Embedding Model**. 

We use the `sentence-transformers/all-mpnet-base-v2` embedding model, which is loaded using the `HuggingFaceEmbeddings` class from `langchain-huggingface` integration.

This model is a powerful choice for generating high-quality embeddings for text data.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"

embeddings = HuggingFaceEmbeddings(model_name=model_name)

Create a `Chroma DB` instance using the `Chroma` class from `langchain-chroma` .

**Parameters**

- `collection_name:str` – Name of the collection to create.

- `embedding_function:Optional[Embeddings]` – Embedding class object. Used to embed texts.

- `persist_directory:Optional[str]` – Directory to persist the collection.

- `client_settings:Optional[chromadb.config.Settings]` – Chroma client settings

- `collection_metadata:Optional[Dict]` – Collection configurations.

- `client:Optional[chromadb.ClientAPI]` – Chroma client. Documentation: https://docs.trychroma.com/reference/python/client

- `relevance_score_fn:Optional[Callable[[float], float]]` – Function to calculate relevance score from distance. Used only in similarity_search_with_relevance_scores

- `create_collection_if_not_exists:Optional[bool]` – Whether to create collection if it doesn’t exist. Defaults to True.

**Returns**

- `Chroma:langchain_chroma.vectorstores.Chroma` - Chroma instance

In [None]:
# Create Vector DB
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings,
    persist_directory="./data/test_chroma_db",
    # Where to save data locally, remove if not necessary
)
print(vector_store)

### Manage Vector Store

Add `documents` into the vector store with `UUIDs` as identifiers.

In [None]:
# Add documents
vector_store.add_documents(documents=documents, ids=uuids)

In [None]:
# Verifying saved data
vector_store.get()

In addition to adding items this way, you can freely `update` or `delete` items in the vector store.

In [10]:
updated_document_1 = Document(
    page_content="I had fish&chip and fried eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

updated_document_2 = Document(
    page_content="The weather forecast for tomorrow is windy and cold, with a high of 82 degrees.",
    metadata={"source": "news"},
    id=2,
)

In [11]:
# Update exmpale

vector_store.update_document(
    document_id=uuids[0], document=updated_document_1
)  # document update

vector_store.update_documents(
    ids=uuids[:2], documents=[updated_document_1, updated_document_2]
)  # documents update

In [None]:
vector_store.get()

In [13]:
# Delete exmple
vector_store.delete(ids=uuids[-1])

In [None]:
vector_store.get()

### Query Vector Store

There are two ways to `Query` the `Vector Store` .

- **Directly** : Query the vector store directly using methods like `similarity_search` or `similarity_search_with_score` .

- **Turning into retriever** : Convert the vector store into a `retriever` object, which can be used in `LangChain` pipelines or chains.

## Text Search

With the `Directly` way, you can simply search for `Text` through `Similarity` without much implementation.

The `Directly` way includes `similarity_search` and `similarity_search_with_score` .

### similarity_search()

`similarity_search()` is run similarity search with Chroma.

**Parameters**

- `query:str` - Query text to search for.

- `k: int = DEFAULT_K` - Number of results to return. Defaults to 4.    

- `filter: Dict[str, str] | None = None` - Filter by metadata. Defaults to None.

- `**kwargs:Any` - Additional keyword arguments to pass to Chroma collection query.


**Returns**
- `List[Documents]` - List of documents most similar to the query text.



### similarity_search_with_score()

`similarity_search_with_score()` is run similarity search with Chroma with distance.

**Parameters**

- `query:str` - Query text to search for.

- `k:int = DEFAULT_K` - Number of results to return. Defaults to 4.

- `filter: Dict[str, str] | None = None` - Filter by metadata. Defaults to None.

- `where_document: Dict[str, str] | None = None` - dict used to filter by the documents. E.g. {$contains: {"text": "hello"}}.

- `**kwargs:Any` : Additional keyword arguments to pass to Chroma collection query.


**Returns**
- `List[Tuple[Document, float]]` - List of documents most similar to the query text and distance in float for each. Lower score represents more similarity.

In [None]:
# Directly - similarity_search

results = vector_store.similarity_search(
    query="LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)

for idx, res in enumerate(results):
    print(f"{idx}: {res.page_content} [{res.metadata}]")

In [None]:
# Directly - similarity_search_with_score

results = vector_store.similarity_search_with_score(
    query="Will it be cold tomorrow?",
    k=1,
    filter={"source": "news"},
)

for idx, (res, score) in enumerate(results):
    print(
        f"{idx}: [Similarity Score: {round(score,3)*100}%] {res.page_content} [{res.metadata}]"
    )

You can also use `Turning into retrievals` way to search for text.

### as_retriever()

The `as_retriever()` method converts a `VectorStore` object into a `Retriever` object.

A `Retriever` is an interface used in `LangChain` to query a vector store and retrieve relevant documents.

**Parameters**

- `search_type:Optional[str]` - Defines the type of search that the Retriever should perform. Can be `similarity` (default), `mmr` , or `similarity_score_threshold`

- `search_kwargs:Optional[Dict]` - Keyword arguments to pass to the search function. 

    Can include things like:

    `k` : Amount of documents to return (Default: 4)

    `score_threshold` : Minimum relevance threshold for similarity_score_threshold

    `fetch_k` : Amount of documents to pass to `MMR` algorithm(Default: 20)
        
    `lambda_mult` : Diversity of results returned by MMR; 1 for minimum diversity and 0 for maximum. (Default: 0.5)

    `filter` : Filter by document metadata


**Returns**

- `VectorStoreRetriever` - Retriever class for VectorStore.


### invoke()

Invoke the retriever to get relevant documents.

Main entry point for synchronous retriever invocations.

**Parameters**

- `input:str` - The query string.
- `config:RunnableConfig | None = None` - Configuration for the retriever. Defaults to None.
- `**kwargs:Any` - Additional arguments to pass to the retriever.


**Returns**

- `List[Document]` : List of relevant documents.

In [None]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 2},
)

retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

## Multimodal Search

`Chorma` supports `Multimodal Collections` , which means it can handle and store embeddings from different types of data, such as `text` , `images` , `audio` , or even `video` .

We can search for `images` using `Chroma` .

### Setting `image` and `image_info` data

This dataset is made by `SDXL` . 

**Dataset: Animal-180**

- [animal-180](https://huggingface.co/datasets/Pupba/animal-180)

This dataset, named `animal-180` , is a collection of 180 realistic animal images generated using `Stable-Diffusion XL(SDXL)` .

It includes images of `lions` , `rabbits` , `cats` , `dogs` , `elephants` and `tigers` , with 30 images per animal category.

All images are free to use for any purpose, as they are synthetically generated and not subject to copyright restrictions.

In [18]:
import tempfile
from PIL import Image


def save_temp_gen_url(image: Image) -> str:
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".png")
    image.save(temp_file, format="PNG")
    temp_file.close()
    return temp_file.name

In [19]:
from datasets import load_dataset

dataset = load_dataset("Pupba/animal-180", split="train")

# slice 50 set
images = dataset[:50]["png"]
image_paths = [save_temp_gen_url(img) for img in images]
metas = dataset[:50]["json"]
prompts = [data["prompt"] for data in metas]
categories = [data["category"] for data in metas]

In [None]:
print("Image Path:", image_paths[0])
print("Prompt:", prompts[0])
print("Category:", categories[0])
images[0]

Load `OpenCLIP` for `Multimodal Embedding` .

- [OpenCLIP](https://github.com/mlfoundations/open_clip/tree/main)

In [21]:
from langchain_experimental.open_clip import OpenCLIPEmbeddings

MODEL = "ViT-H-14-quickgelu"
CHECKPOINT = "dfn5b"

image_embedding = OpenCLIPEmbeddings(model_name=MODEL, checkpoint=CHECKPOINT)

## Create a Multimodal Vector Store

Create a `Multimodal Vector Store` and add the `Image uri` and `Metadata(file_path, category, prompt)`

In [None]:
image_vector_db = Chroma(
    collection_name="multimodal",
    embedding_function=image_embedding,
)

image_vector_db.add_images(
    uris=image_paths,
    metadatas=[
        {"file_path": file_path, "category": category, "prompt": prompt}
        for file_path, category, prompt in zip(image_paths, categories, prompts)
    ],
)

In [None]:
image_vector_db.get()

Make `ImageSearcher`

In [24]:
from typing import Optional, Dict


class ImageSearcher:
    def __init__(self, image_vector_store: Chroma):
        self.__vector_store = image_vector_store

    def searching_text_query(self, query: str) -> Optional[Dict]:
        docs = self.__vector_store.as_retriever().invoke(query)

        if docs and isinstance(docs[0], Document):
            metadata = docs[0].metadata
            return {
                "category": metadata["category"],
                "image": Image.open(metadata["file_path"]),
                "prompt": metadata["prompt"],
            }
        else:
            return None

    def searching_image_query(self, image_uri: str) -> Optional[Dict]:
        docs = self.__vector_store.similarity_search_by_image(uri=image_uri, k=1)
        if docs and isinstance(docs[0], Document):
            metadata = docs[0].metadata
            return {
                "category": metadata["category"],
                "image": Image.open(metadata["file_path"]),
                "prompt": metadata["prompt"],
            }
        else:
            return None


image_search = ImageSearcher(image_vector_store=image_vector_db)

Text Query Search

In [None]:
result = image_search.searching_text_query(query="a elephant run")

print(f"Category: {result['category']}")
print(f"Prompt: {result['prompt']}")
result["image"]

Image Query Search

In [26]:
# query image url
import requests
from io import BytesIO


def load_image_from_url(url: str, resolution: int = 512) -> Image.Image:
    """
    Load an image from a URL and return it as a PIL Image object.

    Args:
        url (str): The URL of the image.

    Returns:
        Image.Image: The loaded PIL Image object.
    """
    response = requests.get(url)
    response.raise_for_status()  # Raise an error for failed requests
    image = Image.open(BytesIO(response.content))
    image = image.resize((resolution, resolution), resample=Image.Resampling.LANCZOS)
    return image


def save_image_to_tempfile(url: str) -> str:
    """
    Download an image from a URL and save it to a temporary file.

    Args:
        url (str): The URL of the image.

    Returns:
        str: The file path to the saved image.
    """
    response = requests.get(url)

    # Raise an error for failed requests
    response.raise_for_status()

    # Create a temporary file
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".jpg")
    temp_file.write(response.content)

    # Close the file to allow other processes to access it
    temp_file.close()
    return temp_file.name

In [None]:
# rabbit image
img_url = "https://i.pinimg.com/736x/b2/e9/f4/b2e9f449c1c5f8a29e31cafb8671c8b2.jpg"

image_query = load_image_from_url(img_url)
image_query_url = save_image_to_tempfile(img_url)

image_query

In [None]:
result = image_search.searching_image_query(image_uri=image_query_url)

print(f"Category: {result['category']}")
print(f"Prompt: {result['prompt']}")
result["image"]

Remove a `Huggingface Cache`

In [None]:
dataset.cleanup_cache_files()

Disconnect `Chroma` DB and Remove Local DB file

In [30]:
del vector_store
del image_vector_db

In [32]:
import os
import shutil

shutil.rmtree(os.path.join(os.getcwd(), "data", "test_chroma_db"))