## References

https://docs.trychroma.com/guides/multimodal

https://docs.trychroma.com/guides#using-collections

https://cookbook.chromadb.dev/embeddings/gpu-support/#openclipm

## Prepare Image Data

### Download Image Data

In [5]:
# %pip install gdown

In [1]:
import gdown

# !gdown 1msLVo0g0LFmL9-qZ73vq9YEVZwbzOePF
url = "https://drive.google.com/uc?id=1msLVo0g0LFmL9-qZ73vq9YEVZwbzOePF"
output = "image_data.zip"

gdown.download(url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1msLVo0g0LFmL9-qZ73vq9YEVZwbzOePF
From (redirected): https://drive.google.com/uc?id=1msLVo0g0LFmL9-qZ73vq9YEVZwbzOePF&confirm=t&uuid=d32d13dd-40b2-43ef-adc1-9611e72155ba
To: /home/aivn48/WorkSpace/Khoa2/VectorDatabase/VetorDatabase/ChromaDB/Collection_Examples/image_data.zip
100%|██████████| 76.1M/76.1M [00:12<00:00, 5.94MB/s]


'image_data.zip'

In [2]:
!unzip -q image_data.zip

In [3]:
rm image_data.zip

### Load Image Uris

https://github.com/chroma-core/chroma/blob/main/chromadb/utils/data_loaders.py

```python
class ImageLoader(DataLoader[List[Optional[Image]]]):
    def __init__(self, max_workers: int = multiprocessing.cpu_count()) -> None:
        try:
            self._PILImage = importlib.import_module("PIL.Image")
            self._max_workers = max_workers
        except ImportError:
            raise ValueError(
                "The PIL python package is not installed. Please install it with `pip install pillow`"
            )

    def _load_image(self, uri: Optional[URI]) -> Optional[Image]:
        return np.array(self._PILImage.open(uri)) if uri is not None else None

    def __call__(self, uris: Sequence[Optional[URI]]) -> List[Optional[Image]]:
        with ThreadPoolExecutor(max_workers=self._max_workers) as executor:
            return list(executor.map(self._load_image, uris))
```

In [8]:
# %pip install pillow

In [1]:
from chromadb.utils.data_loaders import ImageLoader
image_loader = ImageLoader()

In [2]:
import os 
root = 'data/train'

def get_image_uris(root):
    image_uris = []
    for class_name in os.listdir(root):
        class_path = os.path.join(root, class_name)
        images_name = os.listdir(class_path)
        image_uris += [ os.path.join(class_path, fn) for fn in images_name ]
    return image_uris


In [3]:

image_uris = sorted(get_image_uris(root))
image_uris[0:5]

['data/train/African_crocodile/n01697457_10393.JPEG',
 'data/train/African_crocodile/n01697457_104.JPEG',
 'data/train/African_crocodile/n01697457_1331.JPEG',
 'data/train/African_crocodile/n01697457_14906.JPEG',
 'data/train/African_crocodile/n01697457_18587.JPEG']

In [4]:
image_ids = [f"img_{idx_}" for idx_ in range(len(image_uris))]
image_ids[0:5]

['img_0', 'img_1', 'img_2', 'img_3', 'img_4']

In [5]:
image_metadata = [{'ver': idx_%10} for idx_ in range(len(image_uris))]
image_metadata[0:5]

[{'ver': 0}, {'ver': 1}, {'ver': 2}, {'ver': 3}, {'ver': 4}]

## In-memory Vector Database

### Create Client 

In [26]:
import chromadb
client = chromadb.Client()

### Create Collection

Embedding functions can be linked to a collection and used whenever you call `add`, `update`, `upsert` or `query`. You can also use them directly which can be handy for debugging.

By default, Chroma uses the Sentence Transformers `all-MiniLM-L6-v2 model` to create embeddings


https://github.com/chroma-core/chroma/blob/main/chromadb/api/client.py#L106

```python
def create_collection(
        self,
        name: str,
        configuration: Optional[CollectionConfiguration] = None,
        metadata: Optional[CollectionMetadata] = None,
        embedding_function: Optional[
            EmbeddingFunction[Embeddable]
        ] = ef.DefaultEmbeddingFunction(),  # type: ignore
        data_loader: Optional[DataLoader[Loadable]] = None,
        get_or_create: bool = False,
    ) -> Collection:
        model = self._server.create_collection(
            name=name,
            metadata=metadata,
            tenant=self.tenant,
            database=self.database,
            get_or_create=get_or_create,
            configuration=configuration,
        )
        return Collection(
            client=self._server,
            model=model,
            embedding_function=embedding_function,
            data_loader=data_loader,
        )
```

Create a new collection with the given name and metadata.

**Arguments:**

`name` - The name of the collection to create.

`metadata` - Optional metadata to associate with the collection.

`embedding_function` - Optional function to use to embed documents. Uses the default embedding function if not provided.

`get_or_create` - If True, return the existing collection if it exists.

**Returns:**

`Collection` - The newly created collection.

**Raises:**

`ValueError` - If the collection already exists and get_or_create is False.

`ValueError` - If the collection name is invalid.

In [29]:
image_collection = client.create_collection(name='image_collection', 
                                           metadata={"hnsw:space": "cosine"},
                                           data_loader=image_loader) # l2 is the default

### Add image uri to image collection 

In [30]:
image_collection.add(
    ids=image_ids,
    images=image_uris,
    metadatas=image_metadata
)

In [31]:
image_collection._embedding_function

<chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2.ONNXMiniLM_L6_V2 at 0x7f0dc709bf50>

In [None]:
image_collection.get(ids=image_ids[0:5], include=['embeddings'])

In [33]:
test_image_uris = sorted(get_image_uris('data/test'))
print(test_image_uris[0])
image_collection.query(query_uris=test_image_uris[0], n_results=5)

data/test/African_crocodile/n01697457_18534.JPEG


TypeError: TextInputSequence must be str

Create a new collection with the given name and metadata.

**Arguments:**

`name` - The name of the collection to create.

`metadata` - Optional metadata to associate with the collection.

`embedding_function` - Optional function to use to embed documents. Uses the default embedding function if not provided.

`get_or_create` - If True, return the existing collection if it exists.

Returns:

Collection - The newly created collection.
Raises:

ValueError - If the collection already exists and get_or_create is False.
ValueError - If the collection name is invalid.

### Collection with CLIP embedding model - CPU

In [15]:
import chromadb
clip_cpu_client = chromadb.Client()

#### Get CLIP embedding model from chromadb

In [16]:
# %pip install open-clip-torch

In [18]:
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
embedding_function_cpu = OpenCLIPEmbeddingFunction()

  checkpoint = torch.load(checkpoint_path, map_location=map_location)


In [41]:
clip_cpu_collection = clip_cpu_client.create_collection(name='clip_cpu_collection',
                                                        embedding_function=embedding_function_cpu,
                                                        data_loader=image_loader,
                                                        )

In [42]:

clip_cpu_collection.add(ids=image_ids,
                        images=image_uris,
                        metadatas=image_metadata)

In [43]:
test_image_uris = sorted(get_image_uris('data/test'))
test_image_uris[0]

'data/test/African_crocodile/n01697457_18534.JPEG'

In [46]:
clip_cpu_collection.query(query_uris=[test_image_uris[0]], n_results=5)


{'ids': [['img_6', 'img_7', 'img_8', 'img_0', 'img_5']],
 'distances': [[1.3242056369781494,
   1.3248202800750732,
   1.3287749290466309,
   1.3336899280548096,
   1.3342429399490356]],
 'metadatas': [[{'ver': 6}, {'ver': 7}, {'ver': 8}, {'ver': 0}, {'ver': 5}]],
 'embeddings': None,
 'documents': [[None, None, None, None, None]],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}

In [47]:
print(image_uris[6])

data/train/African_crocodile/n01697457_5586.JPEG


### Collection with CLIP embedding model - GPU - ERROR

In [48]:
import chromadb
clip_gpu_client = chromadb.Client()

#### Get CLIP embedding model from chromadb

In [None]:
# %pip install open-clip-torch

In [50]:
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
embedding_function_gpu = OpenCLIPEmbeddingFunction(device="cuda")

In [51]:
clip_gpu_collection = clip_gpu_client.create_collection(name='clip_gpu_collection',
                                                        embedding_function=embedding_function_gpu,
                                                        data_loader=image_loader,
                                                        )

In [52]:

clip_gpu_collection.add(ids=image_ids,
                        images=image_uris,
                        metadatas=image_metadata)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

### Collection with CLIP embedding model - GPU - Correction with torch

In [6]:
import chromadb
clip_gpu_torch_client = chromadb.Client()

#### Get CLIP embedding model from chromadb

In [7]:
# %pip install open-clip-torch

In [8]:
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
embedding_function_gpu_torch = OpenCLIPEmbeddingFunction(device="cuda")

  from .autonotebook import tqdm as notebook_tqdm
  checkpoint = torch.load(checkpoint_path, map_location=map_location)


In [25]:
import torch
# torch.device("cuda")
# torch.cuda.set_device(torch.cuda.current_device())
# torch.set_default_tensor_type(torch.cuda.FloatTensor)
# torch.set_default_tensor_type(torch.FloatTensor)
torch.set_default_device('cuda')


In [26]:
print(torch.ones((22,22), dtype=torch.float32).device)
print(torch.ones((22,22), dtype=torch.int32).device)
print(torch.ones((22,22)).device)


cuda:0
cuda:0
cuda:0


In [27]:
# clip_gpu_torch_client.delete_collection(name='clip_gpu_torch_collection')

In [28]:

clip_gpu_torch_collection = clip_gpu_torch_client.create_collection(name='clip_gpu_torch_collection',
                                                        embedding_function=embedding_function_gpu_torch,
                                                        data_loader=image_loader,
                                                        )

In [29]:

clip_gpu_torch_collection.add(ids=image_ids,
                        images=image_uris,
                        metadatas=image_metadata)

In [30]:
test_image_uris = sorted(get_image_uris('data/test'))
test_image_uris[-1]

'data/test/yawl/n04612504_4963.JPEG'

In [31]:

clip_gpu_torch_collection.query(query_uris=[test_image_uris[-1]], n_results=5)


RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

In [None]:
print(image_uris[6])

data/train/African_crocodile/n01697457_5586.JPEG


In [None]:

# clip_gpu_torch_collection.query(query_embeddings=[test_image_uris[-1]], n_results=5)
