기존 임베딩파일이 있는 상태에서, 기존 파일들은 제외하고 새 문서들만 임베딩하는 자동화 과정

In [None]:
# Load the existing ChromaDB collection:

import chromadb
from chromadb.utils import embedding_functions

# Initialize the ChromaDB client
client = chromadb.PersistentClient(path="./chroma_db")

# Get the existing collection
collection = client.get_collection(
    name="my_collection",
    embedding_function=embedding_functions.DefaultEmbeddingFunction()
)

In [None]:
# Create a function to process and add new documents:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
import hashlib

def add_new_documents(file_path):
    # Load the new document
    loader = TextLoader(file_path)
    documents = loader.load()
    
    # Split the document into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    splits = text_splitter.split_documents(documents)
    
    # Process and add each chunk
    for split in splits:
        # Generate a unique ID based on content
        doc_id = hashlib.md5(split.page_content.encode()).hexdigest()
        
        # Check if the document already exists
        if not collection.get(ids=[doc_id])['ids']:
            # Add the new document chunk
            collection.add(
                documents=[split.page_content],
                metadatas=[{"source": file_path}],
                ids=[doc_id]
            )
            print(f"Added new document chunk: {doc_id}")

In [None]:
# Use the function to add new documents:

# Add a new document
add_new_documents("path/to/new/document.pdf")

This approach allows you to:

Connect to the existing ChromaDB collection.

Load and process new documents.

Split new documents into chunks for better retrieval.

Generate unique IDs for each chunk to avoid duplicates.

Add only new document chunks that don't already exist in the collection.

By using this method, you can incrementally add new documents to your existing ChromaDB instance without reloading or recomputing embeddings for documents that are already in the database.

Remember to adjust the chunk size and overlap in the RecursiveCharacterTextSplitter according to your specific needs.
Also, make sure to use the same embedding function that was used for the initial documents to maintain consistency in your vector space.

This approach allows you to dynamically expand your knowledge base while preserving the existing embeddings and documents in your ChromaDB instance.

### Ref
- https://github.com/run-llama/llama_index/issues/15082
- https://how.wtf/how-to-use-chroma-db-step-by-step-guide.html