# Chroma Vector Database
- Chroma는 대규모 언어 모델(LLM) 애플리케이션 구축을 위해 설계된 AI 네이티브 **오픈 소스 벡터 데이터베이스**다.    
- 임베딩 저장소, 쿼리 및 검색 등의 핵심 기능을 제공하여 개발자들이 효율적으로 작업할 수 있도록 돕는다. 
- https://www.trychroma.com/
- 
## Chroma의 주요 특징

- **오픈 소스 라이선스** 
  - Apache 2.0 라이선스에 따라 제공되어 누구나 자유롭게 사용하고 수정할 수 있다. 
- **다양한 개발 환경 지원**
  -  Python 및 JavaScript/TypeScript SDK를 지원하여 다양한 Langchain 과 연동하여 활용할 수 있다. 
- **유연한 데이터 저장 옵션**
  -  HTTP 방식, 디스크 저장 방식, 인메모리 방식을 선택하여 데이터를 저장할 수 있어 사용자 입장에서 매우 편리하다. 
- **간편한 사용법** 
  - 설치 및 사용법이 매우 간단하여 빠르게 프로토타입을 개발하고 검증할 수 있다. 

## 설치
- `conda install conda-forge::chromadb`
- `pip install langchain-chroma`

# Chroma API 를 이용해 연동
- https://docs.trychroma.com/

In [1]:
import chromadb
from pprint import pprint

#### chroma db 서비스에 접속
client = chromadb.Client() # In Memory 방식 - 메모리에 저장소를 만든다.
# client = chromadb.PersistentClient(path="chroma_db") #파일에 저장. 디렉토리 경로

# HTTP 서버로 실행 -> 터미널(명령프롬프트)에서 서버실행
## chroma run --path db_디렉토리경로
# client = chromadb.HttpClient(host="ip주소", port=8000) # 서버에 접속

#### collection - Database 
COLLECTION_NAME = "test_db"
collection = client.create_collection(
    name=COLLECTION_NAME,
    get_or_create=True   #  있으면 연결, 없으면 생성 후 연결. 
)

In [23]:
from uuid import uuid4

document_list = [
    "This is a document about pineapple.",
    "This is a document about orange.",
    "This is a document about sports.",
    "This is a document about langchain.",
    "This is a document about llm."
]
id_list = [str(uuid4()) for _ in range(len(document_list))]

### 데이터 추가
collection.add(documents=document_list, ids=id_list)

C:\Users\Playdata\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:08<00:00, 9.99MiB/s]  


In [30]:
### 유사도 검색
result = collection.query(
    query_texts=["This is a query document about deeplearning."],
    n_results=3, # top-k 조회개수
)
pprint(result)

{'data': None,
 'distances': [[1.2870078086853027, 1.3082304000854492, 1.4037604331970215]],
 'documents': [['This is a document about langchain.',
                'This is a document about llm.',
                'This is a document about sports.']],
 'embeddings': None,
 'ids': [['26aa24af-1c18-477d-ad32-b42de312348c',
          '260db838-3ddb-4c9f-9909-d5e305a21899',
          '1a44063f-e6c3-406c-a1da-730d5d171bb7']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None, None, None]],
 'uris': None}


# Langchain을 이용해 Chroma 연동

## Data 준비

In [31]:
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    id=1,
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
    id=2,
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
    id=3,
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
    id=4,
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
    id=5,
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
    id=6,
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
    id=7,
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
    id=8,
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
    id=9,
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
    id=10,
)

document_list = [document_1, document_2, document_3, document_4, document_5, document_6, document_7, document_8, document_9, document_10]
ids = [str(uuid4()) for _ in range(len(document_list))]

## Vector Store 생성, 연결
- Chroma.from_documents()
  - VectorStore를 초기화(생성)하고 문서를 추가한다. 

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

COLLECTION_NAME = "example"
PERSIST_DIRECTORY = "vector_store/chroma/example_db" #저장할 경로

# 저장하면서 생성
vector_store = Chroma.from_documents(
    documents=document_list,
    embedding=embedding_model,
    ids=ids,
    collection_name=COLLECTION_NAME,
    persist_directory=PERSIST_DIRECTORY
)

In [34]:
# 생성된 DB에 연결, 데이터 저장없이 생성하고 연결.
vector_store2 = Chroma(
    embedding_function=embedding_model,
    collection_name=COLLECTION_NAME,
    persist_directory=PERSIST_DIRECTORY
)


## VectorStore 정보 확인

In [36]:
vector_store._collection

Collection(name=example)

In [None]:
vector_store._collection.count()
# 저장된 데이터 개수

10

In [38]:
vector_store.get()

{'ids': ['fab140dd-39fb-429e-9eaf-82b00ac8714b',
  '3ac94ed9-95b0-428d-8eab-bcdd80e111aa',
  'ea648b0f-5f58-496e-8e3d-863cf7a14dfb',
  '37070695-cd38-4b4c-8dcb-50f4e8413198',
  'ef9a2bfc-14af-4f0b-a26c-d79e226a628f',
  '240d8064-88a9-42f8-8184-e1c432456bcf',
  '12c0875f-6c7e-4477-9154-c0089789626d',
  '78f4f689-c5f8-4680-a11d-898876d4faf4',
  '2e9cc2f3-9598-416b-8313-fb763e9e6519',
  'bfa52d1a-0c40-4fbe-88ff-9a1eb35ef164'],
 'embeddings': None,
 'documents': ['I had chocolate chip pancakes and scrambled eggs for breakfast this morning.',
  'The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.',
  'Building an exciting new project with LangChain - come check it out!',
  'Robbers broke into the city bank and stole $1 million in cash.',
  "Wow! That was an amazing movie. I can't wait to see it again.",
  'Is the new iPhone worth the price? Read this review to find out.',
  'The top 10 soccer players in the world right now.',
  'LangGraph is the best framewo

## Add (추가)

In [39]:
new_document = Document(
    page_content="LLM은 대형 언어 모델입니다.",
    metadata={'source':"tweet"},
    id=2000
)
vector_store.add_documents([new_document], ids=[str(uuid4())])

['2ec4a20c-2201-4572-bb4a-fe3ea8a9cd06']

In [40]:
vector_store._collection.count()

11

## Update(갱신)

In [41]:
up_document = Document(
    page_content="LLM은 대형 언어 모델입니다. Langchain은 LLM 연동 Framework입니다.",
    metadata={'source':"tweet"},
    id=2000
)
# 한개 문서 업데이트(수정)
vector_store.update_document(
    document_id="2ec4a20c-2201-4572-bb4a-fe3ea8a9cd06",
    document=up_document
)
# 여러개 문서 업데이트
# vector_store.update_documents(
#     ids=[id들],
#     documents=[수정할 Document들]
# )

In [42]:
vector_store.get()

{'ids': ['fab140dd-39fb-429e-9eaf-82b00ac8714b',
  '3ac94ed9-95b0-428d-8eab-bcdd80e111aa',
  'ea648b0f-5f58-496e-8e3d-863cf7a14dfb',
  '37070695-cd38-4b4c-8dcb-50f4e8413198',
  'ef9a2bfc-14af-4f0b-a26c-d79e226a628f',
  '240d8064-88a9-42f8-8184-e1c432456bcf',
  '12c0875f-6c7e-4477-9154-c0089789626d',
  '78f4f689-c5f8-4680-a11d-898876d4faf4',
  '2e9cc2f3-9598-416b-8313-fb763e9e6519',
  'bfa52d1a-0c40-4fbe-88ff-9a1eb35ef164',
  '2ec4a20c-2201-4572-bb4a-fe3ea8a9cd06'],
 'embeddings': None,
 'documents': ['I had chocolate chip pancakes and scrambled eggs for breakfast this morning.',
  'The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.',
  'Building an exciting new project with LangChain - come check it out!',
  'Robbers broke into the city bank and stole $1 million in cash.',
  "Wow! That was an amazing movie. I can't wait to see it again.",
  'Is the new iPhone worth the price? Read this review to find out.',
  'The top 10 soccer players in the world rig

## Delete(삭제)

In [43]:
vector_store.delete(ids=["2ec4a20c-2201-4572-bb4a-fe3ea8a9cd06"])

In [44]:
vector_store.get()

{'ids': ['fab140dd-39fb-429e-9eaf-82b00ac8714b',
  '3ac94ed9-95b0-428d-8eab-bcdd80e111aa',
  'ea648b0f-5f58-496e-8e3d-863cf7a14dfb',
  '37070695-cd38-4b4c-8dcb-50f4e8413198',
  'ef9a2bfc-14af-4f0b-a26c-d79e226a628f',
  '240d8064-88a9-42f8-8184-e1c432456bcf',
  '12c0875f-6c7e-4477-9154-c0089789626d',
  '78f4f689-c5f8-4680-a11d-898876d4faf4',
  '2e9cc2f3-9598-416b-8313-fb763e9e6519',
  'bfa52d1a-0c40-4fbe-88ff-9a1eb35ef164'],
 'embeddings': None,
 'documents': ['I had chocolate chip pancakes and scrambled eggs for breakfast this morning.',
  'The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.',
  'Building an exciting new project with LangChain - come check it out!',
  'Robbers broke into the city bank and stole $1 million in cash.',
  "Wow! That was an amazing movie. I can't wait to see it again.",
  'Is the new iPhone worth the price? Read this review to find out.',
  'The top 10 soccer players in the world right now.',
  'LangGraph is the best framewo

## Query(조회)
- `similarity_search(query, k, filter)`
  - 저장되 있는 item들 중 질의와 가장 유사한 것 k개를 찾는다. 
  - 찾은 결과를 filter 조건으로 필터링 한다. filter 조건은 meta-data의 정보를 이용한다.
  - 질의어(query)는 text(자연어)로 입력한다.
- `similarity_search_with_score(query, k, filter)`
  - 저장되 있는 item들 중 질의와 가장 유사한 것 k개를 찾아 유사도 점수와 함께 반환
- `similarity_search_by_vector(embedding, k, filter)`
  - Embedding Vector 를 질의로 입력한다. (질의(query)를 문장이 아니라 embedding vector로 입력.) 

In [46]:
result = vector_store.similarity_search_with_score(
    query="Will it be hot tomorrow?",
    k=3
)

In [47]:
result

[(Document(metadata={'source': 'news'}, page_content='The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.'),
  0.8616617431230914),
 (Document(metadata={'source': 'tweet'}, page_content='I have a bad feeling I am going to get deleted :('),
  1.5730749643445925),
 (Document(metadata={'source': 'tweet'}, page_content='I had chocolate chip pancakes and scrambled eggs for breakfast this morning.'),
  1.6944268028378655)]

## Retriever를 이용한 조회
- vectorStore.as_retriever()
  - Vector Store에서 Retriever 생성