# Vector Database Collection & API Setup

This notebook will walk you through setting up the vector database portion of the [openai-realtime-rag](https://github.com/ALucek/openai-realtime-rag/tree/main) fork.

## Setting Up Your Vector Database

For our vector database, a classic choice I use is [ChromaDB](https://www.trychroma.com/). While you can host Chroma as a server itself, I've decoupled the database and the API to allow for more dynamic plug and play capabilities for databases.

#### Instantiate ChromaDB

Create a persistent client of ChromaDB that will store everything in the folder `chroma`

In [2]:
pip install chromadb PyMuPDF tiktoken langchain_community

Collecting chromadbNote: you may need to restart the kernel to use updated packages.


  error: subprocess-exited-with-error
  
  × Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [9 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'hnswlib' extension
      creating build\temp.win-amd64-cpython-312\Release\python_bindings
      "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\HP\AppData\Local\Temp\pip-build-env-s9khzivz\overlay\Lib\site-packages\pybind11\include -IC:\Users\HP\AppData\Local\Temp\pip-build-env-s9khzivz\overlay\Lib\site-packages\numpy\_core\include -I./hnswlib/ -Ic:\Users\HP\AppData\Local\Programs\Python\Python312\include -Ic:\Users\HP\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.3


  Downloading chromadb-0.5.18-py3-none-any.whl.metadata (6.8 kB)
Collecting PyMuPDF
  Downloading PyMuPDF-1.24.13-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.5-py3-none-any.whl.metadata (2.9 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6.tar.gz (32 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.0-cp312-cp312-

In [4]:
!python -m venv venv

In [5]:
!.\venv\Scripts\activate

In [None]:
!pip install --upgrade pip

In [None]:
!pip install chromadb PyMuPDF langchain_community

In [9]:
import chromadb

# Creating Vector Database
client = chromadb.PersistentClient()

#### Create a New Collection

This is where all of our chunked text documents are going to be inserted into

In [10]:
collection = client.get_or_create_collection(name="vdb_collection", metadata={"hnsw:space": "cosine"})

#### Load & Split PDF 

We'll be using some simple LangChain integrations to load and chunk our PDF. Using OpenAI's standard token chunk size and overlap for their Assistants API as a baseline.

In [11]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Loading and Chunking
loader = PyMuPDFLoader("./Artificial Intelligence A New Dawn.pdf")
pages = loader.load()

document = ""
for i in range(len(pages)):
    document += pages[i].page_content

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=800,
    chunk_overlap=400,
)

chunks = text_splitter.split_text(document)

In [12]:
len(chunks)

136

#### Insert Chunks into VDB Collection

Embed each chunk into the collection

In [13]:
# Insert Chunks into ChromaDB Collection
i = 0
for chunk in chunks:
    collection.add(
    documents=[chunk],
    ids=[f"chunk_{i}"]
    )
    i += 1

C:\Users\HP\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [06:09<00:00, 225kiB/s] 


#### If Repopulating the DB, first delete the collection

Use this if you need to reset your vector database.

In [None]:
client.delete_collection(name="vdb_collection")

---
## API Setup

We'll be using [FastAPI](https://fastapi.tiangolo.com/) as a quick and easy way to host our query function as a REST API. This API is what will be called from the defined `query_db` tool in the main console file.

In [14]:
from fastapi import FastAPI
from pydantic import BaseModel
from fastapi.middleware.cors import CORSMiddleware

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# Define a request model
class QueryRequest(BaseModel):
    query: str

# Define the query endpoint
@app.post("/query")
async def query_chroma(request: QueryRequest):
    # Perform the query on your ChromaDB collection
    results = collection.query(query_texts=[request.query], n_results=5)
    return {"results": results['documents'][0]}

#### Run API

Using uvicorn to host the API as a local web server

In [15]:
import uvicorn
import threading

def run_api():
    uvicorn.run(app, host="0.0.0.0", port=8000)

# Run the FastAPI app in a background thread
thread = threading.Thread(target=run_api)
thread.start()

INFO:     Started server process [12764]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


INFO:     127.0.0.1:56402 - "OPTIONS /query HTTP/1.1" 200 OK
INFO:     127.0.0.1:56402 - "POST /query HTTP/1.1" 200 OK
