# Vectorstore
## Introduction

- We can also index documentation similar to code.
- In it's simplest form, a vector database is a datastore that allows us to find similar and related documents.
- It is similar to a search engine, but instead of searching for a string or keyworkd, we use the concept of embeddings.
- We'll be using Chroma as popular vector database.
- Many other vector databases exist and now more and more traditional databases are adding vector search capabilities.

## Installation

In [8]:
import os
os.environ["HNSWLIB_NO_NATIVE"]="1"

In [9]:
%pip install -qU langchain langchain-openai langchain-community tiktoken chromadb langchain-chroma


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Document preparation

- In Langchain we can present documents in a structured way using the Document object.
- We have a text and metadata field.
- Let's convert some information into this format.

In [10]:
# We load the texts
from langchain_core.documents import Document

data = [
    {
        "text": "We use REST API to communicate with the server.",
    },
    {
        "text": "For testing in javascript we use Jest." 
    }, 
    {
        "text": "We prefer snake_case for naming variables over CamelCase in javascript.",
    }
]

documents = []

for item in data:
    document = Document(
        page_content=item["text"],
    )
    documents.append(document)

## ChromaDB - A vector database

- Chromadb has the concept of collections.
- A collection is similar to a database or table in a traditional database.
- For this example to run, we check first if the collection workshop exists.
- And if it does, we delete it , so we can start fresh.

In [11]:
import chromadb

collection_name="documentation"
chroma_client = chromadb.PersistentClient(path="./chromadb")
collections = chroma_client.list_collections()

if "documentation" in collections:
    print("deleting documentation")
    chroma_client.delete_collection("documentation")


deleting documentation


## Setting up the embeddings model
- To index the documents, we need to seelect an embeddings model.
- The specific model will be used to calculate the similarity between the documents.
- You need to use the same model for indexing and querying.

In [12]:
# Set the embeddings function
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
print(embeddings.model)

text-embedding-ada-002


## Populating the vector database
- The Vectory database will calculate the embedding for each document.
- It will store the text and metadata in the store as well

In [13]:
# Vectory database will calculate them using the embeddings_model provided
# and store the embeddings for each doc in it's database
# from langchain_community.vectorstores import Chroma
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    client=chroma_client,
    collection_name=collection_name
    # client_settings
)

Once stored , we can ask it to find the related documents through embeddings.

## Querying for similarity

- We ask the vector database to find the most similar documents to our query.
- We can control the number of results we get back. using the k parameter.
- We can also control the threshold of relevance. using the score_threshold parameter.

In [14]:
queries = [
    "What protocol do we use to communicate with the server ?",
    "What is the testing framework for javascript?",
]

for query in queries:
    docs = vectorstore.similarity_search_with_relevance_scores(query, k=4, score_threshold=0.7)
    print("query: ", query)
    for doc in docs:
        print(doc)
    print("---")

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


query:  What protocol do we use to communicate with the server ?
(Document(id='36b62b03-9371-4e08-8df4-a04dd66d27a5', metadata={}, page_content='We use REST API to communicate with the server.'), 0.8068346522236127)
---


Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


query:  What is the testing framework for javascript?
(Document(id='9829c334-885a-42da-896a-cc4f0c2ff395', metadata={}, page_content='For testing in javascript we use Jest.'), 0.851067264274354)
(Document(id='a48ac426-c3db-4bf3-a435-4a9852573e58', metadata={}, page_content='We prefer snake_case for naming variables over CamelCase in javascript.'), 0.7117564646913745)
---
