# Vector Stores (aka. Vector Databases)
* Store embeddings in a very fast searchable database.

In [1]:
#pip install python-dotenv

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

#### Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [3]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [4]:
#!pip install langchain-openai

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

In [2]:
from langchain_openai import ChatOpenAI

chatModel = ChatOpenAI(model="gpt-3.5-turbo-0125")

## Reminder: Steps of the RAG process.
* When you load a document, you end up with strings. Sometimes the strings will be too large to fit into the context window. In those occassions we will use the RAG technique:
    * Split document in small chunks.
    * Transform text chunks in numeric chunks (embeddings).
    * **Load embeddings to a vector database (aka vector store)**.
    * Load question and retrieve the most relevant embeddings to respond it.
    * Sent the embeddings to the LLM to format the response properly.

## Vector databases (aka vector stores): store and search embeddings
* See the documentation page [here](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/).
* See the list of vector stores [here](https://python.langchain.com/v0.1/docs/integrations/vectorstores/).

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [6]:
#!pip install langchain-chroma

In [3]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
loaded_document = TextLoader('./data/state_of_the_union.txt').load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

vector_db = Chroma.from_documents(chunks_of_text, OpenAIEmbeddings())

In [4]:
question = "What did the president say about the John Lewis Voting Rights Act?"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


The `.similarity_search` method in the previous code is used to find the most relevant text chunks from a pre-processed document that are similar to a given query. Here's how it works step-by-step

1. **Document Processing and Embedding:**
   - The document `state_of_the_union.txt` is loaded using the `TextLoader` from the LangChain Community package.
   - The document is then split into smaller chunks of text using the `CharacterTextSplitter`, where each chunk is 1000 characters long without any overlap between the chunks.
   - Each chunk of text is then transformed into an embedding using `OpenAIEmbeddings`. Embeddings are high-dimensional vectors that represent the semantic content of the text.
   - These embeddings are stored in an instance of `Chroma`, which serves as a vector database optimized for efficient similarity searches.

2. **Using `.similarity_search`:**
   - When you invoke `vector_db.similarity_search(question)`, the method converts the query (`"What did the president say about the John Lewis Voting Rights Act?"`) into an embedding using the same method that was used for the chunks.
   - It then searches the vector database (`vector_db`) to find the chunks whose embeddings are most similar to the embedding of the query. This similarity is typically measured using metrics such as cosine similarity.
   - The search results are sorted by relevance, with the most relevant chunks (those that are semantically closest to the query) returned first.

3. **Output:**
   - The result of the `.similarity_search` is stored in `response`, which contains the relevant chunks and their similarity scores.
   - The script prints the content of the most relevant chunk (the first result), which should ideally contain information about what the president said regarding the John Lewis Voting Rights Act.

This method is particularly useful in applications like question answering or document retrieval where you need to quickly find the most relevant parts of a large text based on a query.

## chroma db with hugging face embeddings

In [6]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma

In [15]:

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
loaded_document = PyPDFLoader('data1\pages.pdf').load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

chunks_of_text = text_splitter.split_documents(loaded_document)


In [16]:
embeddings_model_name = 'BAAI/bge-large-en'

In [17]:
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
#embed=embeddings.embed_documents(chunks_of_text)
vector_db = Chroma.from_documents(chunks_of_text, embeddings)

In [18]:
question = " give me the stpes to crete a pdf document"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Page 3 of 4 Creating PDF Documents (continued) 
 
Option 2: If you do not have  Acrobat Standard or higher 
installed use PS2PSF.*   
    
 1. Open the file in its authoring app lication, and choose File > Print. 
2. Select “Print to File” and save. 
3. Open your browser and go to http://ps2pdf.com/convert.htm
 
4. Click “browse” select the file you created in step 2 (.prn or .ps), 
click “convert” 
5. Download the newly created PDF file. 
*Note: Some formatting changes ma y occur once converted (bullets 
may turn to symbols and color may become black and white).


In [19]:


question = " how to reduce the size of a pdf document"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Page 4 of 4 Reducing File Size Options 
 
*WebDCU will accept files up to 2.0MB.* 
 Here is a rough estimate for PDF file sizes: If the contents are pure text, like a CV, the file size is usually 10kb per 
page; therefore, a 1MB file will ha ve about 100 pages.  If the file 
includes some pictures, the file size may increase. If the file is a 
picture, like a scanned license or ce rtification, you may have different 
file sizes based on the picture quality.  In most cases, saving the file at 
about 250kb per page should be enough to generate a clear picture.  Option 1 – Use Adobe PDF Print Command: 1. Open the PDF file, and choose File > Print. 2. Choose Adobe PDF from the printer menu next to Name. 3. Click the Properties (or Preferen ces) button to customize the Adobe 
PDF printer setting. (In some app lications, you may need to click 
Setup in the Print dialog box to open  the list of printers, and then click 
Properties or Preferences.)  Choose Sm allest File Size as your default


## milvus with hugging face embeddings

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_milvus import Milvus


In [2]:
doc=PyPDFLoader('data1\pages.pdf').load()
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks_of_text=text_splitter.split_documents(doc)

In [3]:
model="BAAI/bge-large-en"
embeddings=HuggingFaceEmbeddings(model_name=model)


  from .autonotebook import tqdm as notebook_tqdm


## milvus lite

In [5]:
# URI = "./milvus_example.db"

# vector_store = Milvus(
#     embedding_function=embeddings,
#     connection_args={"uri": URI},
#     index_params={"index_type": "FLAT", "metric_type": "L2"},
# )

## create the milvus db

### 1) install the docker[install](https://docs.docker.com/desktop/setup/install/windows-install/)
### 2)

In [12]:

from tqdm import tqdm
from pymilvus import utility, Collection,connections,CollectionSchema, FieldSchema, DataType,db
from langchain.vectorstores import Milvus

In [16]:
from pymilvus import Collection, MilvusException, connections, db, utility

conn = connections.connect(host="127.0.0.1", port=19530)

# Check if the database exists
db_name = "milvus_demo"
try:
    existing_databases = db.list_database()
    if db_name in existing_databases:
        print(f"Database '{db_name}' already exists.")

        # Use the database context
        db.using_database(db_name)

        # Drop all collections in the database
        collections = utility.list_collections()
        for collection_name in collections:
            collection = Collection(name=collection_name)
            collection.drop()
            print(f"Collection '{collection_name}' has been dropped.")

        db.drop_database(db_name)
        print(f"Database '{db_name}' has been deleted.")
    else:
        print(f"Database '{db_name}' does not exist.")
        database = db.create_database(db_name)
        print(f"Database '{db_name}' created successfully.")
except MilvusException as e:
    print(f"An error occurred: {e}")

MilvusException: <MilvusException: (code=2, message=Fail connecting to server on 127.0.0.1:19530, illegal connection params or server unavailable)>

In [None]:
from pymilvus import MilvusClient
import numpy as np

client = MilvusClient("./milvus_demo.db")
client.create_collection(
    collection_name="demo_collection",
    dimension=384  # The vectors we will use in this demo has 384 dimensions
)

docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

vectors = [[ np.random.uniform(-1, 1) for _ in range(384) ] for _ in range(len(docs)) ]
data = [ {"id": i, "vector": vectors[i], "text": docs[i], "subject": "history"} for i in range(len(vectors)) ]
res = client.insert(
    collection_name="demo_collection",
    data=data
)

res = client.search(
    collection_name="demo_collection",
    data=[vectors[0]],
    filter="subject == 'history'",
    limit=2,
    output_fields=["text", "subject"],
)
print(res)

res = client.query(
    collection_name="demo_collection",
    filter="subject == 'history'",
    output_fields=["text", "subject"],
)
print(res)

res = client.delete(
    collection_name="demo_collection",
    filter="subject == 'history'",
)
print(res)


In [None]:
from pymilvus import MilvusClient


MILVUS_URI = "./huggingface_milvus_test.db"  # Connection URI
COLLECTION_NAME = "huggingface_test"  # Collection name
DIMENSION = 384  # Embedding dimension depending on model

milvus_client = MilvusClient(MILVUS_URI)
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)
milvus_client.create_collection(
    collection_name=COLLECTION_NAME,
    dimension=DIMENSION,
    auto_id=True,  # Enable auto id
    enable_dynamic_field=True,  # Enable dynamic fields
    vector_field_name="question_embedding",  # Map vector field name and embedding column in dataset
    consistency_level="Strong",  # To enable search with latest data
)

## hugging face enbeddings with neo4j

In [None]:
!pip install neo4j==5.26.0

In [15]:

NEO4J_URI='neo4j+s://b9042d2c.databases.neo4j.io'
#NEO4J_URI='neo4j://localhost:7687'
NEO4J_USERNAME='neo4j'
NEO4J_PASSWORD='XMm9obeSnhW3SUhwyAosA2z-ljNfGPj3ycQoc9Xo8WI'
##AURA_INSTANCEID=b9042d2c
#AURA_INSTANCENAME=Instance01


In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings 
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores.neo4j_vector import Neo4jVector


In [2]:
model="BAAI/bge-large-en"
embeddings=HuggingFaceEmbeddings(model_name=model)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
import os

In [16]:


FILE_PATH="data1\pages.pdf"

loader = PyPDFLoader(FILE_PATH)
docs = loader.load()

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500,chunk_overlap=200)

# Split documents into chunks
chunks = text_splitter.split_documents(docs)

# Create a Neo4j vector store
neo4j_db = Neo4jVector.from_documents(
    chunks,
    embeddings,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database='neo4j',
    index_name='techqaVector',
    node_label='techqa',
    text_node_property='text',
    embedding_node_property='embedding',
)



In [19]:
question="how to create a pdf document"
response = neo4j_db.similarity_search(question)

print(response[0].page_content)

Page 2 of 4 Creating PDF Documents 
 
 
Option 1 – Use Adobe PDF Printer Command: 
In many authoring applications, yo u can use the Print command with 
the Adobe PDF printer to convert your file to PDF.   Create a PDF using the Print command (Windows) 
1. Open the file in its authoring a pplication, and choose File > Print. 
2. Choose Adobe PDF from the printer menu. 
 
 
 
3. Click the Properties (or Preferen ces) button to customize the Adobe 
PDF printer setting. (In some app lications, you may need to click 
Setup in the Print dialog box to open  the list of printers, and then click 
Properties or Preferences.)  Choose Sm allest File Size as your default 
setting. 
 
 
 
4. In the Print dialog box, click OK and Save your file. 
 
 
 
Create a PDF using the Print command (Mac OS) 
1. Open the file in its authoring a pplication, and choose File > Print. 
2. Click on the PDF button in the Print window. 3. Click Save as PDF.


## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 004-vector-stores.py
* In terminal, make sure you are in the directory of the file and run:
    * python 004-vector-stores.py