## Vector Stores

### Unstructured Data -> Load, Transform, Embed -> Stored in Vector Store
### Incoming query -> embed it -> retrieve the embedding vectors most similar to embedded query from vector store.

### Some free open-source Vector Stores
## ChromaDB, FAISS,Lance.

In [2]:
!pip install chromadb qdrant-client faiss-cpu

Collecting qdrant-client
  Downloading qdrant_client-1.8.0-py3-none-any.whl (214 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-win_amd64.whl (14.5 MB)
Collecting grpcio-tools>=1.41.0
  Downloading grpcio_tools-1.62.1-cp310-cp310-win_amd64.whl (1.1 MB)
Collecting portalocker<3.0.0,>=2.7.0
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting h2<5,>=3
  Downloading h2-4.1.0-py3-none-any.whl (57 kB)
Collecting hpack<5,>=4.0
  Downloading hpack-4.0.0-py3-none-any.whl (32 kB)
Collecting hyperframe<7,>=6.0
  Downloading hyperframe-6.0.1-py3-none-any.whl (12 kB)
Installing collected packages: hyperframe, hpack, h2, portalocker, grpcio-tools, qdrant-client, faiss-cpu
Successfully installed faiss-cpu-1.8.0 grpcio-tools-1.62.1 h2-4.1.0 hpack-4.0.0 hyperframe-6.0.1 portalocker-2.8.2 qdrant-client-1.8.0


You should consider upgrading via the 'C:\Users\HP\OneDrive\Desktop\LLMs_Intro\langchain_new_env\Scripts\python.exe -m pip install --upgrade pip' command.


In [1]:
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.vectorstores import FAISS

#### Document Loading

In [5]:
loader = WikipediaLoader(query = 'Elon Musk', load_max_docs=5)
documents = loader.load()
documents

[Document(page_content='Elon Reeve Musk (; EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect, and former chairman of Tesla, Inc.; owner, executive chairman, and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is one of the wealthiest people in the world, with an estimated net worth of US$190 billion as of March 2024, according to the Bloomberg Billionaires Index, and $195 billion according to Forbes, primarily from his ownership stakes in Tesla and SpaceX.A member of the wealthy South African Musk family, Elon was born in Pretoria and briefly attended the University of Pretoria before immigrating to Canada at age 18, acquiring citizenship through his Canadian-born mother. Two years later, he matriculated at Queen\'s University at Kingston in Canada. Musk later transferred to the University of Pennsylvani

### Text Splitting

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 400, chunk_overlap = 100)
docs = text_splitter.split_documents(documents=documents)
print(len(docs))
docs

76


[Document(page_content='Elon Reeve Musk (; EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect, and former chairman of Tesla, Inc.; owner, executive chairman, and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is one of the wealthiest', metadata={'title': 'Elon Musk', 'summary': "Elon Reeve Musk (; EE-lon; born June 28, 1971) is a businessman and investor. He is the founder, chairman, CEO, and CTO of SpaceX; angel investor, CEO, product architect, and former chairman of Tesla, Inc.; owner, executive chairman, and CTO of X Corp.; founder of the Boring Company and xAI; co-founder of Neuralink and OpenAI; and president of the Musk Foundation. He is one of the wealthiest people in the world, with an estimated net worth of US$190 billion as of March 2024, according to the Bloomberg Billionaires Index, and $195

### Defining the embedding function

In [7]:
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':True}

embedding_function = HuggingFaceBgeEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

  from .autonotebook import tqdm as notebook_tqdm


### Query

In [8]:
query = "Who is elon musk's father"

### FAISS
### Creating a vector database (FAISS - in memory database)

In [9]:
db = FAISS.from_documents(
    docs,
    embedding_function
)

### Querying the vector database

#### Similarity Search

In [10]:
matched_docs = db.similarity_search(query = query, k = 5)
matched_docs

[Document(page_content="Elon Musk's paternal great-grandmother was a Dutchwoman descended from the Dutch Free Burghers, while one of his maternal great-grandparents came from Switzerland. His paternal grandmother was English from Liverpool and his paternal grandfather Walter Henry J. Musk was South African. Elon Musk's father, Errol Musk, is a South African former electrical and mechanical engineer consultant and", metadata={'title': 'Musk family', 'summary': 'The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.', 'source': 'https://en.wikipedia.org/wiki/Musk_family'}),
 Document(page_content="Elon Reeve Musk was born o

#### Similarity Search by Vectors

In [11]:
embedding_vector = embedding_function.embed_query(query)
matched_docs = db.similarity_search_by_vector(embedding_vector)
matched_docs

[Document(page_content="Elon Musk's paternal great-grandmother was a Dutchwoman descended from the Dutch Free Burghers, while one of his maternal great-grandparents came from Switzerland. His paternal grandmother was English from Liverpool and his paternal grandfather Walter Henry J. Musk was South African. Elon Musk's father, Errol Musk, is a South African former electrical and mechanical engineer consultant and", metadata={'title': 'Musk family', 'summary': 'The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.', 'source': 'https://en.wikipedia.org/wiki/Musk_family'}),
 Document(page_content="Elon Reeve Musk was born o

### Check if the answer is present in results

In [12]:
['errol musk' in doc.page_content.lower() for doc in matched_docs] # Errol Musk is the answer

[True, True, True, False]

FAISS is an in-memory vector store.

And most of the times, we don't use these in-memory vector stores.

Let's start with chromaDB and understand how to... * Save the vector store * Load the vector store * Add new records to vector store.

In [13]:
import chromadb
from langchain.vectorstores import Chroma

### Creating a Chroma Vector Store

In [14]:
from langchain_community.vectorstores import Chroma
db = Chroma.from_documents(docs, embedding_function, persist_directory="output/elon_muskdb")

### Loading the db

In [15]:
loaded_db = Chroma(persist_directory = "output/elon_muskdb", embedding_function = embedding_function)

### Querying the DBs

In [16]:
matched_docs = loaded_db.similarity_search(query = query, k = 5)
matched_docs

[Document(page_content="Elon Musk's paternal great-grandmother was a Dutchwoman descended from the Dutch Free Burghers, while one of his maternal great-grandparents came from Switzerland. His paternal grandmother was English from Liverpool and his paternal grandfather Walter Henry J. Musk was South African. Elon Musk's father, Errol Musk, is a South African former electrical and mechanical engineer consultant and", metadata={'source': 'https://en.wikipedia.org/wiki/Musk_family', 'summary': 'The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.', 'title': 'Musk family'}),
 Document(page_content="Elon Musk's paternal great

### Adding a new record to the existing vector store

In [17]:
family_data_loader = WikipediaLoader(query='Musk Family', load_max_docs=1)
family_documents = family_data_loader.load()
family_documents

[Document(page_content='The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.\n\n\n== History ==\nElon Musk\'s paternal great-grandmother was a Dutchwoman descended from the Dutch Free Burghers, while one of his maternal great-grandparents came from Switzerland. His paternal grandmother was English from Liverpool and his paternal grandfather Walter Henry J. Musk was South African. Elon Musk\'s father, Errol Musk, is a South African former electrical and mechanical engineer consultant and property developer, who was involved in the emerald business at some point in the 1980s, and was a member of the South African Progress

### Using the exising text splitter

In [18]:
family_docs = text_splitter.split_documents(documents=family_documents)
print(len(family_docs))
family_docs

11


[Document(page_content='The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the', metadata={'title': 'Musk family', 'summary': 'The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.', 'source': 'https://en.wikipedia.org/wiki/Musk_family'}),
 Document(page_content='in the world, with an e

### Using the same loaded embedded function

In [19]:
db = Chroma.from_documents(
    family_docs, # The new docs that we want to add
    embedding_function, # Should be the same embedding function
    persist_directory="output/elon_musk_db" # Existing vectorstore where we want to add the new records
)

### Getting the matching documents with the query

In [20]:
matched_docs = db.similarity_search(query=query, k=5)

['errol musk' in doc.page_content.lower() for doc in matched_docs]

[True, True, True, True, True]

## Retrievers

Making a retriever from vector store

We can also define how the vectorstores should search and how many items to return

In [22]:
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceBgeEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000021CB7DBB190>)

### Querying a retriever

In [23]:
query = "Who is elon musk's father?"
matched_docs = retriever.get_relevant_documents(query = query)
matched_docs

[Document(page_content="Elon Musk's paternal great-grandmother was a Dutchwoman descended from the Dutch Free Burghers, while one of his maternal great-grandparents came from Switzerland. His paternal grandmother was English from Liverpool and his paternal grandfather Walter Henry J. Musk was South African. Elon Musk's father, Errol Musk, is a South African former electrical and mechanical engineer consultant and", metadata={'source': 'https://en.wikipedia.org/wiki/Musk_family', 'summary': 'The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.', 'title': 'Musk family'}),
 Document(page_content="Elon Musk's paternal great

### How these retrievers should retreiver, how many items to retrieve
### MMR - Maximum marginal relevance (relevancy and diversity)

In [44]:
retriever = db.as_retriever(search_type='mmr', search_kwargs={"k": 1})

matched_docs = retriever.get_relevant_documents(query=query)

matched_docs

[Document(page_content="In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets, serving as its CEO. In 1986, he helped develop the visual effects industry by funding the computer", metadata={'source': 'https://en.wikipedia.org/wiki/Steve_Jobs', 'summary': 'Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology giant Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer revolution of the 1970s and 1980s, along with his early business partner and fellow Apple co-founder Steve Wozniak.\nJobs was born in San Francisco in 1955 and adopted shortly afterwards. He attended Reed Colleg

### Similarity Score threshold

In [24]:
retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5, "k":2})

matched_docs = retriever.get_relevant_documents(query=query)

matched_docs

[Document(page_content="Elon Musk's paternal great-grandmother was a Dutchwoman descended from the Dutch Free Burghers, while one of his maternal great-grandparents came from Switzerland. His paternal grandmother was English from Liverpool and his paternal grandfather Walter Henry J. Musk was South African. Elon Musk's father, Errol Musk, is a South African former electrical and mechanical engineer consultant and", metadata={'source': 'https://en.wikipedia.org/wiki/Musk_family', 'summary': 'The Musk family is a wealthy family of South African origin that is largely active in the United States and Canada. The Musks are of English, Anglo-Canadian, Pennsylvania Dutch, and Swiss descent. The family is known for its entrepreneurial endeavours. Elon Musk was formerly the wealthiest person in the world, with an estimated net worth of US$232 billion as of December 2023, according to the Bloomberg Billionaires Index.', 'title': 'Musk family'}),
 Document(page_content="Elon Musk's paternal great

#### BM25 Retriever ( similar to keyword search)

In [25]:
!pip install rank_bm25



You should consider upgrading via the 'C:\Users\HP\OneDrive\Desktop\LLMs_Intro\langchain_new_env\Scripts\python.exe -m pip install --upgrade pip' command.


In [25]:
from langchain.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(docs)

In [26]:
matched_docs = bm25_retriever.get_relevant_documents('Musk')
matched_docs

[Document(page_content="October 17. Weeks before the trial was set to begin, Musk reversed course, announcing that he would move forward with the acquisition. The deal was closed on October 27, with Musk immediately becoming Twitter's new owner and CEO. Twitter was taken private and merged into a new parent company named X Corp. Musk promptly fired several top executives, including previous CEO Parag Agrawal. Musk has", metadata={'title': 'Acquisition of Twitter by Elon Musk', 'summary': 'Business magnate Elon Musk initiated an acquisition of American social media company Twitter, Inc. on April 14, 2022, and concluded it on October 27, 2022. Musk had begun buying shares of the company in January 2022, becoming its largest shareholder by April with a 9.1 percent ownership stake. Twitter invited Musk to join its board of directors, an offer he initially accepted before declining. On April 14, Musk made an unsolicited offer to purchase the company, to which Twitter\'s board responded with