<a href="https://colab.research.google.com/github/mahesh-from-sirsi/All_My_AI_Work/blob/main/MaheshVShet_BuildFastWithAI_Module2_Exercises_Create_a_VectorStore_DB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Exercise 1: Create a VectorStore DB
Load a document (PDF or TXT), process its text, apply embeddings, and store the results in ChromaDB OR FAISS for efficient retrieval.





## STEPS IN STORING THE DATA TO THE VECTOR STORE FOR RAG:

### Step1: Import all needed libraries

### Step2: Define all the necessary methods

### Step3: Initialization and model definition

### Step4: Load the text that you want to chunk

### Step3: Convert the text into multiple chunks

### Step4: Intialize the embedding model

### Step5: create vectors of all the chunks

### Step6: Initialize the Vector Database

### Step7: Store the Data in the Vector Database

####Hint:  


In [2]:
!pip install -qU langchain langchain-openai langchain-pinecone langchain_community pypdf chromadb

In [25]:
# STEP 1: IMPORT ALL THE REQUIRED MODULES

# Import all the necessary modules
import os                                                          # To Save the secrets as environment
import numpy as np                                                 # For cosine similarity
from langchain_openai import OpenAIEmbeddings                      # library for creating the embedding model
from langchain_community.document_loaders import PyPDFLoader       # Library for reading the PDF as a PDFLoader
#from langchain.text_splitter import CharacterTextSplitter         # Library for splitting the text into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter # Library for splitting the text into chunks
from langchain_community.vectorstores import Chroma                # Library for interacting with the ChromaDB vector store
from langchain.chains import RetrievalQA                           # For Converting UserQuery to Embeddings and retrive matching entry from top k entries
from langchain.chat_models import ChatOpenAI                       # or any other LLM


In [4]:
# STEP2 - DEFINE ALL NECESSARY METHODS
## Function for converting text to embedding
def embed_text(text):
    text_vector = embed_model.embed_query(text)
    return text_vector


# FORMULAE FOR COSINE SIMILARITY IS
# cosine_similarity = np.dot(A, B) / (norm(A) * norm(B))
#
# This will output a value between -1 and 1 indicating how similar the two
# vectors are in direction (1 means exactly the same, 0 means orthogonal,
# -1 means opposite directions).

def cosine_similarity(statement1, statement2):
    vec1 = embed_text(statement1)
    vec2 = embed_text(statement2)
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

In [19]:
# STEP3 - ALL INITIALIZATION, MODEL CREATION

# Save the Key needed for interacting with the Model, Database etc,
os.environ["OPENAI_API_KEY"] = ""

# Create the embedding model
embed_model = OpenAIEmbeddings(model = "text-embedding-3-small")

# Step 2: Split text into chunks for better processing
# text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)  # Define chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

In [20]:
# STEP4 - Read the text you want to store in the Vector Store (Load a document (TXT or PDF))
loader = PyPDFLoader("/content/MaheshVShet-CV.pdf")  # Use PyPDFLoader for PDFs

# Read the pages of the PDF file
documents = loader.load() # documents is now a list of LangChain Document objects, each containing page_content + metadata (like page number).

# Text Splitter
docs = text_splitter.split_documents(documents)

In [21]:
# Create the Vector Store
vectorstore = Chroma.from_documents(docs, embed_model, persist_directory="chroma_db")

# Persist the database
vectorstore.persist()

In [26]:
#Search the contents of the vector store

# RetrievalQA is a LangChain abstraction that ties together a retriever
# (like ChromaDB) with a language model (LLM) so you can ask natural questions
# over your documents.

# Without RetrievalQA
# You would have to:
# - Take the user’s query → embed it.
# - Search the vector DB (e.g., Chroma) → get the top-k chunks.
# - Concatenate those chunks into a context.
# - Pass that context + the query into the LLM manually.
# - Parse and return the response.
# - This means you’re handling retrieval + formatting + QA pipeline logic yourself.
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-3.5-turbo"),
    retriever=retriever,
)

query = "Summarize the PDF in 3 bullet points."
print(qa_chain.run(query))


- Successful management of Development and QE teams to ensure collaboration and continuous improvement culture
- Ensured 100% on-time delivery of endpoint security solutions without compromising quality or security standards
- Published research paper on "Usage of Jumbo Frames in Virtualised Environments," filed a patent for "Configuring a Virtual Machine Remotely" technology, and developed solutions improving virtual machine performance by 25% in enterprise environments


In [24]:
query = "How old is Mahesh V. Shet"
print(qa_chain.run(query))

Mahesh V. Shet has over 24 years of experience, and his role as Vice President of the Staff Welfare Association at ISRO was from 1999 to 2003. Based on this information, it can be inferred that Mahesh V. Shet is likely in his mid-40s to early 50s.


### Exercise 2 : Response to User Query
After storing a document in a vector database (e.g., ChromaDB or FAISS), write a function that takes a user's query as input and returns a relevant response based on the stored content.





####Hint:  


In [None]:
# Step 2: Define a function to query the database
def get_response(user_query, vector_db, k=3):
    # Search for similar documents/chunks
    relevant_docs = vector_db.similarity_search(user_query, k=k)

    # Extract and return the content of the relevant documents
    return [doc.page_content for doc in relevant_docs]

# Step 3: Test the function with a user query