<a href="https://colab.research.google.com/github/naveedkhalid091/Learn_Agentic_AI/blob/main/step02_generative_ai_for_beginners/02(b)_updated_RAG_implementation_with_PineconeDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Implementation of RAG projects:**

For RAG projects in langchain, you need to store and retreive your data.

You need the **following environment** to set in your project.

1. Install the langchain in your project for creating flexibility in switching the chat models.
2. Firstly, you need a database for data storage and its access key.  
3. An Embedding model for vectorization of your data.
4. LLM model for conversations and its access key.

Lets Install the above environment first.

In [None]:
!pip install -U -q langchain

In [None]:
!pip install -U -q langchain-pinecone

In [None]:
!pip install -U -q langchain_google_genai

## **Gettting Access of `PINECONE` & `GEMINI` using API Keys:**

In [None]:
from google.colab import userdata
import os
os.environ['PINECONE_API_KEY'] = userdata.get('PINECONE_API_KEY') # Getting access of PINECONE Database
os.environ['GOOGLE_API_KEY']=userdata.get('GOOGLE_API_KEY') # Getting access of Gemini

## **Initialization of Pinecone client:**

Initializing the `Pinecone client (pc)` is the important step as this client allows you to perform various operations such as creating indexes, inserting vectors, and executing queries.

In [None]:
from pinecone import Pinecone
pc=Pinecone()

## **Create an Index in PINECONE using above client**.

**You can optionally check the existing index name using be below code to prevent duplicates:**

* **i)** First check if the index already exist with the same choosen name.

* **ii)** Secondly create an index if the same name index is not already created.

In [None]:
# i) Checking the index name if it is already exist?

existing_indexes=[]

checking_db_indexes=pc.list_indexes()
print(existing_indexes)

for info_index in checking_db_indexes:
  existing_indexes.append(info_index.name)

print(existing_indexes)

[]
['online-rag-project', 'my-9th-chem-book', 'second-rag-project', 'family-structure']


In [None]:
# ii) Creation of Index
from pinecone import ServerlessSpec


index_name="my-books"


if index_name in existing_indexes:
  print(f"Index {index_name} already exist")
else:
  # PROCEED WITH INDEX CREATION
  pc.create_index(
    name=index_name,
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

## **Accessing the above created Index:**

- Accessing the index will help us inserting the vectors/embedded data through the below line.

In [None]:
index=pc.Index(index_name)

**Note:** PINECONE database setup is successfully completed. Now you need to setup embedding model for vectorization and chunking of your data.

You can also varify your created index into the PINECONE database by signing into your database and navigate to **`Database->Indexes`**.   

## **Select Embedding model:**

This model will first ensure that all of your data has been vectorized (converted into numbers) and ready for entering into the Pinecone database through above **`index`** variable.

The Embedding model can be selected/imported from the `langchain_google_genai` library as below:   

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model=GoogleGenerativeAIEmbeddings(model="models/embedding-001")

**At this stage, the database is setup and embedding model is also selected for the vectorization of data, finally the vecotrized data will enter into the Pinecone database**

The data that need to be vectorized consist of either simple `text`, `small file` or a `large file`.

The **`simple text`** & **`small file`** will not be chunked but the **`large files`** will first went through the chunking process and then after chunking, the vectorization will be done.  

Lets run all the possiblities one by one.

## **Import PineconeVectorStore**:

- The **`PineconeVectorStore`** is a class of the LangChain framework that not only **embed your files automatically** before storing the files into vector databases but it also simplifies the process of `storing` and `retrieving` vector embeddings (the text of your file).  

- However, you can **convert text into vectors** manually through the `embed_query` method as follow:

 `vector_text=embedding_model.embed_query("Hello, I am Naveed")`
  `print(vector_text)`

But This manual effort has been eliminated by the vector store:






In [None]:
from langchain_pinecone import PineconeVectorStore

## create a vector store client
vector_store=PineconeVectorStore(index=index, embedding=embedding_model)

**While:**
- The `index` parameter tells the vector store where to store and retrieve the vector embeddings.
- The embedding parameter defines how the textual data is converted into vectors.

## 1. **Prepare Documents for the upload**

- 1. If your document is small and don't require
 - Import the Document from `langchain_core.documents` if you only wanted to upload "texts" or small document that don't need chunking.
 - Create a `Document` Object which contains the link of text/file you wanted to store into the Database.
 - Rather then writting the manual IDs for each document, you can import the `uuid` library for generating the random and unique IDs for each document.

The relevent coding these steps is given below:    

In [60]:
!pip install -U -q pymupdf4llm

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m61.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [64]:
# convert PDF into markdown

import pymupdf4llm

pdf_path= "/content/Chemistry 10.pdf" # This book will be deleted automatically when colab session will terminate, so you need to upload it everytime
md_text=pymupdf4llm.to_markdown(pdf_path)

Processing /content/Chemistry 10.pdf...


In [70]:
# Split markdown text into Chunks

from langchain.text_splitter import MarkdownTextSplitter

# Initialize the text splitter
splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

# Split the Markdown text into documents
documents = splitter.create_documents([md_text])

## Below code is only required when you have only one document file or text is available.

In [68]:


# from langchain_core.documents import Document

# document_1=Document(
   # page_content="Chemistry book",
    # metadata={"/content/Chemistry 10.pdf":"Chemistry Book"} # path of the file and its title in dictionary
# )


# put all the ducments into an array because only array is accepted while adding the documents in below step
# documents=[document_1]

In [69]:
len(documents)

1

## Add document into Database after importing `uuid` library

In [71]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['36ad2645-0777-4027-84bd-b97e8645f345',
 '8b9bce4f-b5d8-4bf9-96d8-46ff3a0954c1',
 'a8f662a1-8f3f-4d06-86ba-fa619372d20c',
 'c1c562f6-f525-450d-9fe4-816b6e902385',
 '74232f4a-fba4-40c2-9c28-59dae01fa86a',
 'aac19f00-4b2f-451f-b82d-0f593bf910bf',
 '61bf6aba-6679-4ff6-b292-998919f3c6aa',
 'a6092253-873d-4c16-a3de-853c00871056',
 'ffe48725-bc88-461d-8c7e-32b4fa3e5c57',
 '9a9c149d-a7f8-407a-bfe6-8bee08c931bc',
 '8fe9b795-8fb5-4f8f-bbf0-66fb411eac2f',
 '1c075502-7a40-416a-b839-1ddf53dcffb1',
 'af1b80c9-6862-4c43-a5b5-5348f896d2f2',
 '84dd252c-2982-4fc1-b47e-fb008ff8fe20',
 '496edf66-8572-460b-b5b5-67e327d1743b',
 'ad1bc117-d26f-416f-b7e9-c5dabcf91ebf',
 '0c580189-8bf7-49fb-a5c5-ed3b8f879a01',
 'ab2e6436-5b6f-4714-acc7-127284262131',
 '2892e9d2-4acd-40a8-bd2d-4a203976713a',
 '18fe0e80-449f-40a7-8349-4e64f1013ac6',
 '1e7f8486-956c-408d-8980-fd240842a4f5',
 'c843a7b6-c624-427c-a5a7-c552df977401',
 'f3a40943-c2cc-4ead-8208-0e2d06e38c76',
 '6603c50f-30fe-4143-8f6b-b8c24bb28254',
 '077e6051-f8d3-

## **Querying from LLM regarding the document uploaded to the vector database**:

We cannot directly ask the large language model (LLM) to get information from the vector database. Instead, we follow these steps:
 - 1. **Fetch Relevant Data from database:** First, we use a function called similarity_search to find the most relevant data from the vector database based on the user's query.
 - 2. **Combine Query and Data:** Next, we create a function that combines the user's query with the relevant data fetched from the database. This helps the LLM understand what information is needed.
 - 3. **Send to LLM:** Finally, we send the combined data (the query and the relevant database information) to the LLM. The LLM will then give a response based on this combined information.  

This approach ensures that the LLM can respond accurately, using both the query and the relevant data retrieved from the database.

In [73]:
## similarity search

query= "chepters"

db_result=vector_store.similarity_search(
    query,
    k=10,
    )

for res in db_result:
  print(f"content:{res.page_content}\nMetadata:{res.metadata}\n")

content:_Animation 10.2: Chemanim_
Metadata:{}

content:ty/mbwells/ftp/courses/chem228/chapterno
Metadata:{}

content:beings for drinking purpose. These
Metadata:{}

content:by a CH group but they have similar
Metadata:{}

content:/Knoppix/ESPERE/ESPEREdez05/ESPEREde/www
Metadata:{}

content:/Knoppix/ESPERE/ESPEREdez05/ESPEREde/www
Metadata:{}

content:in almost all fields of industry. They
Metadata:{}

content:attached to automobile exhausts. When
Metadata:{}

content:kill or control the growth of pests.
Metadata:{}

content:chlorides of calcium and magnesium.
Metadata:{}



In [76]:
# llm calling

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash-exp",
    temperature=0.9,
    max_tokens=1000,
    # other params...
)

## Define a function for final answer from LLM:

In [77]:
# define a function
def answer_to_user(query:str):
  # Vector Search
  vector_results = vector_store.similarity_search(query, k=2)
  ## invoking llm
  final_answer = llm.invoke (f'''Answer this user Query: {query}, Here are some ref to answer {vector_results}, ''')
  return final_answer

In [82]:
# calling function with query

answer=answer_to_user("What do you know about the Structures of atoms?, Don't respond from outside the book")
print(answer.content)

Based on the provided documents, here's what I can tell you about the structures of atoms:

*   The documents mention "atoms" and that they can be part of "rings".

That's all the information I can provide about the structures of atoms based solely on these documents.
