<a href="https://colab.research.google.com/github/naveedkhalid091/Learn_Agentic_AI/blob/main/step02_generative_ai_for_beginners/02(b)_RAG_implementation_with_PineconeDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Implementation of RAG projects:**

For RAG projects in langchain, you need to store and retreive your data.

You need the **following environment** to set in your project.

1. Install the langchain in your project for creating flexibility in switching the chat models.
2. Firstly, you need a database for data storage and its access key.  
3. An Embedding model for vectorization of your data.
4. LLM model for conversations and its access key.

Lets Install the above environment first.

In [None]:
!pip install -U -q langchain

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.2/412.2 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -U -q langchain-pinecone

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.3/1.3 MB[0m [31m41.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.3/427.3 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -U -q langchain_google_genai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h

## **Gettting Access of `PINECONE` & `GEMINI` using API Keys:**

In [None]:
from google.colab import userdata
import os
os.environ['PINECONE_API_KEY'] = userdata.get('PINECONE_API_KEY') # Getting access of PINECONE Database
os.environ['GOOGLE_API_KEY']=userdata.get('GOOGLE_API_KEY') # Getting access of Gemini

## **Initialization of Pinecone client:**

Initializing the `Pinecone client (pc)` is the important step as this client allows you to perform various operations such as creating indexes, inserting vectors, and executing queries.

In [None]:
from pinecone import Pinecone
pc=Pinecone()

## **Create an Index in PINECONE using above client**.

**You can optionally check the existing index name using be below code to prevent duplicates:**

* **i)** First check if the index already exist with the same choosen name.

* **ii)** Secondly create an index if the same name index is not already created.

In [None]:
# i) Checking the index name if it is already exist?

existing_indexes=[]

checking_db_indexes=pc.list_indexes()
print(existing_indexes)

for info_index in checking_db_indexes:
  existing_indexes.append(info_index.name)

print(existing_indexes)

[]
['online-rag-project', 'second-rag-project']


In [None]:
# ii) Creation of Index
from pinecone import ServerlessSpec


index_name="my-books"


if index_name in existing_indexes:
  print(f"Index {index_name} already exist")
else:
  # PROCEED WITH INDEX CREATION
  pc.create_index(
    name=index_name,
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

## **Accessing the above created Index:**

- Accessing the index will help us inserting the vectors/embedded data through the below line.

In [None]:
index=pc.Index(index_name)

**Note:** PINECONE database setup is successfully completed. Now you need to setup embedding model for vectorization and chunking of your data.

You can also varify your created index into the PINECONE database by signing into your database and navigate to **`Database->Indexes`**.   

## **Select Embedding model:**

This model will first ensure that all of your data has been vectorized (converted into numbers) and ready for entering into the Pinecone database through above **`index`** variable.

The Embedding model can be selected/imported from the `langchain_google_genai` library as below:   

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model=GoogleGenerativeAIEmbeddings(model="models/embedding-001")

**At this stage, the database is setup and embedding model is also selected for the vectorization of data, finally the vecotrized data will enter into the Pinecone database**

The data that need to be vectorized consist of either simple `text`, `small file` or a `large file`.

The **`simple text`** & **`small file`** will not be chunked but the **`large files`** will first went through the chunking process and then after chunking, the vectorization will be done.  

Lets run all the possiblities one by one.

## **Import PineconeVectorStore**:

- The **`PineconeVectorStore`** is a class of the LangChain framework that not only **embed your files automatically** before storing the files into vector databases but it also simplifies the process of `storing` and `retrieving` vector embeddings (the text of your file).  

- However, you can **convert text into vectors** manually through the `embed_query` method as follow:

 `vector_text=embedding_model.embed_query("Hello, I am Naveed")`
  `print(vector_text)`

But This manual effort has been eliminated by the vector store:






In [None]:
from langchain_pinecone import PineconeVectorStore

## create a vector store client
vector_store=PineconeVectorStore(index=index, embedding=embedding_model)

**While:**
- The `index` parameter tells the vector store where to store and retrieve the vector embeddings.
- The embedding parameter defines how the textual data is converted into vectors.

## 1. **Prepare Documents for the upload**

- **1.** If your document is small then follow below practice:

  - Import the Document from `langchain_core.documents`.
  - if you only wanted to upload "texts" or "small document" that don't need chunking then:
    - Create a `Document` Object which contains the link of text/file you
     wanted to store into the Database.
    - Rather then writting the manual IDs for each document, you can import the `uuid` library for generating the random and unique IDs for each document.

- **2.** If your file is **large and required chunking** then follow below steps:

  - Import a library called `pymupdf4llm`, this library will convert your pdf file into a markdown file.
  - Split the markdown file into smaller chunks by importing `MarkdownTextSplitter` from the `langchain.text_splitter`.
  - Add the chunked document into database by attaching the random generated IDs through the `uuid` with documents.  

The relevent coding these steps is given below:    

## 1. **If your document is `small` or `text only` then follow below practice**:




In [None]:


# from langchain_core.documents import Document

# document_1=Document(
   # page_content="Chemistry book",
    # metadata={"/content/Chemistry 10.pdf":"Chemistry Book"} # path of the file and its title in dictionary
# )


# put all the ducments into an array because only array is accepted while adding the documents in below step
# documents=[document_1]

In [None]:
# Add documents into Database

# from uuid import uuid4

# uuids = [str(uuid4()) for _ in range(len(documents))]

# vector_store.add_documents(documents=documents, ids=uuids)

**2. If your file is large and required chunking then follow below steps:**

In [None]:
!pip install -U -q pymupdf4llm

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/20.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/20.0 MB[0m [31m17.1 MB/s[0m eta [36m0:00:02[0m[2K   [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/20.0 MB[0m [31m50.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/20.0 MB[0m [31m89.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━[0m [32m15.3/20.0 MB[0m [31m170.5 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m20.0/20.0 MB[0m [31m164.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m20.0/20.0 MB[0m [31m164.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m20.0/20.0 MB

In [None]:
# convert PDF into markdown

import pymupdf4llm

pdf_path= "/content/Chemistry 9.pdf" # This book will be deleted automatically when colab session will terminate, so you need to upload it everytime
md_text=pymupdf4llm.to_markdown(pdf_path)

Processing /content/Chemistry 9.pdf...


In [None]:
print(md_text)

## CHEMISTRY
# 9

###### Publisher: CARAVAN BOOK HOUSE, LAHORE


-----

**All rights (Copy right etc.) are reserved with the publisher.**
Approved by the Federal Ministry of Education (Curriculum Wing), Islamabad, according to the
National Curriculum 2006 under the National Textbook and Learning Materials Policy 2007.
**N.O.C. F.2-2/2010-Chem. Dated 2-12-2010. This book has also been published by Punjab**
Textbook Board under a print licence arrangement for free distribution in all Government
School in Punjab. No part of this book can be copied in any form especially guides, help books
etc., without the written permission of the publisher.

|Col1|CONTENTS|Col3|
|---|---|---|
|Unit 1|Fundamentals of Chemistry|1|
|Unit 2|Structure of Atoms|27|
|Unit 3|Periodic Table and Periodicity of Properties|44|
|Unit 4|Structure of Molecules|58|
|Unit 5|Physical States of Matter|75|
|Unit 6|Solutions|96|
|Unit 7|Electrochemistry|113|
|Unit 8|Chemical Reactivity|138|


Authors:


###### Dr. Jaleel Ta

In [None]:
# Split markdown text into Chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                          chunk_overlap=50,
                                          separators=["\n\n", "\n", " ", ""],
                                          )

# Split the Markdown text into documents
documents = splitter.create_documents([md_text])

In [None]:
len(documents)

773

## Add document into Database after importing `uuid` library

In [None]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)

['5af9cbb1-69fc-4d1b-9cbf-90e6f1a5647f',
 '71439909-6775-4fb5-b908-84f7295c04db',
 '1ba46375-38ff-4569-b2cc-dd9c945dd1dc',
 'e7def21d-af28-4632-8d28-2cc5c349789d',
 '2f1006f4-b484-48d4-9a12-88ebfc8c4c76',
 '49e57266-94dc-452f-b973-406ac7dc7cd4',
 '28a4bc3c-eb12-4407-a367-1b6260c1adff',
 'ae45414a-5539-4e62-8746-808bf139d15b',
 '39cea203-d99a-4e7e-8670-f82275f7b68f',
 'e9f992b8-acc2-4341-8166-b8595fd8be58',
 'c4ab87b1-bb63-4254-a31d-dce9fafdacca',
 '32b7e1f3-f567-4a5f-a761-68d872315d78',
 '5c06512c-f0e4-496a-83e3-af4948d9a929',
 '9b748acf-90e4-4dc6-a796-9e1320800f01',
 '4fa36c6e-dcf3-4d27-b13f-e869510a3831',
 'b4d84ec9-d093-466f-81e2-d4049cd59e89',
 '815a0b64-06a0-45e3-ac23-ad111f5920ed',
 '51bab972-4b30-4e88-b3e8-e6bd317f78c0',
 '68c0a1f5-c3e0-43d3-a075-4fde63996c3a',
 '6e3a6201-07bd-4866-b67b-b4c2c98de263',
 '0e566db0-161d-42e3-9540-704d54df93aa',
 'f3d57f27-f507-4072-9f8b-a49e26703e2c',
 'dcdca5f5-cd3f-48b2-be07-855fb198c778',
 'b264e3fa-e81e-4568-b8ae-3ed05ea043d7',
 '83ff9eb2-1c64-

## **Querying from LLM regarding the document uploaded to the vector database**:

We cannot directly ask the large language model (LLM) to get information from the vector database. Instead, we follow these steps:
 - 1. **Fetch Relevant Data from database:** First, we use a function called similarity_search to find the most relevant data from the vector database based on the user's query.
 - 2. **Combine Query and Data:** Next, we create a function that combines the user's query with the relevant data fetched from the database. This helps the LLM understand what information is needed.
 - 3. **Send to LLM:** Finally, we send the combined data (the query and the relevant database information) to the LLM. The LLM will then give a response based on this combined information.  

This approach ensures that the LLM can respond accurately, using both the query and the relevant data retrieved from the database.

In [None]:
## similarity search

query= "units"

db_result=vector_store.similarity_search(
    query,
    k=10,
    )

for res in db_result:
  print(f"content:{res.page_content}\nMetadata:{res.metadata}\n")

content:express concentration of solutions. A few of these units are discussed here.
Metadata:{}

content:_atoms of that element as compared to 1/12 (one-twelfth) the mass of an atom of carbon-th_
_12 isotope (an element having different mass number but same atomic number). Based_
on carbon-12 standard, the mass of an atom of carbon is 12 units and l/2 of it comes to be th
1 unit. When we compare atomic masses of other elements with atomic mass of carbon12 atom, they are expressed as relative atomic masses of those elements. The unit for
Metadata:{}

content:8. Explain why are hydrogen and oxygen considered elements whereas water as a
compound.
9. What is the significance of the symbol of an element?
10. State the reasons: soft drink is a mixture and water is a compound.
11. Classify the following into element, compound or mixture:
i. He and H 2 ii. CO and Co iii. Water and milk
iv. Gold and brass v. Iron and steel
12. Define atomic mass unit. Why is it needed?
Metadata:{}

content:**C

In [None]:
# llm calling

from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash-exp",
    temperature=0.9,
    max_tokens=1000,
    # other params...
)

## Define a function for final answer from LLM:

In [None]:
# define a function
def answer_to_user(query:str):
  # Vector Search
  vector_results = vector_store.similarity_search(query, k=2)
  ## invoking llm
  final_answer = llm.invoke (f'''Answer this user Query: {query}, Here are some ref to answer {vector_results}, ''')
  return final_answer

In [None]:
# calling function with query

answer=answer_to_user("How many chapters are available in chemistry book? tell me the name along with chpater numbers")
print(answer.content)

Based on the provided document, the chemistry book has **8 chapters**. Here are the chapter names along with their corresponding chapter numbers:

*   **Unit 1:** Fundamentals of Chemistry
*   **Unit 2:** Structure of Atoms
*   **Unit 3:** Periodic Table and Periodicity of Properties
*   **Unit 4:** Structure of Molecules
*   **Unit 5:** Physical States of Matter
*   **Unit 6:** Solutions
*   **Unit 7:** Electrochemistry
*   **Unit 8:** Chemical Reactivity


In [None]:
# calling function with query

answer=answer_to_user("make 10 multiple choice questions from chapter Fundamental of Chemistry? ")
print(answer.content)

Okay, here are 10 multiple-choice questions based on the "Fundamentals of Chemistry" chapter, keeping in mind the provided context suggests a basic introductory level (likely for 9th grade).

**Instructions:** Choose the best answer for each question.

**Questions:**

1.  **Which of the following is the fundamental building block of matter?**
    a) Molecule
    b) Compound
    c) Atom
    d) Mixture

2.  **The number of protons in an atom's nucleus determines its:**
    a) Mass number
    b) Atomic number
    c) Number of neutrons
    d) Number of electrons

3.  **Which of the following is NOT a state of matter?**
    a) Solid
    b) Liquid
    c) Gas
    d) Element

4. **A substance made up of only one type of atom is called a(n):**
    a) Compound
    b) Mixture
    c) Element
    d) Solution

5. **What is the smallest unit of a compound that can exist independently?**
    a) Atom
    b) Molecule
    c) Ion
    d) Proton

6. **Which of the following is an example of a physical prope