<a href="https://colab.research.google.com/github/naveedkhalid091/Learn_Agentic_AI/blob/main/step02_generative_ai_for_beginners/02(b)_updated_RAG_implementation_with_PineconeDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Implementation of RAG projects:**

For RAG projects in langchain, you need to store and retreive your data.

You need the **following environment** to set in your project.

1. Install the langchain in your project for creating flexibility in switching the chat models.
2. Firstly, you need a database for data storage and its access key.  
3. An Embedding model for vectorization of your data.
4. LLM model for conversations and its access key.

Lets Install the above environment first.

In [1]:
!pip install -U -q langchain

In [2]:
!pip install -U -q langchain-pinecone

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.3/1.3 MB[0m [31m41.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/427.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.3/427.3 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.3/50.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install -U -q langchain_google_genai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## **Gettting Access of `PINECONE` & `GEMINI` using API Keys:**

In [4]:
from google.colab import userdata
import os
os.environ['PINECONE_API_KEY'] = userdata.get('PINECONE_API_KEY') # Getting access of PINECONE Database
os.environ['GOOGLE_API_KEY']=userdata.get('GOOGLE_API_KEY') # Getting access of Gemini

## **Initialization of Pinecone client:**

Initializing the `Pinecone client (pc)` is the important step as this client allows you to perform various operations such as creating indexes, inserting vectors, and executing queries.

In [5]:
from pinecone import Pinecone
pc=Pinecone()

## **Create an Index in PINECONE using above client**.

**You can optionally check the existing index name using be below code to prevent duplicates:**

* **i)** First check if the index already exist with the same choosen name.

* **ii)** Secondly create an index if the same name index is not already created.

In [6]:
# i) Checking the index name if it is already exist?

existing_indexes=[]

checking_db_indexes=pc.list_indexes()
print(existing_indexes)

for info_index in checking_db_indexes:
  existing_indexes.append(info_index.name)

print(existing_indexes)

[]
['online-rag-project', 'my-family', 'my-9th-chem-book', 'second-rag-project', 'family-structure']


In [8]:
# ii) Creation of Index
from pinecone import ServerlessSpec


index_name="my-family"

if index_name in existing_indexes:
  print(f"Index {index_name} already exist")
else:
  # PROCEED WITH INDEX CREATION
  pc.create_index(
    name=index_name,
    dimension=786,
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

Index my-family already exist


## **Accessing the above created Index:**

- Accessing the index will help us inserting the vectors/embedded data through the below line.

In [9]:
index=pc.Index(index_name)

**Note:** PINECONE database setup is successfully completed. Now you need to setup embedding model for vectorization and chunking of your data.

You can also varify your created index into the PINECONE database by signing into your database and navigate to **`Database->Indexes`**.   

## **Select Embedding model:**

This model will first ensure that all of your data has been vectorized (converted into numbers) and ready for entering into the Pinecone database through above **`index`** variable.

The Embedding model can be selected/imported from the `langchain_google_genai` library as below:   

In [10]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model=GoogleGenerativeAIEmbeddings(model="models/embedding-001")

**At this stage, the database is setup and embedding model is also selected for the vectorization of data, finally the vecotrized data will enter into the Pinecone database**

The data that need to be vectorized consist of either simple `text`, `small file` or a `large file`.

The **`simple text`** & **`small file`** will not be chunked but the **`large files`** will first went through the chunking process and then after chunking, the vectorization will be done.  

Lets run all the possiblities one by one.

## **Import PineconeVectorStore**:

- The **`PineconeVectorStore`** is a class of the LangChain framework that not only **embed your files automatically** before storing the files into vector databases but it also simplifies the process of `storing` and `retrieving` vector embeddings (the text of your file).  

- However, you can **convert text into vectors** manually through the `embed_query` method as follow:

 `vector_text=embedding_model.embed_query("Hello, I am Naveed")`
  `print(vector_text)`

But This manual effort has been eliminated by the vector store:






In [11]:
from langchain_pinecone import PineconeVectorStore

## create a vector store client
vector_store=PineconeVectorStore(index=index, embedding=embedding_model)

**While:**
- The `index` parameter tells the vector store where to store and retrieve the vector embeddings.
- The embedding parameter defines how the textual data is converted into vectors.

## 1. **Prepare Documents for the upload**

 - Import the Document from `langchain_core.documents`.
 - Create a `Document` Object which contains the link of text/file you wanted to store into the Database.
 - Rather then writting the manual IDs for each document, you can import the `uuid` library for generating the random and unique IDs for each document.

The relevent coding these steps is given below:    

In [12]:
from langchain_core.documents import Document

document_1=Document(
    page_content="Chemistry book",
    metadata={"/content/Chemistry 10.pdf":"Chemistry Book"} # path of the file and its title in dictionary
)

In [18]:
documents="/content/Chemistry 10.pdf"  # making an object/variable to a file

In [19]:
len(documents)

25

## Add document into Database

In [21]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(documents))]

vector_store.add_documents(documents=documents, ids=uuids)


AttributeError: 'str' object has no attribute 'page_content'