*P2: LangChain RAG*

RAG (Basic concept):

data is placed in a database. whenever specific part of data is required for any certain type of work, then instead of importing complete file (which utilises tokens everytime, resulting in high costs), only required part of data is taken

# Project Initialization

link: https://python.langchain.com/docs/integrations/vectorstores/pinecone/#setup

In [1]:
%pip install -qU langchain-pinecone langchain-google-genai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.2/52.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-generativeai 0.8.5 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativ

In [2]:
from google.colab import userdata

from pinecone import Pinecone

pinecone_api_key = userdata.get('PINECONE_API_KEY')

pc = Pinecone(api_key=pinecone_api_key)

<!-- Embeddings -> Embeddings Models -->

OAI -> text3Small
Google -> 768

In [3]:
import time
from pinecone import ServerlessSpec
from pinecone import Pinecone

index_name = "rag-project"  # change if desired

existing_indexes = pc.list_indexes()
print(f"Existing indexes: {existing_indexes}")

if index_name in existing_indexes.names():
    print(f"Index '{index_name}' already exists.")
    index = pc.Index(index_name)
else:
    try:
        pc.create_index(
          name=index_name,
          dimension=768, # of Google's
          metric="cosine",
          spec=ServerlessSpec(cloud="aws", region="us-east-1"),
        )
        index = pc.Index(index_name)
        print(f"Index '{index_name}' created successfully.")
    except Exception as e:
        print(f"Failed to create index: {e}")


Existing indexes: [{
    "name": "rag-project",
    "metric": "cosine",
    "host": "rag-project-xe0aah4.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 768,
    "deletion_protection": "disabled",
    "tags": null
}]
Index 'rag-project' already exists.


Concept of Embedding:

- Converting data [text, video, image, etc] into vector database (numbers)
- if anyone comes to search any thing, he can find it easily

e.g; counting is written on each page of a book, from 1-1000, with 100 numbers on each page. if he's asked to find specific number, he'll directly go to that relevant page. On the other hand, if text was there on 10 pages, and he was asked to find the word, this would be time-consuming. this task is made easy by the concept of 'embedding', where data is converted to number

In [4]:
# link: https://python.langchain.com/docs/integrations/text_embedding/google_generative_ai/

from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-exp-03-07")

In [5]:
vector = embeddings.embed_query("RAG Text")

In [None]:
# Text ["rag text"] is sent to Embedding Models

In [6]:
vector

[0.0071320850402116776,
 0.02516273595392704,
 -0.011981824412941933,
 -0.057460956275463104,
 -0.007780914660543203,
 0.003816227661445737,
 -0.009094495326280594,
 -0.015609290450811386,
 -0.00862810853868723,
 0.032993134111166,
 -0.0026175640523433685,
 -0.008408121764659882,
 0.0069951810874044895,
 0.01824452541768551,
 0.11325330287218094,
 -0.010691257193684578,
 -0.0044330814853310585,
 0.01899721659719944,
 -0.014814445748925209,
 -0.012412209995090961,
 0.017608879134058952,
 0.003775796853005886,
 -0.0015138393500819802,
 0.013718612492084503,
 0.013436234556138515,
 -0.00041479262290522456,
 0.008457427844405174,
 0.0033459605183452368,
 0.038955021649599075,
 0.003665958298370242,
 -0.020285747945308685,
 -0.013839791528880596,
 0.015023121610283852,
 0.01217720564454794,
 -0.0020864144898951054,
 -0.0037276367656886578,
 -0.004866336937993765,
 -0.0042744772508740425,
 0.002254558727145195,
 0.024874335154891014,
 -0.006363876163959503,
 0.00226133712567389,
 0.039028845

In [7]:
vector[:5] # Printing first 5

[0.0071320850402116776,
 0.02516273595392704,
 -0.011981824412941933,
 -0.057460956275463104,
 -0.007780914660543203]

- Semantic search: they are similarities of words (like picnic, weather, etc), found by their closest meanings

- Cosine search: specific degree of angle (like cosine) is used to find the word around that angle, for e.g; find similar words around 4 degree angle

# Storing of Embedding in Vector Databases

Vector Store

In [8]:
from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore(index=index, embedding=embeddings)

In [9]:
from langchain_core.documents import Document

# below is a dummy doc
document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
    # metadata includes info like name, date created etc, like a normal file in a system
)

In [10]:
document_1

Document(metadata={'source': 'tweet'}, page_content='I had chocolate chip pancakes and scrambled eggs for breakfast this morning.')

In [11]:
# Saving of Data

from uuid import uuid4

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

In [12]:
len(documents)

10

adding docs inside vector store

In [13]:
from uuid import uuid4 # Python's package to generate random string
uuid4() # This will generate a random string every time

UUID('19dc2123-4d05-41d5-94dc-775e6f4c962f')

In [14]:
# links (for further task):

# https://python.langchain.com/docs/integrations/vectorstores/pinecone/#manage-vector-store

# https://app.pinecone.io/organizations/-OEs7uPSnx2aGNvzE2SR/projects/3c54d424-b7fd-4805-b1d4-bd667c2c49a2/indexes

# https://github.com/panaversity/learn-agentic-ai/tree/main/backup_recent/07b_crew_ai/backup/21_agentic_rag

In [20]:
# making a random list, where the ids of the docs will be obtained

uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents, ids=uuids) # error in this line

# when vector store will be called, whose method is 'add_documents'
# our task was to generate embeddings, then adding them in vector database. this work will be carried out by the class 'add_documents'

GoogleGenerativeAIError: Error embedding content: 429 Resource has been exhausted (e.g. check quota).

Data Retrival

In [19]:
# it has diff methods

# 1. similarity search

# results = vector_store.similarity_search(
#     "LangChain provides abstractions to make working with LLMs easy",
#    )
# for res in results:
#     print(f"* {res.page_content} [{res.metadata}]")