<a href="https://colab.research.google.com/github/poseidon2022/Retreival-Augumented-Generation/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install --upgrade fsspec==2024.6.1
!pip install -qU \
  langchain==0.0.300 \
  datasets==2.14.6 \
  pinecone-client==2.2.4 \
  tiktoken==0.5.1
!pip install langchain_google_genai
!pip install pypdf==3.1.0

In [55]:
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KE')
PINECONE_API = userdata.get('PIN_CONE')
PINECONE_ENV = 'gcp-starter'


In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro",
                             google_api_key = GOOGLE_API_KEY)

#Example generation
llm.invoke("Write me a ballad about LangChain").content

In [18]:
#let us have a continuous in-memory context remembering example
messages = [
    ("system", "You are a helpful assistant."),
    ("human", "Hi AI how are you today?"),
    ("ai", "I am great, how can I help you today?"),
    ("human", "I want to know more about string theory.")
]

In [None]:
res = llm.invoke(messages)
res.content

In [None]:
messages.append(("ai", res.content))
messages

In [None]:
messages.append(("human", "tell me more"))
llm.invoke(messages).content

Now we have our llm setup and the next thing is loading up a freash document. I setup a document on a fictional company named Kidus Melaku Simegne. But you can load up any document and test the functionality.

In [45]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import uuid
from tqdm.auto import tqdm

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",
                                          google_api_key = GOOGLE_API_KEY)
#example embedding
vector = embeddings.embed_query("hello, world!")
vector[:5]

[0.05168594419956207,
 -0.030764883384108543,
 -0.03062233328819275,
 -0.02802734449505806,
 0.01813092641532421]

In [53]:
#Now we have to load up our document and try to split it into smalled chunks that
#are later to be used for embedding
def load_document(file_path):
  loader = PyPDFLoader(file_path)
  document = loader.load()

  #now we use a recursive text splitter to chunk up the text into a list of smaller texts
  text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150
  )
  chunks = text_splitter.split_documents(document)

  #now let us extraxt the content from the chunks and use google's generative ai embedding
  #to prepare our embedding vector
  batch_size = 20
  for i in tqdm(range(0, len(document), batch_size)):
    i_end = min(i + batch_size, len(document))
    batch = document[i:i_end]

    ids = []
    context_array = []
    meta_data = []
    for i, row in enumerate(batch):
      print(f"appending {i}")
      ids.append(str(uuid.uuid4()))
      context_array.append(row.page_content)
      meta_data.append({
          'source' : row.metadata["source"],
          'page' : row.metadata["page"] + 1,
          'context' : row.page_content
      })

  emb_vectors = embeddings.embed_documents(context_array)
  return ids, emb_vectors, meta_data


#let us check if our embedding function is working correctly
ids, emb_vectors, meta_data = load_document("Kidus Melaku's company.pdf")


  0%|          | 0/1 [00:00<?, ?it/s]

appending 0
appending 1
appending 2
appending 3
appending 4
appending 5
appending 6


Now we have our embeding function and to insert any embeded vector along with its meta data and id to a database, we use a vector database known as pinecone for that. we will create an index with a specified dimension and we will try to upsert the embeded document to our database

In [66]:
import pinecone

#setting up the index
pinecone.init(api_key = PINECONE_API, environment = PINECONE_ENV)
index_list = pinecone.list_indexes() #this is to make sure that the same index is not
#created multiple times when running the application multiple times.

if not index_list:
  print("Creating index")
  pinecone.create_index("default", dimension = 1536, metric = "cosine")

#One can change the metric to dotproduct and anu other, based on the application

#this is to make sure that our index is created with the dimensions defined and all
print(pinecone.describe_index("default"))
index = pinecone.Index("default")

IndexDescription(name='default', metric='cosine', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='starter', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')


In [71]:
#now let us create our upserting function for the embeded vector to pinecone

def upsert(ids, emb_vectors, meta_data, index):
   to_upsert = zip(ids, emb_vectors, meta_data)
   index.upsert(vectors = to_upsert)
   time.sleep(2)


In [72]:
#let us tru upserting a sample document to the vector database
ids, emb_vectors, meta_data = load_document("Kidus Melaku's company.pdf")
upsert(ids, emb_vectors, meta_data, index)

  0%|          | 0/1 [00:00<?, ?it/s]

appending 0
appending 1
appending 2
appending 3
appending 4
appending 5
appending 6


ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'Content-Length': '103', 'x-pinecone-request-latency-ms': '19', 'x-pinecone-request-id': '3092300012056077798', 'date': 'Wed, 04 Sep 2024 07:29:55 GMT', 'x-envoy-upstream-service-time': '20', 'server': 'envoy', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"code":3,"message":"Vector dimension 768 does not match the dimension of the index 1536","details":[]}
