Retrieval and Chain using Langchain

In [1]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("attention.pdf")
docs = loader.load()


In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200)
documents = text_splitter.split_documents(docs)

In [7]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = GoogleGenerativeAIEmbeddings(model = "models/gemini-embedding-001")
db = Chroma.from_documents(documents[0:30],embeddings)


E0000 00:00:1758827385.626881   85434 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1758827385.627322   85434 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [11]:
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv
load_dotenv()
import os

google_api_key = os.getenv("GOOGLE_API_KEY")
if google_api_key is not None:
    os.environ["GOOGLE_API_KEY"] = google_api_key

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

E0000 00:00:1758827826.966497   85434 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [8]:
query = "What are transformers"

In [9]:
result=db.similarity_search(query)
result[0].page_content

'aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate\nself-attention and discuss its advantages over models such as [14, 15] and [8].\n3 Model Architecture\nMost competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 29].\nHere, the encoder maps an input sequence of symbol representations (x1,...,x n) to a sequence\nof continuous representations z = (z1,...,z n). Given z, the decoder then generates an output\nsequence (y1,...,y m) of symbols one element at a time. At each step the model is auto-regressive\n[9], consuming the previously generated symbols as additional input when generating the next.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks'

In [10]:
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context.
Think step by step before providing a detailed answer. 
I will tip you $1000 if the user finds the answer helpful. 
<context>
                                          {context}
                                          </context>
Question: {input}
""")

In [12]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(llm,prompt)

In [13]:
## Retrievers are used to import information. It does not store information as in vectors
## Learn more about it.
retriever = db.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'GoogleGenerativeAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x71d6900e16d0>, search_kwargs={})

In [14]:
##Retrieval chain

from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(retriever,document_chain)

In [19]:
response=retrieval_chain.invoke({"input":"Scaled dot product attention"})

In [20]:
response['answer']

'Based on the provided context, here is a detailed explanation of Scaled Dot-Product Attention:\n\nScaled Dot-Product Attention is a specific type of attention mechanism.\n\n1.  **Computation Process:**\n    *   It involves computing the dot product of a query with all keys.\n    *   Each of these dot products is then divided by a scaling factor of $\\sqrt{d_k}$.\n    *   A softmax function is applied to these scaled dot products to obtain the weights on the values.\n    *   In practice, for a set of queries (matrix Q), keys (matrix K), and values (matrix V), the output matrix is computed using the formula:\n        $Attention(Q,K,V ) = softmax(\\frac{Q K^T}{\\sqrt{d_k}})V$\n\n2.  **Relationship to Other Attention Functions:**\n    *   It is one of the two most commonly used attention functions, the other being additive attention.\n    *   **Dot-Product (Multiplicative) Attention:** Scaled Dot-Product Attention is identical to dot-product attention, *except* for the scaling factor of $