Basic RAG 101

1.Loading Document

In [7]:
PDF_FILE = "Prompt Engineering.pdf"

MODEL = "llama2"

In [8]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(PDF_FILE)
pages = loader.load()

print(f"Number of pages loaded: {len(pages)}")
print(f"Length of a page : {len(pages[1].page_content)}")
print("Content of a page:", pages[10].page_content)

Number of pages loaded: 68
Length of a page : 281
Content of a page: Prompt Engineering
February 2025
11
• Top-P sampling selects the top tokens whose cumulative probability does not exceed 
a certain value (P). Values for P range from 0 (greedy decoding) to 1 (all tokens in the 
LLM’s vocabulary).
The best way to choose between top-K and top-P is to experiment with both methods (or 
both together) and see which one produces the results you are looking for. 
Putting it all together
Choosing between top-K, top-P, temperature, and the number of tokens to generate, 
depends on the specific application and desired outcome, and the settings all impact one 
another. It’s also important to make sure you understand how your chosen model combines 
the different sampling settings together.
If temperature, top-K, and top-P are all available (as in Vertex Studio), tokens that meet 
both the top-K and top-P criteria are candidates for the next predicted token, and then 
temperature is applied to sa

2.Text Chunking with RecursiveCharacterTextSplitter

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
chunks = splitter.split_documents(pages)

print(f"Number of chunks: {len(chunks)}")
print(f"Length of a chunk: {len(chunks[1].page_content)}")
print("Content of a chunk:", chunks[10].page_content)


Number of chunks: 89
Length of a chunk: 281
Content of a chunk: Prompt Engineering
February 2025
10
deterministic: the highest probability token is always selected (though note that if two tokens 
have the same highest predicted probability, depending on how tiebreaking is implemented 
you may not always get the same output with temperature 0).
Temperatures close to the max tend to create more random output. And as temperature gets 
higher and higher, all tokens become equally likely to be the next predicted token.
The Gemini temperature control can be understood in a similar way to the softmax function 
used in machine learning. A low temperature setting mirrors a low softmax temperature (T), 
emphasizing a single, preferred temperature with high certainty. A higher Gemini temperature 
setting is like a high softmax temperature, making a wider range of temperatures around 
the selected setting more acceptable. This increased uncertainty accommodates scenarios 
where a rigid, precise t

3.FAISS + Ollama Embedding Index

In [11]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="llama2")
vectorstore = FAISS.from_documents(chunks, embeddings)

In [18]:
retriever = vectorstore.as_retriever()
retriever.invoke("what is Prompt Engineering?")

[Document(id='a95abc19-f984-47b0-896b-02ed4c0e8a11', metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 20.2 (Macintosh)', 'creationdate': '2025-03-17T13:40:21-06:00', 'moddate': '2025-03-17T13:40:26-06:00', 'trapped': '/False', 'source': 'Prompt Engineering.pdf', 'total_pages': 68, 'page': 67, 'page_label': '68'}, page_content='10. Google Cloud Platform, 2023, Chain of Thought and React. Available at: https://github.com/ \nGoogleCloudPlatform/generative-ai/blob/main/language/prompts/examples/chain_of_thought_react.ipynb . \n11. Wang, X., et al., 2023, Self Consistency Improves Chain of Thought reasoning in language models.  \nAvailable at: https://arxiv.org/pdf/2203.11171.pdf .\n12. Yao, S., et al., 2023, Tree of Thoughts: Deliberate Problem Solving with Large Language Models.  \nAvailable at: https://arxiv.org/pdf/2305.10601.pdf .\n13. Yao, S., et al., 2023, ReAct: Synergizing Reasoning and Acting in Language Models. Available at:  \nhttps://arxiv.org/pdf/2210

In [None]:
#Single Prompt Query on OllamaLLM (Testing)
from langchain_ollama import OllamaLLM
model = OllamaLLM(model=MODEL, temperature=0.1)
model.invoke("What is Prompt Engineering?")

'\nPrompt engineering is a relatively new subfield of natural language processing (NLP) that focuses on the design and optimization of language prompts or inputs to improve the performance of AI models in various tasks. The goal of prompt engineering is to create more effective and efficient language prompts that can elicit better responses from AI models, such as improved accuracy, relevance, and completeness.\n\nPrompt engineering involves a range of techniques, including:\n\n1. Designing optimal prompts: This involves creating prompts that are specifically tailored to the task at hand, using techniques such as template-based prompts, semantic prompts, and hybrid prompts.\n2. Optimizing prompts for specific models: Prompt engineering can involve optimizing prompts for specific AI models, such as language translation models or question answering systems, to improve their performance on particular tasks.\n3. Generating diverse prompts: Prompt engineering can also involve generating a d

4.Creating a Custom Prompt Template with LangChain

In [14]:
from langchain.prompts import PromptTemplate

template = """
You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: Here is some context

Question: Here is a question



In [15]:

chain = prompt | model

chain.invoke({
    "context": "Anna's sister is Susan", 
    "question": "Who is Susan's sister?"
})

" Sure! Based on the context you provided, Susan's sister is Anna."

5.Chaining Retriever, Prompt, and LLM in LangChain

In [16]:

from operator import itemgetter

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
)

6.Batch Inference on RAG Chain with LangChain

In [17]:
questions = [
    "What is Top-K and Top-P?",
    "What is Temperature?",
    "What is Output Length?",
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print("*************************\n")

Question: What is Top-K and Top-P?
Answer:  Based on the provided context, I can answer your question. Top-K and Top-P are terms used in the field of prompt engineering for language models.

Top-K refers to the number of examples or instances that a model is trained on. For instance, in the document you provided, Top-K is mentioned as "N/A" for the given prompt. This means that the model has not been trained on any specific number of examples for this task.

Top-P, on the other hand, refers to the proportion of positive examples in a dataset. In the context of few-shot learning, Top-P represents the percentage of instances in the training set that are positive (i.e., belonging to the target class). A higher Top-P value indicates that the model has more opportunities to learn from positive examples, which can improve its performance on unseen data.

In summary, Top-K refers to the number of examples trained on, while Top-P represents the proportion of positive instances in the training 