1. Load the PDF

In [19]:
### Read a PDf file
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('NET-Microservices.pdf')
pages=loader.load()
pages


[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2022-05-18T00:38:25+05:30', 'moddate': '2022-05-18T00:38:25+05:30', 'source': 'NET-Microservices.pdf', 'total_pages': 349, 'page': 0, 'page_label': '1'}, page_content=''),
 Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2022-05-18T00:38:25+05:30', 'moddate': '2022-05-18T00:38:25+05:30', 'source': 'NET-Microservices.pdf', 'total_pages': 349, 'page': 1, 'page_label': '2'}, page_content='EDITION v6.0 - Updated to ASP.NET Core 6.0 \nRefer changelog for the book updates and community contributions. \nThis guide is an introduction to developing microservices-based applications and managing them \nusing containers. It discusses architectural design and implementation approaches using .NET and \nDocker containers. \nTo make it easier to get started, the guide focuses on a refer

2. Check the lenght of the pages array object

In [20]:
len(pages)  # Number of pages loaded

349

3. Semantic Chunking (Not Fixed Length)
This will chunk based on semantic breakpoints (e.g., sentence/paragraph ends), and keep some overlap for context preservation.

In [21]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Join all pages 
full_text = " ".join([p.page_content for p in pages])

# Use RecursiveCharacterTextSplitter with sentence-level separators
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", "!", "?", ",", " "],
    chunk_size=1000,
    chunk_overlap=200,
)

chunks = text_splitter.create_documents([full_text])

In [23]:
full_text[:1000]  # Display the first 1000 characters of the full text

' EDITION v6.0 - Updated to ASP.NET Core 6.0 \nRefer changelog for the book updates and community contributions. \nThis guide is an introduction to developing microservices-based applications and managing them \nusing containers. It discusses architectural design and implementation approaches using .NET and \nDocker containers. \nTo make it easier to get started, the guide focuses on a reference containerized and microservice -\nbased application that you can explore. The reference application is available at the \neShopOnContainers GitHub repo. \nAction links \n• This e-book is also available in a PDF format (English version only) Download \n• Clone/Fork the reference application eShopOnContainers on GitHub \n• Watch the introductory video \n• Get to know the Microservices Architecture right away \nIntroduction \nEnterprises are increasingly realizing cost savings, solving deployment problems, and improving \nDevOps and production operations by using containers. Microsoft has been rel

In [24]:
len(full_text)  # Length of the full text

803222

4. Check the lenght of the chunks

In [22]:
len(chunks)  # Number of chunks created

1006

In [29]:
chunks[0]

Document(metadata={}, page_content='EDITION v6.0 - Updated to ASP.NET Core 6.0 \nRefer changelog for the book updates and community contributions. \nThis guide is an introduction to developing microservices-based applications and managing them \nusing containers. It discusses architectural design and implementation approaches using .NET and \nDocker containers. \nTo make it easier to get started, the guide focuses on a reference containerized and microservice -\nbased application that you can explore. The reference application is available at the \neShopOnContainers GitHub repo. \nAction links \n• This e-book is also available in a PDF format (English version only) Download \n• Clone/Fork the reference application eShopOnContainers on GitHub \n• Watch the introductory video \n• Get to know the Microservices Architecture right away \nIntroduction \nEnterprises are increasingly realizing cost savings, solving deployment problems, and improving')

5. Embeding chunks

In [9]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()


In [None]:
documents_with_embeddings = embedding_model.embed_documents([c.page_content for c in chunks])

In [39]:
len(documents_with_embeddings)

1006

In [26]:
# Display the first chunk and its embedding
print(chunks[0].page_content)

EDITION v6.0 - Updated to ASP.NET Core 6.0 
Refer changelog for the book updates and community contributions. 
This guide is an introduction to developing microservices-based applications and managing them 
using containers. It discusses architectural design and implementation approaches using .NET and 
Docker containers. 
To make it easier to get started, the guide focuses on a reference containerized and microservice -
based application that you can explore. The reference application is available at the 
eShopOnContainers GitHub repo. 
Action links 
• This e-book is also available in a PDF format (English version only) Download 
• Clone/Fork the reference application eShopOnContainers on GitHub 
• Watch the introductory video 
• Get to know the Microservices Architecture right away 
Introduction 
Enterprises are increasingly realizing cost savings, solving deployment problems, and improving


In [28]:
# Display the first 10 dimensions of the embedding
print(documents_with_embeddings[0][:10])  

[0.0074204481378972505, -0.009146921000551145, 0.005873393183934531, -0.016425189699665173, -0.005558565620695548, 0.011056196664013923, -0.018402169724740538, -0.017494925504931542, -0.014989846430135772, -0.0204062326120481]


In [None]:
# dimensions of the embedding
len(documents_with_embeddings[0])

1536

6. Store in Monogo DB 


In [3]:
import os
from dotenv import load_dotenv
load_dotenv() 


True

In [4]:
from pymongo import MongoClient
from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch

In [5]:
client = MongoClient(os.getenv("MONGODB_CONNECTION_STRING"))
collection = client[os.getenv("MONGODB_NAME")][os.getenv("MONGODB_COLLECTION")]

In [None]:

# Reset w/out deleting the Search Index 
collection.delete_many({})

DeleteResult({'n': 1006, 'electionId': ObjectId('7fffffff00000000000001d4'), 'opTime': {'ts': Timestamp(1749215903, 108), 't': 468}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1749215903, 108), 'signature': {'hash': b'\xfc\xb1@\xe9L\x193!2\x93\xe57\x88(%a\x05+\x12\x10', 'keyId': 7447331114961600517}}, 'operationTime': Timestamp(1749215903, 108)}, acknowledged=True)

In [60]:
# Insert the documents in MongoDB Atlas with their embedding
docsearch = MongoDBAtlasVectorSearch.from_documents(
    chunks, embedding_model, collection=collection, index_name=os.getenv("MONGODB_INDEX_NAME")
)

7. Retrieve

In [10]:
# initialize vector store
vectorStore = MongoDBAtlasVectorSearch(
    collection, OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY")), index_name=os.getenv("MONGODB_INDEX_NAME")
)

In [11]:
os.getenv("MONGODB_INDEX_NAME")

'vector_index'

In [17]:
# perform a similarity search between the embedding of the query and the embeddings of the documents
myquery = "What is docker and how does it work?"
context_from_vecotore_store = vectorStore.similarity_search(query=myquery, k=5)

context_from_vecotore_store

[Document(id='6842eace68f3870b5a129950', metadata={'_id': '6842eace68f3870b5a129950'}, page_content='A container image is a way to package an app or service and deploy it in a reliable and reproducible \nway. You could say that Docker isn’t only a technology but also a philosophy and a process.  \nWhen using Docker, you won’t hear developers say, “It works on my machine, why not in production?” \nThey can simply say, “It runs on Docker”, because the packaged Docker application can be executed \non any supported Docker environment, and it runs the way it was intended to on all deployment \ntargets (such as Dev, QA, staging, and production). \nA simple analogy \nPerhaps a simple analogy can help getting the grasp of the core concept of Docker. \nLet’s go back in time to the 1950s for a moment. There were no word processors, and the \nphotocopiers were used everywhere (kind of). \nImagine you’re responsible for quickly issuing batches of letters as required, to mail them to \ncustomers, u

In [15]:
from langchain_core.prompts import PromptTemplate

In [16]:
prompt=PromptTemplate(
    template="""You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:""",
    input_variables=['context', 'question']
)

In [18]:
prompt.invoke({"question":"What is Microservice Architecture?","context":context_from_vecotore_store})

StringPromptValue(text="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: What is Microservice Architecture? \nContext: [Document(id='6842eace68f3870b5a129950', metadata={'_id': '6842eace68f3870b5a129950'}, page_content='A container image is a way to package an app or service and deploy it in a reliable and reproducible \\nway. You could say that Docker isn’t only a technology but also a philosophy and a process.  \\nWhen using Docker, you won’t hear developers say, “It works on my machine, why not in production?” \\nThey can simply say, “It runs on Docker”, because the packaged Docker application can be executed \\non any supported Docker environment, and it runs the way it was intended to on all deployment \\ntargets (such as Dev, QA, staging, and production). \\nA simple analogy \\nPerha

In [21]:
from langchain_openai import OpenAI

In [22]:
llm = OpenAI(model_name="gpt-3.5-turbo-instruct")

In [26]:
retriever = vectorStore.as_retriever()

In [33]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda

In [27]:
output_parser = StrOutputParser()

In [20]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [29]:
from langchain_core.messages.base import BaseMessage

In [31]:
def get_question(input):
    if not input:
        return None
    elif isinstance(input,str):
        return input
    elif isinstance(input,dict) and 'question' in input:
        return input['question']
    elif isinstance(input,BaseMessage):
        return input.content
    else:
        raise Exception("string or dict with 'question' key expected as RAG chain input.")

In [28]:
from langchain import hub
rag_prompt = hub.pull("rlm/rag-prompt")

In [None]:
rag_chain = (
        {
            "context": RunnableLambda(get_question) | retriever | format_docs,
            "question": RunnablePassthrough()
        }
        | rag_prompt
        | llm
)

In [36]:
rag_chain.invoke("What is Docker?")

' Docker is an open-source project and company that automates the deployment of applications as portable, self-sufficient containers. These containers can run on the cloud or on-premises and offer benefits such as isolation, portability, agility, scalability, and control. A container image is a package with all the dependencies and information needed to create a container, and a Dockerfile is a text file that contains instructions for building a Docker image. '