# RAG

Load imports

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_community.document_loaders import TextLoader, PyMuPDFLoader, UnstructuredPDFLoader, WebBaseLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.retrievers import WikipediaRetriever
from langchain_core.runnables import RunnablePassthrough, RunnableLambda, RunnableParallel


from dotenv import load_dotenv
import re




USER_AGENT environment variable not set, consider setting it to identify your requests.


initialize models

In [3]:
load_dotenv()

model = ChatGoogleGenerativeAI(model= "gemini-2.5-flash-lite")
embeddings = GoogleGenerativeAIEmbeddings(model="gemini-embedding-001")
parser = StrOutputParser()

Provide an url and a single, space-delimited string is generated that is easy for splitters then it returns a list of Documents, for vector stores to handle.

In [4]:
def web_scraper(url : str):
    
    web_loader = WebBaseLoader(url)
    web_docs = web_loader.load()
    scraped_text = web_docs[0].page_content
    # This collapses all whitespace (newlines, tabs, multiple spaces) into a single space.
    preprocessed_text = re.sub(r'\s+', ' ', scraped_text).strip()
    text_splitter = RecursiveCharacterTextSplitter(
        # Try to split by paragraph, then line, then sentence, then character
        separators=["\n\n", "\n", ".", " ", "...", "View More"], 
        chunk_size=750, 
        chunk_overlap=100,
        length_function=len,
        is_separator_regex=False,
    )
    return text_splitter.create_documents([preprocessed_text])

def general_recursive_split(docs):
    
    text_splitter = RecursiveCharacterTextSplitter(
        # Try to split by paragraph, then line, then sentence, then character
        separators=["\n\n", "\n", ".", " "], 
        chunk_size=1000, 
        chunk_overlap=200,
        length_function=len,
        is_separator_regex=False,
    )
    chunked_docs = text_splitter.split_documents(docs)
    print(f"Finished chunking. Total chunks created: {len(chunked_docs)}")
    return chunked_docs



prompt templates for the note creation of current affairs, doc loader used is WebLoader, and a simple sequential chain to return results

In [5]:

prompt = PromptTemplate(template="I have scraped a current affairs website, I want to study for UPSC examination, As an expert can you list todays current affairs and categorize them and make a study note of it, this the data i scraped from the website, data --> {text}")

chain = prompt | model | parser

result = chain.invoke({"text": web_scraper(url= "https://currentaffairs.adda247.com/")})
print(result)

It's great that you're using scraped data to prepare for the UPSC exam! As an expert, I can help you organize and analyze this information. The provided text is a compilation of links and summaries from a current affairs website. While it doesn't contain the *actual* detailed news for today (November 20, 2025, as indicated by the dates), it provides a clear structure of the categories they cover, which is excellent for UPSC preparation.

Based on the scraped data, here's a breakdown of today's (November 20, 2025) current affairs, categorized for your UPSC study, along with study notes for each point.

**Important Note:** The data primarily lists headlines and brief mentions. For actual UPSC preparation, you would need to click on these headlines (or find similar news from other reliable sources) to get the detailed context, implications, and analysis.

---

## Today's Current Affairs (November 20, 2025) - UPSC Study Notes

Here's a categorization of the current affairs based on the pro

## vector store

WebLoader

In [6]:
web_chunks = web_scraper(url= "https://currentaffairs.adda247.com/")
                     
vector_store = Chroma(embedding_function=embeddings, collection_name= "todays_news", persist_directory= "data/chromadb")

vector_store.add_documents(documents=web_chunks)

['6fee77d4-9ec3-4a72-8f68-ed2b150a862b',
 'fffc7183-86f7-4ff5-ad61-c7446881af39',
 '88701661-7990-41a6-a98d-4a596afbb3d0',
 '828b19c3-5498-47af-a751-899ab7b13090',
 '4a4b3f9d-a12a-4361-bb71-55e3fa0fc304',
 '0fce62dd-eacd-4b49-a21d-41897283bd2c',
 '6a76c084-fd4d-4ed8-b377-80edd5b85a33',
 'c4aeca01-a52a-4228-a0b4-8f39592cc26a',
 '5ac8dad2-3b6c-4c97-800d-337fe79fa2ec',
 '810a7d8b-5f99-4952-ba9e-c4876ed39391',
 '1f466c93-1f22-4a3a-9bad-fe9cf5b6de1b',
 '16116938-39d4-49ce-a74d-28a8b433cc02',
 'bbb420d8-e997-4f12-a93e-f09381a2fb4e',
 'a94667bf-9f4f-4df9-9652-b69a1bb92a8f',
 '54b2bfdd-78e0-4fe4-9cff-5612f3e8b3a0',
 '9228cbbb-31a4-49ca-9955-407ecca199d6',
 'cb4fb7bb-3581-4771-b952-c2b7c418c244',
 '6e1fd401-1df6-42f1-8f0b-af81214159eb',
 'e9ca5cc0-da29-4f3f-8a32-dc32b59392df',
 '103dca18-4229-4c2c-abb4-724865592d02',
 '1b341cd7-0b33-42e2-b948-868a394e09f3',
 '5cda97c8-423b-4e02-b829-3268d4d686e6',
 '91c38ac2-4773-484b-b3de-0427abd401f9']

DirectoryLoader with PyMuPDFLoader

In [7]:
pdf_loader = DirectoryLoader(path="data/", glob="*.pdf",loader_cls= PyMuPDFLoader)

pdf_docs = pdf_loader.load()
                     
pdf_chunks = general_recursive_split(pdf_docs)

vector_store = Chroma(embedding_function=embeddings, collection_name= "books", persist_directory= "data/chromadb")

vector_store.add_documents(documents=pdf_chunks)

Finished chunking. Total chunks created: 865


['b0eaf239-dc7b-4189-9a2d-b481a7e2f94a',
 '26165c16-9ad2-487c-be0b-28fbbc61e6c8',
 '79438c35-2a0e-4e71-b727-19e2988b8d54',
 '14085a81-bd65-4340-994d-b39123d96595',
 '250cf118-4caa-41f8-9323-090a33c59260',
 '1afc9927-2e19-4db1-befb-f821f97bd8b5',
 'a11a3562-c2cf-40c9-a7cf-7b3e790c7190',
 '8c4ed936-86e2-4530-87cb-8baacb6b6d42',
 'b93d1363-9b73-450e-bf94-fb3e8230f28e',
 '4926e202-51c9-4315-ae1c-700fc0dcdf81',
 '7d802860-a6fd-4dd7-9691-e9cfe27544dd',
 '911c3016-a3ce-49c0-9dcd-9132c1a68526',
 '81ecb55f-95da-48ae-b90a-60766627106a',
 'ed1eddad-b73f-4e69-b053-25d3cb55e886',
 '8332b4da-d7ed-4f5b-bab7-5df68dd2b5aa',
 '63ec50d0-9865-43eb-8e1a-1948ffa63ad6',
 '9e5130c7-073b-4547-b6fb-1b1b48ad45c4',
 '24db3277-6a2e-459e-9977-1eb723e1c9a8',
 'a895562d-08ac-4e80-bb6d-61f7436c9ff4',
 '1eabddb1-6d8e-44bf-93ca-19a34269f960',
 'fed8c6af-7500-46a5-b867-d96ca9fa8e63',
 '3ab995ca-df44-4b5e-ac4f-f46cf62d1bf2',
 '03b89d95-b196-4dd9-88b0-7dbc015ca87e',
 '7a0f33d5-b1a8-4227-acd2-a836160fa3bc',
 'a0f9c693-4914-

In [8]:
vector_store.get(include=["embeddings", "metadatas"])


{'ids': ['bec4acb4-bec7-4f77-beae-2ee4774bb44c',
  '1eb5782a-ba80-488a-82cf-6c71e9845aba',
  '22e7e387-2acd-4d7f-af0b-4c41e01b462f',
  'cc714a31-ac98-42da-be64-d1b7faa30b0b',
  'bdbf2498-d2ff-487e-919b-ef0a583c6ebf',
  '392caea3-fb97-41fb-9af5-65d32296c2da',
  '9edd0b80-070b-4814-9724-b87316da7c08',
  '0e65246a-11a8-43e1-87b4-e4d2041ba520',
  '0777f6b3-69a4-4405-bcb1-793f44e053a9',
  '6b4b0c3d-26a0-45ff-9a6b-f361cdfdab31',
  '013d51d9-7b96-4f40-9aa7-cea831d17b08',
  '9f9269a9-a53b-4150-916a-7dffa2ec2bd7',
  'b380e5cb-4f54-49cf-a383-4ab684bca209',
  '4dc856fd-0568-4f9a-9434-366a54e55faa',
  '4c1049c5-f6ef-412e-b682-4d54d641a353',
  '270a638b-1baf-4fc5-8c7c-97e9be732ab6',
  '75048b64-079b-4950-8ccf-65e6678e474e',
  '9f800bd8-7635-4aa7-85ff-14ab568b0288',
  'c8b40c43-ede5-4635-873e-f4a56e7d2702',
  '37af5244-8891-4c06-bd64-469454e7d009',
  '4b9092ae-5bd3-4093-86a3-8a62e399c8aa',
  '8032960e-b2e1-42e4-bddf-11d1755ef1bc',
  '2d7b818e-c197-43a6-8347-ee8879b2088f',
  '76e4f8ad-8b55-4f4e-8c29-

In [9]:
vector_store.similarity_search(query= "help me with my habits?", k=5)

[Document(id='ac4c60b1-c275-438f-92ca-aeba318b05bb', metadata={'total_pages': 256, 'format': 'PDF 1.4', 'keywords': '', 'trapped': '', 'subject': '', 'page': 56, 'title': 'Atomic habits \\( PDFDrive.com \\).pdf', 'creationdate': '2020-04-30T18:46:22+00:00', 'file_path': 'data\\Atomic habits.pdf', 'creationDate': "D:20200430184622+00'00'", 'creator': 'calibre 3.48.0 [https://calibre-ebook.com]', 'source': 'data\\Atomic habits.pdf', 'producer': 'calibre 3.48.0 [https://calibre-ebook.com]', 'moddate': '', 'author': 'James Clear', 'modDate': ''}, page_content='may reduce stress right now (that’s how it’s serving you), but it’s not a healthy\nlong-term behavior.\nIf you’re still having trouble determining how to rate a particular habit, here is\na question I like to use: “Does this behavior help me become the type of person I\nwish to be? Does this habit cast a vote for or against my desired identity?”\nHabits that reinforce your desired identity are usually good. Habits that conflict\nwith

In [10]:
retriever = vector_store.as_retriever(search_type = "mmr", search_kwargs={"k": 5, "lambdaMult":1})


In [11]:
def format_docs(docs):    
    context = "\\n\\n".join(doc.page_content for doc in docs)
    return context 

query = "what engine oil to use for dominar 400?"

retriever_text = retriever.invoke(query)
context = format_docs(docs=retriever_text)
context

"WELCOME TO THE DOMINAR CLAN! \nYou are now the proud owner of a modern masterpiece, \nDominar 400 BS VI. The new Dominar 400 BS VI is designed \nto deliver unparalleled technology with superior performance. \nThis makes your Dominar, unbeatable and unchallenged on \nevery terrain. \nBefore you ride out, please read this Owner's Manual carefully \nand familiarise yourself with the operation mechanism, \ncontrols and maintenance requirements of your \nDominar 400 BS VI. This will ensure you a safe and trouble \nfree ownership experience. \nTo keep your bike in perfect running condition and deliver \nconsistent performance, we have specially programmed the \nperiodic maintenance services which includes 3 free services \nand subsequent paid services, as per the schedule contained \nin this booklet. We earnestly advise you to avail all these \nservices at any of our Bajaj Dealers, who \nare well equipped with all necessary facilities, genuine \nparts, oils and trained manpower to ensure th

In [40]:
# system_prompt = PromptTemplate(template="You are an intelligent chabot that answers based on the context provided, please answer the questions based on that, context --> {context}, question --> {question}", input_variables= ["context", "question"])

system_prompt = PromptTemplate(template="""
    You are an expert AI assistant specialized in providing answers
    based on the context provided in the 'CONTEXT' section below. If user demands detailed answer provide them one with the context in mind else make it brief.

   CONTEXT:
    --------------------
    {context}
    --------------------

    QUESTION:
    --------------------
    {question}
    --------------------
    """, input_variables= ["context", "question"])


### Building a RAG chain

In [41]:
retrieve_chain = RunnableParallel({"context": retriever | RunnableLambda(format_docs), "question": RunnablePassthrough()})

In [42]:
final_chain = retrieve_chain | system_prompt | model | parser
print(final_chain.get_graph().draw_ascii())

              +---------------------------------+           
              | Parallel<context,question>Input |           
              +---------------------------------+           
                    ****                ****                
                 ***                        ***             
               **                              ***          
+----------------------+                          **        
| VectorStoreRetriever |                           *        
+----------------------+                           *        
            *                                      *        
            *                                      *        
            *                                      *        
    +-------------+                         +-------------+ 
    | format_docs |                         | Passthrough | 
    +-------------+*                        +-------------+ 
                    ****                ****                
                        

In [43]:
query = "make a detailed note on Digestion and Transport of Nutrients ? "
print(final_chain.invoke(query))

Digestion and Transport of Nutrients involve several key processes within the body:

**Digestion:**

*   **Starch:** Digestion begins with starch.
*   **Proteins:** Trypsin partially digests proteins.
*   **Lipids:** Lipases completely digest lipids into fatty acids and glycerol.
*   **Small Intestine:** Digestive juices from the liver and pancreas aid in digestion here. Carbohydrases in the intestinal juice convert complex carbohydrates into simple sugars like glucose, fructose, and galactose. Proteases break down proteins into amino acids. The absorption of simple nutrients, water, vitamins, and minerals primarily occurs in the small intestine.
*   **Large Intestine:** Undigested food substances reach the large intestine. Here, remaining water and salts are absorbed. Certain bacteria in the large intestine produce Vitamin K and B complex. The large intestine then carries digestive waste to the rectum for expulsion through the anus.

**Transport of Nutrients:**

*   **From Small Intes

In [44]:
query = "what was the life changing incident of james clear? "
print(final_chain.invoke(query))

The life-changing incident for James Clear was when he was hit in the face with a baseball bat on the final day of his sophomore year of high school. This severe injury resulted in a broken nose, multiple skull fractures, and shattered eye sockets.


In [45]:
query = "suggest me which engine oil should i use for dominar 400?"
print(final_chain.invoke(query))

For your Dominar 400 BS VI, you should use **Bajaj 10000 10W50 BS6 compliant, JASO MA2 API SN, Fully Synthetic engine oil**. Using BS VI compliant engine oil will ensure a prolonged life for your catalytic converter.
