<a href="https://colab.research.google.com/github/jayshri/AIAgents/blob/main/SmartStudyBuddyAgent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Problem Statement

Business Problem

Imagine you're a student juggling multiple subjects, each with extensive lecture notes (provided
as text files or PDFs). It's challenging to quickly find specific information or revise concepts
across these varied documents. You need a "Smart Study Buddy" – a virtual assistant that can:

● Understand your questions about the course material.

● Retrieve relevant information only from your uploaded lecture notes.

● Maintain the context of your current study session, allowing for follow-up questions.
This tool aims to make studying more efficient and targeted.

ML Problem

To build the "Smart Study Buddy," we will develop a Retrieval Augmented Generation (RAG)
system. This system will:

● Data Sources (Documents): Ingest multiple local document files (e.g., .txt, .pdf) representing your lecture notes.

○ Hint: You'll need to use appropriate document loaders from LangChain for
different file types.

● Vector Store: Use Pinecone to store a searchable representation (embeddings) of your
lecture notes. This allows for efficient similarity search to find relevant context.

● LLM: Leverage a Large Language Model (LLM) to understand questions, process
retrieved context, and generate helpful answers.

● Conversational Memory (In-Session): Implement a mechanism to remember the
dialogue within the current study session, enabling coherent follow-up interactions.

Project Tasks

The project has been broken down into the following key tasks:

1. Pinecone Setup

● You will need a Pinecone account.

● Initialize the Pinecone client in your notebook using your API key and environment.

● Define a unique name for your Pinecone index.

● Write code to create the index if it doesn't already exist. Ensure you specify the correct
dimension for the embedding model you choose (e.g., OpenAI's
text-embedding-ada-002 has a dimension of 1536).

● Reference: Pinecone LangChain Integration


2. Data Ingestion and Processing (Loading and Chunking)

● Create 2–3 sample lecture note files (e.g., subjectA_notes.txt,
subjectB_notes.pdf). You can fill them with dummy text or copy-paste some
educational content.

● Use appropriate LangChain document loaders to load the content from these files.
○ TextLoader for .txt files.
○ PyPDFLoader for .pdf files. (You might need to pip install pypdf)

● Chunk the loaded documents into smaller, manageable pieces. This is crucial for fitting
within the LLM's context window and for effective retrieval.
○ Consider using RecursiveCharacterTextSplitter or
CharacterTextSplitter. Experiment with chunk_size and
chunk_overlap.

○ References:
■ CharacterTextSplitter
■ RecursiveCharacterTextSplitter


3. Embedding Generation and Vector Storage

● For each document chunk, generate an embedding (a numerical vector representation).

● You can use:
○ OpenAIEmbeddings (requires an OpenAI API key).
○ Sentence Transformer models from Hugging Face via
HuggingFaceEmbeddings (e.g., 'all-MiniLM-L6-v2').

● Store these chunks and their corresponding embeddings in your Pinecone index.
○ Use PineconeVectorStore.from_documents() for an easy way to
populate the index.
○ Ensure the embedding model used here matches the dimension specified during
Pinecone index creation.

4. Conversational Question Answering Chain

● Implement a chain that can answer questions based on the retrieved context and
maintain conversation history for the current session.

● ConversationalRetrievalChain is a good high-level option.

● Alternatively, you can build a custom chain using a retriever from your Pinecone vector
store and integrating a memory module.

● Use a LangChain memory module like ConversationBufferMemory or
ConversationSummaryMemory to store the chat history of the current session.

● Reference: Adding Memory to QA Chains

● Create a PromptTemplate to guide the LLM. The prompt should instruct the LLM to:
○ Answer the user's question based only on the provided context (retrieved
chunks).
○ If the answer cannot be found in the provided context, explicitly state that the
information is not in the lecture notes.
○ Consider the chat history when formulating the current answer.
○ Reference: PromptTemplate


5. Testing and Demonstration

● In your notebook, instantiate and interact with your "Smart Study Buddy."
● Test Case 1: Ask an initial question whose answer is clearly in one of your documents.
● Test Case 2: Ask a follow-up question that relies on the context established by the
previous question-answer pair.
○ Example:
■ User: "What are the main types of RAG systems?"
■ Bot: "Type A, Type B, Type C"

■ User: "Tell me more about Type A."

● Test Case 3: Ask a question for which the answer is not present in your documents to
verify the "I don't know" or "not in notes" behavior.
● Test Case 4 (Optional): Restart your notebook kernel and run the interaction part again.
The chatbot should not remember the conversation from the previous run (as we are not
implementing persistent user memory), but it should still be able to answer questions
based on the documents already indexed in Pinecone.

Extra Requirements (Focus on In-Session Conversation)
The core requirement is to make the system conversational *within a single interaction session*.
The system should understand follow-up questions that implicitly refer to previous turns in the
current dialogue.

Example Interaction:
● User: "What were the key topics covered in the Machine Learning lecture?"
● System: (Retrieves from notes) "The key topics included Supervised Learning,
Unsupervised Learning, and Reinforcement Learning."
● User: "Can you explain Supervised Learning in more detail based on the notes?"
○ (The system should understand "Supervised Learning" refers to the one
mentioned in its previous response and use the notes to elaborate.)

This is achieved using LangChain's memory components, ensuring the `chat_history` is passed
appropriately.

In [1]:
pip install langchain langchain-community langchain-core langchain-openai langchain-huggingface langchain-pinecone pinecone-client pypdf python-dotenv



In [2]:
pip -q install -U langchain langchain-core langchain-community langchain-openai langchain-pinecone

In [3]:
pip install -qU langchain-text-splitters

In [8]:
from re import template
from google.colab import userdata
from google.colab import drive
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.document_loaders.base import BaseBlobParser
from langchain_classic.memory import ConversationBufferMemory
from langchain_classic.chains import ConversationalRetrievalChain
from langchain_classic.prompts import PromptTemplate


def chunk_documents(docs):
  '''its important to chunk the data into chunk size(small pieces)
     for LLM to search data into those chunks.
     Using LangChain's RecursiveCharacterTextSplitter here for chunking the data.
  '''
  splitter = RecursiveCharacterTextSplitter(
    chunk_size = 800,
    chunk_overlap = 200,
  )
  chunks = splitter.split_documents(docs)
  print(f"After chunking :  {len(chunks)} chunks")
  return chunks

def load_documents(paths : list[str]):
  # load the files from google drive using Langchain's TextLoader and PyPDFLoader
  all_docs = []
  for path in paths:
    if path.endswith(".txt"):
      loader = TextLoader(path, encoding="utf-8")
    elif path.endswith(".pdf"):
      loader = PyPDFLoader(path)
    else:
      continue
    docs = loader.load()
    #store more metada along with documents to the docs
    for doc in docs:
      doc.metadata["file_name"] = os.path.basename(path)
      doc.metadata["source_path"] = path
    all_docs.extend(docs)

  return all_docs

def setup_environment():
  os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
  os.environ["PINECONE_API_KEY"] = userdata.get("PINECONE_API_KEY")
  os.environ["PINECONE_INDEX_NAME"] = "demo"

def store_chunks_in_pinecone(chunks):
  embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
  vectorstore = PineconeVectorStore.from_documents(
      documents = chunks,
      embedding=embeddings,
      index_name=os.environ["PINECONE_INDEX_NAME"],
      namespace="ai-agents",
  )
  return vectorstore

def build_chain(vectorstore):
  llm_model = "gpt-4o-mini"
  llm = ChatOpenAI(model=llm_model, temperature=0)

  retriever = vectorstore.as_retriever(search_kwargs={"k" : 4})

  memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
  )
  qa_prompt = PromptTemplate(
    input_variables=["context", "question", "chat_history"],
    template=(
        "You are Smart Study Buddy. You answer questions using ONLY the lecture notes context provided.\n\n"
        "Rules:\n"
        "1) Use ONLY the context below. Do not use outside knowledge.\n"
        "2) If the answer is not in the context, say: \"That is not in the lecture notes.\" (and stop)\n"
        "3) Use chat history to understand follow-up references (e.g., \"Type A\", \"that\", \"it\").\n\n"
        "Chat History (current session):\n"
        "{chat_history}\n\n"
        "Retrieved Lecture Notes Context:\n"
        "{context}\n\n"
        "User Question:\n"
        "{question}\n\n"
        "Answer (based only on context):"
      ),
    )
  chain = ConversationalRetrievalChain.from_llm(
      llm = llm,
      retriever = retriever,
      memory = memory,
      get_chat_history=lambda h: h,
      combine_docs_chain_kwargs={"prompt": qa_prompt},
      verbose=False,
  )
  return chain

def interactive_chat(chain):
    print("\n=== Interactive Chat ===")
    print("Type 'exit' to quit.\n")
    while True:
        q = input("You: ").strip()
        if not q:
            continue
        if q.lower() in {"exit", "quit"}:
            break
        result = chain.invoke({"question": q})
        print("Buddy:", result["answer"])
        print()

def main() :
  drive.mount('/content/drive')
  txt_path = "/content/drive/MyDrive/IK-assignments/subjectA_notes.txt"
  pdf_path = "/content/drive/MyDrive/IK-assignments/subjectB_notes.pdf"

  docs = load_documents([txt_path,pdf_path])

  chunks = chunk_documents(docs)

  setup_environment()

  vectorstore = store_chunks_in_pinecone(chunks)
  print(f"stored chunks into vectorstore")
  chain = build_chain(vectorstore)
  interactive_chat(chain)

main()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
After chunking :  39 chunks


KeyboardInterrupt: 