In [None]:
%%capture --no-stderr
%pip install langchain langchain-community langchain-text-splitters langchain-openai langchain-chroma langchain-tavily wikipedia gradio arxiv pymupdf pypdf

In [None]:
from getpass import getpass
import os

api_keys = ["OPENAI_API_KEY", "TAVILY_API_KEY"]
for key in api_keys:
    os.environ[key] = getpass(f"Enter your {key}:")

Enter your OPENAI_API_KEY:··········
Enter your TAVILY_API_KEY:··········


# Lab 1: Build a simple RAG QA Chatbot

In [None]:
# Ingest the documents
# Create text chunks
# Embed & store the chunks in the vector db
# Set up chat backend & create function for basic chat loop
# Set up gradio UI

## Document Ingestion

In this section we are going to download the 5 latest papers from ArXiv on the subject of AI. Then we will load the documents into memory as text pages for chunking and embedding.

In [None]:
# Ingest documents from Arxiv
import arxiv
from langchain_community.document_loaders import PyPDFLoader

In [None]:
# Search for the 5 latest papers on AI
search = arxiv.Search(
    query="AI",
    max_results=5,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

results = list(search.results())

# Download the papers
for result in results:
    print(f"Downloading {result.title}...")
    result.download_pdf()
    print("Done.")

print("All papers downloaded.")

  results = list(search.results())


Downloading Achieving Hilbert-Schmidt Independence Under Rényi Differential Privacy for Fair and Private Data Generation...
Done.
Downloading Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval...
Done.
Downloading Reasoning-Intensive Regression...
Done.
Downloading Operational Validation of Large-Language-Model Agent Social Simulation: Evidence from Voat v/technology...
Done.
Downloading From Drone Imagery to Livability Mapping: AI-powered Environment Perception in Rural China...
Done.
All papers downloaded.


In [None]:
def load_pdfs_from_directory(directory_path):
    """
    Loads all PDF files from a directory and returns a dictionary
    where keys are document titles and values are lists of pages.
    """
    pdf_documents = {}
    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(directory_path, filename)
            try:
                loader = PyPDFLoader(file_path)
                pages = loader.load_and_split()
                pdf_documents[filename] = pages
            except Exception as e:
                print(f"Error loading {filename}: {e}")
    return pdf_documents

# Example usage (assuming your PDFs are in /content)
pdf_data = load_pdfs_from_directory("/content")

In [None]:
pdf_data.keys()

dict_keys(['2508.21740v1.Operational_Validation_of_Large_Language_Model_Agent_Social_Simulation__Evidence_from_Voat_v_technology.pdf', '2508.21788v1.Going_over_Fine_Web_with_a_Fine_Tooth_Comb__Technical_Report_of_Indexing_Fine_Web_for_Problematic_Content_Search_and_Retrieval.pdf', '2508.21815v1.Achieving_Hilbert_Schmidt_Independence_Under_Rényi_Differential_Privacy_for_Fair_and_Private_Data_Generation.pdf', '2508.21762v1.Reasoning_Intensive_Regression.pdf', '2508.21738v1.From_Drone_Imagery_to_Livability_Mapping__AI_powered_Environment_Perception_in_Rural_China.pdf'])

## Break each document down into "chunks"

**Explanation & Motivation**

Now that we have our documents ingested and loaded into memory, we can begin the process of breaking them down into chunks.

Chunking is used as a means to manage the amount of context fed into a language model during inference. This is especially useful when we want to use large documents as context.

Another goal of chunking is to keep relevant context together in these smaller pieces. Language models tend to struggle with picking out details in larger text blocks, so the goal of our retrieval phase in our RAG pipeline is to only gather the most relevant chunks for the given query.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Set up your text_splitter here
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=50
) # Implement the text splitter using RecursiveCharacterTextSplitter

In [None]:
# Create chunks for each document. Remember, each document is also composed of one or more documents itself.
# You can try using the `transform_documents` method to help process lists of documents

chunks = {
    title: text_splitter.transform_documents(docs)
    for title, docs in pdf_data.items()
}

Now lets observe the results of our chunking:

In [None]:
doc_titles = list(chunks.keys())

for title in doc_titles:
  num_pages = len(pdf_data[title])
  num_chunks = len(chunks[title])
  print(f"The document, {title.split('.')[-2]}, has {num_pages} pages that are split into {num_chunks} chunks.")

The document, Operational_Validation_of_Large_Language_Model_Agent_Social_Simulation__Evidence_from_Voat_v_technology, has 28 pages that are split into 375 chunks.
The document, Going_over_Fine_Web_with_a_Fine_Tooth_Comb__Technical_Report_of_Indexing_Fine_Web_for_Problematic_Content_Search_and_Retrieval, has 28 pages that are split into 369 chunks.
The document, Achieving_Hilbert_Schmidt_Independence_Under_Rényi_Differential_Privacy_for_Fair_and_Private_Data_Generation, has 27 pages that are split into 454 chunks.
The document, Reasoning_Intensive_Regression, has 28 pages that are split into 395 chunks.
The document, From_Drone_Imagery_to_Livability_Mapping__AI_powered_Environment_Perception_in_Rural_China, has 37 pages that are split into 357 chunks.


## Embed and store the chunks

Next we'll use an embedding model to create vector representations of our chunks that can then be stored in our vector database. These vectors will be used during the retrieval phase of our RAG system where we will perform similarity search to find the most relevant chunks based on the given question.

In [None]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

In [None]:
# Initialize embeddings from OpenAI
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [None]:
vector_store = Chroma(
    collection_name="lc-demo",
    embedding_function=embeddings,
    persist_directory="/content/lc-vector-store"
)

Now that we have initialized our embeddings and vector store, we're ready to embed and load our documents in:

In [None]:
for title, chunk_sequence in chunks.items():
  # Add the method to add documents to your vector_store here
  vector_store.add_documents(documents=chunk_sequence)
  print(f"Added chunks for {title.split('.')[2]}")

Added chunks for Operational_Validation_of_Large_Language_Model_Agent_Social_Simulation__Evidence_from_Voat_v_technology
Added chunks for Going_over_Fine_Web_with_a_Fine_Tooth_Comb__Technical_Report_of_Indexing_Fine_Web_for_Problematic_Content_Search_and_Retrieval
Added chunks for Achieving_Hilbert_Schmidt_Independence_Under_Rényi_Differential_Privacy_for_Fair_and_Private_Data_Generation
Added chunks for Reasoning_Intensive_Regression
Added chunks for From_Drone_Imagery_to_Livability_Mapping__AI_powered_Environment_Perception_in_Rural_China


## Create the chain to invoke the LLM

In [None]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from typing import List, Dict, Any
import time

In [None]:
llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0) # Set up the chat model here

In [None]:
# You can use this as a base prompt and modify it if you feel you need to
system_prompt = """
You are a helpful chatbot that answers questions about the subject of AI
based on ONLY the context provided to you. Do not use any other context.
"""

prompt_template = ChatPromptTemplate.from_messages([
    SystemMessage(content=system_prompt),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{question} {context}")
])

chain = prompt_template | llm | StrOutputParser() # Set up chain here

In [None]:
chain.invoke({"question": "What is the meaning of life, the universe, and everything?", "context": "", "chat_history": []})

'The context provided does not include information about the meaning of life, the universe, and everything. Therefore, I am unable to provide an answer based on the given information.'

In [None]:
def arxiv_chat(question: str, history: List[Dict[str, Any]]):
  retriever = vector_store.as_retriever(search_kwargs={"k": 5})
  context = retriever.invoke(question)
  response = chain.invoke({"question": question, "context": context, "chat_history": history}) # Invoke the `chain` we created before and pass in the input with the following keys: `question`, `context`, and `chat_history`
  # message = f"{response.content}\n\nToken Usage: {response.response_metadata['token_usage']}"
  return response



## Set up the Gradio Chat UI

In [None]:
import gradio as gr

In [None]:
gr.ChatInterface(fn=arxiv_chat, title="ArXiv Chat", type="messages").launch(debug=True)

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://435d90dbf5bf2dad4e.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
