## Document Loader using Langchain

In [1]:
pip install langchain

Note: you may need to restart the kernel to use updated packages.


### PDF LOADER

In [8]:
import os 
import groq
import sys
from dotenv import load_dotenv 

load_dotenv()

groq_api_key = os.getenv("GROQ_API_KEY_temp_1")

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("C:/Users/KUSHAL/OneDrive/Desktop/Notebook LM/thebook_machine_learning.pdf")
pages = loader.load()
len(pages)

234

In [19]:
page = pages[3]
print(page.page_content[:500])

published by the press syndicate of the university of cambridge
The Pitt Building, Trumpington Street, Cambridge, United Kingdom
cambridge university press
The Edinburgh Building, Cambridge CB2 2RU, UK
40 West 20th Street, New York, NY 10011–4211, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
Ruiz de Alarc´ on 13, 28014 Madrid, Spain
Dock House, The Waterfront, Cape Town 8001, South Africa
http://www.cambridge.org
c⃝Cambridge University Press 2008
This book is in copyright. Subj


In [20]:
page.metadata

{'producer': 'pdfTeX-1.40.10',
 'creator': 'LaTeX with hyperref package',
 'creationdate': '2010-10-01T15:47:05-07:00',
 'author': 'AlexJ.SmolaandVishyS.V.N.Vishwanathan',
 'title': 'AnIntroductiontoMachineLearning',
 'subject': '',
 'keywords': '',
 'moddate': '2010-10-01T15:47:05-07:00',
 'trapped': '/False',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.1415926-1.40.10-2.2 (TeX Live/MacPorts 2009_6) kpathsea version 5.0.0',
 'source': 'C:/Users/KUSHAL/OneDrive/Desktop/Notebook LM/thebook_machine_learning.pdf',
 'total_pages': 234,
 'page': 3,
 'page_label': 'iv'}

In [84]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size = 2000,
    chunk_overlap = 200,
    length_function = len
)

In [85]:
docs = text_splitter.split_documents(pages)
len(docs)

299

In [86]:
len(pages)

234

In [88]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/NotionDB")
notion_db = loader.load()

In [90]:
docs = text_splitter.split_documents(notion_db)

In [91]:
len(notion_db)

0

In [92]:
len(docs)

0

## Splitter Example

In [46]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chunk_size = 26
chunk_overlap = 4

In [47]:
r_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
text_1 = 'abcdefghijklmnopqrstuvwxyz'
text_2 = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'


In [50]:
r_splitter.split_text(text_1)


['abcdefghijklmnopqrstuvwxyz']

In [53]:
r_splitter.split_text(text_2)

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefghijklmnopqrstuv', 'stuvwxyz']

In [60]:
text_3 = "a b c, d e f g h, i ,j k l m ,n o p q r s t u v w ,x y z"
r_splitter.split_text(text_3)

['a b c, d e f g h, i ,j k l', 'k l m ,n o p q r s t u v', 'u v w ,x y z']

In [64]:
c_splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, separator=",")
c_splitter.split_text(text_3)

['a b c, d e f g h, i', 'i ,j k l m', 'n o p q r s t u v w ,x y z']

In [77]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

len(some_text)

c_splitter = CharacterTextSplitter(
    chunk_size = 450,
    chunk_overlap = 0,
    separator = ' '
)

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 150,
    chunk_overlap = 0,
    separators= ['\n\n', '\n',"(?<=\. )", ' ', '']
)

In [78]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [79]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

## Task 1 

In [93]:
from langchain.vectorstores import Chroma

In [95]:
persit_directory = 'docs/chroma/'

In [109]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("C:/Users/KUSHAL/OneDrive/Desktop/Notebook LM/BLOGS", glob="**/*.pdf", recursive=True)
docs = loader.load()


In [110]:
unique_paths = {doc.metadata["source"] for doc in docs}
num_files = len(unique_paths)
print(f"Loaded content from {num_files} unique PDF files")
print("These files:", unique_paths)

Loaded content from 11 unique PDF files
These files: {'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Edge_Revolution_Key_Trends_in_Edge_Computing_for_2025_and_Beyond.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\Navigating_the_Evolving_Landscape_Top_Cybersecurity_Trends_for_2025.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Quantum_Leap_Emerging_Trends_in_Quantum_Computing_for_Scientific_and_Industrial_Applications.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\Green_Horizons_The_Rise_of_Sustainable_Technology_in_2025_and_Beyond.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Future_is_Now_Dominant_Cloud_Computing_Trends_of_2025.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Ubiquitous_Ascent_of_Artificial_Intelligence_in_Daily_Life.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\Beyond_the_Hype_Key_Blockchain_Technology_Trends_for_2025-2030.pdf', 'C:\

In [106]:
pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.7.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.7.0-cp311-cp311-win_amd64.whl.metadata (14 kB)
Collecting scipy (from sentence-transformers)
  Downloading scipy-1.16.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting Pillow (from sentence-transformers)
  Using cached pillow-11.3.0-cp311-cp311-win_amd64.whl.metadata (9.2 kB)
Collecting sympy>=1.13.3 (from torch>=1.11.0->sentence-transformers)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch>=1.11.0->sentence-transformers)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch>=1.11.0->sentence-transformers)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 

In [111]:
pip install chromadb


Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-win_amd64.whl.metadata (7.1 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-win_amd64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.35.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.35.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.35.0-py3-none-any.whl.metadata (1.5 kB)
Collecting pypika>=0.48.9 (fr

In [None]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Required installation instructions:
# pip install sentence-transformers chromadb

# Load PDFs
directory_path = "C:/Users/KUSHAL/OneDrive/Desktop/Notebook LM/BLOGS"
loader = PyPDFDirectoryLoader(directory_path, glob="**/*.pdf", recursive=True)
docs = loader.load()

# Confirm loaded files
unique_paths = {doc.metadata["source"] for doc in docs}
print(f"Loaded content from {len(unique_paths)} unique PDF files")
print("These files:", unique_paths)

# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
print(f"Total chunks: {len(chunks)}")

# Create sentence-transformers embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Store embeddings in Chroma
persist_directory = "chroma_store"
vectorstore = Chroma.from_documents(documents=chunks, embedding=embedding_model, persist_directory=persist_directory)
vectorstore.persist()
print("Embedding and storage complete.")

# Querying top-k diverse results using MMR
def query_docs(query, k=3, fetch_k=15):
    raw_results = vectorstore.max_marginal_relevance_search(query, k=fetch_k)
    seen_sources = set()
    unique_results = []

    for doc in raw_results:
        key = (doc.metadata.get("source", ""), hash(doc.page_content))
        if key not in seen_sources:
            unique_results.append(doc)
            seen_sources.add(key)
        if len(unique_results) == k:
            break

    print("\nTop Unique Results (with diversity):")
    for i, doc in enumerate(unique_results):
        print(f"\nResult {i+1}:")
        print("Source:", doc.metadata.get("source", "Unknown"))
        print("Content:", doc.page_content[:500])

# Try again
query_docs("Explain the role of neural networks in deep learning, and how they differ from traditional machine learning algorithms or data science approaches")


Loaded content from 11 unique PDF files
These files: {'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Edge_Revolution_Key_Trends_in_Edge_Computing_for_2025_and_Beyond.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\Navigating_the_Evolving_Landscape_Top_Cybersecurity_Trends_for_2025.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Quantum_Leap_Emerging_Trends_in_Quantum_Computing_for_Scientific_and_Industrial_Applications.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\Green_Horizons_The_Rise_of_Sustainable_Technology_in_2025_and_Beyond.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Future_is_Now_Dominant_Cloud_Computing_Trends_of_2025.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\The_Ubiquitous_Ascent_of_Artificial_Intelligence_in_Daily_Life.pdf', 'C:\\Users\\KUSHAL\\OneDrive\\Desktop\\Notebook LM\\BLOGS\\Beyond_the_Hype_Key_Blockchain_Technology_Trends_for_2025-2030.pdf', 'C:\

## TASK 2

In [9]:
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import TokenTextSplitter
from langchain.document_loaders import PyPDFLoader
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()
GROQ_API_KEY = os.getenv("GROQ_API_KEY_temp_1")
os.environ["OPENAI_API_BASE"] = "https://api.groq.com/openai/v1"

# Load the PDF
loader = PyPDFLoader("C:/Users/KUSHAL/OneDrive/Desktop/Notebook LM/thebook_machine_learning.pdf")
pages = loader.load()

# Create embeddings instance
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Function to build retriever using Chroma
def build_retriever(chunks, collection_name):
    db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        collection_name=collection_name,
        persist_directory=f"./chroma_store_{collection_name}"
    )
    retriever = db.as_retriever()
    return retriever

# Function to split and return chunks for each chunk size
def get_chunks(chunk_size):
    splitter = TokenTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=0
    )
    return splitter.split_documents(pages)

# Build chunked documents and retrievers
chunks_256 = get_chunks(256)
chunks_512 = get_chunks(512)
chunks_1024 = get_chunks(1024)
retriever_256 = build_retriever(chunks_256, "ml_chunks_256")
retriever_512 = build_retriever(chunks_512, "ml_chunks_512")
retriever_1024 = build_retriever(chunks_1024, "ml_chunks_1024")

# Initialize LLM for QA
llm = ChatGroq(model_name="qwen/qwen3-32b", api_key=GROQ_API_KEY)

# Create RetrievalQA chains
qa_256 = RetrievalQA.from_chain_type(llm=llm, retriever=retriever_256)
qa_512 = RetrievalQA.from_chain_type(llm=llm, retriever=retriever_512)
qa_1024 = RetrievalQA.from_chain_type(llm=llm, retriever=retriever_1024)

# Define evaluation questions
questions = [
    "What is the main goal of the book?",
    "Explain one key concept in machine learning discussed in the PDF.",
    "What are common challenges in machine learning mentioned in the book?"
]

# Collect answers for each chunk size
results = {"256": [], "512": [], "1024": []}
for q in questions:
    results["256"].append(qa_256.run(q))
    results["512"].append(qa_512.run(q))
    results["1024"].append(qa_1024.run(q))

# Output results
for size in ["256", "512", "1024"]:
    print(f"\nChunk Size: {size} tokens")
    for i, answer in enumerate(results[size]):
        print(f"Q{i+1}: {questions[i]}\nA: {answer}\n")



Chunk Size: 256 tokens
Q1: What is the main goal of the book?
A: <think>
Okay, the user is asking about the main goal of the book based on the provided context. Let me look through the given information again.

The context includes several customer reviews and the preface of the book. The reviews mention that the book is used as a textbook in machine learning and data mining classes. It's noted as a helpful resource for students, an excellent addition to a Machine Learning class, and a good reference. The preface states that the book is a textbook and mentions a bias towards easily accessible works for simplification.

From this, the main goal seems to be serving as a textbook for students in these fields. The authors aimed to make the material accessible, possibly prioritizing clarity and ease of understanding over citing original sources. They want students to focus on learning the concepts rather than getting bogged down by the original references. The reviews also highlight its us