# Build a Conversational RAG app with Custom PDF ingestion using Ollama-Langchain
Goals:
* use open-source LLM from Ollama for ChatCompletion
* use open-source embedding model from HuggingFace for Embeddings for VectorStore
* once done, convert to python script
Document using: https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf
* it has been manually split by chaps into ~20 separate pdfs

* should also have a version whre you can pass links and then ingest using import requests, but will have less control

In [6]:
import glob
import os

In [14]:
files_paths = glob.glob("/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/*.pdf")
files_paths

['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap11.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap13.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap12.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cover-content-page.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap16.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap17.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/index.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cha

In [59]:
# load pdfs into list

from langchain_community.document_loaders import PyPDFLoader
from tqdm import tqdm

def load_pdfs(file_paths):
    """
    file_paths must end with .pdf
    PyPDFLoader auto splits the pdf into pages, each page is 1 Document object

    returns a dict of key: file_path and value: list of document objects
    """
    documents_dict = {}   
    for f in tqdm(file_paths):
        loader = PyPDFLoader(file_path = f)
        documents = loader.load()
        documents_dict[f] = documents
    return documents_dict

In [61]:
documents_dict = load_pdfs(file_paths=files_paths)

100%|██████████| 22/22 [00:33<00:00,  1.50s/it]


In [65]:
len(documents_dict) == len(files_paths), len(documents_dict)

(True, 22)

In [79]:
# print all the keys

from pprint import pprint

for k in documents_dict.keys():
    print(k)

/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap11.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap13.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap12.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cover-content-page.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap16.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap17.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/index.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap15.pdf
/Users/I748920/Desktop/llm

In [83]:
len(documents_dict['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf'])

52

In [85]:
d = documents_dict['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf']

In [93]:
len(d[0].page_content)

1413

as you can see even though PyPDFLoader splits the pdf into different Document objects for each page, each Document object is still considered huge number of char
- so there is still a need to split into chunks

if you have a lot RAM, you can afford to split into smaller chunk sizes 

* small chunk size, will help the model with smaller context window when referencing the document
* big chunk size ensures the text is too split up causing the text to lose its meaning

example
like for example, one chunk talks about intro to decision trees then another chunk is about random forest using bootstrapping of decision trees
query: "why is smaller decision trees better?"
Your RAG will rank both contexts highly, often giving unreliable results because your chunk is too small to capture the meaning. Your RAG might not answer the question because the question is in the extended portion of context that happens to be in another chunk

Nic Ang suggestion to choose the right number for chunk_size

- use traditional NLP techniques like BOW or sth, to calculate the average number of characters, average number of words, pdf page
then you calculate the average number of characters in each word for each page, aggregate across all documents, then you get your # of characters.
- I suggest you split per page!
- not sure how many characters exist per page, but using some domain knowledge about textbooks, we can tell that a topic is most likely captured in a single page
hence, using the average character count per page of each chapter is a sound choice to start
- so ideally you have 844 or sth chunks if you have 844 pages abt there
- You think about it, like look at the textbook itself
- and think "if im a chunker, what's the best number of characters such that I can capture enouggh meaning without throwing away important detail"
- so probably just take the number of characters per page on average

chunk_overlap

- overlap because you may be cutting off information prematurely without overlap
- so if you have 20% char overlap, then expect 800*20% chunks
- it's a decent value to start

In [176]:
# chunk the pdfs

from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_list_of_documents(documents):
    """
    input a list of documents as Document objects

    output a list of chunks as Document objects
    """

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 500,
        chunk_overlap = 100, # using 20% is a good start
        length_function=len,
        is_separator_regex=False,
        add_start_index=True
    )

    chunks = text_splitter.split_documents(documents)    
    return chunks

In [178]:
docs = documents_dict['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf']
len(docs)

52

In [188]:
chunks = chunk_documents(docs)
len(chunks)

241

In [204]:
chunks[5]

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf', 'page': 1}, page_content='Training  SampleWeighted  SampleG(x) = sign[∑M\nm=1αmGm(x)]\nGM(x)\nG3(x)\nG2(x)\nG1(x)Final Classifier\nFIGURE 10.1. Schematic of AdaBoost. Classiﬁers are trained on weighted ve r-\nsions of the dataset, and then combined to produce a ﬁnal pred iction.\nThe predictions from all of them are then combined through a w eighted\nmajority vote to produce the ﬁnal prediction:\nG(x) = sign(M∑\nm=1αmGm(x))\n. (10.1)\nHereα1,α2,...,α Mare computed by the boosting algorithm, and weight')

In [208]:
all_chunks = []

for key in tqdm(documents_dict.keys()):
    documents = documents_dict[key]
    chunks = chunk_list_of_documents(documents=documents)
    all_chunks.extend(chunks)

len(all_chunks)

100%|██████████| 22/22 [00:00<00:00, 112.01it/s]


5378

# Embeddings

https://python.langchain.com/v0.2/docs/integrations/text_embedding/ollama/

all the chunks swee swee alr

left with
- indexing chunks use ollama-embeddings
- create vectorstore using InMemoryVectorStore or Chroma
- setup retriever -> retriever = vectorstore.as_retriever(search_type='similarity')
- setup message history using InMemoryChatMessageHistory
- setup prompts and rag chain
- test generation