# Build a Conversational RAG app with Custom PDF ingestion using Ollama-Langchain
Goals:
* use open-source LLM from Ollama for ChatCompletion
* use open-source embedding model from HuggingFace for Embeddings for VectorStore
* once done, convert to python script
Document using: https://hastie.su.domains/ElemStatLearn/printings/ESLII_print12_toc.pdf
* for this sample, only using last chap - chap 18
* it has been manually split by chaps into ~20 separate pdfs

* should also have a version where you can pass links and then ingest using import requests, but will have less control

In [1]:
import glob
import os

In [2]:
files_paths = glob.glob("/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/*.pdf")
files_paths

['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap11.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap13.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap12.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cover-content-page.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap16.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap17.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/index.pdf',
 '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cha

In [3]:
# load pdfs into list

from langchain_community.document_loaders import PyPDFLoader
from tqdm import tqdm

def load_pdfs(file_paths):
    """
    file_paths must end with .pdf
    PyPDFLoader auto splits the pdf into pages, each page is 1 Document object

    returns a dict of key: file_path and value: list of document objects
    """
    documents_dict = {}   
    for f in tqdm(file_paths):
        loader = PyPDFLoader(file_path = f)
        documents = loader.load()
        documents_dict[f] = documents
    return documents_dict



In [4]:
documents_dict = load_pdfs(file_paths=files_paths)

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
100%|██████████| 22/22 [00:34<00:00,  1.56s/it]


In [5]:
len(documents_dict) == len(files_paths), len(documents_dict)

(True, 22)

In [6]:
# print all the keys

from pprint import pprint

for k in documents_dict.keys():
    print(k)

/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap11.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap13.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap12.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cover-content-page.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap16.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap17.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/index.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap15.pdf
/Users/I748920/Desktop/llm

In [7]:
len(documents_dict['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf'])

52

In [8]:
d = documents_dict['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf']

In [9]:
len(d[0].page_content)

1413

as you can see even though PyPDFLoader splits the pdf into different Document objects for each page, each Document object is still considered huge number of char
- so there is still a need to split into chunks

if you have a lot RAM, you can afford to split into smaller chunk sizes 

* small chunk size, will help the model with smaller context window when referencing the document
* big chunk size ensures the text is too split up causing the text to lose its meaning

example
like for example, one chunk talks about intro to decision trees then another chunk is about random forest using bootstrapping of decision trees
query: "why is smaller decision trees better?"
Your RAG will rank both contexts highly, often giving unreliable results because your chunk is too small to capture the meaning. Your RAG might not answer the question because the question is in the extended portion of context that happens to be in another chunk

Nic Ang suggestion to choose the right number for chunk_size

- use traditional NLP techniques like BOW or sth, to calculate the average number of characters, average number of words, pdf page
then you calculate the average number of characters in each word for each page, aggregate across all documents, then you get your # of characters.
- I suggest you split per page!
- not sure how many characters exist per page, but using some domain knowledge about textbooks, we can tell that a topic is most likely captured in a single page
hence, using the average character count per page of each chapter is a sound choice to start
- so ideally you have 844 or sth chunks if you have 844 pages abt there
- You think about it, like look at the textbook itself
- and think "if im a chunker, what's the best number of characters such that I can capture enouggh meaning without throwing away important detail"
- so probably just take the number of characters per page on average

chunk_overlap

- overlap because you may be cutting off information prematurely without overlap
- so if you have 20% char overlap, then expect 800*20% chunks
- it's a decent value to start

In [14]:
# chunk the pdfs

from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_list_of_documents(documents):
    """
    input a list of documents as Document objects

    output a list of chunks as Document objects
    """

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 500,
        chunk_overlap = 100, # using 20% is a good start
        length_function=len,
        is_separator_regex=False,
        add_start_index=True
    )

    chunks = text_splitter.split_documents(documents)    
    return chunks

In [15]:
docs = documents_dict['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf']
len(docs)

52

In [16]:
chunks = chunk_list_of_documents(docs)
len(chunks)

241

In [17]:
chunks[5]

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf', 'page': 1, 'start_index': 329}, page_content='Training  SampleWeighted  SampleG(x) = sign[∑M\nm=1αmGm(x)]\nGM(x)\nG3(x)\nG2(x)\nG1(x)Final Classifier\nFIGURE 10.1. Schematic of AdaBoost. Classiﬁers are trained on weighted ve r-\nsions of the dataset, and then combined to produce a ﬁnal pred iction.\nThe predictions from all of them are then combined through a w eighted\nmajority vote to produce the ﬁnal prediction:\nG(x) = sign(M∑\nm=1αmGm(x))\n. (10.1)\nHereα1,α2,...,α Mare computed by the boosting algorithm, and weight')

In [18]:
documents_dict.keys()

dict_keys(['/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap10.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap11.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap13.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap12.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/cover-content-page.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap16.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap17.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/index.pdf', '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/c

In [19]:
keys = [
    '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap16.pdf',
    '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap17.pdf',
    '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf'
]

use only chap 16-18

In [21]:
all_chunks = []

for key in tqdm(keys):
    documents = documents_dict[key]
    chunks = chunk_list_of_documents(documents=documents)
    all_chunks.extend(chunks)

len(all_chunks)

100%|██████████| 3/3 [00:00<00:00, 511.96it/s]


528

In [22]:
chunks[0]

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 0, 'start_index': 0}, page_content='This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,')

# Embeddings

https://python.langchain.com/v0.2/docs/integrations/text_embedding/ollama/

all the chunks swee swee alr

left with
- indexing chunks use ollama-embeddings
- create vectorstore using InMemoryVectorStore or Chroma or FAISS
- setup retriever -> retriever = vectorstore.as_retriever(search_type='similarity')
- setup message history using InMemoryChatMessageHistory
- setup prompts and rag chain
- test generation

In [37]:
from langchain_ollama import OllamaEmbeddings

embedding_model = OllamaEmbeddings(
    model="llama3"
)

In [39]:
import time

start_time=time.time()
embeddings = embedding_model.embed_documents(texts=[d.page_content for d in chunks[:100]])
time_taken = time.time()-start_time

In [40]:
len(embeddings[0]),len(embeddings[1]),len(embeddings),time_taken

(4096, 4096, 100, 50.45562291145325)

ollama3 takes about ~0.5s for each chunk which will be quite time consuming for entire pdf all 20 chaps

~5000 chunks total will take about 35min

try huggingface bge embeddings and see time taken for same
https://python.langchain.com/v0.2/api_reference/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html

In [23]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':False}
hf_embedding_model = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

  from tqdm.autonotebook import tqdm, trange


In [24]:
import time

start_time=time.time()
embeddings = hf_embedding_model.embed_documents(texts=[d.page_content for d in chunks[:100]])
time_taken = time.time()-start_time

In [43]:
len(embeddings[0]),len(embeddings[1]),len(embeddings),time_taken

(768, 768, 100, 4.16506290435791)

takes 1/10th of the time using HuggingFace embedding model

**QUES** im also instantiating llama3 model here. this is different from the LLM chat completion model?
or is this the encoder part of the LLM architecture and the LLM is the decoder?

mention to nic about HuggingFaceEmbedding model taking 1/10th of the time
"sentence-transformers/all-mpnet-base-v2" hf embeddings - 768 dim embeddings, takes 4.2s for 100 chunks
vs
"llama3" ollama embeddings - 4096 dim embeddings, takes 43s for 100 chunks

In [45]:
len(all_chunks)

528

In [None]:
embedding_model

In [49]:
hf_embedding_model

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': False}, multi_process=False, show_progress=False)

compare vectorstore creation for 100 chunks, time taken

In [87]:
from langchain_core.vectorstores import InMemoryVectorStore

start_time=time.time()
vectorstore_ollama = InMemoryVectorStore.from_documents(
    documents=chunks[:100],
    embedding=embedding_model
)

time.time()-start_time

42.992793798446655

ok so its relative / ~ equivalent to the time it takes to embed the chunks individually, negligble overhead time take to store in the vectordb

In [51]:
from langchain_core.vectorstores import InMemoryVectorStore

start_time=time.time()
vectorstore_hf = InMemoryVectorStore.from_documents(
    documents=chunks[:100],
    embedding=hf_embedding_model
)

time.time()-start_time

4.1332597732543945

how to view the vectors?

In [127]:
sample_id = list(vectorstore_ollama.store.keys())[0]
vectorstore_ollama.store[sample_id]

{'id': 'c9b8ac3c-7be0-46bd-8236-4e7a083e5341',
 'vector': [-0.012608163,
  -0.0120608555,
  0.00086725794,
  0.01326893,
  0.010097277,
  0.00035206287,
  0.00026821686,
  0.0045496686,
  -0.018493203,
  -0.002548048,
  -0.024194721,
  0.0037522789,
  -0.02383316,
  -0.025309505,
  0.0048821387,
  -0.0088724205,
  0.011225567,
  -0.023888016,
  -0.021714922,
  -0.016348604,
  -0.0025922458,
  0.0071643423,
  -0.009047262,
  0.0028966002,
  0.0025268095,
  -0.0017926403,
  0.00061689224,
  0.010913248,
  0.0061964006,
  0.00086333125,
  0.005995591,
  0.041502133,
  -0.015813723,
  0.00423508,
  0.007360782,
  0.008093625,
  -0.020000702,
  -0.0077266153,
  -0.010213232,
  0.01152888,
  0.007582599,
  0.0064120456,
  0.0058136354,
  0.0072216727,
  0.006009894,
  0.017256323,
  0.005799593,
  -0.0056000687,
  -0.009832098,
  -0.021397784,
  -0.010919142,
  -0.012415352,
  0.0076087904,
  -0.013315198,
  0.027662594,
  0.024872152,
  0.019452816,
  0.0011041962,
  0.011057808,
  0.016669

In [129]:
vectorstore_ollama.store[sample_id].keys()

dict_keys(['id', 'vector', 'text', 'metadata'])

In [133]:
vectorstore_ollama.store[sample_id]['id']

'c9b8ac3c-7be0-46bd-8236-4e7a083e5341'

In [143]:
len(vectorstore_ollama.store[sample_id]['vector']),vectorstore_ollama.store[sample_id]['vector'][:10]

(4096,
 [-0.012608163,
  -0.0120608555,
  0.00086725794,
  0.01326893,
  0.010097277,
  0.00035206287,
  0.00026821686,
  0.0045496686,
  -0.018493203,
  -0.002548048])

In [141]:
vectorstore_ollama.store[sample_id]['text']
# so it also stores the text

'This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,'

In [145]:
vectorstore_ollama.store[sample_id]['metadata']
# so it also stores the metadata

{'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf',
 'page': 0,
 'start_index': 0}

In [147]:
chunks[0]
# this is exactly the first chunk

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 0, 'start_index': 0}, page_content='This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,')

so basically the vectorstore, takes in the entire Document object, adds in the vector embeddings, sets an id, and organises it into a dict object

InMemoryVectorStore.from_documents(documents, embedding, **kwargs)
    Return VectorStore initialized from documents and embeddings.

InMemoryVectorStore.from_texts(texts, embedding[, metadatas])
    Return VectorStore initialized from texts and embeddings.

In [55]:
vectorstore_hf.store

{'5f6ad0bd-8d21-483c-80cf-21c43cced33e': {'id': '5f6ad0bd-8d21-483c-80cf-21c43cced33e',
  'vector': [-0.04412347078323364,
   0.012674535624682903,
   -0.025128500536084175,
   -0.01419922150671482,
   -0.01705695316195488,
   -0.012716677971184254,
   0.03225192055106163,
   0.024593472480773926,
   0.0008378431084565818,
   0.052746038883924484,
   0.10067520290613174,
   -0.03932853043079376,
   0.02757219970226288,
   0.08505786210298538,
   -0.0019245308358222246,
   -0.03277314826846123,
   0.0013417424634099007,
   -0.036669474095106125,
   -0.0026207168120890856,
   -0.07339895516633987,
   -0.042232900857925415,
   -0.041781749576330185,
   0.0445852130651474,
   -0.005270527675747871,
   -0.06915991008281708,
   -0.01725328341126442,
   0.003751284908503294,
   -0.05043787509202957,
   -0.05077974870800972,
   0.013837937265634537,
   -0.0009410607744939625,
   0.010919244028627872,
   0.03145540878176689,
   0.0027374280616641045,
   1.8080958170685335e-06,
   -0.04081142693

In [57]:
sample_id = list(vectorstore_hf.store.keys())[0]
vectorstore_hf.store[sample_id]

{'id': '5f6ad0bd-8d21-483c-80cf-21c43cced33e',
 'vector': [-0.04412347078323364,
  0.012674535624682903,
  -0.025128500536084175,
  -0.01419922150671482,
  -0.01705695316195488,
  -0.012716677971184254,
  0.03225192055106163,
  0.024593472480773926,
  0.0008378431084565818,
  0.052746038883924484,
  0.10067520290613174,
  -0.03932853043079376,
  0.02757219970226288,
  0.08505786210298538,
  -0.0019245308358222246,
  -0.03277314826846123,
  0.0013417424634099007,
  -0.036669474095106125,
  -0.0026207168120890856,
  -0.07339895516633987,
  -0.042232900857925415,
  -0.041781749576330185,
  0.0445852130651474,
  -0.005270527675747871,
  -0.06915991008281708,
  -0.01725328341126442,
  0.003751284908503294,
  -0.05043787509202957,
  -0.05077974870800972,
  0.013837937265634537,
  -0.0009410607744939625,
  0.010919244028627872,
  0.03145540878176689,
  0.0027374280616641045,
  1.8080958170685335e-06,
  -0.04081142693758011,
  0.020633356645703316,
  0.01638626493513584,
  0.008129150606691837

In [59]:
vectorstore_hf.store[sample_id].keys()

dict_keys(['id', 'vector', 'text', 'metadata'])

In [61]:
vectorstore_hf.store[sample_id]['id']

'5f6ad0bd-8d21-483c-80cf-21c43cced33e'

In [63]:
len(vectorstore_hf.store[sample_id]['vector']),vectorstore_hf.store[sample_id]['vector'][:10]

(768,
 [-0.04412347078323364,
  0.012674535624682903,
  -0.025128500536084175,
  -0.01419922150671482,
  -0.01705695316195488,
  -0.012716677971184254,
  0.03225192055106163,
  0.024593472480773926,
  0.0008378431084565818,
  0.052746038883924484])

In [65]:
vectorstore_hf.store[sample_id]['text']
# so it also stores the text

'This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,'

In [67]:
vectorstore_hf.store[sample_id]['metadata']
# so it also stores the metadata

{'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf',
 'page': 0,
 'start_index': 0}

In [69]:
chunks[0]
# this is exactly the first chunk

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 0, 'start_index': 0}, page_content='This is page 649\nPrinter: Opaque this\n18\nHigh-Dimensional Problems: p≫N\n18.1 When pis Much Bigger than N\nIn this chapter we discuss prediction problems in which the n umber of\nfeaturespis much larger than the number of observations N, often written\np≫N. Such problems have become of increasing importance, espec ially in\ngenomics and other areas of computational biology. We will s ee that high\nvariance and overﬁtting are a major concern in this setting. As a result,')

!! both vectorstore_ollama and vectorstore_hf stores it in the same way since its using the same InMemoryVectorStore

InMemoryVectorStore
* In-memory implementation of VectorStore using a dictionary. Uses numpy to compute cosine similarity for search.

can also try using Chroma or FAISS (Facebook AI Similarity Search)
* so Chroma and FAISS are not using RAM to store the vectors?

- setup retriever -> retriever = vectorstore.as_retriever(search_type='similarity')
- setup message history using InMemoryChatMessageHistory
- setup prompts and rag chain
- test generation

In [75]:
retriever_ollama = vectorstore_ollama.as_retriever(
    search_type='similarity',
    search_kwargs = {'k':10}
)
retriever_hf = vectorstore_hf.as_retriever(
    search_type='similarity',
    search_kwargs = {'k':10}
)

In [313]:
# e.g. ollama

sample_query = "for high-dimensional problems, with regards to p and N, in what \
cases can ridge regression exploit the correlation in the features of the dataset?"

start_time = time.time()
retrieved_docs_ollama = retriever_ollama.invoke(input=sample_query)
time.time()-start_time

0.3430757522583008

In [77]:
# e.g.  hf

sample_query = "for high-dimensional problems, with regards to p and N, in what \
cases can ridge regression exploit the correlation in the features of the dataset?"

start_time = time.time()
retrieved_docs_hf = retriever_hf.invoke(input=sample_query)
time.time()-start_time

0.12175583839416504

ollama does take longer, so the larger the vector embedding dim, the longer the retrieval takes

In [307]:
for doc in retrieved_docs_ollama:
    print(doc.metadata['source'])

/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/

In [309]:
for doc in retrieved_docs_ollama:
    print(doc.page_content)
    print()

class is the one that wins the most pairwise contests. In the “ one versus all”
(ova) approach, each class is compared to all of the others in Ktwo-class
comparisons. To classify a test point, we compute the conﬁde nces (signed
distancefromthehyperplane)foreachofthe Kclassiﬁers.Thewinneristhe
class with the highest conﬁdence. Finally, Vapnik (1998) an d Weston and
Watkins (1999) suggested (somewhat complex) multiclass cr iteria which
generalize the two-class criterion (12.7).

2For a ﬁxed value of the regularization parameter λ, the degrees of freedom depends
on the observed predictor values in each simulation. Hence we compute the average
degrees of freedom over simulations.

data, and use it as the data for each of the CV folds.
The support vector “kernel trick” of Section 12.3.7 exploit s the same re-
duction used in this section, in a slightly diﬀerent context . Suppose we have
at our disposal the N×Ngram (inner-product) matrix K=XXT. From
(18.12) we have K=UD2UT, and soKcaptures t

but ollama is able to pull contexts only from chap18 which is good

In [93]:
chap18_docs = [c for c in chunks if c.metadata['source'].endswith("chap18.pdf")]

this is the doc that should have been pulled
{'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 2, 'start_index': 397}

In [96]:
sample_query

'for high-dimensional problems, with regards to p and N, in what cases can ridge regression exploit the correlation in the features of the dataset?'

In [98]:
for c in chap18_docs:
    if c.metadata['start_index']==397:
        break

c

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 2, 'start_index': 397}, page_content='using the optimal ridge parameter in each of the three cases, the median\nvalue of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values\nexceeding 2 was equal to 9.8, 1.2 and 0.0.\nRidge regression with λ= 0.001 successfully exploits the correlation in\nthe features when p<N, but cannot do so when p≫N. In the latter case\nthere is not enough information in the relatively small numb er of samples\nto eﬃciently estimate the high-dimensional covariance mat rix. In that case,')

In [99]:
gt_context = c

ollama retrieval is poor

In [319]:
for doc in retrieved_docs_hf:
    print(doc.metadata['source'])

/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf
/Users/I748920/Desktop/llms-learning/

In [83]:
for doc in retrieved_docs_hf:
    print(doc.page_content)
    print()

using the optimal ridge parameter in each of the three cases, the median
value of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values
exceeding 2 was equal to 9.8, 1.2 and 0.0.
Ridge regression with λ= 0.001 successfully exploits the correlation in
the features when p<N, but cannot do so when p≫N. In the latter case
there is not enough information in the relatively small numb er of samples
to eﬃciently estimate the high-dimensional covariance mat rix. In that case,

over the 100 simulation runs. The p= 1000 case is designed to mimic the
kind of data that we might see in a high-dimensional genomic o r proteomic
dataset, for example.
We ﬁt a ridge regression to the data, with three diﬀerent valu es for the
regularization parameter λ: 0.001, 100, and 1000. When λ= 0.001, this
is nearly the same as least squares regression, with a little regularization
just to ensure that the problem is non-singular when p > N. Figure 18.1

just to ensure that the problem is non-singular when p >

huggingfaceembeddings succesfully pulls the right context to refer to
and all contexts are from chap18 which is good

In [107]:
retriever_ollama = vectorstore_ollama.as_retriever(
    search_type='similarity',
    search_kwargs = {'k':1}
)
retriever_hf = vectorstore_hf.as_retriever(
    search_type='similarity',
    search_kwargs = {'k':1}
)

In [109]:
sample_query

'for high-dimensional problems, with regards to p and N, in what cases can ridge regression exploit the correlation in the features of the dataset?'

In [111]:
gt_context

Document(metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 2, 'start_index': 397}, page_content='using the optimal ridge parameter in each of the three cases, the median\nvalue of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values\nexceeding 2 was equal to 9.8, 1.2 and 0.0.\nRidge regression with λ= 0.001 successfully exploits the correlation in\nthe features when p<N, but cannot do so when p≫N. In the latter case\nthere is not enough information in the relatively small numb er of samples\nto eﬃciently estimate the high-dimensional covariance mat rix. In that case,')

In [115]:
retrieved_docs_ollama = retriever_ollama.invoke(input=sample_query)
retrieved_docs_hf = retriever_hf.invoke(input=sample_query)

In [333]:
retrieved_docs_ollama

[Document(id='c6776009-5552-4369-949f-6d335e371e07', metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 9, 'start_index': 826}, page_content='class is the one that wins the most pairwise contests. In the “ one versus all”\n(ova) approach, each class is compared to all of the others in Ktwo-class\ncomparisons. To classify a test point, we compute the conﬁde nces (signed\ndistancefromthehyperplane)foreachofthe Kclassiﬁers.Thewinneristhe\nclass with the highest conﬁdence. Finally, Vapnik (1998) an d Weston and\nWatkins (1999) suggested (somewhat complex) multiclass cr iteria which\ngeneralize the two-class criterion (12.7).')]

In [117]:
retrieved_docs_hf

[Document(id='22a44def-aa60-4ad5-8f43-37181aca119b', metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 2, 'start_index': 397}, page_content='using the optimal ridge parameter in each of the three cases, the median\nvalue of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values\nexceeding 2 was equal to 9.8, 1.2 and 0.0.\nRidge regression with λ= 0.001 successfully exploits the correlation in\nthe features when p<N, but cannot do so when p≫N. In the latter case\nthere is not enough information in the relatively small numb er of samples\nto eﬃciently estimate the high-dimensional covariance mat rix. In that case,')]

use HuggingFaceEmbeddings from now on

- setup message history using InMemoryChatMessageHistory

In [132]:
retrieved_docs_hf

[Document(id='22a44def-aa60-4ad5-8f43-37181aca119b', metadata={'source': '/Users/I748920/Desktop/llms-learning/pdf-chatbot-app/data/elements-of-statistical-learning-book/chap18.pdf', 'page': 2, 'start_index': 397}, page_content='using the optimal ridge parameter in each of the three cases, the median\nvalue of|tj|was 2.0, 0.6 and 0.2, and the average number of |tj|values\nexceeding 2 was equal to 9.8, 1.2 and 0.0.\nRidge regression with λ= 0.001 successfully exploits the correlation in\nthe features when p<N, but cannot do so when p≫N. In the latter case\nthere is not enough information in the relatively small numb er of samples\nto eﬃciently estimate the high-dimensional covariance mat rix. In that case,')]

In [119]:
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

message_history_store = {}

def get_session_history(session_id: str):
    if session_id not in message_history_store:
        # create new chat history
        message_history_store[session_id] = InMemoryChatMessageHistory()
    return message_history_store[session_id]

In [136]:
# setup LLM

from langchain_ollama import ChatOllama

llm_model = ChatOllama(
    model = 'llama3.1',
    temperature=0 # The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8)
)

In [146]:
llm_model.invoke("hi")

AttributeError: 'FieldInfo' object has no attribute 'chat'

In [46]:
# test message history

chat_with_history1 = RunnableWithMessageHistory(llm_model,get_session_history)

In [54]:
config = {'configurable':{'session_id':'1'}}

response = chat_with_history1.invoke("whats my name",config=config)
# response
print(response.content)

I'm still not able to recall or access any information about your name. If you'd like to share it with me, I can try to get to know you better!

If not, we could also play a game where I come up with a fun, fictional name for our conversation. Would you like that?


In [56]:
response = chat_with_history1.invoke("my name is Johnson Green",config=config)
print(response.content)

Nice to meet you, Johnson Green! It's great that you shared your name with me.

Now that we've got the formalities out of the way, what would you like to talk about? Do you have a favorite hobby or topic you'd like to discuss?

(By the way, I'll make sure to remember your name for our conversation. Just don't worry if I forget it later - I'm designed to handle lots of conversations at once!)


In [68]:
message_history_store['1'].messages

[HumanMessage(content='whats my name'),
 AIMessage(content="I'm a large language model, I don't have the ability to know or remember your personal information, including your name. This is our first conversation, and I don't retain any data about individual users.\n\nIf you'd like to share your name with me, I can try to get to know you better!", response_metadata={'model': 'llama3.1', 'created_at': '2024-09-14T11:22:41.274113Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2643882334, 'load_duration': 30360000, 'prompt_eval_count': 14, 'prompt_eval_duration': 279456000, 'eval_count': 64, 'eval_duration': 2332879000}, id='run-206acab8-26e6-4e1a-94b3-30d482ce3458-0', usage_metadata={'input_tokens': 14, 'output_tokens': 64, 'total_tokens': 78}),
 HumanMessage(content='whats my name'),
 AIMessage(content='I\'m still not able to recall or access any information about a specific person\'s name. If you\'d like to tell me your name, I

In [71]:
message_history_store

{'1': InMemoryChatMessageHistory(messages=[HumanMessage(content='whats my name'), AIMessage(content="I'm a large language model, I don't have the ability to know or remember your personal information, including your name. This is our first conversation, and I don't retain any data about individual users.\n\nIf you'd like to share your name with me, I can try to get to know you better!", response_metadata={'model': 'llama3.1', 'created_at': '2024-09-14T11:22:41.274113Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2643882334, 'load_duration': 30360000, 'prompt_eval_count': 14, 'prompt_eval_duration': 279456000, 'eval_count': 64, 'eval_duration': 2332879000}, id='run-206acab8-26e6-4e1a-94b3-30d482ce3458-0', usage_metadata={'input_tokens': 14, 'output_tokens': 64, 'total_tokens': 78}), HumanMessage(content='whats my name'), AIMessage(content='I\'m still not able to recall or access any information about a specific person\'s name.

now messagehistory is rag chain

chat_with_history1 = RunnableWithMessageHistory(llm_model,get_session_history)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

full chain should be

- test chain with message history
- setup prompts and rag chain
- test chain with rag retrieval context
- integrate both chains and test generation

(see managing conversation history)

In [75]:
# alternative

message_input = {
    'context': retriever_hf,
    'question': RunnablePassthrough()
}

rag_chain = message_input | prompt | model | StrOutputParser()

NameError: name 'retriever_hf' is not defined