# Executive Summary

To facilitate Data Science Application (DSA) subject (IAAust) knowledge integration, a Retrieval-Augmented Generation (RAG) approach can be used together with a LLM.  
Different retrieval methods were compared and the Maximum Marginal Relevance method was found to work best in retrieving diverse and unique results from vectorstore.  
Running a few questions to test showed encouraging answers - the LLM was pulling correct answers from relevant documents and summarizing them in a succinct manner.

Basic workflow: Financial text -> Split into chunks + OpenAI embeddings -> Load Vectorstore  
Ask questions -> Retrieval from Vectorstore -> Get most relevant embeddings -> Expose to LLM -> Get relevant answer

Some possible follow-up that extends beyond the scope of this project:
* Can this workflow scale? Three documents were used but how about a hundred?
* Chatbot functionality with memory and a clean interface can be built for non-technical stakeholders
* How can model performance drift be tracked for LLMs? What if quality of embeddings/results deteriorate over time?
* Can tabular data be created by specifically specifying the output? This could serve as input to ML/DL predictive models.

Inspired by: https://learn.deeplearning.ai/langchain-chat-with-your-data/

## Load libraries

In [1]:
import os
import numpy as np
from dotenv import load_dotenv, find_dotenv
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, NLTKTextSplitter
#from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
#from langchain.chat_models import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import GPT4All
from langchain_core.embeddings import Embeddings
from langchain_community.retrievers.svm import SVMRetriever
from transformers import AutoTokenizer, AutoModel
import torch
import nltk
from google import genai
import requests
import json
from time import sleep


nltk.download("punkt_tab")

load_dotenv(find_dotenv())

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\caixi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


False

## Initialize LLM

In [None]:
def query(payload):
    try:
        response = requests.post(API_URL, headers=headers, json=payload)
        
        # Check if the request was successful
        response.raise_for_status()
        
        # Try to parse the JSON response
        return response.json()
    except requests.exceptions.HTTPError as e:
        print(f"HTTP Error: {e}")
        print(f"Response content: {response.text}")
    except json.JSONDecodeError:
        print(f"JSONDecodeError: Could not parse response as JSON")
        print(f"Response content: {response.text}")
    except requests.exceptions.RequestException as e:
        print(f"Request Error: {e}")
    
    # If there was an error, sleep a bit and return None
    sleep(1)
    return None

API_URL = "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2"
headers = {"Authorization": "Bearer <API_KEY>"}  # Free tokens available at huggingface.co

## Select an appropriate text splitter

### Load a single PDF

In [3]:
loader = PyMuPDFLoader("../data/DSA 2025 C07 Natural Language Processing_1.pdf")
pages = loader.load_and_split()

In [18]:
len(pages)

66

In [19]:
pages[0].page_content[0:500]

'Fellowship Program \n  \nData Science Applications \nChapter 7: Natural Language \nProcessing'

In [22]:
pages[50].metadata

{'producer': 'Adobe PDF Services',
 'creator': 'Microsoft® Word 2019',
 'creationdate': '2024-12-18T04:48:59+00:00',
 'source': '../data/DSA 2025 C07 Natural Language Processing_1.pdf',
 'file_path': '../data/DSA 2025 C07 Natural Language Processing_1.pdf',
 'total_pages': 66,
 'format': 'PDF 1.7',
 'title': 'Data Science Applications',
 'author': 'User',
 'subject': 'Natural Language Processing',
 'keywords': 'Chapter 7',
 'moddate': '2024-12-18T04:49:02+00:00',
 'trapped': '',
 'modDate': 'D:20241218044902Z',
 'creationDate': 'D:20241218044859Z',
 'page': 50}

### Compare between text splitters

Let's choose which TextSplitter to use. Here I'll compare results between `RecursiveCharacterTextSplitter` and `NLTKTextSplitter`.

In [3]:
# small paramaters for now to conveniently assess results
chunk_size = 500
chunk_overlap = 50

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""],  # default values
)

nltk_splitter = NLTKTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

In [4]:
# random page
snippet = pages[25].page_content[0:3000]
snippet

NameError: name 'pages' is not defined

In [None]:
r_res = r_splitter.split_text(snippet)
len(r_res)

6

In [None]:
r_res

["Data Science Applications \nChapter 7: Natural Language Processing \n \n \n \n \n \n \n \n© December 2024 The Institute of Actuaries of Australia \nPage 26 of 66 \nBoW or TF-IDF if handling sparse, high-dimensional data isn't an issue, or if simplicity is \nprioritized. \n• \nImportance of Word Frequency: Use TF-IDF if differentiating between commonly-occurring \nand rare terms is important (e.g., information retrieval). Use BoW if simple word occurrence or",
 'frequency is sufficient. Use Word Embeddings for deeper semantic and context-based \nanalysis where word frequency is less relevant. \n• \nText Size and Dataset Complexity: Use BoW or TF-IDF for simpler, smaller text datasets. Use \nWord Embeddings for larger datasets with complex relationships between words. \n• \nComputational Complexity: Use BoW or TF-IDF if computational resources are limited or if the \ntask requires quick and easy implementation. Use Word Embeddings if computational',
 "resources and time are available f

In [None]:
nltk_res = nltk_splitter.split_text(snippet)
len(nltk_res)

7

In [9]:
nltk_res

["Data Science Applications \nChapter 7: Natural Language Processing \n \n \n \n \n \n \n \n© December 2024 The Institute of Actuaries of Australia \nPage 26 of 66 \nBoW or TF-IDF if handling sparse, high-dimensional data isn't an issue, or if simplicity is \nprioritized.\n\n• \nImportance of Word Frequency: Use TF-IDF if differentiating between commonly-occurring \nand rare terms is important (e.g., information retrieval).\n\nUse BoW if simple word occurrence or \nfrequency is sufficient.",
 'Use Word Embeddings for deeper semantic and context-based \nanalysis where word frequency is less relevant.\n\n• \nText Size and Dataset Complexity: Use BoW or TF-IDF for simpler, smaller text datasets.\n\nUse \nWord Embeddings for larger datasets with complex relationships between words.\n\n• \nComputational Complexity: Use BoW or TF-IDF if computational resources are limited or if the \ntask requires quick and easy implementation.',
 'Use Word Embeddings if computational \nresources and time ar

The ending of each split chunk by `RecursiveCharacterTextSplitter` doesn't coincide with the end of a sentence/puncutuation, whereas `NLTKTextSplitter` captures that nuance.  
Let's go with `NLTKTextSplitter`.

### Split PDF and assess results

In [None]:
chunk_size = 500
chunk_overlap = chunk_size * 0.1

nltk_splitter = NLTKTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

split_res = nltk_splitter.split_documents(pages)

Created a chunk of size 1584, which is longer than the specified 1000


In [13]:
len(split_res)

162

In [14]:
split_res[50]

Document(metadata={'producer': 'Adobe PDF Services', 'creator': 'Microsoft® Word 2019', 'creationdate': '2024-12-18T04:48:59+00:00', 'source': '../data/DSA 2025 C07 Natural Language Processing_1.pdf', 'file_path': '../data/DSA 2025 C07 Natural Language Processing_1.pdf', 'total_pages': 66, 'format': 'PDF 1.7', 'title': 'Data Science Applications', 'author': 'User', 'subject': 'Natural Language Processing', 'keywords': 'Chapter 7', 'moddate': '2024-12-18T04:49:02+00:00', 'trapped': '', 'modDate': 'D:20241218044902Z', 'creationDate': 'D:20241218044859Z', 'page': 20}, page_content='In this \ncomponent of the TF-IDF score, a term’s weight is increased by the inverse of the number of \ndocuments in which it appears.\n\nIn other words, common terms that appear in many documents in \nthe corpus will be given a lower score.\n\nVideo 7.3 explains the intuition behind the TF-IDF method.\n\nVideo 7.3 – Term frequency-inverse document frequency \n \nRecord your video notes here \n \n \n \n \n \n \

## Load all PDFs and split

In [5]:
path = "../data/"

loaders = []

for file in os.listdir(path):
    if file.endswith(".pdf"):
        loaders.append(PyMuPDFLoader(os.path.join(path, file)))

loaders

[<langchain_community.document_loaders.pdf.PyMuPDFLoader at 0x1b477e78070>,
 <langchain_community.document_loaders.pdf.PyMuPDFLoader at 0x1b427d7c2e0>,
 <langchain_community.document_loaders.pdf.PyMuPDFLoader at 0x1b477e7a0b0>]

In [6]:
docs = []
for loader in loaders:
    docs.extend(loader.load())

len(docs)

256

In [7]:
# splits docs

splits = nltk_splitter.split_documents(docs)

Created a chunk of size 562, which is longer than the specified 500
Created a chunk of size 611, which is longer than the specified 500
Created a chunk of size 573, which is longer than the specified 500
Created a chunk of size 568, which is longer than the specified 500
Created a chunk of size 1238, which is longer than the specified 500
Created a chunk of size 715, which is longer than the specified 500
Created a chunk of size 553, which is longer than the specified 500
Created a chunk of size 567, which is longer than the specified 500
Created a chunk of size 510, which is longer than the specified 500
Created a chunk of size 1415, which is longer than the specified 500
Created a chunk of size 538, which is longer than the specified 500
Created a chunk of size 754, which is longer than the specified 500
Created a chunk of size 534, which is longer than the specified 500
Created a chunk of size 547, which is longer than the specified 500
Created a chunk of size 578, which is longer t

In [8]:
len(splits)

1119

## Use OpenAI word embeddings

In [9]:
class HuggingFaceEmbeddings(Embeddings):
    def __init__(self, model_name: str="BAAI/bge-large-en-v1.5"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        return [self._get_embedding(t) for t in texts]

    def embed_query(self, text: str) -> list[float]:
        return self._get_embedding(text)

    def _get_embedding(self, text: str) -> list[float]:
        inputs = self.tokenizer(text, return_tensors="pt", padding=True,
                                truncation=True, max_length=512)
        with torch.no_grad():
            outputs = self.model(**inputs)
        # Use CLS token
        return outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy().tolist()

In [10]:
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [11]:
sentence1 = (
    "The sun sets in the evening, casting a warm orange glow across the horizon."
)
sentence2 = "Twilight descends upon the land as the day draws to a close, painting the sky with hues of red and gold."
sentence3 = "Baby JJ crawled up the mattress to get his milk."

In [12]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [22]:
np.dot(embedding1, embedding1)/np.sqrt(np.sum(np.square(embedding1)))/np.sqrt(np.sum(np.square(embedding1)))

np.float64(1.0000000000000002)

## Initialize Vectorstore

Embeddings databases (also known as vector databases/stores) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database.  
Here, Chroma is used.

In [13]:
persist_directory = "docs/chroma"

In [14]:
vectordb = Chroma.from_documents(
    documents=splits, embedding=embedding, persist_directory=persist_directory
)

vectordb.persist()

  vectordb.persist()


In [16]:
print(vectordb._collection.count())

1119


In [17]:
# does vectordb count tally with total splits?
vectordb._collection.count() == len(splits)

True

## Comparing retrieval methods


With the documents and embeddings in the vectorstore, there are several ways to retrieve this information.  
Here three methods are compared: `similarity_search`, `max_marginal_relevance_search` (MMR) and `ContextualCompressionRetriever`

Similarity search: Selects examples based on similarity to the inputs. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs.
https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/similarity

MMR: Selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity.  
It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples.
https://python.langchain.com/docs/modules/model_io/prompts/example_selectors/mmr

### Similarity Search

In [18]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

In [19]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [20]:
smalldb.similarity_search(question, k=2)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

### Maximum Marginal Relevance

MMR penalizes the second text due to its similarity with the first, and instead returns the third text which is related yet different.  
Let's try it on our docs.

In [23]:
smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [24]:
docs_qn = "Briefly explain the main clustering techniques"

Using similarity search, the first two search results are the same.

In [25]:
ss_res = vectordb.similarity_search(docs_qn, k=3)
[res.page_content[:300] for res in ss_res[:3]]

['This often runs counter to the aim of a clustering algorithm, which is to \ndiscover the underlying structure of a dataset that was previously unknown.\n\nManual validation involves validation by a human expert, who reviews the clustering outcomes \nand makes a subjective evaluation of whether they are ',
 'As with most validation activities, it is likely that a combination of the above validation methods will \nbe the most effective in determining the best model to adopt for the given purpose.\n\nThe optimal number of clusters will be typically determined by reference to multiple validation \nmethods.',
 'In unsupervised learning, response variables are often not available and not even \nrelevant for the problem being solved.\n\nInstead, validation of clustering outcomes involves \nassessing whether the characterisation of the data or simplification of the features is reasonable.\n\nThis means alternative ']

With mmr, there are no repeating results.

In [26]:
mmr_res = vectordb.max_marginal_relevance_search(docs_qn, k=3, fetch_k=10)
[res.page_content[:200] for res in mmr_res[:3]]

['This often runs counter to the aim of a clustering algorithm, which is to \ndiscover the underlying structure of a dataset that was previously unknown.\n\nManual validation involves validation by a human',
 'Once the distance measure for individual observations has been determined, these can be used \nby ‘linkage’ methods to calculate the distance between two clusters.\n\nFour common linkage \nmethods are: \n•',
 'Divisive Hierarchical Clustering \nDivisive hierarchical clustering was introduced in Section 6.3.\n\nUnlike agglomerative hierarchical \nclustering (see Section 6.3.3), which uses a bottom-up approach, d']

In [27]:
# take a closer look at the third search result
mmr_res[2].page_content

'Divisive Hierarchical Clustering \nDivisive hierarchical clustering was introduced in Section 6.3.\n\nUnlike agglomerative hierarchical \nclustering (see Section 6.3.3), which uses a bottom-up approach, divisive hierarchical clustering \nuses a top-down approach.\n\nThe top-down approach begins with all observations belonging to a \nsingle cluster.\n\nAt each step of the algorithm, a cluster is split into two, and the number of clusters is \nincreased by one.'

One thing I observed: A low `fetch_k` and a low `k` results in all results comes from the same document.  
However, the other docs also contain information about Evergrande.

In [28]:
[(res.metadata["source"], res.metadata["page"]) for res in mmr_res[:3]]

[('../data/DSA 2025 C06 Unsupervised Learning.pdf', 38),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 28),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 34)]

In [29]:
mmr_res = vectordb.max_marginal_relevance_search(docs_qn, k=10, fetch_k=200)

Increasing `k` and `fetch_k` seems to fix this. All three documents are now being cited.

In [30]:
sorted(
    [(res.metadata["source"], res.metadata["page"]) for res in mmr_res],
    key=lambda x: (x[0], x[1]),
)

[('../data/DSA 2025 C06 Unsupervised Learning.pdf', 17),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 19),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 25),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 28),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 34),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 34),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 35),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 38),
 ('../data/DSA 2025 C06 Unsupervised Learning.pdf', 56),
 ('../data/DSA 2025 C07 Natural Language Processing_1.pdf', 17)]

In [31]:
[res.page_content[:100] for res in mmr_res]

['This often runs counter to the aim of a clustering algorithm, which is to \ndiscover the underlying s',
 'Once the distance measure for individual observations has been determined, these can be used \nby ‘li',
 'As was discussed in Chapter 5 (Classification and neural networks), the reaching of a local \noptimum',
 'Divisive Hierarchical Clustering \nDivisive hierarchical clustering was introduced in Section 6.3.\n\nU',
 'The justification for the clustering algorithm choices can include: \n• \nUse K-means when: \n– \nYou kn',
 'Clustering can help to achieve specific outcomes, such as detecting fraud or identifying similar \ngr',
 'Medium.\n\nhttps://medium.com/predict/three-popular-clustering-\nmethods-and-when-to-use-each-4227c80ba',
 'Data Science Applications \nChapter 6: Unsupervised Learning \n \n \n \n \n \n \n \n© December 2024 The Insti',
 'These two effects can present computational challenges for running clustering algorithms \nefficientl',
 'An NLP task \nis no different

### Contextual Compression

In [32]:
llm = GPT4All(
    model="./models/mistral-7b-instruct-v0.1.Q4_0.gguf",  # path to downloaded model
    backend="llama",  # required for GGUF
    verbose=True
)

# Step 4: Create compressor
compressor = LLMChainExtractor.from_llm(llm)

In [33]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [34]:
# Helper function for printing docs
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

In [35]:
compressed_docs = compression_retriever.get_relevant_documents(docs_qn)
pretty_print_docs(compressed_docs)

  compressed_docs = compression_retriever.get_relevant_documents(docs_qn)


Document 1:

- Clustering algorithm aims to discover previously unknown structure of dataset
- Manual validation involves human expert reviewing clustering outcomes and making subjective evaluation
----------------------------------------------------------------------------------------------------
Document 2:

- Clustering techniques
- Validation activities
- Determine the best model
- Given purpose
- Optimal number of clusters
- Multiple validation methods
----------------------------------------------------------------------------------------------------
Document 3:

- unsupervised learning
- response variables not available or relevant for problem being solved
- validation of clustering outcomes involves assessing characterisation of data or simplification of features
- alternative ways from those used to validate supervised learning outcomes are needed
----------------------------------------------------------------------------------------------------
Document 4:

• Distance measur

`vectordb.as_retriever()` calls Class VectorStoreRetriever which, by default, uses similarity search.  
Hence we see repeated results once more. Let's examine the results using MMR.

In [36]:
compression_retriever_mmr = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=vectordb.as_retriever(search_type="mmr")
)

In [37]:
compressed_docs = compression_retriever_mmr.get_relevant_documents(docs_qn)
pretty_print_docs(compressed_docs)

Document 1:

- Clustering algorithm aims to discover previously unknown structure of dataset
- Manual validation involves human expert reviewing clustering outcomes and making subjective evaluation
----------------------------------------------------------------------------------------------------
Document 2:

- distance measure for individual observations
- linkage methods to calculate the distance between two clusters
- four common linkage methods: complete linkage, single linkage, average linkage, centroid linkage
----------------------------------------------------------------------------------------------------
Document 3:

Divisive hierarchical clustering was introduced in Section 6.3. Unlike agglomerative hierarchical clustering, which uses a bottom-up approach, divisive hierarchical clustering uses a top-down approach. The top-down approach begins with all observations belonging to a single cluster. At each step of the algorithm, a cluster is split into two, and the number of c

As we saw earlier, MMR gives better results compared to similarity search.

### Retrieval without vectorstores

In [38]:
all_docs_text = [d.page_content for d in docs]
joined_docs_text = " ".join(all_docs_text)

In [39]:
nltk_splitter = NLTKTextSplitter(chunk_size=500, chunk_overlap=50)

docs_splits = nltk_splitter.split_text(joined_docs_text)

Created a chunk of size 562, which is longer than the specified 500
Created a chunk of size 611, which is longer than the specified 500
Created a chunk of size 573, which is longer than the specified 500
Created a chunk of size 827, which is longer than the specified 500
Created a chunk of size 1285, which is longer than the specified 500
Created a chunk of size 715, which is longer than the specified 500
Created a chunk of size 709, which is longer than the specified 500
Created a chunk of size 535, which is longer than the specified 500
Created a chunk of size 643, which is longer than the specified 500
Created a chunk of size 567, which is longer than the specified 500
Created a chunk of size 899, which is longer than the specified 500
Created a chunk of size 510, which is longer than the specified 500
Created a chunk of size 1649, which is longer than the specified 500
Created a chunk of size 519, which is longer than the specified 500
Created a chunk of size 674, which is longer t

In [40]:
svm_retriever = SVMRetriever.from_texts(
    texts=docs_splits,
    embeddings=embedding
)

In [41]:
docs_svm = svm_retriever.get_relevant_documents(docs_qn)
docs_svm[0]

Document(metadata={}, page_content='Discuss alternative clustering techniques \n2.2.8.\n\nEvaluate a clustering algorithm using internal, external, and manual validation Data Science Applications \nChapter 6: Unsupervised Learning \n \n \n \n \n \n \n \n© December 2024 The Institute of Actuaries of Australia \nPage 4 of 63 \n6.1.\n\nIntroduction \nUnsupervised learning is a data science technique that looks for patterns in data.')

It works well but metadata is missing. There's probably is a way to include metadata, but that's out of the scope for this project.

In [42]:
docs_svm

[Document(metadata={}, page_content='Discuss alternative clustering techniques \n2.2.8.\n\nEvaluate a clustering algorithm using internal, external, and manual validation Data Science Applications \nChapter 6: Unsupervised Learning \n \n \n \n \n \n \n \n© December 2024 The Institute of Actuaries of Australia \nPage 4 of 63 \n6.1.\n\nIntroduction \nUnsupervised learning is a data science technique that looks for patterns in data.'),
 Document(metadata={}, page_content='The goal of clustering is to separate \nobservations into homogenous groups or clusters based on their features such that the \nobservations in each cluster are more like each other than observations in other clusters.\n\nClustering can help to achieve specific outcomes, such as detecting fraud or identifying similar \ngroups of customers.\n\nClustering can also be used to summarise a complex situation or to produce \neasy-to-understand narratives, which can be used by managers within a business.'),
 Document(metadata={}

## Question Answering

In [43]:
qa_chain = RetrievalQA.from_chain_type(llm, retriever=vectordb.as_retriever(search_kwargs={"k": 3}),
)

In [44]:
result = qa_chain({"query": docs_qn})

  result = qa_chain({"query": docs_qn})


In [45]:
result["result"]

' Clustering is an unsupervised machine learning technique that involves grouping similar data points together based on their characteristics or features. There are several clustering techniques, including k-means, hierarchical clustering, density-based clustering, and fuzzy clustering. Each of these techniques has its own strengths and weaknesses, and the choice of technique depends on the nature of the data being analyzed and the goals of the analysis.'

In [87]:
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, 
just say that you don't know, don't try to make up an answer.
Keep the answer as concise as possible. The tone should be informative. Use bullet points. Cite the chapter reference
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [88]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)

In [89]:
qa_res = qa_chain(
    {"query": "Why is loss function important and what are the key optimisation methods?"}
)

In [90]:
print(qa_res["result"])

 The loss function is important because it represents the price paid for the inaccuracy of a model’s predictions, which is the measure by which a model is judged and optimised. A better classifier will produce a lower loss function score than a poorer model. Key optimisation methods include varying the learning rate over the training stage with a ‘learning rate scheduler’, plotting the calculated loss after each iteration of the gradient descent algorithm to ensure convergence, and using different batch sizes of data that is used in training.
Reference: Chapter 5.2.2


In [None]:
qa_chain_refine = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Changed from "refine" to "stuff"
    retriever=vectordb.as_retriever(search_kwargs={"k": 3})
)

The output from `chain_type="stuff"` is a lot more structured. 

In [105]:
refine_result = qa_chain_refine(
    {"query": "Why is loss function important and what are the key optimisation methods?"}
)
print(refine_result["result"])

 The loss function is important because it represents the price paid for the inaccuracy of a model’s predictions, which is the measure by which a model is judged and optimised. A better classifier will produce a lower loss function score than a poorer model. There are several key optimisation methods that can be used to find the optimal values of the parameters of a neural network, including gradient descent with learning rate scheduling, stochastic gradient descent, Adam optimiser, and RMSprop optimiser.


In [106]:
refine_result = qa_chain_refine(
    {"query": "Describe at high-level what is a neural network."}
)
print(refine_result["result"])

 A neural network is a type of machine learning algorithm that uses artificial neurons to model complex relationships between inputs and outputs. It can be used for tasks such as classification, regression, and prediction. Neural networks are trained on large amounts of data and use backpropagation to adjust the weights of the connections between neurons in order to minimize error. They have been shown to be effective at solving a wide range of problems, but also come with risks such as reverse engineering, unintended classifications, model bias, and spurious learning.
