# Retrieval

The Retrieval is the centerpiece of our retrieval augmented generation (RAG) system.


## Vectorstore retrieval


In [1]:
%pip install langchain chromadb pypdf langchain-community




### Similarity Search

In [2]:
from langchain.vectorstores import Chroma
from utils import SaladOllamaEmbeddings

In [3]:
embedding = SaladOllamaEmbeddings()

documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Python is a popular programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Data science involves extracting insights from data."
]

In [5]:
chroma_db = Chroma.from_texts(documents, embedding=embedding)
# chroma_db.delete(chroma_db.get()["ids"])

In [6]:
print(chroma_db._collection.count())

4


In [7]:
question  = "python programming language ?"
chroma_db.similarity_search(question, k=2)

[Document(page_content='Python is a popular programming language.'),
 Document(page_content='Machine learning is a subset of artificial intelligence.')]

## Load some pdf into chromadb

In [9]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("data/machine_learning_linear_reg.pdf"),
    PyPDFLoader("data/machine_learning_linear_reg.pdf"),
    PyPDFLoader("data/machine_learning_Decision Tree.pdf"),
    PyPDFLoader("data/machine_learning_XGBoost.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)


In [10]:
docs

[Document(page_content='                  Linear Regression  Introduc)on  Linear regression is a fundamental supervised learning algorithm used in the ﬁeld of sta5s5cs and machine learning. It is employed to establish the rela5onship between a dependent variable and one or more independent variables. The objec5ve of linear regression is to ﬁnd the best-ﬁ?ng straight line that can depict the rela5onship between the variables. This line serves as a predic5ve model for future data points.  How It Works  Linear regression works by minimizing the ver5cal distances between the observed data points and the predicted values generated by the linear approxima5on. It accomplishes this through the method of least squares, which involves minimizing the sum of the squares of the diﬀerences between the observed and predicted values. The algorithm computes the slope and intercept of the line that minimizes the overall error, thereby determining the best-ﬁt line.  Mathema)cal Intui)on  The equa5on for 

In [13]:

# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
len(splits)


12

In [14]:
# vectordb.delete(ids=vectordb.get()["ids"])

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory = "retrieval_db"
)

In [130]:
print(vectordb._collection.count())

24


In [48]:
question = "What are the advantages of Linear Regression "
docs_ss = vectordb.similarity_search(question,k=3)

In [49]:
docs_ss

[Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='Advantages  Despite its limita5ons, linear regre

In [50]:
docs_ss[0].page_content[:200]

'including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the da'

In [51]:
docs_ss[1].page_content[:200]

'including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the da'

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

Note the difference in results with `MMR`.

In [52]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


In [53]:
docs_mmr

[Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex datasets. They may create overly complex trees that fail to generalize well to unseen data. Decision trees are also sensi-ve to small varia-ons in the training data and can be unstable, leading to diﬀerent results with slight changes in the input data. Addi-onally, decision trees can struggle to capture rela-onships between features that are not explicitly represented in the data.  Advantages', metadata={'page': 

In [54]:
docs_mmr[0].page_content[:100]

'including its reliance on the linearity assump5on between the dependent and independent variables. I'

In [55]:
docs_mmr[1].page_content[:100]

'their versa-lity, decision trees can be prone to overﬁEng, especially when dealing with complex data'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [133]:
question = "What are the advantages of Linear Regression ?"

In [134]:
docs = vectordb.max_marginal_relevance_search(
    question,
    k=3,
    filter={"source":"data/machine_learning_linear_reg.pdf"}
)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


In [135]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}
{'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}
{'page': 1, 'source': 'data/machine_learning_linear_reg.pdf'}


In [136]:
docs

[Document(page_content='including its reliance on the linearity assump5on between the dependent and independent variables. If the rela5onship between the variables is non-linear, the model may not accurately represent the data. Addi5onally, linear regression is sensi5ve to outliers, and its performance may be impacted by the presence of mul5collinearity among the independent variables.', metadata={'page': 0, 'source': 'data/machine_learning_linear_reg.pdf'}),
 Document(page_content='Linear Regression  Introduc)on  Linear regression is a fundamental supervised learning algorithm used in the ﬁeld of sta5s5cs and machine learning. It is employed to establish the rela5onship between a dependent variable and one or more independent variables. The objec5ve of linear regression is to ﬁnd the best-ﬁ?ng straight line that can depict the rela5onship between the variables. This line serves as a predic5ve model for future data points.  How It Works  Linear regression works by minimizing the ver5ca