# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [1]:
import os
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file



In [None]:
#!pip install lark

Collecting lark
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Downloading lark-1.2.2-py3-none-any.whl (111 kB)
Installing collected packages: lark
Successfully installed lark-1.2.2


### Similarity Search

In [2]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
from langchain.embeddings import HuggingFaceEmbeddings

embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")


  embedding = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

  vectordb = Chroma(


In [4]:
print(vectordb._collection.count())

64


In [11]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [12]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [13]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [14]:
smalldb.similarity_search(question, k=2)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [15]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [5]:
question = "what did they say last work experience?"
docs_ss = vectordb.similarity_search(question,k=8)

In [6]:
docs_ss[0].page_content[:100]

'Work\nExperience'

In [7]:
docs_ss[1].page_content[:100]

'Skills Languages: English (fluent),'

Note the difference in results with `MMR`.

In [9]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [10]:
docs_mmr[0].page_content[:100]

'Work\nExperience'

In [11]:
docs_mmr[1].page_content[:100]

'manner'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [12]:
question = "what did they say about work experience?"

In [14]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf"}
)

In [15]:
for d in docs:
    print(d.metadata)

{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}
{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}
{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [16]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [17]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

**Note:** The default model for `OpenAI` ("from langchain.llms import OpenAI") is `text-davinci-003`. Due to the deprication of OpenAI's model `text-davinci-003` on 4 January 2024, you'll be using OpenAI's recommended replacement model `gpt-3.5-turbo-instruct` instead.

The course implemented with OpenAI API , however I have adjusted it to Llama3.2:3b model

In [18]:
from langchain_community.llms import Ollama
from langchain.retrievers.self_query.base import SelfQueryRetriever

# This creates a LangChain-compatible LLM object
llm = Ollama(model="llama3.2:3b")



retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectordb,
    document_contents="Lecture notes",
    metadata_field_info=metadata_field_info,
    verbose=True
)

  llm = Ollama(model="llama3.2:3b")


In [19]:
question = "what is the contact number?"

In [20]:
docs = retriever.get_relevant_documents(question)

  docs = retriever.get_relevant_documents(question)


In [21]:
for d in docs:
    print(d.metadata)

{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}
{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}
{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}
{'creationdate': '2025-03-08T22:52:34+00:00', 'creator': 'LaTeX with hyperref', 'page': 0, 'page_label': '1', 'producer': 'xdvipdfmx (20240305)', 'source': 'C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf', 'total_pages': 1}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [22]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [23]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [24]:
# Wrap our vectorstore
#llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
compressor = LLMChainExtractor.from_llm(llm)

In [25]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [26]:
question = "what is the language skills?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

extensive language skills; international
----------------------------------------------------------------------------------------------------
Document 2:

English
----------------------------------------------------------------------------------------------------
Document 3:

NO OUTPUT
----------------------------------------------------------------------------------------------------
Document 4:

German (intermediate), Russian


## Combining various techniques

In [27]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [28]:
question = "what is the language skills?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

extensive language skills; international
----------------------------------------------------------------------------------------------------
Document 2:

English (fluent)
----------------------------------------------------------------------------------------------------
Document 3:

time management skills; good
----------------------------------------------------------------------------------------------------
Document 4:

segmentation)


## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [30]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [31]:
# Load PDF
loader = PyPDFLoader("C:/Users/jalil/projects/NLP_playground/RAG_cv/Jalil_Mahmud_cv.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [32]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [33]:
question = "What is the language skills?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(metadata={}, page_content='• Collaborated with a team to develop a mobile autonomous robot optimized for\nsolving mazes in the most efficient manner\nHospitality Industry| USA, Germany, UAE, Turkey 2010 - 2018\n• Au-Pair\n• Front Office Department / Night Manager\n• Event department\nCertificates Neural networks and deep learning / Deeplearning.ai\nRobotics : Computational motion planning / University of Pennsylvania\nProjects https://github.com/jalilmm\nSkills Languages: English (fluent), German (intermediate), Russian (intermediate),\nTurkish (fluent), Azerbaijani (native)\nProgramming: Python, C/C++, MATLAB, PLC, ROS/ROS2, LabVIEW')

In [34]:
question = "What is the language skills?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(metadata={}, page_content='Jalil Mahmud\nPhone: (+49) 152-5284-7112\nEmail: jalil_mahmud@outlook.com\nwww.linkedin.com/in/jalil-mahmud/\n#include <High IT-affinity; very good handling with figures and numbers; extensive language skills; international\nand intercultural experience; focused; ambitious; highly motivated; team-minded; time management skills; good\nat problem-solving; continuous learner>\nEducation Technische Hochschule Ulm Ulm, Germany\nIntellignt Systems M.Sc. 2024 - 2026(expected)\nKaunas University of Technology Kaunas, Lithuania\nIntelligent Robotics Systems B.Sc. 2019 - 2023\nWest Pomeranian University of Technology Szczecin, Poland\nRobotics B.Sc. 2021/ Exchange Student\nBalikesir University Balikesir, Turkiye\nTourism and Hotel Management B.Sc. 2009 - 2013\nWork\nExperience\nWorking Student, Bosch GmbH| Stuttgart, Germany 2024.10 - 2025.03\nSystems Engineer, Bosch GmbH| Stuttgart, Germany 2024.01 - 2024.10\n• Working on real-time AI applications for embedde