<a href="https://colab.research.google.com/github/lucken99/ConstitutionXpert/blob/main/ProjectCI_Aug26.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# ## Install all required libraries
# !pip install langchain
# !pip install pypdf

# !pip install pdfminer.six
# # !pip install unstructured pdf2image  # for unstructured pdf loader (legacy)

# !pip install tiktoken
# !pip install sentence_transformers
# !pip install chromadb
# !pip install cohere
# !pip install openai
# !pip install litellm
!pip install rank_bm25

# DATA

> We have text file which consists of paragraphs related to Indian Constitution for e.g., Articles, Schedules, etc.

>

In [1]:
# data path
dir_path = "/content/drive/MyDrive/Project_CILLM/db/"
text_file_path = "/content/drive/MyDrive/Project_CILLM/db/file_context_corpus_cleaned_extended_part3.txt"
text_file_path = "/content/drive/MyDrive/Project_CILLM/colab_project/Constitution-Xpert using OpenAI Embeddings (1)/file_context_corpus_cleaned_extended_part3.txt"
pdf_file_path = "/content/drive/MyDrive/Project_CILLM/db/file_context_corpus_cleaned_extended_part3.pdf"


In [2]:
# utility function to load api keys from json file
import json
path_to_keys = "/content/drive/MyDrive/Project_CILLM/Keys/keys.json"
def return_api_key(name):
    with open(path_to_keys, 'r') as f:
        json_data = json.load(f)
        return json_data[name]

# RETRIEVAL


## [Document loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

In [3]:
# Text Loaders
from langchain.document_loaders import TextLoader

# PDF Loaders (try which suits us best)
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import PyPDFDirectoryLoader
# from langchain.document_loaders import MathpixPDFLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.document_loaders import PDFMinerPDFasHTMLLoader
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.document_loaders import OnlinePDFLoader

loader_text = TextLoader(text_file_path)
loader_pdf = PyPDFLoader(pdf_file_path)

In [4]:
## utility function for document loaders information

def loaded_doc_info(loader, show_loaded_data=False):
    data = loader.load()
    print("Type of the loader:", type(loader))
    print("Length of the data:", len(data))
    if show_loaded_data:
        print(data)
    return data




In [5]:
# loaded_doc_info(loader_text, show_loaded_data=True)
text_data = loaded_doc_info(loader_text)

Type of the loader: <class 'langchain.document_loaders.text.TextLoader'>
Length of the data: 1


In [6]:
pdf_data = loaded_doc_info(loader_pdf)

Type of the loader: <class 'langchain.document_loaders.pdf.PyPDFLoader'>
Length of the data: 178


In [None]:
# # checking different pdf loader
# loader = PDFMinerPDFasHTMLLoader(pdf_file_path)
# data = loaded_doc_info(loader, True)

## [Document Transformers](https://python.langchain.com/docs/modules/data_connection/document_transformers/)

In [7]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import MarkdownHeaderTextSplitter  # we can split by parts in constitution using this splitter and add metadata for better search
                                                                # first we have to add headers (for e.g., # or ##)

# # Split by tokens
# from langchain.text_splitter import TokenTextSplitter
# from langchain.text_splitter import SpacyTextSplitter
# from langchain.text_splitter import NLTKTextSplitter

# !pip install tiktoken
# good for OpenAI Models
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
#     chunk_size=100, chunk_overlap=0
# )

# # Sentence Transformers token split
# from langchain.text_splitter import SentenceTransformersTokenTextSplitter # for a particular sentence transformer

# # Hugging face tokenizers
# from transformers import GPT2TokenizerFast
# tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
# text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
#     tokenizer, chunk_size=100, chunk_overlap=0
# )
# texts = text_splitter.split_text(text)


In [8]:
def split_data(loader, splitter):
    docs = loader.load_and_split(splitter)
    print("Loader Type:", type(loader))
    print("Splitter Type:", type(splitter))
    print("Length of splitted data:", len(docs))
    return docs




In [9]:
# using openai's tiktoken for splitting data
text_splitter = CharacterTextSplitter.from_tiktoken_encoder()
docs_text = split_data(loader_text, text_splitter)
docs_pdf = split_data(loader_pdf, text_splitter)

Loader Type: <class 'langchain.document_loaders.text.TextLoader'>
Splitter Type: <class 'langchain.text_splitter.CharacterTextSplitter'>
Length of splitted data: 31
Loader Type: <class 'langchain.document_loaders.pdf.PyPDFLoader'>
Splitter Type: <class 'langchain.text_splitter.CharacterTextSplitter'>
Length of splitted data: 178


In [10]:
# using RecursiveCharacterTextSplitter
recur_splitter = RecursiveCharacterTextSplitter()
docs_text = split_data(loader_text, recur_splitter)
docs_pdf = split_data(loader_pdf, recur_splitter)

Loader Type: <class 'langchain.document_loaders.text.TextLoader'>
Splitter Type: <class 'langchain.text_splitter.RecursiveCharacterTextSplitter'>
Length of splitted data: 156
Loader Type: <class 'langchain.document_loaders.pdf.PyPDFLoader'>
Splitter Type: <class 'langchain.text_splitter.RecursiveCharacterTextSplitter'>
Length of splitted data: 178


In [None]:
docs_pdf[0]

In [11]:
char_splitter = CharacterTextSplitter(
    separator = "\n\n",
    # chunk_size = 1000,
    # chunk_overlap  = 200,
    # length_function = len,
    is_separator_regex = False,
)

# char_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=500)

docs_text = split_data(loader_text, char_splitter)
docs_pdf = split_data(loader_pdf, char_splitter)


Loader Type: <class 'langchain.document_loaders.text.TextLoader'>
Splitter Type: <class 'langchain.text_splitter.CharacterTextSplitter'>
Length of splitted data: 156
Loader Type: <class 'langchain.document_loaders.pdf.PyPDFLoader'>
Splitter Type: <class 'langchain.text_splitter.CharacterTextSplitter'>
Length of splitted data: 178


In [12]:
docs = [doc.page_content for doc in docs_text]

In [None]:
# docs_pdf[0].page_content

In [None]:
# print(docs_pdf[0].page_content.rstrip('\n'))

## Text Embedding models and chromadb Vectorstore
[MTEB blog](https://huggingface.co/blog/mteb) <br>
[MTEB](https://huggingface.co/spaces/mteb/leaderboard)

### Chromadb VectorStore


In [13]:
from langchain.vectorstores import Chroma


In [14]:
# utility function for embeddings info

def embeddings_info(embeddings):
    print("Total Embeddings:", len(embeddings))
    print("Dimension:", len(embeddings[0]))


In [15]:
### Problems with Chroma
### https://python.langchain.com/docs/integrations/vectorstores/chroma#basic-example-including-saving-to-disk
## Caching Embeddings using LocalFileStore
# from langchain.storage import LocalFileStore
# from langchain.embeddings import CacheBackedEmbeddings # Embeddings can be stored or temporarily cached to avoid needing to recompute them.
# import os

# utility function for creating embeddings using chromdb
# def create_cached_embeddings(docs, embedding_model, model_name, fs):
#     # print(list(fs.yield_keys()))
#     # cached_embedder = CacheBackedEmbeddings.from_bytes_store(
#     #     embedding_model, fs, namespace=model_name
#     # )
#     ### Create the vectorstore
#     # db = Chroma.from_documents(docs, cached_embedder)
#     db = Chroma.from_documents(docs, embedding_model)
#     return db

# def create_embeddings(docs, embedding_model):
#     return Chroma.from_documents(docs, embedding_model)


# utility function for similarity search
def return_similar_docs(db, query, k=4, show_docs=False):
    docs = db.similarity_search_with_relevance_scores(query, k=k)
    docs = sorted(docs, key=lambda x: -x[1])
    if show_docs:
        for doc in docs:
            print("Text:", doc[0].page_content)
            print("Relevance Score:", doc[1])
            print("--"*50)
            print("--"*50)
    return docs



### BGE Hugging face embeddings

In [16]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
hf = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [None]:
# embeddings = hf.embed_documents(docs)
# embeddings_info(embeddings)
# sentence1 = "Tell me about article 23"
# embedding1 = hf.embed_query(sentence1)
# import numpy as np

# dot_product = np.dot(embeddings, embedding1)
# dot_product
# np.argmax()

In [17]:
# hf_db = Chroma.from_documents(docs_text, hf, persist_directory="./chroma_db_hf")
hf_db = Chroma(persist_directory="./chroma_db_hf", embedding_function=hf)

In [None]:
### check similar documents
query = "What is writs in constitution?"
retrived_docs = return_similar_docs(hf_db, query, show_docs=True)

### Cohere Embeddings

In [18]:
COHERE_API_KEY = return_api_key("COHERE_API_KEY")

In [19]:
from langchain.embeddings import CohereEmbeddings
cohr = CohereEmbeddings(
    model="embed-english-v2.0",
    cohere_api_key=COHERE_API_KEY
    )


In [20]:
# db_cohr = create_embeddings(docs_text, cohr)
# cohr_db = Chroma.from_documents(docs_text, cohr, persist_directory="./chroma_db_cohr")
cohr_db = Chroma(persist_directory="./chroma_db_cohr", embedding_function=cohr)

In [None]:
### check similar documents
query = "What is writs in constitution?"
retrieved_docs = return_similar_docs(cohr_db, query, show_docs=True)

#### Reordering
- When models must access relevant information in the middle of long contexts, then tend to ignore the provided documents. See: https://arxiv.org/abs/2307.03172


In [None]:
from langchain.document_transformers import (
    LongContextReorder,
)

# Reorder returned docs
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(retrieved_docs)
reordered_docs

### OpenAI Embeddings

In [None]:
OPENAI_API_KEY = return_api_key("OPENAI_API_KEY")
# OPENAI_API_KEY

In [None]:
from langchain.embeddings import OpenAIEmbeddings

openai = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


#### We can also use [VectorstoreIndexCreator](https://python.langchain.com/docs/modules/data_connection/retrievers/#one-line-index-creation)
for creating vectorstore quickly with one liner code.

In [22]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Chroma,
    embedding=CohereEmbeddings(cohere_api_key=COHERE_API_KEY),
    text_splitter=CharacterTextSplitter(separator='\n\n'),
)

index = index_creator.from_loaders([loader_text])

# querying
query = "what is article 4?"
# index.query(query) # give your llm
# index.query_with_sources(query) # give your llm

# check Vectorstore
index.vectorstore

# can use as a retriever
index.vectorstore.as_retriever()

# can use in QA using llm
# qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)

VectorStoreRetriever(tags=['Chroma', 'CohereEmbeddings'], metadata=None, vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x78572fc8a650>, search_type='similarity', search_kwargs={})

## [Retrievers](https://python.langchain.com/docs/modules/data_connection/retrievers/)

### [MultiQueryRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever)

In [26]:
# using cohere llm via LiteLLM lib interface
# from langchain.chat_models import ChatLiteLLM
from langchain.llms import Cohere
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from langchain.retrievers.multi_query import MultiQueryRetriever

# llm = ChatLiteLLM(model="command-nightly", cohere_api_key=COHERE_API_KEY)
llm = Cohere(model="command-nightly", cohere_api_key=COHERE_API_KEY)
# llm.cohere_api_key
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=cohr_db.as_retriever(), llm=llm
)

In [27]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)


In [28]:
# Problem with the litellm lib
question = "Tell me about article 4"
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ["1. Tell me about article 4's tags.", "2. Tell me about article 4's content.", "3. Tell me about article 4's authors."]


7

- We can supply our own prompt along with [output parser](https://python.langchain.com/docs/modules/model_io/output_parsers/) to split the results into a list of queries



In [29]:
from typing import List
from langchain import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from a vector
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search.
    Provide these alternative questions seperated by newlines.
    Original question: {question}""",
)

# Chain
llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

# Other inputs
question = "what is reservation in contitution of india?"

In [30]:
# Run
retriever = MultiQueryRetriever(
    retriever=cohr_db.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

# Results
unique_docs = retriever.get_relevant_documents(
    query=question
)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. What is the reservation in the Constitution of India according to the Constitution?', '2. What does the Constitution of India say about reservation?', '3. How does the reservation in the Constitution of India work?', '4. What are the different types of reservation in the Constitution of India?', '5. How has the reservation in the Constitution of India changed over time?']


11

### [Ensemble Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)
- The EnsembleRetriever takes a list of retrievers as input and ensemble the results of their get_relevant_documents() methods and rerank the results based on the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.
- The most common pattern is to combine a sparse retriever(like BM25) with a dense retriever(like Embedding similarity), because their strengths are complementary. It is also known as "hybrid search".

In [35]:
from langchain.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

In [39]:
# initialize the bm25 retriever
bm25_retriever = BM25Retriever.from_documents(docs_text)
bm25_retriever.k = 2

# initialize the Chromadb as retriever
chroma_retriever = cohr_db.as_retriever(search_kwargs={"k": 2})



In [41]:
# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.5, 0.5])

In [42]:
docs = ensemble_retriever.get_relevant_documents("schedules")
docs

[Document(page_content="The 44th Amendment Amend articles 19, 22, 30, 31A, 31C, 38, 71, 74, 77, 83, 103, 105, 123, 132, 133, 134, 139A, 150, 166, 172, 192, 194, 213, 217, 225, 226, 227, 239B, 329, 352, 356, 358, 359, 360 and 371F.\nInsert articles 134A and 361A.\nRemove articles 31, 257A and 329A.\nAmend part 12.\nAmend schedule 9.\nThe objective of 44th Amendment was: Amendment passed after revocation of internal emergency in the Country. \nProvides for human rights safeguards and mechanisms to prevent abuse of executive and legislative authority. Annuls some Amendments enacted in Amendment Bill 42. \nThe 44th Amendment is enforced since: 20 June, 1 August & 6 September 1979 [6] \nThe Prime Minster at the time of 44th Amendment was: Morarji Desai \nThe President at the time of 44th Amendment was: Neelam Sanjiva Reddy \n\nThe 45th Amendment Amend article 334.\nThe objective of 45th Amendment was: Extend reservation for SCs and STs and nomination of Anglo Indian members in Parliament an

### [MultiVector Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector)

- The methods to create multiple vectors per document include:

    - smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever)
    - summary: create a summary for each document, embed that along with (or instead of) the document
    - hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document

In [47]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
COHERE_API_KEY = return_api_key("COHERE_API_KEY")

In [44]:
from langchain.vectorstores import Chroma
from langchain.embeddings import CohereEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader

In [56]:
loaders = TextLoader(text_file_path)

docs = loaders.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

#### Smaller chunks


In [57]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_docs_CI",
    embedding_function=CohereEmbeddings(cohere_api_key=COHERE_API_KEY),
)

# The Storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)

import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]


In [58]:
# The splitter to use to create the smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=800)

In [59]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [60]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [61]:
# Vectorstore alone retrieves the small chunks
retriever.vectorstore.similarity_search("article 29")

[Document(page_content='Article 297: Things of value within territorial waters or continental shelf and resources of the exclusive economic zone to vest in the Union.\nArticle 298: Power to carry on trade, etc.\nArticle 299: Contracts.\nArticle 300: Suits and proceedings.', metadata={'doc_id': '246a1d0b-497f-4a4b-aa57-663384c849b9', 'source': '/content/drive/MyDrive/Project_CILLM/colab_project/Constitution-Xpert using OpenAI Embeddings (1)/file_context_corpus_cleaned_extended_part3.txt'}),
 Document(page_content='PART IVA: FUNDAMENTAL DUTIES\nArticle 51A: Fundamental duties.\n\nPART V: THE UNION:\nCHAPTER I: THE EXECUTIVE', metadata={'doc_id': '115e017e-e303-4fa1-a5a5-20e4f531f55f', 'source': '/content/drive/MyDrive/Project_CILLM/colab_project/Constitution-Xpert using OpenAI Embeddings (1)/file_context_corpus_cleaned_extended_part3.txt'}),
 Document(page_content='Article 379-391: [Repealed]\nArticle 392: Power of the President to remove difficulties.', metadata={'doc_id': 'd588fc73-eb1

In [68]:
# Retriever returns larger chunks
len(retriever.get_relevant_documents("article 29")[0].page_content)

8326

In [None]:
# sub_docs

In [None]:
# # cohr_db.similarity_search("article 29")
# bm25_retriever = BM25Retriever.from_documents(docs)
# bm25_retriever.k = 2

# ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever], weights=[0.5, 0.5])
# ensemble_retriever.get_relevant_documents("article 29")

#### Summary

- Oftentimes a summary may be able to distill more accurately what a chunk is about, leading to better retrieval. Here we show how to create summaries, and then embed those.

In [69]:
from langchain.llms import Cohere
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.schema.document import Document

In [71]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | Cohere(model="command-nightly", max_retries=0, cohere_api_key=COHERE_API_KEY)
    | StrOutputParser()
)

In [78]:
# due to free license we are requesting only 5 API calls per minute , therefore this cell will take time
import time
# summaries = chain.batch(docs, {"max_concurrency": 1})
summaries = []
for i in range(len(docs)):
    summaries.append(chain.invoke(doc))
    if i%5 == 0:
        time.sleep(60)

In [79]:
summary_docs = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

In [87]:
import pickle
pickle.dump(summary_docs, open("summaries.txt", "wb"))

In [88]:
summary_docs

[Document(page_content=' This document summarizes the rights and freedoms of citizens in India as outlined in the Indian Constitution. It includes protections for freedom of speech, expression, assembly, association, movement, residence, and profession, as well as the right to education and protection from unfair conviction. It also includes provisions for the protection of life and personal liberty, prohibition of traffic in human beings and forced labor, and the prohibition of employment of children in factories and mines. Additionally, it outlines the freedom of conscience and the right to freely practice and propagate religion, as well as the right to manage religious affairs and own and acquire property. The document also mentions the freedom from paying taxes for the promotion or maintenance of any particular religion, as well as the right to conserve a distinct language, script, or culture. It also states that no citizen shall be denied admission into any educational institution

In [89]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [90]:
sub_docs = vectorstore.similarity_search("Tell me about Article 29")

In [91]:
sub_docs[0]

Document(page_content='Article 297: Things of value within territorial waters or continental shelf and resources of the exclusive economic zone to vest in the Union.\nArticle 298: Power to carry on trade, etc.\nArticle 299: Contracts.\nArticle 300: Suits and proceedings.', metadata={'doc_id': '246a1d0b-497f-4a4b-aa57-663384c849b9', 'source': '/content/drive/MyDrive/Project_CILLM/colab_project/Constitution-Xpert using OpenAI Embeddings (1)/file_context_corpus_cleaned_extended_part3.txt'})

In [92]:
retrieved_docs = retriever.get_relevant_documents("Tell me bout Article 29")

[]

### [Self Querying](https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/)

- A self-querying retriever is one that, as the name suggests, has the ability to query itself.
-  Specifically, given any natural language query, the retriever uses a query-constructing LLM chain to write a structured query and then applies that structured query to it's underlying VectorStore.
- This allows the retriever to not only use the user-input query for semantic similarity comparison with the contents of stored documented, but to also extract filters from the user query on the metadata of stored documents and to execute those filters.

### [WebResearchRetriever](https://python.langchain.com/docs/modules/data_connection/retrievers/web_research)

Given a query, this retriever will:

- Formulate a set of relate Google searches
- Search for each
- Load all the resulting URLs
- Then embed and perform similarity search with the query on the consolidate page content