# Indexers and Retrievers
An `index` is a powerful data structure that meticulously organizes and stores documents to enable efficient searching, while a `retriever` harnesses the index to locate and return pertinent documents in response to user queries. Within LangChain, the primary index types are centered on vector databases, with embeddings-based indexes being the most prevalent.

Retrievers focus on extracting relevant documents to merge with prompts for language models. A retriever exposes a `get_relevant_documents` method, which accepts a query string as input and returns a list of related documents.

---

## Setup

In [1]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())
openai.api_type = os.environ.get("OPENAI_API_TYPE")
openai.api_base = os.environ.get("OPENAI_API_BASE")
openai.api_key = os.environ.get("OPENAI_API_KEY")
openai.api_version = os.environ.get("OPENAI_API_VERSION")

## LLM

In [7]:
from langchain.chat_models import AzureChatOpenAI

llm = AzureChatOpenAI(
    deployment_name="gpt4",
    temperature=0,
)

## Indexers and Retrievers

Here we use the `TextLoader` class to load a text file as document.

In [3]:
from langchain.document_loaders import TextLoader

# text to write to a local file
# taken from https://www.theverge.com/2023/3/14/23639313/google-ai-language-model-palm-api-challenge-openai
text = """Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”

PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or
Meta's LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,
PaLM is a flexible system that can potentially carry out all sorts of text generation and
editing tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for
example, or you could use it for tasks like summarizing text or even writing code.
(It's similar to features Google also announced today for its Workspace apps like Google
Docs and Gmail.)
"""

# write text to local file
with open("../../data/PaLM.txt", "w") as file:
    file.write(text)

# use TextLoader to load text from local file
loader = TextLoader("../../data/PaLM.txt")
docs_from_file = loader.load()

print(len(docs_from_file))
# 1

1


Then, we use `CharacterTextSplitter` to split the docs into texts.

The length of the contents may vary depending on their source. For instance, a PDF file containing a book may exceed the input window size of the model, making it incompatible with direct processing. However, splitting the large text into smaller segments will allow us to use the most relevant chunk as the context instead of expecting the model to comprehend the whole book and answer a question.

In [14]:
from langchain.text_splitter import CharacterTextSplitter

# create a text splitter
text_splitter = CharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# split documents into chunks
docs = text_splitter.split_documents(docs_from_file)

print(len(docs))
# 2

docs

Created a chunk of size 373, which is longer than the specified 200


2


[Document(page_content='Google opens up its AI language model PaLM to challenge OpenAI and GPT-3\nGoogle is offering developers access to one of its most advanced AI language models: PaLM.\nThe search giant is launching an API for PaLM alongside a number of AI enterprise tools\nit says will help businesses “generate text, images, code, videos, audio, and more from\nsimple natural language prompts.”', metadata={'source': '../../data/PaLM.txt'}),
 Document(page_content="PaLM is a large language model, or LLM, similar to the GPT series created by OpenAI or\nMeta's LLaMA family of models. Google first announced PaLM in April 2022. Like other LLMs,\nPaLM is a flexible system that can potentially carry out all sorts of text generation and\nediting tasks. You could train PaLM to be a conversational chatbot like ChatGPT, for\nexample, or you could use it for tasks like summarizing text or even writing code.\n(It's similar to features Google also announced today for its Workspace apps like Goog

Embeddings allow us to effectively search for documents or portions of documents that relate to our query by examining their semantic similarities. The system becomes more efficient in finding and presenting relevant information by converting documents and user queries into numerical vectors (embeddings) and storing them in specialized databases like Deep Lake, which serves as our vector store database.

In [5]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

We'll employ the Deep Lake vector store with our embeddings in place.

In [6]:
from langchain.vectorstores import DeepLake

# Before executing the following code, make sure to have your
# Activeloop key saved in the “ACTIVELOOP_TOKEN” environment variable.

# create Deep Lake dataset
# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = os.environ.get("ACTIVELOOP_ORG_ID")
my_activeloop_dataset_name = "indexers_retrievers"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# add documents to our Deep Lake dataset
db.add_documents(docs)

Your Deep Lake dataset has been successfully created!
The dataset is private so make sure you are logged in!


/

Dataset(path='hub://iamrk04/indexers_retrievers', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype     shape     dtype  compression
  -------    -------   -------   -------  ------- 
 embedding  embedding  (2, 768)  float32   None   
    id        text      (2, 1)     str     None   
 metadata     json      (2, 1)     str     None   
   text       text      (2, 1)     str     None   


 

['a0193d9d-1782-11ee-b60b-010101010000',
 'a0193d9e-1782-11ee-98a7-010101010000']

Once we have the retriever, we can start with question-answering. We will employ a so-called "stuff chain" (refer to CombineDocuments Chains). Stuffing is one way to supply information to the LLM. Using this technique, we "stuff" all the information into the LLM's prompt. However, this method is only effective with shorter documents, as most LLMs have a context length limit.

In [8]:
from langchain.chains import RetrievalQA

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=db.as_retriever()  # retriever
)

In [15]:
# retrieving compressed documents to verify final result
retrieved_docs = db.as_retriever().get_relevant_documents(
    "How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)

Google opens up its AI language model PaLM to challenge OpenAI and GPT-3
Google is offering developers access to one of its most advanced AI language models: PaLM.
The search giant is launching an API for PaLM alongside a number of AI enterprise tools
it says will help businesses “generate text, images, code, videos, audio, and more from
simple natural language prompts.”


We can query our document about specific topic that can be found in the documents. A similarity search will be conducted using the embeddings (query gets converted to embeddings) to identify matching documents to be used as context for the LLM.

In [9]:
query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

Google plans to challenge OpenAI by opening up its AI language model, PaLM, to developers. They are launching an API for PaLM alongside a number of AI enterprise tools that will help businesses generate text, images, code, videos, audio, and more from simple natural language prompts. By offering access to its advanced AI language model and providing tools for various applications, Google aims to compete with OpenAI's GPT-3 and other large language models in the market.


## A Potential Problem

This method has a downside: you might not know how to get the right documents later when storing data. In the Q&A example, we cut the text into equal parts, causing both useful and useless text to show up when a user asks a question.

Including unrelated information in the LLM prompt is detrimental because:
- It can divert the LLM's focus from pertinent details.
- It occupies valuable space that could be utilized for more relevant information.

## Possible Solution

The `ContextualCompressionRetriever` is a wrapper around another retriever in LangChain. It takes a base retriever and a `DocumentCompressor` and automatically compresses the retrieved documents from the base retriever. This means that only the most relevant parts of the retrieved documents are returned, given a specific query.

A popular compressor choice is the `LLMChainExtractor`, which uses an LLMChain to extract only the statements relevant to the query from the documents. To improve the retrieval process, a ContextualCompressionRetriever is used, wrapping the base retriever with an LLMChainExtractor. The LLMChainExtractor iterates over the initially returned documents and extracts only the content relevant to the query. 

In [10]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# create compressor for the retriever
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=db.as_retriever()
)

In [12]:
# retrieving compressed documents to verfiy final result
retrieved_docs = compression_retriever.get_relevant_documents(
    "How Google plans to challenge OpenAI?"
)
print(retrieved_docs[0].page_content)

Google opens up its AI language model PaLM to challenge OpenAI and GPT-3. The search giant is launching an API for PaLM alongside a number of AI enterprise tools it says will help businesses “generate text, images, code, videos, audio, and more from simple natural language prompts.”


Once we have created the `compression_retriever`, we can use it in the chain.

In [13]:
from langchain.chains import RetrievalQA

# create a retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=compression_retriever  # retriever
)

query = "How Google plans to challenge OpenAI?"
response = qa_chain.run(query)
print(response)

Google plans to challenge OpenAI by opening up its AI language model, PaLM, and launching an API for it alongside a number of AI enterprise tools. These tools will help businesses generate text, images, code, videos, audio, and more from simple natural language prompts. By offering a competitive alternative to OpenAI's GPT-3, Google aims to attract users and businesses to its own large language model and related services.
