# Contextual Compression and Filtering in RAG

### Installing dependencies

In [13]:
!pip install -qU langchain langchain-community huggingface_hub lancedb pypdf python-dotenv transformers sentence-transformers

### Importing libraries

In [8]:
from langchain_community.llms import HuggingFaceHub
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv
from langchain_community.llms import OpenAI
import lancedb
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from getpass import getpass
import os
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline

In [9]:
import os
from getpass import getpass

os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass("Enter HuggingFace Hub Token:")

Enter HuggingFace Hub Token:··········


### Load the data

In [None]:
!wget https://raw.githubusercontent.com/lancedb/vectordb-recipes/main/examples/Contextual-Compression-with-RAG/Science_Glossary.pdf

In [14]:
loader = PyPDFLoader("Science_Glossary.pdf")
documents = loader.load()
print(len(documents))
print(documents[0].page_content)

7
SCIENCE GLOSSARY 
 
Abiotic:  A nonliving factor or element (e.g., light, water, heat, rock, energy, mineral). 
 
Acid deposition: Precipitation with a pH less than 5.6 that forms in the atmosphere when certain pollutants mix 
with water vapor. 
 
Allele:  Any of a set of possible forms of a gene. 
 
Biochemical conversion:  The changing of organic matter into other chemical forms. 
 
Biological diversity: The variety and complexity of species present and interacting in an ecosystem and the relative 
abundance of each. 
 
Biomass conversion: The changing of organic matter that has been produced by photosynthesis into useful liquid, gas 
or fuel. 
 
Biomedical technology: The application of health care theories to develop methods, products and tools to maintain or 
improve homeostasis. 
 
Biomes:  A community of living organisms of a single major ecological region. 
 
Biotechnology:  The ways that humans apply biological concepts to produce products and provide services. 
 
Biotic:  A

### Split texts

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=70)
final_doc = text_splitter.split_documents(documents)
print(len(final_doc))
print(final_doc[0])

22
page_content='SCIENCE GLOSSARY 
 
Abiotic:  A nonliving factor or element (e.g., light, water, heat, rock, energy, mineral). 
 
Acid deposition: Precipitation with a pH less than 5.6 that forms in the atmosphere when certain pollutants mix 
with water vapor. 
 
Allele:  Any of a set of possible forms of a gene. 
 
Biochemical conversion:  The changing of organic matter into other chemical forms. 
 
Biological diversity: The variety and complexity of species present and interacting in an ecosystem and the relative 
abundance of each. 
 
Biomass conversion: The changing of organic matter that has been produced by photosynthesis into useful liquid, gas 
or fuel.' metadata={'source': 'Science_Glossary.pdf', 'page': 0}


### Embeddings

In [16]:
embeddings = SentenceTransformerEmbeddings(
    model_name="llmware/industry-bert-insurance-v0.1"
)

  embeddings = SentenceTransformerEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/808 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

### Load the LLM

In [17]:
repo_id = "llmware/bling-sheared-llama-1.3b-0.1"
llm = HuggingFaceHub(
    repo_id=repo_id, model_kwargs={"temperature": 0.3, "max_length": 500}
)

  llm = HuggingFaceHub(


In [18]:
def pretty_print_docs(docs):
    print(
        f"\n{'-'* 100}\n".join(
            [f"Document{i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

### Instantiate VectorStore (LanceDB)

In [19]:
import lancedb

context_data = lancedb.connect("./.lancedb")
table = context_data.create_table(
    "context",
    data=[
        {
            "vector": embeddings.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)

### Retriever

In [20]:
# initialize the retriever
from langchain_community.vectorstores import LanceDB

database = LanceDB.from_documents(final_doc, embeddings)

In [21]:
retriever_d = database.as_retriever(search_kwargs={"k": 3})

In [22]:
docs = retriever_d.get_relevant_documents(query="What is Wetlands?")
pretty_print_docs(docs)

  docs = retriever_d.get_relevant_documents(query="What is Wetlands?")


Document1:

body of water; also called a drainage basin. 
  
Wetlands: Lands where water saturation is the dominant factor determining the nature of the soil 
development and the plant and animal communities (e.g., sloughs, estuaries, marshes). 
 
 
 7
----------------------------------------------------------------------------------------------------
Document2:

developed state. 
 
Endangered species:  A species that is in danger of extinction throughout all or a significant portion of its range. 
 
Engineering: The application of scientific, physical, mechanical and mathematical principles to design 
processes, products and structures that improve the quality of life. 
 
Environment: The total of the surroundings (air, water, soil, vegetation, people, wildlife) influencing each living 
being’s existence, including physical, biological and all other factors; the surroundings of a plant 
or animals including other plants or animals, climate and location. 
 2
---------------------------

### Compressor

In [23]:
# creating the compressor
compressor = LLMChainExtractor.from_llm(llm=llm)

# compressor retriver = base retriever + compressor
compression_retriever = ContextualCompressionRetriever(
    base_retriever=retriever_d, base_compressor=compressor
)

In [25]:
os.environ["OPENAI_API_KEY "] = getpass()
embdeddings_filter = EmbeddingsFilter(embeddings=embeddings)
compression_retriever_filter = ContextualCompressionRetriever(
    base_retriever=retriever_d, base_compressor=embdeddings_filter
)

compressed_docs = compression_retriever_filter.get_relevant_documents(
    query="What is the Environment?"
)
pretty_print_docs(compressed_docs)

··········
Document1:

Niche (ecological): The role played by an organism in an ecosystem; its food preferences, requirements for shelter, 
special behaviors and the timing of its activities (e.g., nocturnal, diurnal), interaction with other 
organisms and its habitat. 
 
Nonpoint source pollution: Contamination that originates from many locations that all discharge into a location (e.g., a lake, 
stream, land area). 
 
Nonrenewable resources: Substances (e.g., oil, gas, coal, copper, gold) that, once used, cannot be replaced in this geological 
age. 
 
Nova: A variable star that suddenly increases in brightness to several times its normal magnitude and
----------------------------------------------------------------------------------------------------
Document2:

developed state. 
 
Endangered species:  A species that is in danger of extinction throughout all or a significant portion of its range. 
 
Engineering: The application of scientific, physical, mechanical and mathematical pri

### Retrieve answer from Compressed Data

In [26]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=compression_retriever_filter, verbose=True
)
# Ask Question
qa("What is Environment?")

  qa("What is Environment?")




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What is Environment?',
 'result': "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\nNiche (ecological): The role played by an organism in an ecosystem; its food preferences, requirements for shelter, \nspecial behaviors and the timing of its activities (e.g., nocturnal, diurnal), interaction with other \norganisms and its habitat. \n \nNonpoint source pollution: Contamination that originates from many locations that all discharge into a location (e.g., a lake, \nstream, land area). \n \nNonrenewable resources: Substances (e.g., oil, gas, coal, copper, gold) that, once used, cannot be replaced in this geological \nage. \n \nNova: A variable star that suddenly increases in brightness to several times its normal magnitude and\n\ndeveloped state. \n \nEndangered species:  A species that is in danger of extinction throughout all or a significant portion of its rang

# Pipeline

In [27]:
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings, k=5)

# creating the pipeline
compressed_pipeline = DocumentCompressorPipeline(
    transformers=[redundant_filter, relevant_filter]
)

# compressor retriever
comp_pipe_retrieve = ContextualCompressionRetriever(
    base_retriever=retriever_d, base_compressor=compressed_pipeline
)

# print the prompt
print(comp_pipe_retrieve)

# Get relevant documents
compressed_docs = comp_pipe_retrieve.get_relevant_documents(
    query="What is Environment?"
)
pretty_print_docs(compressed_docs)

base_compressor=DocumentCompressorPipeline(transformers=[EmbeddingsRedundantFilter(embeddings=HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), model_name='llmware/industry-bert-insurance-v0.1', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False), similarity_fn=<function cosine_similarity at 0x7a582bfddab0>, similarity_threshold=0.95), EmbeddingsFilter(embeddings=HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dime

In [29]:
compressed_docs = comp_pipe_retrieve.get_relevant_documents(
    query="What is Hazardous waste?"
)
pretty_print_docs(compressed_docs)

Document1:

and age relationships of rock units and the occurrences of structural features, mineral deposits 
and fossil localities). 
 
Groundwater:  Water that infiltrates the soil and is located in underground reservoirs called aquifers. 
 
Hazardous waste: A solid that, because of its quantity or concentration or its physical, chemical or infectious 
characteristics, may cause or pose a substantial present or potential hazard to human health or 
the environment when improperly treated, stored, transported or disposed of, or otherwise 
managed. 
 
Homeostasis:  The tendency for a system to remain in a state of equilibrium by resisting change. 
 
 3
----------------------------------------------------------------------------------------------------
Document2:

Transportation systems: A group of related parts that function together to perform a major task in any form of 
transportation. 
 
Transportation  
technology:  The physical ways humans move materials, goods and people. 
 
Trop