## Here we are about to discuss 2 key enhancements for RAG
##### 1. When you have diferent kind of documents, you can't create a single index. you have to create multiple indexes and have to index via all of them seperately to perform the RAG. we are going to see how we can do a better RAG in this case using Merger Retriever (LOTR)
##### 2. Lost in the Middle ! Imagine LLM retrieved 10 documents for a query, but Response synthesizer only looks at top and bottom set of documents as it ignore the middle set of documents. Using Reranking we can fix this issue.

# 1. Merger Retriever (LOTR)

In [22]:
import os, yaml
from langchain.vectorstores import Chroma
from langchain.chat_models import AzureChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.retrievers.merger_retriever import MergerRetriever
from langchain.document_transformers import (
                                            EmbeddingsRedundantFilter,
                                            EmbeddingsClusteringFilter,
                                            LongContextReorder
                                            )
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ContextualCompressionRetriever

In [10]:
with open('cadentials.yaml') as f:
    credentials = yaml.load(f, Loader=yaml.FullLoader)

os.environ['OPENAI_API_KEY'] = credentials['OPENAI_API_KEY']
os.environ['OPENAI_API_TYPE'] = credentials['OPENAI_API_TYPE']
os.environ['AZURE_OPENAI_ENDPOINT'] = credentials['AD_OPENAI_API_BASE']
os.environ['OPENAI_API_VERSION'] = credentials['AD_OPENAI_API_VERSION']
os.environ["COHERE_API"] = credentials['COHERE_API']
os.environ['ENGINE'] = credentials['ENGINE']

In [11]:
embedding = HuggingFaceBgeEmbeddings(
                                    model_name="BAAI/bge-small-en-v1.5",
                                    model_kwargs={'device': 'mps'},
                                    encode_kwargs={'normalize_embeddings': True}
                                    )

llm = AzureChatOpenAI(
                    deployment_name=credentials['AD_DEPLOYMENT_ID'],
                    model_name=credentials['AD_ENGINE'],
                    temperature=0.9, 
                    max_tokens=256
                    )

### Data Preprocessing

In [12]:
loader_un_sdg = PyPDFLoader("data/political/UN SDG.pdf")
documents_un_sdg = loader_un_sdg.load()
text_splitter_un_sdg = RecursiveCharacterTextSplitter(
                                                    chunk_size=1000,
                                                    chunk_overlap=100
                                                    )
texts_un_sdg = text_splitter_un_sdg.split_documents(documents_un_sdg)

In [13]:
texts_un_sdg[0]

Document(page_content='TRANSFORMING OUR WORLD:\nTHE 2030 AGENDA FOR \nSUST AINABLE DEVELOPMENTUNITED NA TIONS', metadata={'source': 'data/political/UN SDG.pdf', 'page': 0})

In [14]:
loader_paris_agreement = PyPDFLoader("data/political/english_paris_agreement.pdf")
documents_paris_agreement = loader_paris_agreement.load()
text_splitter_paris_agreement = RecursiveCharacterTextSplitter(
                                                                chunk_size=1000,
                                                                chunk_overlap=100
                                                                )
texts_paris_agreement = text_splitter_paris_agreement.split_documents(documents_paris_agreement)

In [15]:
texts_paris_agreement[0]

Document(page_content='PARIS AGREEMENT \n(mm \nUNITED NATIONS \n2015', metadata={'source': 'data/political/english_paris_agreement.pdf', 'page': 0})

### Vector Store

In [19]:
if not os.path.exists("db/18/un_sdg_chroma_cosine"):
    un_sdg_store = Chroma.from_documents(
                                        texts_un_sdg, 
                                        embedding, 
                                        collection_metadata={"hnsw:space": "cosine"}, 
                                        persist_directory="db/18/un_sdg_chroma_cosine"
                                        )
else:
    un_sdg_store = Chroma(
                        persist_directory="db/18/un_sdg_chroma_cosine",
                        embedding_function=embedding
                        )

if not os.path.exists("db/18/paris_chroma_cosine"):
    paris_agreement_store = Chroma.from_documents(
                                        texts_paris_agreement, 
                                        embedding, 
                                        collection_metadata={"hnsw:space": "cosine"}, 
                                        persist_directory="db/18/paris_chroma_cosine"
                                        )
else:
    paris_agreement_store = Chroma(
                        persist_directory="db/18/paris_chroma_cosine",
                        embedding_function=embedding
                        )

### Build Merged Retriever

In [20]:
retriever_un_sdg = un_sdg_store.as_retriever(
                                            search_type = "similarity", 
                                            search_kwargs = {
                                                            "k":3, 
                                                            "include_metadata": True
                                                            }
                                            )

retriever_paris_agreement = paris_agreement_store.as_retriever(
                                                                search_type = "similarity", 
                                                                search_kwargs = {
                                                                                "k":3, 
                                                                                "include_metadata": True
                                                                                }
                                            )
lotr = MergerRetriever(retrievers=[retriever_un_sdg, retriever_paris_agreement])
lotr


MergerRetriever(retrievers=[VectorStoreRetriever(tags=['Chroma', 'HuggingFaceBgeEmbeddings'], vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x12c43bca0>, search_kwargs={'k': 3, 'include_metadata': True}), VectorStoreRetriever(tags=['Chroma', 'HuggingFaceBgeEmbeddings'], vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x2b9133520>, search_kwargs={'k': 3, 'include_metadata': True})])

### Indexing

In [21]:
for chunks in lotr.get_relevant_documents("Is there any framework available to tackle the climate change?"):
    print(chunks.page_content)

finance should  represent a progression beyond previous efforts. 
4. The provision of scaled-up financial resources should aim to achieve a 
balance between adaptation and mitigation, taking into account country-driven 
strategies, and the priorities and needs of developing country Parties, especially 
those that are particularly vulnerable to the adverse effects of climate change and 
have significant capacity constraints, such as the least developed countries and 
small island developing States, considering the need for public and grant-based 
resources for adaptation. 
5. Developed country Parties shall biennially communicate indicative 
quantitative and qualitative information related to paragraphs 1 and 3 of this 
Article, as applicable, including, as available, projected levels of public financial 
resources to be provided to developing country Parties. Other Parties providing 
resources are encouraged to communicate biennially such information on a 
voluntary basis.
and+ adaptat

# Reranking

In [23]:
docs = lotr.get_relevant_documents("Is there any framework available to tackle the climate change?")
docs

 Document(page_content='finance should  represent a progression beyond previous efforts. \n4. The provision of scaled-up financial resources should aim to achieve a \nbalance between adaptation and mitigation, taking into account country-driven \nstrategies, and the priorities and needs of developing country Parties, especially \nthose that are particularly vulnerable to the adverse effects of climate change and \nhave significant capacity constraints, such as the least developed countries and \nsmall island developing States, considering the need for public and grant-based \nresources for adaptation. \n5. Developed country Parties shall biennially communicate indicative \nquantitative and qualitative information related to paragraphs 1 and 3 of this \nArticle, as applicable, including, as available, projected levels of public financial \nresources to be provided to developing country Parties. Other Parties providing \nresources are encouraged to communicate biennially such information

In [25]:
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)
reordered_docs

[Document(page_content='finance should  represent a progression beyond previous efforts. \n4. The provision of scaled-up financial resources should aim to achieve a \nbalance between adaptation and mitigation, taking into account country-driven \nstrategies, and the priorities and needs of developing country Parties, especially \nthose that are particularly vulnerable to the adverse effects of climate change and \nhave significant capacity constraints, such as the least developed countries and \nsmall island developing States, considering the need for public and grant-based \nresources for adaptation. \n5. Developed country Parties shall biennially communicate indicative \nquantitative and qualitative information related to paragraphs 1 and 3 of this \nArticle, as applicable, including, as available, projected levels of public financial \nresources to be provided to developing country Parties. Other Parties providing \nresources are encouraged to communicate biennially such information

#### You can clearly see documents have been reranked