### Ensemble Retriever

An Ensemble Retriever combines the strengths of multiple retrieval methods to improve the overall retrieval performance. By using different retrievers (like BM25 and FAISS) together, you can leverage the advantages of each method and mitigate their individual weaknesses.

Weighted Scoring:

 - Each retriever produces its own set of results with associated relevance scores. The ensemble retriever then combines these scores, often using a weighted sum or another aggregation method, to produce a final ranking of documents.
Ranking and Fusion:

- The ensemble can rank the documents based on the combined scores and either merge the results (by taking the top-ranked documents from each retriever) or re-rank the combined set of documents.
Adaptive Strategies:

Some ensemble retrievers adaptively adjust the weights given to each retriever based on the query or the context. For example, certain queries might benefit more from dense retrieval (FAISS), while others might benefit from keyword-based retrieval (BM25).

In [6]:
import glob
from langchain.document_loaders import TextLoader
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document

from langchain.vectorstores import FAISS

from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [3]:
# Specify the directory path (adjust as needed)
directory_path = "/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post"

# Use glob to find all .txt files in the directory and subdirectories
txt_files = glob.glob(f"{directory_path}**/*.txt", recursive=True)

# Initialize an empty list for loaders
loaders = [TextLoader(path) for path in txt_files]

# Initialize an empty list to store documents
docs = []

# Loop through each loader, load the document, and extend the docs list
for loader in loaders:
    docs.extend(loader.load())

In [5]:
# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2
bm25_retriever.get_relevant_documents("langsmith")


  warn_deprecated(


[Document(metadata={'source': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_evaluating-rag-pipelines-with-ragas-langsmith_.txt'}, page_content='URL: https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/\nTitle: Evaluating RAG pipelines with Ragas + LangSmith\n\nEditor\'s Note: This post was written in collaboration with the Ragas team. One of the things we think and talk about a lot at LangChain is how the industry will evolve to identify new monitoring and evaluation metrics that evolve beyond traditional ML ops metrics. Ragas is an exciting new framework that helps developers evaluate QA pipelines in new ways. This post shows how LangSmith and Ragas can be a powerful combination for teams that want to build reliable LLM apps.\n\nHow important evals are to the team is a major differentiator between folks rushing out hot garbage and those seriously building products in the space.\n\nThis HackerNews comment emphasizes th

In [8]:
faiss_vectorstore = FAISS.from_documents(docs, embedding)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})
faiss_retriever.get_relevant_documents("langsmith")


[Document(metadata={'source': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_announcing-langsmith_.txt'}, page_content='URL: https://blog.langchain.dev/announcing-langsmith/\nTitle: Announcing LangSmith, a unified platform for debugging, testing, evaluating, and monitoring your LLM applications\n\nLangChain exists to make it as easy as possible to develop LLM-powered applications.\n\nWe started with an open-source Python package when the main blocker for building LLM-powered applications was getting a simple prototype working. We remember seeing Nat Friedman tweet in late 2022 that there was “not enough tinkering happening.” The LangChain open-source packages are aimed at addressing this and we see lots of tinkering happening now (Nat agrees)–people are building everything from chatbots over internal company documents to an AI dungeon master for a Dungeons and Dragons game.\n\nThe blocker has now changed. While it’s easy to build a prototyp

In [9]:
# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever],
                                       weights=[0.5, 0.5])

In [12]:
docs = ensemble_retriever.get_relevant_documents("Integration of Langsmith")


In [13]:
docs

[Document(metadata={'source': '/home/heliya/Desktop/rag_approaches/src/rag_approaches/dataset/blog_post/blog.langchain.dev_integrating-chatgpt-with-google-drive-and-notion-data_.txt'}, page_content='URL: https://blog.langchain.dev/integrating-chatgpt-with-google-drive-and-notion-data/\nTitle: Tavrn x LangChain: Integrating Noah: ChatGPT with Google Drive and Notion data\n\nEditor\'s Note: This post was written in collaboration with the Tavrn team. They were able to build a new personal assistant app, Noah, that\'s highly personalized and highly context-aware using LangChain (with some interesting retrieval tactics) and LangSmith (for fine-tuning chains and prompts).\n\nChatGPT is already an indispensable tool for many in the workplace. Its impressive general purpose performance makes it extremely versatile to assist in workflows ranging from creative brainstorming to coding. In order to get the best outputs from ChatGPT, users are familiar with the process of prompting - providing the 