<a href="https://colab.research.google.com/github/hiwei93/rag-practice/blob/main/Ensemble_Retrievers_(Fusion_retrieval_or_hybrid_search).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 融合检索 LangChian 实现

融合检索 `Fusion retrieval` / 混合搜索 `hybrid search` 的 LangChain 实现，使用 LangChian 的集合检索器 `Ensemble Retriever`。

基于 LangChain 官方文档 [Ensemble Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble) 实现。



## 安装依赖

In [1]:
!pip install langchain rank_bm25 faiss-cpu --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.4/802.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.6/218.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.4/48.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h

### 构建 BM25 检索器

BM25 是搜索领域标准方法，基于关键词的检索方法

In [2]:
from langchain.retrievers import BM25Retriever

doc_list_1 = [
    "I like apples",
    "I like oranges",
    "Apples and oranges are fruits",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(
    doc_list_1, metadatas=[{"source": 1}] * len(doc_list_1)
)
bm25_retriever.k = 2

In [3]:
# Failed on sematic searching
bm25_retriever.get_relevant_documents("Who like apples?")

[Document(page_content='I like oranges', metadata={'source': 1}),
 Document(page_content='I like apples', metadata={'source': 1})]

## 构建基于向量的检索器

In [4]:
from langchain_community.embeddings import HuggingFaceInferenceAPIEmbeddings
from langchain_community.vectorstores import FAISS

In [6]:
from google.colab import userdata

# 获取 Huggingface token
inference_api_key = userdata.get('hf_token')

In [7]:
doc_list_2 = [
    "You like apples",
    "You like oranges",
]


embedding = HuggingFaceInferenceAPIEmbeddings(
    api_key=inference_api_key, model_name="sentence-transformers/all-MiniLM-L6-v2"
)

faiss_vectorstore = FAISS.from_texts(
    doc_list_2, embedding, metadatas=[{"source": 2}] * len(doc_list_2)
)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

In [8]:
faiss_vectorstore.max_marginal_relevance_search("Who like apples?")

[Document(page_content='You like apples', metadata={'source': 2}),
 Document(page_content='You like oranges', metadata={'source': 2})]

## 构建集合检索器

集合检索器 `Ensemble Retriever` 是融合检索 `Fusion retrieval` 的 LangChain 实现

In [9]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.6, 0.4]
)

In [10]:
docs = ensemble_retriever.invoke("Who like apples?")
docs

[Document(page_content='I like oranges', metadata={'source': 1}),
 Document(page_content='I like apples', metadata={'source': 1}),
 Document(page_content='You like apples', metadata={'source': 2}),
 Document(page_content='You like oranges', metadata={'source': 2})]