# JinaReranker
## 概述
## ```Jina Reranker``` 是一個文件重排序和壓縮工具，重新排序檢索到的文件或結果，以優先顯示最相關的項目。它主要用於資訊檢索和自然語言處理（NLP）任務，旨在從大型資料集中更快、更準確地提取關鍵資訊。

---
**主要功能**
- 基於相關性的重排序
 Jina Reranker 分析搜尋結果並根據相關性分數重新排序文件。這確保使用者可以優先存取更相關的資訊。
- 多語言支援
 Jina Reranker 支援多語言模型，如 ```jina-reranker-v2-base-multilingual```，能夠處理各種語言的資料。
- 文件壓縮
 它只選擇前 N 個最相關的文件（```top_n```），壓縮搜尋結果以減少噪音並優化效能。
- 與 LangChain 整合
 Jina Reranker 與 LangChain 等工作流程工具無縫整合，便於連接到自然語言處理管道。

---
**運作方式**
- 文件檢索
 使用基礎檢索器擷取初始搜尋結果。
- 相關性分數計算
 Jina Reranker 利用預訓練模型（如 ```jina-reranker-v2-base-multilingual```）計算每個文件的相關性分數。
- 文件重排序和壓縮
 基於相關性分數，選擇前 N 個文件並提供重新排序的結果。

### 目錄
- [概述](#概述)
- [環境設定](#環境設定)
- [Jina Reranker](#jina-reranker)
- [使用 JinaRerank 執行重排序](#使用jinarerank執行重排序)

### 參考資料
- [LangChain Documentation](https://python.langchain.com/docs/how_to/lcel_cheatsheet/)
- [Jina Reranker](https://jina.ai/reranker/)

---

## 我的見解

Jina Reranker 提供了企業級的重排序解決方案，特別在多語言支援和 LangChain 整合方面表現出色，適合國際化應用需求。

## 學習補充重點

**技術特色：**
- **多語言優勢**：支援跨語言檢索和排序
- **企業級穩定性**：Jina AI 提供的商業級服務
- **高效壓縮**：智能選擇最相關文件減少噪音
- **API 整合**：雲端服務易於部署和擴展

**與其他重排器比較：**
- **vs CrossEncoder**：更易使用，無需本地模型部署
- **vs 開源方案**：更好的多語言支援和穩定性
- **vs 自建方案**：減少維護成本，專業優化

**適用場景：**
- **國際化應用**：需要處理多種語言的搜尋系統
- **企業知識庫**：大規模文檔的精確檢索
- **電商平台**：多語言商品搜尋和推薦
- **客服系統**：跨語言的智能問答

**部署考量：**
- **API 限制**：注意請求頻率和資料量限制
- **成本控制**：根據使用量優化成本
- **延遲管理**：網路請求可能增加響應時間
- **備用方案**：考慮本地重排器作為降級選項

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.
**Issuing an API Key for JinaReranker**
- Add the following to your .env file
    >JINA_API_KEY="YOUR_JINA_API_KEY"

In [None]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [None]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

You can also load the ```OPEN_API_KEY``` from the ```.env``` file.

In [4]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

In [None]:
# Set local environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "03-JinaReranker",
    }
)

Environment variables have been set successfully.


## Jina Reranker

- Load data for a simple example and create a retriever.

In [1]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

- A text document is loaded into the system.

- The document is split into smaller chunks for better processing.

- ```FAISS``` is used with ```OpenAI embeddings``` to create a retriever.

- The retriever processes a query to find and display the most relevant documents.


In [5]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Load the document
documents = TextLoader("./data/appendix-keywords.txt").load()

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Split the document into chunks
texts = text_splitter.split_documents(documents)

# Initialize the retriever
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever(
    search_kwargs={"k": 10}
)

# Define the query
query = "Tell me about Word2Vec."

# Retrieve relevant documents
docs = retriever.invoke(query)

# Print the retrieved documents
pretty_print_docs(docs)

Document 1:

Word2Vec
Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
----------------------------------------------------------------------------------------------------
Document 2:

Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors that computers can process and understand.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
----------------------------------------------------------------------------------------------------
Document 3:

VectorStore
Definition: A VectorStore is a system designed to store data

## 使用 JinaRerank 執行重排序
- 使用 JinaRerank 初始化文件壓縮系統，以優先處理最相關的文件。
- 透過選擇前 3 個最相關的文件（top_n=3）來壓縮檢索文件。
- 使用 JinaRerank 壓縮器和現有檢索器建立 ```ContextualCompressionRetriever```。
- 系統處理查詢以檢索和壓縮相關文件。

---

## 我的見解

這個實作展示了 JinaRerank 的核心工作流程，透過簡潔的配置就能實現高效的文件重排序和壓縮。

## 學習補充重點

**實作要點：**
- **top_n 參數**：控制最終保留的文件數量
- **壓縮策略**：結合重排序和數量限制
- **無縫整合**：與現有檢索器完美配合
- **查詢處理**：自動化的相關性評估流程

**配置建議：**
- **top_n=3**：適合問答系統的精確回答
- **top_n=5-10**：適合需要更多上下文的場景
- **動態調整**：根據查詢複雜度調整數量

**效能優化：**
- 減少無關文件干擾
- 降低後續處理成本
- 提升生成內容品質
- 改善用戶體驗

**監控指標：**
- 重排序準確率
- 響應時間變化
- 文件壓縮比例
- 最終答案品質

In [None]:
from ast import mod
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import JinaRerank

# Initialize the JinaRerank compressor
compressor = JinaRerank(model="jina-reranker-v2-base-multilingual", top_n=3)

# Initialize the document compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# Retrieve and compress relevant documents
compressed_docs = compression_retriever.invoke("Explain Word2Vec.")


In [None]:
# Display the compressed documents in a readable format
pretty_print_docs(compressed_docs)

Document 1:

Word2Vec
Definition: Word2Vec is a technique in NLP that maps words to a vector space, representing their semantic relationships based on context.
Example: In a Word2Vec model, "king" and "queen" are represented by vectors located close to each other.
Related Keywords: Natural Language Processing (NLP), Embedding, Semantic Similarity
----------------------------------------------------------------------------------------------------
Document 2:

Embedding
Definition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors that computers can process and understand.
Example: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].
Related Keywords: Natural Language Processing (NLP), Vectorization, Deep Learning
----------------------------------------------------------------------------------------------------
Document 3:

VectorStore
Definition: A VectorStore is a system designed to store data