# 크로스 인코더를 사용한 검색 순위 재조정

이 노트북에서는 크로스 인코더를 사용하여 검색 결과의 순위를 재조정하는 예제를 살펴봅니다.

이는 [바이 인코더](https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieval-bi-encoder)를 사용하여 생성된 임베딩을 사용하여 시맨틱 검색을 구현했지만 결과가 사용 사례에서 요구하는 만큼 정확하지 않은 경우의 일반적인 사용 사례입니다. 문서의 순위를 재조정하는 데 사용할 수 있는 비즈니스 규칙(예: 문서의 최신성 또는 인기도)이 있기 때문일 수 있습니다.

그러나 관련성을 결정하는 데 도움이 되는 미묘한 도메인별 규칙이 있는 경우가 많으며, 이때 크로스 인코더가 유용할 수 있습니다. 크로스 인코더는 바이 인코더보다 정확하지만 확장성이 좋지 않으므로 시맨틱 검색으로 반환된 단축 목록을 다시 정렬하는 데 사용하는 것이 가장 이상적인 사용 사례입니다.

### 예시

D 문서와 Q 쿼리가 있는 검색 작업을 생각해 봅시다.

모든 쌍별 관련성을 계산하는 무차별 대입 방식은 비용이 많이 들며, 그 비용은 ``D * Q``로 확장됩니다. 이를 **크로스 인코딩**이라고 합니다.

더 빠른 접근 방식은 **임베딩 기반 검색**으로, 각 문서와 쿼리에 대해 임베딩을 한 번 계산한 다음 여러 번 재사용하여 쌍별 관련성을 저렴하게 계산하는 것입니다. 임베딩은 한 번만 계산되기 때문에 비용은 '``D + Q``'로 확장됩니다. 이를 **바이 인코딩**이라고 합니다.

임베딩 기반 검색은 더 빠르지만 품질이 떨어질 수 있습니다. 두 가지 장점을 모두 얻기 위해 일반적으로 임베딩(또는 다른 바이 인코더)을 사용하여 저렴하게 상위 후보를 식별한 다음, GPT(또는 다른 크로스 인코더)를 사용하여 상위 후보의 순위를 다시 매기는 방법을 사용합니다. 이 하이브리드 접근 방식의 비용은 ''(D + Q) * 임베딩 비용 + (N * Q) * 재랭크 비용'으로 확장되며, 여기서 '``N``은 재랭크된 후보의 수입니다.

### 연습

이 접근 방식을 설명하기 위해 ``text-davinci-003``과 ``logprobs``를 활성화하여 GPT 기반 크로스 인코더를 구축하겠습니다. 

이 노트북은 Weaviate의 훌륭한 [기사](https://weaviate.io/blog/cross-encoders-as-reranker)와 Sentence Transformers의 바이 인코더와 크로스 인코더에 대한 [훌륭한 설명](https://www.sbert.net/examples/applications/cross-encoder/README.html)을 바탕으로 작성되었습니다.

In [None]:
!pip install openai
!pip install arxiv
!pip install tenacity
!pip install pandas
!pip install tiktoken

In [1]:
import arxiv
from math import exp
import openai
import pandas as pd
from tenacity import retry, wait_random_exponential, stop_after_attempt
import tiktoken

## 검색

이 예에서는 arXiv 검색 서비스를 사용하지만, 이 단계는 사용 중인 모든 검색 서비스에서 수행할 수 있습니다. 고려해야 할 핵심 항목은 관련성이 있을 수 있는 모든 문서를 캡처하기 위해 약간 과도하게 가져온 다음 다시 정렬하는 것입니다.


In [2]:
query = "how do bi-encoders work for sentence embeddings"
search = arxiv.Search(
    query=query, max_results=20, sort_by=arxiv.SortCriterion.Relevance
)

In [3]:
result_list = []

for result in search.results():
    result_dict = {}

    result_dict.update({"title": result.title})
    result_dict.update({"summary": result.summary})

    # Taking the first url provided
    result_dict.update({"article_url": [x.href for x in result.links][0]})
    result_dict.update({"pdf_url": [x.href for x in result.links][1]})
    result_list.append(result_dict)

In [4]:
result_list[0]

{'title': 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features',
 'summary': 'Models based on large-pretrained language models, such as S(entence)BERT,\nprovide effective and efficient sentence embeddings that show high correlation\nto human similarity ratings, but lack interpretability. On the other hand,\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\nRepresentation, AMR) can make explicit the semantic aspects in which two\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\nand do not reach state-of-the-art performance when rating sentence similarity.\n  In this work, we aim at the best of both worlds, by learning to induce\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\nvarious semantic sentence features (e.g., semantic roles, negation, or\nquantification). We show 

In [5]:
for i, result in enumerate(result_list):
    print(f"{i + 1}: {result['title']}")

1: SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
2: Are Classes Clusters?
3: Semantic Composition in Visually Grounded Language Models
4: Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions
5: Learning Probabilistic Sentence Representations from Paraphrases
6: Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings
7: How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation
8: Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences
9: Vec2Sent: Probing Sentence Embeddings with Natural Language Generation
10: Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings
11: SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding
12: Learning Joint Representations of Videos and Sentences with Web Image Search

## 크로스 인코더

여기서 고려해야 할 핵심 요소는 다음과 같습니다:
- 예제를 도메인별로 맞춤화하기 - 크로스 인코더의 강점은 예제를 도메인에 맞게 맞춤화할 때 발휘됩니다.
- 다시 순위를 매길 수 있는 잠재적 예제 수와 처리 속도 사이에는 절충점이 있습니다. 크로스 인코더 요청을 더 빠르게 처리하려면 일괄 처리 및 병렬 처리를 고려하세요.

단계는 다음과 같습니다:
- 관련성을 평가하는 프롬프트를 작성하고 도메인에 맞게 조정할 수 있는 몇 가지 예제를 제공합니다.
- 예`` 및 ``아니요``에 대한 토큰에 ``로그 편향``을 추가하여 다른 토큰이 발생할 가능성을 줄입니다.
- 예/아니오의 분류와 ``로그 프로브``를 반환합니다.
- '``예``'에 키가 지정된 ``logprobs``에 따라 결과의 순위를 다시 매깁니다.

In [6]:
tokens = [" Yes", " No"]
tokenizer = tiktoken.encoding_for_model("text-davinci-003")
ids = [tokenizer.encode(token) for token in tokens]
ids[0], ids[1]

([3363], [1400])

In [7]:
prompt = '''
You are an Assistant responsible for helping detect whether the retrieved document is relevant to the query. For a given input, you need to output a single token: "Yes" or "No" indicating the retrieved document is relevant to the query.

Query: How to plant a tree?
Document: """Cars were invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.[3][4][5] Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced horse-drawn carriages.[6] In Europe and other parts of the world, demand for automobiles did not increase until after World War II.[7] The car is considered an essential part of the developed economy."""
Relevant: No

Query: Has the coronavirus vaccine been approved?
Document: """The Pfizer-BioNTech COVID-19 vaccine was approved for emergency use in the United States on December 11, 2020."""
Relevant: Yes

Query: What is the capital of France?
Document: """Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré."""
Relevant: Yes

Query: What are some papers to learn about PPO reinforcement learning?
Document: """Proximal Policy Optimization and its Dynamic Version for Sequence Generation: In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance."""
Relevant: Yes

Query: Explain sentence embeddings
Document: """Inside the bubble: exploring the environments of reionisation-era Lyman-α emitting galaxies with JADES and FRESCO: We present a study of the environments of 16 Lyman-α emitting galaxies (LAEs) in the reionisation era (5.8<z<8) identified by JWST/NIRSpec as part of the JWST Advanced Deep Extragalactic Survey (JADES). Unless situated in sufficiently (re)ionised regions, Lyman-α emission from these galaxies would be strongly absorbed by neutral gas in the intergalactic medium (IGM). We conservatively estimate sizes of the ionised regions required to reconcile the relatively low Lyman-α velocity offsets (ΔvLyα<300kms−1) with moderately high Lyman-α escape fractions (fesc,Lyα>5%) observed in our sample of LAEs, indicating the presence of ionised ``bubbles'' with physical sizes of the order of 0.1pMpc≲Rion≲1pMpc in a patchy reionisation scenario where the bubbles are embedded in a fully neutral IGM. Around half of the LAEs in our sample are found to coincide with large-scale galaxy overdensities seen in FRESCO at z∼5.8-5.9 and z∼7.3, suggesting Lyman-α transmission is strongly enhanced in such overdense regions, and underlining the importance of LAEs as tracers of the first large-scale ionised bubbles. Considering only spectroscopically confirmed galaxies, we find our sample of UV-faint LAEs (MUV≳−20mag) and their direct neighbours are generally not able to produce the required ionised regions based on the Lyman-α transmission properties, suggesting lower-luminosity sources likely play an important role in carving out these bubbles. These observations demonstrate the combined power of JWST multi-object and slitless spectroscopy in acquiring a unique view of the early stages of Cosmic Reionisation via the most distant LAEs."""
Relevant: No

Query: {query}
Document: """{document}"""
Relevant:
'''


@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def document_relevance(query, document):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt.format(query=query, document=content),
        temperature=0,
        logprobs=1,
        logit_bias={3363: 1, 1400: 1},
    )

    return (
        query,
        document,
        response["choices"][0]["text"],
        response["choices"][0]["logprobs"]["token_logprobs"][0],
    )

In [8]:
content = result_list[0]["title"] + ": " + result_list[0]["summary"]

# Set logprobs to 1 so our response will include the most probable token the model identified
response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt.format(query=query, document=content),
    temperature=0,
    logprobs=1,
    logit_bias={3363: 1, 1400: 1},
    max_tokens=1,
)

In [9]:
result = response["choices"][0]
print(f"Result was {result['text']}")
print(f"Logprobs was {result['logprobs']['token_logprobs'][0]}")
print("\nBelow is the full logprobs object\n\n")
print(result["logprobs"])

Result was Yes
Logprobs was -0.05869877

Below is the full logprobs object


{
  "tokens": [
    "Yes"
  ],
  "token_logprobs": [
    -0.05869877
  ],
  "top_logprobs": [
    {
      "Yes": -0.05869877
    }
  ],
  "text_offset": [
    5764
  ]
}


In [10]:
output_list = []
for x in result_list:
    content = x["title"] + ": " + x["summary"]

    try:
        output_list.append(document_relevance(query, document=content))

    except Exception as e:
        print(e)

In [11]:
output_list[:10]

[('how do bi-encoders work for sentence embeddings',
  'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features: Models based on large-pretrained language models, such as S(entence)BERT,\nprovide effective and efficient sentence embeddings that show high correlation\nto human similarity ratings, but lack interpretability. On the other hand,\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\nRepresentation, AMR) can make explicit the semantic aspects in which two\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\nand do not reach state-of-the-art performance when rating sentence similarity.\n  In this work, we aim at the best of both worlds, by learning to induce\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\nvarious semantic sentence features (e.g., semantic roles, negation

In [12]:
output_df = pd.DataFrame(
    output_list, columns=["query", "document", "prediction", "logprobs"]
).reset_index()
# Use exp() to convert logprobs into probability
output_df["probability"] = output_df["logprobs"].apply(exp)
# Reorder based on likelihood of being Yes
output_df["yes_probability"] = output_df.apply(
    lambda x: x["probability"] * -1 + 1
    if x["prediction"] == "No"
    else x["probability"],
    axis=1,
)
output_df.head()

Unnamed: 0,index,query,document,prediction,logprobs,probability,yes_probability
0,0,how do bi-encoders work for sentence embeddings,SBERT studies Meaning Representations: Decompo...,Yes,-0.053264,0.94813,0.94813
1,1,how do bi-encoders work for sentence embeddings,Are Classes Clusters?: Sentence embedding mode...,No,-0.009535,0.99051,0.00949
2,2,how do bi-encoders work for sentence embeddings,Semantic Composition in Visually Grounded Lang...,No,-0.008887,0.991152,0.008848
3,3,how do bi-encoders work for sentence embeddings,Evaluating the Construct Validity of Text Embe...,No,-0.008584,0.991453,0.008547
4,4,how do bi-encoders work for sentence embeddings,Learning Probabilistic Sentence Representation...,No,-0.011976,0.988096,0.011904


In [13]:
# Return reranked results
reranked_df = output_df.sort_values(
    by=["yes_probability"], ascending=False
).reset_index()
reranked_df.head(10)

Unnamed: 0,level_0,index,query,document,prediction,logprobs,probability,yes_probability
0,16,16,how do bi-encoders work for sentence embeddings,In Search for Linear Relations in Sentence Emb...,Yes,-0.004824,0.995187,0.995187
1,8,8,how do bi-encoders work for sentence embeddings,Vec2Sent: Probing Sentence Embeddings with Nat...,Yes,-0.004863,0.995149,0.995149
2,19,19,how do bi-encoders work for sentence embeddings,Relational Sentence Embedding for Flexible Sem...,Yes,-0.038814,0.96193,0.96193
3,0,0,how do bi-encoders work for sentence embeddings,SBERT studies Meaning Representations: Decompo...,Yes,-0.053264,0.94813,0.94813
4,15,15,how do bi-encoders work for sentence embeddings,Sentence-T5: Scalable Sentence Encoders from P...,No,-0.291893,0.746849,0.253151
5,6,6,how do bi-encoders work for sentence embeddings,How to Probe Sentence Embeddings in Low-Resour...,No,-0.015551,0.98457,0.01543
6,18,18,how do bi-encoders work for sentence embeddings,Efficient and Flexible Topic Modeling using Pr...,No,-0.015296,0.98482,0.01518
7,9,9,how do bi-encoders work for sentence embeddings,Non-Linguistic Supervision for Contrastive Lea...,No,-0.013869,0.986227,0.013773
8,12,12,how do bi-encoders work for sentence embeddings,Character-based Neural Networks for Sentence P...,No,-0.012866,0.987216,0.012784
9,7,7,how do bi-encoders work for sentence embeddings,Clustering and Network Analysis for the Embedd...,No,-0.012663,0.987417,0.012583


In [14]:
# Inspect our new top document following reranking
reranked_df["document"][0]

'In Search for Linear Relations in Sentence Embedding Spaces: We present an introductory investigation into continuous-space vector\nrepresentations of sentences. We acquire pairs of very similar sentences\ndiffering only by a small alterations (such as change of a noun, adding an\nadjective, noun or punctuation) from datasets for natural language inference\nusing a simple pattern method. We look into how such a small change within the\nsentence text affects its representation in the continuous space and how such\nalterations are reflected by some of the popular sentence embedding models. We\nfound that vector differences of some embeddings actually reflect small changes\nwithin a sentence.'

결론 ## 결론

지금까지 학술 논문의 순위를 재조정하기 위해 맞춤형 교차 인코더를 만드는 방법을 살펴봤습니다. 이 접근 방식은 사용자에게 가장 관련성이 높은 말뭉치를 선택하는 데 사용할 수 있는 도메인별 뉘앙스가 있고, 크로스 인코더가 처리해야 하는 데이터의 양을 제한하기 위해 사전 필터링이 수행된 경우에 가장 효과적입니다.

몇 가지 일반적인 사용 사례는 다음과 같습니다:
- 가장 관련성이 높은 100개의 주식 보고서 목록을 반환한 다음 특정 고객 포트폴리오 집합의 세부 컨텍스트에 따라 상위 5개 또는 10개로 재순서화하기
- 가장 관련성이 높은 상위 100개 또는 1000개의 결과를 가져오는 기존의 규칙 기반 검색을 실행하여 특정 사용자의 컨텍스트에 따라 결과를 정리합니다.


### 한 단계 더 나아가기

여기서와 같이 소수의 예로 대부분의 순위 재조정 사례를 다룰 수 있을 정도로 도메인이 충분히 일반적일 때는 몇 개의 예로 접근하는 것이 효과적일 수 있습니다. 그러나 문서 간의 차이가 더 구체화되면 더 다양한 예제를 사용하여 보다 정교한 교차 인코딩을 만들기 위해 '미세 조정' 엔드포인트를 고려할 수 있습니다.

위의 몇 가지 예제에서도 각각 몇 초가 걸리는 등 ``text-davinci-003``을 사용할 때 고려해야 할 지연 시간 영향도 있습니다. ``ada`` 또는 ``babbage`` 미세 조정 모델에서 적절한 결과를 얻을 수 있는 경우 ``Fine-tuning`` 엔드포인트가 도움이 될 수 있습니다.

저희는 OpenAI의 ``Completions`` 엔드포인트를 사용하여 크로스 인코더를 구축했지만, 이 영역은 오픈 소스 커뮤니티에서 잘 지원하고 있습니다. [여기](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1)를 예로 들면 HuggingFace의 예시입니다.

검색 사용 사례를 조정하는 데 유용하게 사용되기를 바라며, 여러분이 구축한 것을 기대하겠습니다.