# 検索APIと再ランキングを使用した質問応答

関連情報の検索は時として干し草の山から針を探すような作業に感じられることがありますが、絶望する必要はありません。GPTは実際にこの作業の多くを代行してくれます。このガイドでは、既存の検索システムを様々なAI技術で強化し、ノイズを取り除く方法を探ります。

GPTに情報を取得させる方法は2つあります：

1. **人間のブラウジングの模倣：** [GPTが検索をトリガー](https://openai.com/blog/chatgpt-plugins#browsing)し、結果を評価し、必要に応じて検索クエリを修正します。また、特定の検索結果をフォローアップして思考の連鎖を形成することもでき、これは人間のユーザーが行うのと同様です。
2. **埋め込みによる検索：** コンテンツとユーザークエリの[埋め込み](https://platform.openai.com/docs/guides/embeddings)を計算し、コサイン類似度で測定した最も関連性の高い[コンテンツを取得](Question_answering_using_embeddings.ipynb)します。この技術はGoogleなどの検索エンジンで[広く使用](https://blog.google/products/search/search-language-understanding-bert/)されています。

これらのアプローチはどちらも有望ですが、それぞれに欠点があります。1つ目は反復的な性質により遅くなる可能性があり、2つ目は知識ベース全体を事前に埋め込み、新しいコンテンツを継続的に埋め込み、ベクトルデータベースを維持する必要があります。

これらのアプローチを組み合わせ、[再ランキング](https://www.sbert.net/examples/applications/retrieve_rerank/README.html)手法からインスピレーションを得ることで、中間的なアプローチを特定しました。**このアプローチは、Slack検索APIや社内データを持つElasticSearchインスタンスなど、既存の検索システムの上に実装できます**。動作方法は以下の通りです：

![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_rerank_answer.png)

**ステップ1：検索**

1. ユーザーが質問をします。
2. GPTが潜在的なクエリのリストを生成します。
3. 検索クエリが並列で実行されます。

**ステップ2：再ランキング**

1. 各結果の埋め込みを使用して、ユーザーの質問に対する生成された仮想的な理想的回答との意味的類似度を計算します。
2. この類似度メトリックに基づいて結果がランク付けされ、フィルタリングされます。

**ステップ3：回答**

1. 上位の検索結果を与えられて、モデルはユーザーの質問に対する回答を参考文献とリンクを含めて生成します。

このハイブリッドアプローチは比較的低いレイテンシを提供し、ベクトルデータベースの維持を必要とせずに既存の検索エンドポイントに統合できます。詳しく見ていきましょう！例として[News API](https://newsapi.org/)を検索対象のドメインとして使用します。

## セットアップ

`OPENAI_API_KEY`に加えて、環境に`NEWS_API_KEY`を含める必要があります。APIキーは[こちら](https://newsapi.org/)で取得できます。

In [1]:
%%capture
%env NEWS_API_KEY = YOUR_NEWS_API_KEY


In [2]:
# Dependencies
from datetime import date, timedelta  # date handling for fetching recent news
from IPython import display  # for pretty printing
import json  # for parsing the JSON api responses and model outputs
from numpy import dot  # for cosine similarity
from openai import OpenAI
import os  # for loading environment variables
import requests  # for making the API requests
from tqdm.notebook import tqdm  # for printing progress bars

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# Load environment variables
news_api_key = os.getenv("NEWS_API_KEY")

GPT_MODEL = "gpt-3.5-turbo"


# Helper functions
def json_gpt(input: str):
    completion = client.chat.completions.create(model=GPT_MODEL,
    messages=[
        {"role": "system", "content": "Output only valid JSON"},
        {"role": "user", "content": input},
    ],
    temperature=0.5)

    text = completion.choices[0].message.content
    parsed = json.loads(text)

    return parsed


def embeddings(input: list[str]) -> list[list[str]]:
    response = client.embeddings.create(model="text-embedding-3-small", input=input)
    return [data.embedding for data in response.data]

## 1. 検索

すべてはユーザーの質問から始まります。

In [3]:
# User asks a question
USER_QUESTION = "Who won the NBA championship? And who was the MVP? Tell me a bit about the last game."

今度は、可能な限り網羅的になるように、この質問に基づいて多様なクエリのリストを生成するためにモデルを使用します。

In [4]:
QUERIES_INPUT = f"""
You have access to a search API that returns recent news articles.
Generate an array of search queries that are relevant to this question.
Use a variation of related keywords for the queries, trying to be as general as possible.
Include as many queries as you can think of, including and excluding terms.
For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].
Be creative. The more queries you include, the more likely you are to find relevant results.

User question: {USER_QUESTION}

Format: {{"queries": ["query_1", "query_2", "query_3"]}}
"""

queries = json_gpt(QUERIES_INPUT)["queries"]

# Let's include the original question as well for good measure
queries.append(USER_QUESTION)

queries

['NBA championship winner',
 'MVP of NBA championship',
 'Last game of NBA championship',
 'NBA finals winner',
 'Most valuable player of NBA championship',
 'Finals game of NBA',
 'Who won the NBA finals',
 'NBA championship game summary',
 'NBA finals MVP',
 'Champion of NBA playoffs',
 'NBA finals last game highlights',
 'NBA championship series result',
 'NBA finals game score',
 'NBA finals game recap',
 'NBA champion team and player',
 'NBA finals statistics',
 'NBA championship final score',
 'NBA finals best player',
 'NBA playoffs champion and MVP',
 'NBA finals game analysis',
 'Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.']

クエリは良さそうなので、検索を実行しましょう。

In [5]:
def search_news(
    query: str,
    news_api_key: str = news_api_key,
    num_articles: int = 50,
    from_datetime: str = "2023-06-01",  # the 2023 NBA finals were played in June 2023
    to_datetime: str = "2023-06-30",
) -> dict:
    response = requests.get(
        "https://newsapi.org/v2/everything",
        params={
            "q": query,
            "apiKey": news_api_key,
            "pageSize": num_articles,
            "sortBy": "relevancy",
            "from": from_datetime,
            "to": to_datetime,
        },
    )

    return response.json()


articles = []

for query in tqdm(queries):
    result = search_news(query)
    if result["status"] == "ok":
        articles = articles + result["articles"]
    else:
        raise Exception(result["message"])

# remove duplicates
articles = list({article["url"]: article for article in articles}.values())

print("Total number of articles:", len(articles))
print("Top 5 articles of query 1:", "\n")

for article in articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print()


  0%|          | 0/21 [00:00<?, ?it/s]

Total number of articles: 554
Top 5 articles of query 1: 

Title: Nascar takes on Le Mans as LeBron James gets centenary race under way
Description: <ul><li>Nascar has presence at iconic race for first time since 1976</li><li>NBA superstar LeBron James waves flag as honorary starter</li></ul>The crowd chanted “U-S-A! U-S-A!” as Nascar driver lineup for the 24 Hours of Le Mans passed through the city cente…
Content: The crowd chanted U-S-A! U-S-A! as Nascar driver lineup for the 24 Hours of Le Mans passed through t...

Title: NBA finals predictions: Nuggets or Heat? Our writers share their picks
Description: Denver or Miami? Our contributors pick the winner, key players and dark horses before the NBA’s grand finale tips offA lot has been made of the importance of a balanced roster with continuity, but, somehow, still not enough. The Nuggets are the prime example …
Content: The Nuggets are here because 
A lot has been made of the importance of a balanced roster with conti...

Title: Unbo

ご覧のとおり、検索クエリは多くの場合、大量の結果を返しますが、その多くはユーザーが尋ねた元の質問に関連していません。最終的な回答の品質を向上させるために、埋め込みを使用して結果を再ランク付けし、フィルタリングします。

## 2. 再ランク付け

[HyDE (Gao et al.)](https://arxiv.org/abs/2212.10496)からインスピレーションを得て、まず仮想的な理想的な回答を生成し、それと結果を比較して再ランク付けを行います。これにより、質問に類似したものではなく、良い回答のように見える結果を優先することができます。以下が、仮想的な回答を生成するために使用するプロンプトです。

In [6]:
HA_INPUT = f"""
Generate a hypothetical answer to the user's question. This answer will be used to rank search results. 
Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders
like NAME did something, or NAME said something at PLACE. 

User question: {USER_QUESTION}

Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""

hypothetical_answer = json_gpt(HA_INPUT)["hypotheticalAnswer"]

hypothetical_answer


'The NBA championship was won by TEAM NAME. The MVP was awarded to PLAYER NAME. The last game was held at STADIUM NAME, where both teams played with great energy and enthusiasm. It was a close game, but in the end, TEAM NAME emerged victorious.'

次に、検索結果と仮想的な回答の埋め込みを生成します。その後、これらの埋め込み間のコサイン距離を計算し、意味的類似度の指標を得ます。OpenAIの埋め込みはAPIで正規化されて返されるため、完全なコサイン類似度計算の代わりに単純に内積を計算するだけで済むことに注意してください。

In [7]:
hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]
article_embeddings = embeddings(
    [
        f"{article['title']} {article['description']} {article['content'][0:100]}"
        for article in articles
    ]
)

# Calculate cosine similarity
cosine_similarities = []
for article_embedding in article_embeddings:
    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))

cosine_similarities[0:10]


[0.7854456526852069,
 0.8086023500072106,
 0.8002998147018501,
 0.7961229569526956,
 0.798354506673743,
 0.758216458795653,
 0.7753754083127359,
 0.7494958338411927,
 0.804733946801739,
 0.8405965885235218]

最後に、これらの類似度スコアを使用して結果をソートし、フィルタリングします。

In [8]:
scored_articles = zip(articles, cosine_similarities)

# Sort articles by cosine similarity
sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)

# Print top 5 articles
print("Top 5 articles:", "\n")

for article, score in sorted_articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print("Score:", score)
    print()


Top 5 articles: 

Title: NBA Finals: Denver Nuggets beat Miami Hea, lift thier first-ever NBA title
Description: Denver Nuggets won their maiden NBA Championship trophy defeating Miami Heat 94-89 in Game 5 of the NBA Final held on Tuesday at the Ball Arena in Denver
Content: Denver Nuggets won their maiden NBA Championship trophy defeating Miami Heat 94-89 in Game 5 of the ...
Score: 0.8445817523602124

Title: Photos: Denver Nuggets celebrate their first NBA title
Description: The Nuggets capped off an impressive postseason by beating the Miami Heat in the NBA Finals.
Content: Thousands of supporters watched along the streets of Denver, Colorado as the US National Basketball ...
Score: 0.842070667753606

Title: Denver Nuggets win first NBA championship title in Game 5 victory over Miami Heat
Description: The Denver Nuggets won their first NBA championship Monday night, downing the Miami Heat 94-89 at Ball Arena in Denver to take Game 5 of the NBA Finals.
Content: The Denver Nuggets won

素晴らしい！これらの結果は、元のクエリにずっと関連性が高く見えます。それでは、上位5件の結果を使用して最終的な回答を生成しましょう。

## 3. 回答

In [9]:
formatted_top_results = [
    {
        "title": article["title"],
        "description": article["description"],
        "url": article["url"],
    }
    for article, _score in sorted_articles[0:5]
]

ANSWER_INPUT = f"""
Generate an answer to the user's question based on the given search results. 
TOP_RESULTS: {formatted_top_results}
USER_QUESTION: {USER_QUESTION}

Include as much information as possible in the answer. Reference the relevant search result urls as markdown links.
"""

completion = client.chat.completions.create(
    model=GPT_MODEL,
    messages=[{"role": "user", "content": ANSWER_INPUT}],
    temperature=0.5,
    stream=True,
)

text = ""
for chunk in completion:
    text += chunk.choices[0].delta.content
    display.clear_output(wait=True)
    display.display(display.Markdown(text))

The Denver Nuggets won their first-ever NBA championship by defeating the Miami Heat 94-89 in Game 5 of the NBA Finals held on Tuesday at the Ball Arena in Denver, according to this [Business Standard article](https://www.business-standard.com/sports/other-sports-news/nba-finals-denver-nuggets-beat-miami-hea-lift-thier-first-ever-nba-title-123061300285_1.html). Nikola Jokic, the Nuggets' center, was named the NBA Finals MVP. In a rock-fight of a Game 5, the Nuggets reached the NBA mountaintop, securing their franchise's first NBA championship and setting Nikola Jokic's legacy as an all-timer in stone, according to this [Yahoo Sports article](https://sports.yahoo.com/nba-finals-nikola-jokic-denver-nuggets-survive-miami-heat-to-secure-franchises-first-nba-championship-030321214.html). For more information and photos of the Nuggets' celebration, check out this [Al Jazeera article](https://www.aljazeera.com/gallery/2023/6/15/photos-denver-nuggets-celebrate-their-first-nba-title) and this [CNN article](https://www.cnn.com/2023/06/12/sport/denver-nuggets-nba-championship-spt-intl?cid=external-feeds_iluminar_yahoo).