# Document search with embeddings

## Overview
This example demonstrates how to use the QIANFAN API and ERNIE LLMs to create embeddings and generate an answer for an input query. This tutorial is adapted from [Talk to documents](https://github.com/google-gemini/cookbook/blob/main/examples/Talk_to_documents_with_embeddings.ipynb) to use the QIANFAN API and ERNIE LLMs.

## Setup
To run this example, you need to set up your QIANFAN API key. You can obtain an API key from [QIANFAN](https://console.bce.baidu.com/qianfan/overview). The API key is sensitive and should be kept secret. You can set it as an environment variable or store it in a `.env` file in your local folder. An example of a `.env` file is : `QIANFAN_API_KEY=your_api_key_here`. We will use the [python-dotenv](https://pypi.org/project/python-dotenv/) package to load the environment variables from the `.env` file.

In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
qianfan_api_key = os.getenv("QIANFAN_API_KEY")

You may need to adjust your proxy settings.

In [2]:
del os.environ["http_proxy"]
del os.environ["https_proxy"]

## Embedding generation

In this section, you will see how to generate embeddings for a piece of text using the embeddings from the QIANFAN API.

An embedding is a vector representation of text, meaning it's an array of floating-point numbers. This representation is useful for various natural language processing tasks. In this example, we'll use embeddings to calculate the similarity between a query and saved documents.

See <https://cloud.baidu.com/doc/qianfan-docs/s/Um8r1tpwy> to learn embedding model support in QIANFAN. Since QIANFAN API is OpenAI compataible, you could also refer to <https://platform.openai.com/docs/guides/embeddings> for more details on how to use embeddings.

In [3]:
from openai import OpenAI

embedding_model = "embedding-v1"
client = OpenAI(base_url="https://qianfan.baidubce.com/v2", api_key=qianfan_api_key)

In [4]:
response = client.embeddings.create(input=["hello, world!"], model=embedding_model)
print(response.data[0].embedding)

[0.06902186572551727, -0.020647214725613594, -0.09811518341302872, 0.08861803263425827, -0.05155585706233978, -0.1355103999376297, 0.027453158050775528, -0.02333379164338112, -0.02600487507879734, -0.01652243174612522, -0.0017610457725822926, -0.08661286532878876, 0.060988616198301315, -0.001437491038814187, -0.07762893289327621, 0.01092784944921732, 0.0037263911217451096, -0.008090104907751083, -0.004760784097015858, -0.03508036211133003, 0.025809502229094505, -0.05357282981276512, 0.0033178087323904037, -0.04031052812933922, 0.034276120364665985, 0.006712841335684061, 0.050705697387456894, 0.08019276708364487, 0.028069473803043365, 0.012824407778680325, 0.00511750765144825, 0.06826956570148468, 0.08066719770431519, -0.08300089091062546, -0.04138464480638504, 0.0629473477602005, -0.02266598306596279, 0.0010189099702984095, -0.026319226250052452, 0.0868440568447113, 0.027562227100133896, -0.002775319153442979, 0.028266888111829758, -0.01826317235827446, 0.08004681766033173, -0.12195521

## Building an embeddings database

Suppose there is a [document about coffee](coffee.txt). You want to build an embeddings database for this document so that you can later search for relevant information based on a query. In real applications, you would typically have a collection of documents, and you would build a sophisticated pipeline to process and store them, using a dedicated vector database. For simplicity, we will use a single document in this example, and use a pandas DataFrame to store the embeddings.

In [5]:
with open("coffee.txt", "r", encoding="utf-8") as f:
    coffee_text_from_file = f.read()

texts = [p.strip() for p in coffee_text_from_file.split("\n") if p.strip()]

import pandas as pd

df = pd.DataFrame(texts, columns=["Text"])
df["Length"] = df.apply(lambda row: len(row["Text"]), axis=1)

df[:10]

Unnamed: 0,Text,Length
0,咖啡（英语：coffee）是指咖啡植物的种子即咖啡豆在经过烘焙磨粉后通过冲泡制成的饮料，是世...,519
1,传说9世纪的埃塞俄比亚的牧羊人发现并咀嚼了咖啡果实，随后将咖啡果实带给了附近修道院的僧侣，但...,257
2,咖啡传播到穆斯林世界后伊斯兰医学认可了咖啡的好处，认为其可以提振精神并防止酒和大麻对穆斯林的...,245
3,同样在16世纪，与阿拉伯世界的贸易令威尼斯获得了包括咖啡在内的非洲商品，威尼斯商人则向威尼斯...,327
4,1675年时，英格兰就有3000多家咖啡馆；启蒙运动时期，咖啡馆成为民众深入讨论宗教和政治的...,110
5,1773年，波士顿倾茶事件后约翰·亚当斯和许多美国人认为喝茶是不爱国的，令大量美国人在美国独...,56
6,18世纪，葡萄牙人首先在巴西里约热内卢附近，后来则是圣保罗种植咖啡并建设种植园。1852-1...,172
7,在咖啡的原产地埃塞俄比亚，18世纪前咖啡曾被埃塞俄比亚正教会所禁止，直至19世纪后期叶埃塞俄...,67
8,咖啡在19世纪中已经引入中国上海，1843年—44年上海对外贸易文献就有记载“枷榧豆5包，每...,69
9,香港英文报章在1866年刊登关于coffee shop的报导。1885年香港中文报章以“咖啡...,244


Get the embeddings for each of these bodies of text. Add this information to the dataframe.

In [6]:
def embed_fn(text):
    response = client.embeddings.create(input=[text], model=embedding_model)
    return response.data[0].embedding


df["Embeddings"] = df.apply(lambda row: embed_fn(row["Text"]), axis=1)
df["Embedding Length"] = df.apply(lambda row: len(row["Embeddings"]), axis=1)

df[:10]

Unnamed: 0,Text,Length,Embeddings,Embedding Length
0,咖啡（英语：coffee）是指咖啡植物的种子即咖啡豆在经过烘焙磨粉后通过冲泡制成的饮料，是世...,519,"[0.15193034708499908, 0.041656315326690674, 0....",384
1,传说9世纪的埃塞俄比亚的牧羊人发现并咀嚼了咖啡果实，随后将咖啡果实带给了附近修道院的僧侣，但...,257,"[0.09848490357398987, 0.08066622912883759, 0.0...",384
2,咖啡传播到穆斯林世界后伊斯兰医学认可了咖啡的好处，认为其可以提振精神并防止酒和大麻对穆斯林的...,245,"[0.14575347304344177, 0.10908215492963791, 0.0...",384
3,同样在16世纪，与阿拉伯世界的贸易令威尼斯获得了包括咖啡在内的非洲商品，威尼斯商人则向威尼斯...,327,"[0.17097845673561096, 0.11287374794483185, 0.0...",384
4,1675年时，英格兰就有3000多家咖啡馆；启蒙运动时期，咖啡馆成为民众深入讨论宗教和政治的...,110,"[0.10498320311307907, -0.007980545051395893, 0...",384
5,1773年，波士顿倾茶事件后约翰·亚当斯和许多美国人认为喝茶是不爱国的，令大量美国人在美国独...,56,"[0.053890161216259, 0.013532426208257675, 0.05...",384
6,18世纪，葡萄牙人首先在巴西里约热内卢附近，后来则是圣保罗种植咖啡并建设种植园。1852-1...,172,"[0.12374340742826462, 0.0700632706284523, 0.08...",384
7,在咖啡的原产地埃塞俄比亚，18世纪前咖啡曾被埃塞俄比亚正教会所禁止，直至19世纪后期叶埃塞俄...,67,"[0.08917555958032608, 0.0382981076836586, 0.04...",384
8,咖啡在19世纪中已经引入中国上海，1843年—44年上海对外贸易文献就有记载“枷榧豆5包，每...,69,"[0.058429356664419174, 0.04646380990743637, 0....",384
9,香港英文报章在1866年刊登关于coffee shop的报导。1885年香港中文报章以“咖啡...,244,"[0.11594365537166595, 0.08037444949150085, 0.0...",384


## Document search with Q&A

Now that the text embeddings are generated, let's create a Q&A system to search these texts. You will ask a question about coffee, create an embedding for that question, and compare it with the text embeddings stored in the DataFrame.

The question's embedding is also a vector. This vector is compared with each text's embedding using the dot product. Since embedding vectors from the API are already normalized, the dot product effectively measures cosine similarity. This score indicates the similarity in direction between the two vectors.

Dot product values range from -1 to 1, inclusive:
*   A value of **1** indicates the vectors point in the same direction (high similarity).
*   A value of **0** indicates the vectors are orthogonal (meaning they are unrelated or independent).
*   A value of **-1** indicates the vectors point in opposite directions (high dissimilarity).

In [7]:
query = "1675 年时，英格兰有多少家咖啡馆？"

response = client.embeddings.create(input=[query], model=embedding_model)
query_embedding = response.data[0].embedding
print(query_embedding)

[0.07976387441158295, 0.06226450204849243, 0.025520851835608482, 0.007429834920912981, -0.02793368697166443, -0.07999370992183685, 0.004191769752651453, -0.015133338049054146, -0.08789581060409546, -0.050846993923187256, 0.04244512692093849, -0.013125265948474407, 0.04309540241956711, 0.09139090776443481, -0.023681435734033585, -0.04101189598441124, -0.030267737805843353, 0.0783049464225769, -0.005382657051086426, -0.030117006972432137, -0.007633841130882502, -0.06582427024841309, -0.020405227318406105, -0.013561825267970562, -0.05290765315294266, 0.019695445895195007, -0.004299778491258621, 0.06311466544866562, 0.04892529174685478, 0.03373253345489502, -0.030521860346198082, -0.033984314650297165, -0.013920535333454609, 0.011016780510544777, -0.00624663894996047, 0.08831621706485748, 0.04861832037568092, 0.028786908835172653, 0.02556384913623333, 0.006319008767604828, -0.08490139991044998, 0.0038173107896000147, 0.018722133710980415, 0.03331358730792999, 0.015983808785676956, -0.02257

Use the `find_best_passage` function to calculate the dot products, and then sort the dataframe from the largest to smallest dot product value to retrieve the relevant passage out of the database.

In [8]:
import numpy as np


def find_best_passage(query, dataframe):
    response = client.embeddings.create(input=[query], model=embedding_model)
    query_embedding = response.data[0].embedding

    dot_products = np.dot(np.stack(dataframe["Embeddings"]), query_embedding)
    idx = np.argmax(dot_products)
    return dataframe.iloc[idx]["Text"]  # Return text from index with max value

View the most relevant text from the database, you will find that the most relevant text is retrived from the database. Later, we will use this text to generate an answer to the question.

In [9]:
passage = find_best_passage(query, df)
passage

'1675年时，英格兰就有3000多家咖啡馆；启蒙运动时期，咖啡馆成为民众深入讨论宗教和政治的聚集地，1670年代的英国国王查理二世就曾试图取缔咖啡馆。这一时期的英国人认为咖啡具有药用价值，甚至名医也会推荐将咖啡用于医疗。'

## Question and Answering Application

Let's try to use the text generation API of ERNIE LLM to create a Q & A system. To achieve this, we will prepare a prompt to generate an answer based on the most relevant text retrieved from the database. The prompt will include the question and the relevant text, and we will use the ERNIE LLM to generate a response.

In [10]:
import textwrap


def make_prompt(query, relevant_passage):
    escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
    prompt = textwrap.dedent(
      """
    你是一个乐于助人且信息丰富的机器人，使用下面提供的参考段落中的文本来回答问题。
    请务必用完整的句子回答，内容要全面，包括所有相关的背景信息。

    然而，你的对话对象是非技术人员，所以请务必分解复杂的概念，并使用友好和对话式的语气。
    如果段落与答案无关，你可以忽略它。

    问题：'{query}'
    段落：'{relevant_passage}'

    答案：
    """
    ).format(query=query, relevant_passage=escaped)

    return prompt

In [11]:
prompt = make_prompt(query, passage)
print(prompt)


你是一个乐于助人且信息丰富的机器人，使用下面提供的参考段落中的文本来回答问题。
请务必用完整的句子回答，内容要全面，包括所有相关的背景信息。

然而，你的对话对象是非技术人员，所以请务必分解复杂的概念，并使用友好和对话式的语气。
如果段落与答案无关，你可以忽略它。

问题：'1675 年时，英格兰有多少家咖啡馆？'
段落：'1675年时，英格兰就有3000多家咖啡馆；启蒙运动时期，咖啡馆成为民众深入讨论宗教和政治的聚集地，1670年代的英国国王查理二世就曾试图取缔咖啡馆。这一时期的英国人认为咖啡具有药用价值，甚至名医也会推荐将咖啡用于医疗。'

答案：



We will use ernie-4.5-turbo model to find the answer to your query.

In [12]:
chat_model = "ernie-4.5-turbo-32k"

messages = [{"role": "system", "content": ""}]
messages.append({"role": "user", "content": prompt})

response = client.chat.completions.create(
    model=chat_model,
    messages=messages,
)

print(response)

ChatCompletion(id='as-9e8nc2kqxd', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='在1675年的时候，英格兰有超过3000家的咖啡馆。在那个时期，咖啡馆成为了民众深入讨论宗教和政治的重要聚集地，甚至还曾引起过英国国王查理二世的关注，他一度试图取缔这些咖啡馆呢。', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None), flag=0)], created=1748495780, model='ernie-4.5-turbo-32k', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=48, prompt_tokens=160, total_tokens=208, completion_tokens_details=None, prompt_tokens_details=None))


In [13]:
from IPython.display import Markdown

Markdown(response.choices[0].message.content)

在1675年的时候，英格兰有超过3000家的咖啡馆。在那个时期，咖啡馆成为了民众深入讨论宗教和政治的重要聚集地，甚至还曾引起过英国国王查理二世的关注，他一度试图取缔这些咖啡馆呢。