# Chapter 2 Keyword Search

In this chapter, you will learn how to use keyword search and use the database to answer questions. Search is common in our daily lives. It includes search engines and also includes searching within applications, such as searching in Spotify, YouTube, or Google Maps. Companies and organizations also need to use keyword search or various other search methods to search their internal files. Keyword search is the most common way to build search systems. Next, let's look at how to use keyword search systems, and then see how language models can improve these systems.

In this chapter tutorial, we need to use the API keys of Weaviate and Cohere.

## Table of Contents

- [I. Environment Configuration](#I.)

- [II. Weaviate Database](#II.)

- [2.1 Authentication Configuration](#2.1)

- [2.2 Database Connection](#2.2)

- [III. Keyword Search](#III.)

- [3.1 Build Keyword Search Function](#3.1)

- [3.2 BM25 Algorithm](#3.2)

- [3.3 Use Keyword Search Function](#3.3)

- [IV. Deeper Understanding of Keyword Search](#IV.)

-[V. Restrictions on keyword searches](#V.)

## 1. Environment Configuration <a id="1."></a>

Let's prepare some Python libraries and APIs that we will need:

In [None]:
!pip install cohere
!pip install weaviate-client
!pip install python-dotenv

Before we start learning, we need to apply for the Weaviate and Cohere APIs, and then import the APIs into the local environment variables as follows.

1. Open the .env file in this file directory, which contains the following template:

WEAVIATE_API_KEY="your_weaviate_api_key"

WEAVIATE_API_URL="your_weaviate_api_url"

COHERE_API_KEY="your_cohere_api_key"

2. Replace "your_weaviate_api_key", "your_weaviate_api_url", and "your_cohere_api_key" with your own API Key

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # 读取本地 .env 文件

## 2. Weaviate database <a id="2."></a>

Weaviate is an open source database. It has a keyword search function and a vector search function that relies on a language model.

### 2.1 Configure authentication

In [None]:
import weaviate
auth_config = weaviate.auth.AuthApiKey(
    api_key=os.environ['WEAVIATE_API_KEY'])  # 获取环境变量中的 Weaviate API 密钥，进行身份验证。

Now that we have our authentication configuration set up, let's look at how to connect the client to the actual database.

### 2.2 Connecting to the database

+ weaviate.Client() : Weaviate client object.

+ url : URL property of the Weaviate client. This URL specifies the location to communicate with the Weaviate service.

+ auth_client_secret : Authentication secret property of the Weaviate client

+ additional_headers : Additional request header information.

In [None]:
client = weaviate.Client(
    url=os.environ['WEAVIATE_API_URL'],  
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": os.environ['COHERE_API_KEY'],  # 这里添加了一个名为 X-Cohere-Api-Key 的请求头，其值为环境变量中的 Cohere API 密钥。
    }
)

This database is a public database that contains 10 million records. These records come from Wikipedia. Each cell, each record, and each row is a paragraph from Wikipedia.
These 10 million records are from 10 different languages. 1 million of them are in English, and the remaining 9 million are in other languages. We can select and filter the language we want to query, which we will learn later.

After running the following line of code, we make sure that the client is ready and connected. If it returns True, it means that our local Weaviate client is able to connect to the remote Weaviate database. Then we can perform keyword search on this dataset.

In [None]:
print(client.is_ready())

## Three, keyword search <a id="Three, "></a>

Let's take a brief look at keyword search first.

Suppose you have a query: "What color is the grass?", and you search in a very small document set that contains the following five sentences: "Tomorrow is Saturday", "Grass is green", "The capital of Canada is Ottawa", "The sky is blue", "Whales are mammals".

This is a simple search example. Keyword search works by comparing how many common words there are between **query** and **document**. If we compare how many common words there are between query and the first sentence, we can see that they only have one common word: "is".

We can count the word counts for each sentence in this document set. Then we can see that the second sentence has the most common words with query, so keyword search may return it as the answer.

Let's start learning how to use keyword search.

### 3.1 Build a keyword search function <a id="3.1"></a>

In the previous code, we have connected to the database. Now let's build a function to query the database. We will call it "keyword_search".

In [1]:
def keyword_search(query,
                   results_lang='en',
                   properties=["title", "url", "text"],
                   num_results=3):
    """
    关键词搜索函数

    参数：
    query：要搜索的关键词
    results_lang：搜索结果的语言，默认为英文（'en'）
    properties：要返回的属性列表，默认为 title（标题）、url 和 text（文本）
    num_results：要返回的结果数量，默认为 3 个

    返回：
    搜索结果列表
    """

# Build a filter to limit the language of search results
    where_filter = {
        "path": ["lang"],
        "operator": "Equal",
        "valueString": results_lang
    }

# Send a query request to get the search results
    response = (
        client.query.get("Articles", properties)
        .with_bm25(
            query=query
        )
        .with_where(where_filter)
        .with_limit(num_results)
        .do()
    )

# Extract search results
    result = response['data']['Get']['Articles']
    return result

This function accepts four parameters: query (keyword to search), results_lang (lang of search results, default is English), properties (list of properties to return, default is title, URL and text), num_results (number of results to return, default is 3).

Let's understand this code step by step:

(1) Constructing filters:
Inside the function, we first construct a filter where_filter to limit the language of search results. Here, the language of search results is the same as the results_lang parameter.

(2) Sending query requests and getting search results:
Use the Weaviate client object client to query the data of type "Articles" in the dataset. The query results are stored in the response variable.

The query operation includes the following parts:
- Use the property list specified by the properties parameter to determine what needs to be included in our search results
- Call .with_bm25(query=query) to add the keyword query query to the query request, which will use the BM25 algorithm to weight the relevance of query and article content to improve the relevance of search results.
- Call .with_where(where_filter) is used to add a language filter to the query request to limit the language of the search results.
- Call .with_limit(num_results) to add a result limit to the query request to specify the number of search results to be returned.
- Finally, call .do() to execute the query operation.

(3) Extract search results and return:
Extract the search results from the response, specifically the response['data']['Get']['Articles'] part, and store it in the result variable.

### 3.2 BM25 algorithm <a id="3.2"></a>

In the previous code, we have connected to the database. Now let's build a function to query the database. We call it "keyword_search".

From the above introduction, we can see that the **BM25 algorithm** is used in the keyword search function to calculate the relevance score between the query and the document.

Let's first introduce the TF-IDF algorithm, which is a common text analysis technique that measures the importance of words in a document by multiplying the term frequency (TF) with the inverse document frequency (IDF). The term frequency is the frequency of each term in the document. The inverse document frequency is a measure of the importance of a term. However, the TF-IDF algorithm has some shortcomings when calculating the relevance between a document and a query, such as not considering the impact of document length and document frequency on the weight.

The BM25 algorithm is a classic search algorithm, which is an improvement on the traditional TF-IDF algorithm. BM25 calculates the relevance score between a document and a query by considering the term frequency, document length, and inverse document frequency of the term, thereby more accurately evaluating the relevance of the document. This improvement enables BM25 to perform better in information retrieval tasks and better meet the needs of practical applications.

The following are the key points of the BM25 algorithm:

(1) **Document and query: **

BM25 algorithm measures the relevance of a document by calculating the relevance score between each term in the document and the term in the query. This relevance score is calculated based on the frequency (TF) of each term in the document and the inverse document frequency (IDF) of the entire document collection.

(2) **TF (term frequency) factor**:

BM25 takes into account the influence of term frequency, but compared to TF-IDF, it uses a smoother way to handle term frequency. Specifically, BM25 uses the following formula for calculating word frequency:

$$ \text{TF}(t, d) = \frac{f(t, d) \times (k_1 + 1)}{f(t, d) + k_1 \times (1 - b + b \times \frac{|d|}{\text{avgdl}})}$$

Where:
- ${f}(t,d)$ represents the term
- $d$ represents the length of document d.
- avgdl represents the average document length.
- $k_1$ and b are adjustment parameters used to balance the impact of TF and document length.

(3) **IDF (Inverse Document Frequency) Factor**:

Similar to TF-IDF, BM25 also considers the inverse document frequency of the term. IDF can be calculated using the traditional $\log(\frac{N - n + 0.5}{n + 0.5})$, or other variants.
N represents the total number of documents in the document collection.
- N represents the total number of documents in the document collection.
- n represents the number of documents containing term t.

(4) Adjustment parameters $k_1$ and b are two adjustment parameters in BM25, which are used to balance the influence of TF and document length. Generally, their selection depends on the specific application and dataset.

(5) Final score calculation:

The final BM25 score is the weighted sum of TF and IDF of each query term, and then the scores of all query terms are added up to get the total score of the document.

### 3.3 Using the keyword search function <a id="3.3"></a>

Now let's use this keyword search function and pass it a query.

Suppose we want to search for "What is the most viewed televised event?"

We pass the query to the function and then print it out to see what it returns after running it.

Observe the search results we get. It's a long paragraph of text, which is not very convenient for us to browse. But you can see that it is a list of dictionaries. So let's define a function to iterate over the key-value pairs and print it in a better way.

In [None]:
def print_result(result):
    """
    打印搜索结果。

    参数:
    result: 搜索结果列表，每个元素是一个字典，包含要打印的键值对。

    返回:
    无返回值，仅打印结果。

    """
    for i, item in enumerate(result):
        print(f'item {i}')  # 打印索引
        for key in item.keys():
            print(f"{key}:{item.get(key)}")  # 打印键值对
            print()  # 打印空行，增加可读性
        print()  # 打印空行，用于分隔不同的字典


Call this function to view the search results in a clearer presentation.

In [None]:
print_result(keyword_search_results)

item 0
text:The most active Gamergate supporters or "Gamergaters" said that Gamergate was a movement for ethics in games journalism, for protecting the "gamer" identity, and for opposing "political correctness" in video games and that any harassment of women was done by others not affiliated with Gamergate. They argued that the close relationships between journalists and developers demonstrated a conspiracy among reviewers to focus on progressive social issues. Some supporters pointed to what they considered disproportionate praise for games such as "Depression Quest" and "Gone Home", which feature unconventional gameplay and stories with social implications, while they viewed traditional AAA games as downplayed. False claims of the "ethics in game journalism" had started as early as 2012, when Geoff Keighley was accused of such unethical behavior when he was presenting information about "Halo 4" among advertisements for Mountain Dew and Doritos, an event called "Doritosgate" in the game.mer culture.

title:Gamergate (harassment campaign)

url:https://en.wikipedia.org/wiki?curid=43758363

item 1
text:"Rolling Stone" stated Jackson's Super Bowl performance "is far and away the most famous moment in the history of the Super Bowl halftime show". "PopCrush" called the performance "one of the most shocking moments in pop culture" as well as a "totally unexpected and unforgettable moment". "Gawker" ranked the performance among the most recent of the "10 Shows that Advanced Sex on Television", commenting the set "had all the elements of a huge story" and "within seconds the world searched furtively for pictures", concluding "it remains so ubiquitous, it's impossible to look at a starburst nipple shield without thinking "Janet Jackson"". "E! Online" ranked it among the top ten most shocking celebrity moments of the prior two decades. A study of television's most impactful moments of the last 50 years conducted by Sony Electronics and the Nielsen Television Research Company ranked Jackson's Super Bowl performance at #26. The incident was the only Super Bowl event on the list and the highest music and entertainment event aside from the death of Whitney Houston. TV Guide Network ranked it at #2 in a 2010 special listing the "25 Biggest TV Blunders". "Complex" stated "It's the Citizen Kane of televised nip-slips—so unexpected, and on such a large stage, that nothing else will ever come close. If Beyoncé were to whip out both breasts and put on a puppet show with them whaten she performs this year in New Orleans, it would rate as just the second most shocking Super Boob display. Janet's strangely ornamented right nipple is a living legend, and so is Justin Timberlake's terrified reaction." Music channel Fuse listed it as the most controversial Super Bowl halftime show, saying the "revealing performance remains (and will forever remain) the craziest thing to ever happen at a halftime show. Almost immediately after the incident, the FCC received a flood of complain"It prompted a million mothers to cover their eyes, fathers and sons to jump out of their seats in shock and numerous sanctions by the Federal Communications Commission, including a US$550,000 fine against CBS. Talk about a halftime show that will be hard to top." The incident was also declared "the most memorable Super Bowl halftime show in history", as well as "the most controversial", adding "you can't talk about this halftime show, or any subsequent halftime show from here to eternity, without mentioning the wardrobe malfunction".

title:Super Bowl XXXVIII halftime show controversy

url:https://en.wikipedia.org/wiki?curid=498971

item 2
text:West Germany (established in May 1949) was not eligible for the 1950 World Cup (the first after the war), and so all preparations were made with a view toward the 1954 matches in Bern, Switzerland. By that time Adidas's football boots were considerably lighter than the ones made before the war, based on English designs. At the World Cup Adi had a secret weapon, which he revealed when West Germany made the finals against the overwhelmingly favored Hungarian team, which was undefeated since May 1950 and had defeated West Germany 8–3 in group play. Despite this defeat, West Germany made the knock-out rounds by twice defeating Turkey handily. The team defeated Yugoslavia and Austria to reach the final (a remarkable achievement), where the hope of many German fans was simply that the team "avoid another humiliating defeat" at the hands of the Hungarians. The day of the final began with light rain, which brightened the prospectsof the West German team who called it ""Fritz Walter-Wetter"" because the team's best player excelled in muddy conditions. Dassler informed Herberger before the match of his latest innovation—"screw in studs." Unlike the traditional boot which had fixed leather spike studs, Dassler's shoe allowed spikes of various lengths to be affixed depending on the state of the pitch. As the playing field at Wankdorf Stadium drastically deteriorated, Herberger famously announced, "Adi, screw them on." The longer spikes improved the footing of West German players compared to the Hungarians whose mud-caked boots were also much heavier. The West Germans staged a come from behind upset, winning 3-2, in what became known as the "Miracle in Bern." Herberger publicly praised Dassler as a key contributor to the win, and Adidas's fame rose both in West Germany, where the win was considered a key post-war event in restoring German self-esteem and abroad, where in the first televised World Cup final viewerswere introduced to "the ultimate breakthrough."

title:Adolf Dassler

url:https://en.wikipedia.org/wiki?curid=2373164

The first result is a piece of text. It consists of text, title, and url. We want to find the TV shows with the highest ratings. This result doesn't look completely correct, but it contains many keywords.
The second result is an article about "Super Bowl", which may be a TV show with high ratings.
Then there is a third result here, which mentions "World Cup".
We can see the URL of each article, click it, and it will lead us to the Wikipedia page of the article.
Let's take another example in Chinese, suppose we want to search for "China".

In [None]:
query = "中国"
keyword_search_results = keyword_search(query，results_lang='zh')  # 中文用“zh”
print_result(keyword_search_results)

item 0
text:In ancient times, "China" had different meanings: some referred to the capital where the emperor was located as "China". 《》: "Benefit this China to pacify the four directions." Mao Zhuan: "China is the capital." 《》: "Then the emperor will take the throne there." 《Collected Explanations》: "Liu Xi said; 'The capital of the emperor is the center, so it is called China'." Some referred to the Huaxia and Han areas as China (because it is among the four barbarians). 《》: "When the "Xiaoya" was completely abolished, the four barbarians invaded each other and China became insignificant." Also 《》: "Therefore, the reputation spread throughout China and even to the barbarians." The Huaxia people mostly built their capitals in the south and north of the Yellow River, so they called their place "China", which has the same meaning as "Middle Earth", "Central Plains", "Zhongzhou", "Zhongxia" and "Zhonghua". At first, it referred to the northern part of Henan Province, the southern part of Shanxi Province and the southern part of Shaanxi Province and nearby areas. Later, the scope of activities of the Central Plains Dynasty expanded, and the middle and lower reaches of the Yellow River were also called "China". Or it refers to the country that governs the Central Plains. "The Book of Jin": "Yue Da then united Wu and Shu, and secretly planned to conquer China." The areas under its jurisdiction, including those that did not belong to the Yellow River basin, were also all called "China." "The Book of Jin": "Afterwards, Qin used its army to destroy the six kingdoms and annex China." In the unified situation, the central dynasty often called itself "China"; and in the period of division, "China" could also refer to the middle and lower reaches of the Yellow River (i.e. the Central Plains) or the dynasty that continued the orthodoxy. "Book of Jin·Chronicle 14" Fu Jian said to his brother Fu Rong, "Liu Chan may not be the descendant of the Han Dynasty, but he was eventually annexed by China." Here, "China" refers to the Wei Kingdom in North China during the Three Kingdoms period, because Wei inherited the orthodoxy of the Han Dynasty. In addition, in ancient times, "China"The word "country" can also be used to refer to the Han nationality alone. <br>

title:China's title

url:https://zh.wikipedia.org/wiki?curid=527278

item 1
text: Swiss Bank (China) Co., Ltd. is a subsidiary of UBS. Its predecessor was the Beijing Branch of Swiss Bank Co., Ltd. established in 2004. In March 2012, the China Banking Regulatory Commission issued the "Reply of the China Banking Regulatory Commission on the Opening of Swiss Bank (China) Limited, which was restructured from a branch of UBS in China", approving the opening of Swiss Bank (China) Limited, whose English name is UBS (China) Limited, as a wholly foreign-owned bank solely funded by UBS, with its registered business address at Unit 1217-1230, Yinglan International Financial Center, No. 7 Financial Street, Xicheng District, Beijing; the registered capital is RMB 2 billion, nearly 85% of which is allocated by Swiss Bank, and the rest is transferred from the operating funds of the original branches of Swiss Bank in China; the qualifications of PETER ERIC WALSHE as the chairman of Swiss Bank (China) Limited and SIMON JIXIANG JIN as the president of Swiss Bank (China) Limited were approved. It is allowed to operate foreign exchange business for various customers and RMB business for customers other than Chinese citizens. In July 2012, Swiss Bank (China) Limited, located in Xicheng District, Beijing,=(UK) Co., Ltd. officially opened.

title:UBS Group

url:https://zh.wikipedia.org/wiki?curid=556866

item 2
text:The unique appearance has attracted the attention of the fashion industry, making her appear in Puma's Suede 50 event, Levi's New Year TVC, and Louis Vuitton's 2018 exhibitions and events; and appeared in many cutting-edge fashion and lifestyle magazines in Asia, including Vogue me (China), Harper's Bazaar (China), Nylon (China/Japan), Ellemen, Metropolis Numéro (China), and GRAZIA. (Phoenix Music)

title:Liu Boxin

url:https://zh.wikipedia.org/wiki?curid=6070776

We got three results in total, each of which is a piece of text. It consists of text, title, and url. We are looking for information related to China, and we can see that the keyword "China" appears many times in all three results. We can also see the url of each article. Clicking it will take you to the Wikipedia page of the article.

You can try to modify the query to see what else is in the dataset.
Here, you can also try to view the attributes. The following is a list of attributes used when building this dataset, which are stored in the database

In [None]:
properties = ["text", "title", "url", "views", "lang"]
# Other languages ​​you can try: en, de, fr, es, it, ja, ar, zh, ko, hi

You can see the number of views a Wikipedia page has received by looking at the property views and use that to filter or sort.

You can also filter by other languages. Other languages ​​to try include English, German, French, Spanish, Italian, Japanese, Arabic, Chinese, Korean, and Hindi. Just enter one of the languages ​​and pass it to keyword search and it will give you results in that language. However, when selecting a language, be aware that documents in the selected language must have co-occurring keywords with the query. This is to get relevant results.

BM25 only needs to have one co-occurring keyword to score it as somewhat relevant. And the more words the query and document share, and the more repetitions in the document, the higher the score.

The above are some advanced examples. It shows the process of querying the database and then viewing the results.

## 4. A deeper understanding of keyword retrieval <a id="4. "></a>

Next, let's review **search** at a higher level.

As shown in the figure below, the main components of search include **query**, **search system**, and **previously processed databases (Document Archive)** that the search system can access. The search system responds with a series of search results in descending order of relevance of the data in the database to the query.

![Alt ​​text](images/2-1.png)

If we look more closely, we can think of **search system** as having two stages. The first stage is usually the **retrieval** stage, followed by a stage called **reranking**. Retrieval will produce an initial ranking result based on a ranking algorithm (such as TF-IDF, BM25, etc.), but it may not always be the order that best matches the user's intent. Reranking refers to the process of further ranking these results after the initial ranking results are returned by Retrieval. Reranking can be based on various factors, such as semantic relevance, user preferences, domain-specific informationetc. Reranking is usually necessary because we want to include or introduce other information besides text relevance.

In addition, the implementation of the first stage of Retrieval usually requires **Inverted Index**. Inverted Index is a commonly used data structure in the field of information retrieval, which is used to quickly find documents containing a specific term. Its basic idea is to associate each term in a document collection with a list of documents containing the term, so as to quickly locate the document containing the term through the term.

In the figure below, we can see that Inverted Index has two columns, one for keywords and the other for the document ID containing the keyword. Such a structure enables search engines to quickly locate documents containing user query terms, thereby supporting efficient information retrieval. In actual application scenarios, Inverted Index also records the position information and frequency of occurrence of the term in the document, etc.

![Alt ​​text](images/2-2.png)

When we enter query = "What color is the sky?", we can see that in the Inverted Index, the word "color" corresponds to document 804, and the word "sky" also corresponds to document 804. Therefore, 804 will score highly in the results retrieved in the first phase.

## V. Limitations of keyword search <a id="V. "></a>

As shown in the figure below, suppose we query "Strong pain in the side of the head". If we search for a document in the Document Archive, there is a sentence in this document that can accurately answer it, such as "Sharp temple headache". However, since this answer uses different keywords, keyword search will not be able to retrieve this document.

The language model can solve this problem because the language model can not only focus on keywords, but also consider the meaning of the sentences in the document, and can retrieve such documents for the query.

![Alt ​​text](images/2-3.png)

Language Models can improve both stages of search (retrieval and reranking), and in the next course, we will learn how language models can be improved through embedding.

![Alt ​​text](images/2-4.png)

Embedding will be the content of the next course. Then we will look at how reranking is performed and how the lag model can improve it. At the end of this course, we will also look at how the LLM (large language model) generates responses based on the previous search steps.