# Retrieval with TF-IDF and BM25

*Lexical search* is a type of search where a query is compared against a collection of text documents (as short as a sentence or as long as an entire article) by directly matching the words in the query. For example, given a query like "What is a cat?", you would look for documents that contains one or more of the words: "What", "is", "a", "cat". You can assign a score to each document based on how important each word is (by assigning a score to each word), and how many words are matched (e.g. by summing up the scores).

Traditionally, retrieval was simply a keyword(s) search, aka a boolean search. The retrieval systme just looked for the existance of these words in the document. 
Soon enough, we learned that it's not enough. While all words were born equal, some are more equal than others.  

TF-IDF is an algorithm that learns to reweight the words based on their frequencies: frequent words that appear often in the many documents get less weight, and rare words gets an increased weight.

Let's see it in action.

In [1]:
from typing import Iterator, Tuple

import numpy as np
import pandas as pd
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

## Creating a Toy-Dataset

In [2]:
documents = [
    "Tech Stocks Rally as Investors Bet on Strong Quarterly Earnings",
    "Global Markets Dip Amid Concerns Over Slowing Tech Sector Growth",
    "Cryptocurrency Prices Surge Following Regulatory Clarity in Europe",
    "Federal Reserve Hints at Future Rate Cuts Boosting Market Confidence",
    "Oil Prices Fall as Supply Chain Disruptions Ease Worldwide",
    "Major Bank Reports Record Profits Driven by Consumer Lending",
    "Retail Stocks Drop After Weak Holiday Sales Forecast",
    "Automakers Invest Heavily in Electric Vehicles to Stay Competitive",
    "Investors Pull Back from Risky Assets Amid Inflation Fears",
    "Fintech Startups Gain Momentum with New Digital Payment Solutions",
]

## Stop Word Removal and Stemming

In real-world scenario, we must compromize between the model accuracy and its performance in terms of speed and space. One of the main factors to consuming space is the vocabulary size, the number of different tokens (types) we take into account.

There are two main techniques to reduce the number of types:
1. Removing stop words
2. Stemming and Lemmatization

Let's check them out:

### Stop Word Removal
Stop words are common words in a language (like ```"the"```, ```"is"```, ```"and"```) that appear frequently but carry little semantic meaning. Removing them reduces noise, decrease the number of tokens we must store, and therefore may improve the retrieval efficiency.

**Example:**

Original: ```"The cat is sitting on the mat."```  
After Stop Word Removal: ```"cat sitting mat"```

### Stemming & Lemmatization
[*Stemming*](https://www.geeksforgeeks.org/nlp/snowball-stemmer-nlp/) reduces words to their root form by chopping them abruptly, so that different variants of a word are treated as the same token. This helps in reducing the dimensionality, the number of different words we treat, as well as matching similar concepts and words, as they are all mapped to the same word.

**Example:**

Words: ```"Consult", "Consultant", "Consulting", "Consultantative", "Consultants"```  
After Stemming: ```"consult"```

*Lemmatization* reduces words to their base or dictionary form (aka lemma). Unlike stemming, which simply chops off the end of a word, lemmatization ensure the resulting lemma is a valid word. For example, "better" is lemmatized to "good" instead of a non-dictionary root word.

In [3]:
# Initialize stemmer and stopwords
stemmer = PorterStemmer()

tfidf_vec = TfidfVectorizer(stop_words="english")
stop_words = set(tfidf_vec.get_stop_words())  # type: ignore


### Task 1:

Complete the preprocessing function in the code chunk below.

The input of the function is a string of text.
It returns a list of stemmed, non-stop words.

1. Split the text into single words. (Hint: use the [```split()```](https://docs.python.org/3/library/stdtypes.html#str.split) function)
2. For each word word check if it is a stop word using the ```stop_words``` set.
3. Finally stem all the remaining words using [```stemmer.stem()```](https://www.nltk.org/api/nltk.stem.porter.html#nltk.stem.porter.PorterStemmer.stem).
4. Return the resulting list.

In [4]:
def preprocess(text: str) -> list[str]:
    """Preprocess the input text by tokenizing, removing stopwords, and stemming.

    Args:
        text (str): The input text to preprocess.

    Returns:
        list[str]: A list of preprocessed tokens.
    """
    # Your code here:
    tokens = text.lower().split()
    tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    return tokens

In [5]:
# Preprocess documents
processed_docs = [" ".join(preprocess(doc)) for doc in documents]

## TF-IDF: Term Frequency–Inverse Document Frequency

**TF-IDF** is a numerical statistic that reflects how important a word is to a document within a collection (corpus).  
It is widely used in information retrieval and text mining to rank documents by relevance.

**1. Term Frequency (TF)**

Measures how often a term $t$ appears in a document $d$.

$$
\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in } d}{\text{Total number of terms in } d}
$$

**2. Inverse Document Frequency (IDF)**

Measures how informative a term is, by counting the number of documents it appears in — rare terms get a higher score.

$$
\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{num of documents containing term t} + 1}\right)
$$

(The $+1$ prevents division by zero.)

**3. TF-IDF Score**

$$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
$$

**Interpretation**

* **High TF-IDF** → The term appears frequently in the document but rarely elsewhere → *important*.
* **Low TF-IDF** → The term is common across documents or rare in the current one → *less informative*.

## Using TF-IDF Vectorization

Now it is time to apply the TF-IDF vectorization to our preprocessed data. Fortunately, we don't have to code this ourselfs.  
The vectorization is implemented for example in [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

To perform retrieval, we preprocess and vectorize the query with the same TF-IDF model that we fit on the documents.  
Then, a distance metric, such as [*cosine similarity*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), is used to measure the relevancy of each document, and find the best matching results.

$$\text{Cosine similarity}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$


In [6]:
# create and fit a TfidfVectorizer over the document corpus
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(processed_docs)

In [7]:
def retrieve_documents(qry: str) -> Iterator[Tuple[float, str]]:
    """Retrieve documents based on a TF-IDF.

    Args:
        qry (str): a query to search for

    Yields:
        Iterator[Tuple[float, str]]: a list of results and their matching score
    """
    preprocessed_query = " ".join(preprocess(qry))
    print(f'Searching for: "{preprocessed_query}"')
    query_vector = tfidf_vectorizer.transform([preprocessed_query])

    # Compute cosine similarity
    cosine_sim = (tfidf_matrix @ query_vector.T).toarray().flatten()
    tfidf_ranking = np.argsort(-cosine_sim)
    for idx in tfidf_ranking:
        if cosine_sim[idx] > 0:
            yield cosine_sim[idx], documents[idx]

In [8]:
query = "Tech sector earnings"
# query = "energy supply price drop"
# query = "automobile sector investments"

print("Retrieval results using TF-IDF Ranking:")
for score, result in retrieve_documents(query):
    print(f"Score: {score:.4f} | Doc: {result}")

Retrieval results using TF-IDF Ranking:
Searching for: "tech sector earn"
Score: 0.3899 | Doc: Tech Stocks Rally as Investors Bet on Strong Quarterly Earnings
Score: 0.3653 | Doc: Global Markets Dip Amid Concerns Over Slowing Tech Sector Growth


### Task 2:

In the code above, experiment with different queries.
Can you find weaknesses with TF-IDF vectorization?

## Issues with TF-IDF

- **Bag-of-words assumption (no semantics)**  
  TF–IDF ignores word order, syntax and meaning. Synonyms and paraphrases (e.g., "car" vs "automobile") won’t match, and semantically similar documents can be missed.

- **Polysemy and ambiguity**  
  A single token can have multiple senses; TF–IDF treats them identically and can return irrelevant documents.

- **Sparse, high-dimensional vectors**  
  Representations are large and sparse, which can be memory- and compute-inefficient at scale, and sensitive to vocabulary mismatch and OOV tokens.

- **Length and frequency bias**  
  Raw term frequency and document length can bias scores (longer docs more likely to contain query terms). IDF can overemphasize rare noise words if corpus statistics are unstable.

- **No contextual or phrase understanding**  
  Multi-word expressions and context-dependent meaning are poorly handled; phrase matches are often missed unless explicitly indexed.

- **Preprocessing dependency**  
  Stemming, lemmatization, stopword lists, tokenization, and case handling can dramatically change results; mismatches in preprocessing between query and documents cause retrieval failures.

- **Scaling and sparsity in multilingual/morphologically rich languages**  
  Languages with rich morphology, diacritics or compound words pose additional challenges for lexical-only matching.

Mitigations include BM25 (better length normalization), query expansion, relevance feedback, LSI/LSA or SVD to capture latent topics, and modern dense (embedding-based) retrieval or hybrid sparse+dense approaches for semantic matching.

## BM-25

BM-25 expands on TF-IDF and handles cases where long documents got priority and a higher score, due to their lexical richness. It adds 2 fraction variables over the TF and the IDF, allowing to control their influence on the scores.

In this example, we will use [bm25s](https://github.com/xhluca/bm25s). A relatively newer version of BM25, which adds some more capabilities on top of the original algorithm, such as tokenization, and a more efficient score-matrix calculation. You can read more about it in [the paper](https://arxiv.org/abs/2407.03618).



In [9]:
import bm25s
import Stemmer

In [10]:
stemmer = Stemmer.Stemmer("english")

# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_tokens = bm25s.tokenize(documents, stopwords="en", stemmer=stemmer)

# Create the BM25 model and index the corpus
retriever = bm25s.BM25()
retriever.index(corpus_tokens)

Split strings:   0%|          | 0/10 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/10 [00:00<?, ?it/s]

BM25S Count Tokens:   0%|          | 0/10 [00:00<?, ?it/s]

BM25S Compute Scores:   0%|          | 0/10 [00:00<?, ?it/s]

In [11]:
def retrieve_documents_bm25(qry: str, k: int = 2) -> Iterator[Tuple[float, int]]:
    """Retrieve documents using BM25.

    Args:
        qry (str): a query to search for
        k (int): # of top results to retrieve

    Yields:
        Iterator[Tuple[float, str]]: a list of results and their matching score
    """
    preprocessed_query = bm25s.tokenize(qry, stemmer=stemmer)
    results, scores = retriever.retrieve(preprocessed_query, k=k)

    for i in range(results.shape[1]):
        score, doc = scores[0, i], results[0, i]
        if score > 0:
            yield score, doc

In [12]:
query = "Tech sector earnings"
# query = "energy supply price drop"
# query = "automobile sector investments"

print("Retrieval results using BM25 Ranking:")
for score, result in retrieve_documents_bm25(query, 3):
    print(f"Score: {score:.4f} | Doc: {documents[result]}")

Retrieval results using BM25 Ranking:


Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Score: 1.4050 | Doc: Tech Stocks Rally as Investors Bet on Strong Quarterly Earnings
Score: 1.2647 | Doc: Global Markets Dip Amid Concerns Over Slowing Tech Sector Growth


## Larger dataset
Let's make it real.

In [13]:
financial_retrieval_corpus = pd.read_parquet("../../data/financial_retrieval_corpus.parquet")
financial_retrieval_queries = pd.read_parquet("../../data/financial_retrieval_queries.parquet")

Here is a larger corpus. Some of these documents are quite long.

In [14]:
financial_retrieval_corpus

Unnamed: 0,content
0,NVIDIA Announces Financial Results for Second ...
1,"Q2 Fiscal 2024 Summary GAAP ($ in millions, ex..."
2,Outlook NVIDIA’s outlook for the third quarter...
3,Gaming Second-quarter revenue was $2.49 billio...
4,Automotive Second-quarter revenue was $253 mil...
5,Stock futures edged lower on Thursday morning ...
6,Japan’s September trade balance swings into su...
7,Here are some of the tickers on my radar for T...
8,\t2022 12/31/22\t2021 12/31/21\t2020 12/31/20\...
9,"Nokia said it would cut up to 14,000 jobs as p..."


And here are some queries for this dataset. For each query, we know exactly what the expected answer is, and in which document number (`contenet_index`) it should be found.

In [15]:
financial_retrieval_queries

Unnamed: 0,query,answer,content_index
0,What was the increase in revenue from the prev...,Up 101%,0
1,What was the revenue in second quarter?,$13.51 billion,0
2,How many shares were repurchased in second qua...,7.5 million shares,0
3,Who is the CEO of Nvidia?,Jensen Huang,0
4,What was the percentage increase in data cente...,Up 141% from Q1,0
...,...,...,...
95,Which stock index increased by the fewest poin...,"Dow Jones Industrial futures, up 23 points.",22
96,Which stock index increased by the highest per...,"S&P 500 futures futures, up 1.2%.",22
97,Which stock has the highest price - (A) River ...,(A) River Industries,23
98,"If Smithson stock increases by $2, what is the...",(C) $21,23


## Task 3

It's your turn now to evaluate how well *BM25s* work on this dataset.  
Tokenize the dataset, and run the queries. Evaluate how often the correct document was retrieved.

In [16]:
# tokenizing and setting up the corpus:
corpus_tokens = bm25s.tokenize(
    financial_retrieval_corpus["content"].to_list(), stopwords="en", stemmer=stemmer
)

# example query:
query = financial_retrieval_queries["query"].iloc[0]

for score, result in retrieve_documents_bm25(query, 3):
    print(
        f"Score: {score:.4f} | ID: {result} | Doc: {financial_retrieval_corpus['content'].iloc[result]}"
    )

Split strings:   0%|          | 0/25 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/25 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Score: 0.7635 | ID: 8 | Doc: 	2022 12/31/22	2021 12/31/21	2020 12/31/20	2019 12/31/19 NET SALES OR REVENUES	81,462	53,823	31,536	24,578 Cost of Goods Sold (Excl Depreciation)	57,066	37,306	22,584	18,402 Depreciation, Depletion And Amortization	3,543	2,911	2,322	2,107 Depreciation	2,655	2,146	1,802	1,298 Amortization of Intangibles	888	51	51	44Amortization of Deferred Charges	--	714	469	765 GROSS INCOME	20,853	13,606	6,630	4,069 Selling, General & Admin Expenses	7,021	7,110	4,636	3,989 Research and Development Expense	3,075	2,593	1,491	1,343OPERATING INCOME	13,832	6,496	1,994	80 Extraordinary Charge - Pretax	(228)	(101)	0	(196) Non-Operating Interest Income	297	56	30	44 Other Income/Expenses - Net	(15)	263	(122)	92 Interest Expense On Debt	167	424	796	716 Interest Capitalized	0	53	48	31 PRETAX INCOME	13,719	6,343	1,154	(665) Income Taxes	(1,132)	(699)	(292)	(110) Current Domestic Income Tax	62	9	4	5 Current Foreign Income Tax	1,266	839	248	86Deferred Domestic Income Tax	27	0	0	(4) Defer

In [None]:
# Evaluate BM25 on all queries: add a boolean "match" column indicating whether top-1 retrieved doc equals expected content_index
def top1_matches(row) -> int:
    """Gets only the top one match for the query.

    Args:
        row: a financial_retrieval_queries pandas dataset row

    Returns:
        int: the document id that was matched
    """
    qry = row["query"]
    expected = int(row["content_index"])
    results = list(retrieve_documents_bm25(qry, k=1))
    if not results:
        return False
    score, doc_idx = results[0]
    return int(doc_idx) == expected


financial_retrieval_queries["match"] = financial_retrieval_queries.apply(top1_matches, axis=1)
financial_retrieval_queries.head()

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Split strings:   0%|          | 0/1 [00:00<?, ?it/s]

Stem Tokens:   0%|          | 0/1 [00:00<?, ?it/s]

BM25S Retrieve:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,query,answer,content_index,match
0,What was the increase in revenue from the prev...,Up 101%,0,False
1,What was the revenue in second quarter?,$13.51 billion,0,True
2,How many shares were repurchased in second qua...,7.5 million shares,0,True
3,Who is the CEO of Nvidia?,Jensen Huang,0,False
4,What was the percentage increase in data cente...,Up 141% from Q1,0,True


In [18]:
financial_retrieval_queries["match"].sum() / len(financial_retrieval_queries)

np.float64(0.08)