# 1. What is BM25?

BM25 (Best Matching 25) is a **ranking function used in information retrieval**. It scores documents based on their relevance to a query. It’s widely used in search engines like **Elasticsearch** and **Lucene**.

BM25 is an improvement over the traditional **TF-IDF**, as it introduces **document length normalization** and **term saturation**.

# 2. BM25 Formula

The BM25 score of a document (D) for a query (Q) is:

$$
\text{score}(D,Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, D) \cdot (k_1 + 1)}{f(t, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}
$$

Where:

* $t$ = term in the query
* $f(t, D)$ = term frequency of (t) in document (D)
* $|D|$ = length of the document (number of terms)
* $\text{avgdl}$ = average document length in the collection
* $k_1$ = term frequency saturation parameter (commonly 1.2–2.0)
* $b$ = document length normalization parameter (commonly 0.75)
* $\text{IDF}(t)$ = inverse document frequency of term (t):

$$
\text{IDF}(t) = \log \frac{N - n_t + 0.5}{n_t + 0.5 + 1}
$$

* (N) = total number of documents
* (n_t) = number of documents containing term (t)

# 3. Key Features of BM25

1. **Term Frequency Saturation:** Adding more occurrences of a term doesn’t increase relevance linearly; it saturates.
2. **Length Normalization:** Longer documents don’t get unfairly higher scores just because they have more words.
3. **IDF Weighting:** Rare terms contribute more to the score.

# 4. BM25 in Practice

### Elasticsearch Example:

```json
{
  "query": {
    "match": {
      "content": "search term"
    }
  }
}
```

* By default, Elasticsearch uses BM25 for ranking.
* You can adjust `k1` and `b` in the index settings:

```json
"similarity": {
  "my_bm25": {
    "type": "BM25",
    "k1": 1.2,
    "b": 0.75
  }
}
```

---

# Python Example (using `rank_bm25` library)

In [14]:
from rank_bm25 import BM25Okapi

corpus = [
    "the quick brown fox",
    "jumps over the lazy dog",
    "quick silver fox runs"  # Third document for comparison
]

tokenized_corpus = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

query = "quick fox".lower().split()
scores = bm25.get_scores(query)

# Get sorted results
results = sorted(zip(corpus, scores), key=lambda x: x[1], reverse=True)

print("Ranked results:")
for doc, score in results:
    print(f"  Score: {score:.6f} - '{doc}'")

print(f"\nRaw scores: {scores}")

Ranked results:
  Score: 0.105828 - 'the quick brown fox'
  Score: 0.105828 - 'quick silver fox runs'
  Score: 0.000000 - 'jumps over the lazy dog'

Raw scores: [0.10582842 0.         0.10582842]
