# Appendix A: Tuning BM25 parameters for the MSMARCO Document dataset

The following shows a principled, data-driven approach to tuning BM25 parameters with a basic query, using the MSMARCO Document dataset. This assumes familiarity with basic query tuning as shown in the "Query tuning" notebooks.

BM25 contains two parameters `k1` and `b`. Roughly speaking (very roughly), `k1` controls the amount of term saturation (at some point, more terms does not mean more relevant) and `b` controls the importance of document length. A deeper look into these parameters is beyond the scope of this notebook, but our [three part blog series on understanding BM25](https://www.elastic.co/blog/practical-bm25-part-1-how-shards-affect-relevance-scoring-in-elasticsearch) is very useful for that.

Be aware that not all query types will see improvements with BM25 tuning. Sometimes it's more impactful to just tune query parameters. As always you try it out with your datasets first and get concrete measurements. We recommend customizing index settings/analyzers first, then do query parmeter tuning and get your baseline measurements. Next, try the best index settings/analyzers with BM25 tuning, then do query parameter tuning and see if it makes any improvement on your baseline. If there's no significant difference it's best to just stick with the default BM25 parameters for simplicty.

In [None]:
%load_ext autoreload
%autoreload 2

In [2]:
import importlib
import os
import sys

from elasticsearch import Elasticsearch
from skopt.plots import plot_objective

In [None]:
# project library
sys.path.insert(0, os.path.abspath('..'))

import qopt
importlib.reload(qopt)

from qopt.notebooks import evaluate_mrr100_dev, optimize_bm25_mrr100
from qopt.optimize import Config, set_bm25_parameters

In [4]:
# use a local Elasticsearch or Cloud instance (https://cloud.elastic.co/)
es = Elasticsearch('http://localhost:9200')
# es = Elasticsearch('http://35.234.93.126:9200')

# set the parallelization parameter `max_concurrent_searches` for the Rank Evaluation API calls
max_concurrent_searches = 10
# max_concurrent_searches = 30

index = 'msmarco-document'
template_id = 'combined_matches'

# no query params
query_params = {}

# default Elasticsearch BM25 params
default_bm25_params = {'k1': 1.2, 'b': 0.75}

## Baseline evaluation

For tuning the BM25 parameters, we're going to use just a `match` query per field, combined using a `bool` `should` query. This will search for query terms across the `url`, `title`, and `body` fields, and we'll be attempting to optimize the BM25 parameters that are used in the scoring function for each field. In theory, each field could have it's own BM25 [similarty](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25) and parameters, but we'll leave that as an exercise to the reader.

Since BM25 parameters are actually index settings in Elasticsearch (they are theoretically query parameters, but they are implemented as index settings to be consistent with other similarity modules), we need to make sure to set the parameters before any evaluation step. At optimization time, we'lll do the same process: set the BM25 parameters to try, then run the rank evaluation API on the training query dataset.

In [5]:
%%time

set_bm25_parameters(es, index, **default_bm25_params)

_ = evaluate_mrr100_dev(es, max_concurrent_searches, index, template_id, query_params)

Score: 0.2495
CPU times: user 2.04 s, sys: 647 ms, total: 2.68 s
Wall time: 31 s


That's the same baseline that we've seen in the "Query tuning" notebook, so we know we're setup correctly.

## Optimization

Now we're ready to run the optimization procedure and see if we can improve on that, while holding the default query parameters constant.

We know that there's roughly a standard range for each parameter, so we've hardcoded those. We don't show all the code details here, but you can have a look in the corresponding Python module files for more details. Here's the pertinent details for our parameter space:

* `k1`: `0.5` to `5.0`
* `b`: `0.3` to `1.0`
* number of iterations: `40`
* number of initial points: `10`
* static initial points:
  * Elasticsearch defaults: `k1`: `1.2`, `b`: `0.75`
  * Anserini [1] defaults: `k1`: `0.9`, `b`: `0.4`
  
[1] [anserini](https://github.com/castorini/anserini) is a commonly used tool in academia for research into search systems

In [None]:
%%time

_, best_params, _, metadata = optimize_bm25_mrr100(es, max_concurrent_searches, index template_id, query_params)

Here's a look at the parameter space, which is easy to plot here since there are just two parameters. 

In [None]:
_ = plot_objective(metadata, sample_source='result')

It's good to see smooth curves with just a single optimal point in the parameter space.

In [10]:
%%time

set_bm25_parameters(es, index, **best_params)

_ = evaluate_mrr100_dev(es, max_concurrent_searches, index, template_id, query_params)

Score: 0.2575
CPU times: user 2.02 s, sys: 757 ms, total: 2.78 s
Wall time: 36.5 s


Pretty good, we see an improvement just from tuning the BM25 parameters.

Before we wrap up the notebook, it's good to set the BM25 index settings back to the defaults so that anything we run after this notebook on the same index will not be using unexpected parameter values!

In [12]:
set_bm25_parameters(es, index, **default_bm25_params)

## Conclusion

We've shown a very simple but principled way to tune BM25 parameters `k1` and `b`. We've used a similar approach as when optimizing query parameters. In this case, it was useful to rely on Bayesian optimization since we set a pretty wide range over parameters. Stepping through each 1/10th or 1/100th of each parameter in a grid search would be very time consuming indeed.