# Configuring analyzers for the MSMARCO Document dataset

Before we start tuning queries and other index parameters, we wanted to first show a very simple iteration on the standard analyzers. In the MS MARCO Document dataset we have three fields: `url`, `title` and `body`. We tried just couple very small improvements, mostly to stopword lists, to see what would happen to our baseline queries. We now have two indices to play with:

- `msmarco-doument.defaults` with some default analyzers
 - `url`: standard
 - `title`: english
 - `body`: english
- `msmarco-document` with customized analyzers
 - `url`: english with URL-specific stopword list
 - `title`: english with question-specfic stopword list
 - `body`: english with question-specfic stopword list

The stopword lists have been changed:
 1. Since the MS MARCO query dataset is all questions, it makes sense to add a few extra stop words like: who, what, when where, why, how
 1. URLs in addition have some other words that don't really need to be searched on: http, https, www, com, edu
 
More details can be found in the index settings in `conf`.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import importlib
import os
import sys

from elasticsearch import Elasticsearch

In [15]:
# project library
sys.path.insert(0, os.path.abspath('..'))

import qopt
importlib.reload(qopt)

from qopt.notebooks import evaluate_mrr100_dev

In [4]:
# use a local Elasticsearch or Cloud instance (https://cloud.elastic.co/)
# es = Elasticsearch('http://localhost:9200')
es = Elasticsearch('http://35.234.93.126:9200')

# set the parallelization parameter `max_concurrent_searches` for the Rank Evaluation API calls
# max_concurrent_searches = 10
max_concurrent_searches = 30

## Comparisons

The following runs a series of comparisons between the baseline default index `msmarco-document.default` and the custom index `msmarco-document`. We use multiple query types just to confirm that we make improvements across all of them.

### Query: combined per-field `match`es

In [5]:
def combined_matches(index):
    evaluate_mrr100_dev(es, max_concurrent_searches, index,
                        'combined_matches', params={})

In [8]:
%%time

combined_matches('msmarco-document.defaults')

Score: 0.2385
CPU times: user 2.52 s, sys: 917 ms, total: 3.43 s
Wall time: 5min 20s


In [7]:
%%time

combined_matches('msmarco-document')

Score: 0.2495
CPU times: user 3 s, sys: 1.13 s, total: 4.13 s
Wall time: 38.2 s


### Query: `multi_match` `cross_fields`

In [18]:
def multi_match_cross_fields(index):
    evaluate_mrr100_dev(es, max_concurrent_searches, index,
        template_id='cross_fields',
        params={
            'operator': 'OR',
            'minimum_should_match': 50,  # in percent/%
            'tie_breaker': 0.0,
            'url|boost': 1.0,
            'title|boost': 1.0,
            'body|boost': 1.0,
        })

In [10]:
%%time

multi_match_cross_fields('msmarco-document.defaults')

Score: 0.2460
CPU times: user 3.23 s, sys: 1.47 s, total: 4.7 s
Wall time: 1min 56s


In [19]:
%%time

multi_match_cross_fields('msmarco-document')

Score: 0.2673
CPU times: user 3.18 s, sys: 1.37 s, total: 4.54 s
Wall time: 5min 23s


### Query: `multi_match` `best_fields`

In [20]:
def multi_match_best_fields(index):
    evaluate_mrr100_dev(es, max_concurrent_searches, index,
        template_id='best_fields',
        params={
            'tie_breaker': 0.0,
            'url|boost': 1.0,
            'title|boost': 1.0,
            'body|boost': 1.0,
        })

In [21]:
%%time

multi_match_best_fields('msmarco-document.defaults')

Score: 0.2699
CPU times: user 2.03 s, sys: 840 ms, total: 2.87 s
Wall time: 4min 48s


In [22]:
%%time

multi_match_best_fields('msmarco-document')

Score: 0.2879
CPU times: user 3.18 s, sys: 1.04 s, total: 4.22 s
Wall time: 5min 3s


## Conclusion

As you can see, there's a measurable and consistent improvement with just some minor changes to the default analyzers. All other notebooks that follow will use the custom analyzers including for their baseline measurements.