# Changing retrieval model parameters in Elasticsearch

This assignment uses the `aquaint` index (assuming you've already created that for Assignment 1). Alternatively, any other Elasticsearch index may be used.

In [1]:
from elasticsearch import Elasticsearch

In [2]:
INDEX_NAME = "aquaint"
DOC_TYPE = "doc"
FIELD = "content"

In [3]:
es = Elasticsearch()

In [4]:
query = "tropical storms"

### Small utility function for printing document rankings formatted

In [5]:
def printres(res):
    for r in res:
        print("%s %6.2f" % (r["_id"], r["_score"]))

### Run query with default parameters

In [6]:
res = es.search(index=INDEX_NAME, q=query, df=FIELD, _source=False, size=10).get("hits", {}).get("hits", {})

In [7]:
printres(res)

APW19990810.0014  17.56
APW19990810.0132  17.56
XIE19960926.0183  17.34
APW19990617.0225  16.94
NYT20000415.0089  16.82
APW20000522.0084  16.82
NYT19990802.0233  16.64
XIE19990826.0376  16.62
APW19990402.0247  16.58
NYT19980601.0290  16.41


### Changing BM25 parameters

Change the default similarity function

In [8]:
SIM = {
    "similarity": {
        "default": { 
            "type": "BM25",
            "b": 0,
            "k1": 2
        }
    }
}

A custom similarity can be updated by closing the index, updating the index settings, and reopening the index.

In [9]:
es.indices.close(index=INDEX_NAME)
es.indices.put_settings(index=INDEX_NAME, body=SIM)
es.indices.open(index=INDEX_NAME)

{'acknowledged': True}

You might need to wait a little bit before firing the first query. If you're getting errors from Elasticsearch, you can use the code below to wait 100ms.

In [10]:
from time import sleep
sleep(0.1)

Then run the query the same way as before

In [11]:
res = es.search(index=INDEX_NAME, q=query, df=FIELD, _source=False, size=10).get("hits", {}).get("hits", {})

In [12]:
printres(res)

NYT19990802.0233  27.01
APW20000510.0148  26.71
APW20000510.0252  26.71
NYT19980601.0290  26.34
NYT19980601.0292  26.34
NYT19990929.0517  26.19
APW19990810.0014  26.11
APW19990810.0132  26.11
NYT19980917.0428  24.48
NYT19981231.0071  23.23


You can also retrieve the current similarity settings.

In [13]:
es.indices.get_settings(index=INDEX_NAME)

{'aquaint': {'settings': {'index': {'creation_date': '1504451814186',
    'number_of_replicas': '1',
    'number_of_shards': '1',
    'provided_name': 'aquaint',
    'similarity': {'custom_bm25': {'b': '1', 'k1': '2', 'type': 'BM25'},
     'default': {'b': '0', 'k1': '2', 'lambda': '0.2', 'type': 'BM25'},
     'lm': {'lambda': '0.2', 'type': 'LMJelinekMercer'}},
    'uuid': '_HsQL_W8QrqrIoSXJKrn5w',
    'version': {'created': '5040299'}}}}}

### Using a different retrieval model

Similarly to above, you may also change the retrieval model that is used. Elasticsearch implements, among others, language modeling (LM) and divergence from randomness (DFR).

See: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html

Use Language Modeling with Jelinek-Mercer smoothing and with lambda=0.2. Check the retrieval scores change for the top-10 documents.

In [14]:
SIM = {
    "similarity": {
        "default": {
            "type": "LMJelinekMercer",
            "lambda": 0.2
        }
    }
}

In [15]:
es.indices.close(index=INDEX_NAME)
es.indices.put_settings(index=INDEX_NAME, body=SIM)
es.indices.open(index=INDEX_NAME)

{'acknowledged': True}

In [16]:
sleep(0.1)

In [17]:
res = es.search(index=INDEX_NAME, q=query, df=FIELD, _source=False, size=10).get("hits", {}).get("hits", {})

In [18]:
printres(res)

APW19990810.0014  17.56
APW19990810.0132  17.56
XIE19960926.0183  17.34
APW19990617.0225  16.94
NYT20000415.0089  16.82
APW20000522.0084  16.82
NYT19990802.0233  16.64
XIE19990826.0376  16.62
APW19990402.0247  16.58
NYT19980601.0290  16.41


### Changing back to default similarity

**Important** you need to change back to the default similarity manually.

In [19]:
SIM = {
    "similarity": {
        "default": {
            "type": "BM25",
            "b": 0.75,
            "k1": 1.2
        }
    }
}

In [20]:
es.indices.close(index=INDEX_NAME)
es.indices.put_settings(index=INDEX_NAME, body=SIM)
es.indices.open(index=INDEX_NAME)

{'acknowledged': True}