# Performing a parameter sweep

The objective of this task is to optimize the parameters of the BM25 retrieval model (b and k1) in a systematic way. 

The proper solution would be to do an extensive grid search, as b and k1 are not independent of each other. The pseudo code for that would look something like:

```
for b between 0 and 1 in 0.1 steps
    for k between 1 and 2 in 0.1 steps
        perform retrieval using b and k
        evaluate
pick b and k with highest overall performance (MAP score)
```

As this would take a lot of time, what we do instead in this exercise is the following:
  - keep the default k1 value (1.2) and perform a sweep on b
  - use the best performing b value and perform a sweep on k1
  
This may not be a global optimum, but should give better results than the default setting.

**NOTE** The solution is incomplete; those parts of the code that should be developed as part of Assignment 1 are removed.

In [19]:
from elasticsearch import Elasticsearch
import time

In [14]:
INDEX_NAME = "aquaint"
DOC_TYPE = "doc"
FIELD = "content"

In [3]:
QUERY_FILE = "data/queries.txt"
OUTPUT_FILE = "data/baseline.txt"
QRELS_FILE = "data/qrels2.csv"

In [6]:
def load_queries(query_file):
    queries = {}
    with open(query_file, "r") as fin:
        for line in fin.readlines():
            qid, query = line.strip().split(" ", 1)
            queries[qid] = query
    return queries

In [7]:
def eval_query(ranking, gt):
    # See Assignment 1
    pass

In [8]:
def eval(gt_file, output_file):
    
    # See Assignment 1
    # This function is modified such that it only returns the MAP score
    pass

In [9]:
queries = load_queries(QUERY_FILE)

In [10]:
es = Elasticsearch()

### Perform a sweep on b

In [2]:
map_scores = []
x = []
for i in range(11):
    b = round(i / 10, 2) 
    x.append(b)
    print("Running queries for b=%f" % b)
    
    # TODO run retrieval using k1=1.2 and b
    
    # TODO evaluate the ranking
    map_score = 0
    map_scores.append(map_score)

Running queries for b=0.000000
Running queries for b=0.100000
Running queries for b=0.200000
Running queries for b=0.300000
Running queries for b=0.400000
Running queries for b=0.500000
Running queries for b=0.600000
Running queries for b=0.700000
Running queries for b=0.800000
Running queries for b=0.900000
Running queries for b=1.000000


In [2]:
import matplotlib.pyplot as plt

**TODO** Plot MAP scores for the various b values

### Perform a sweep on k1

Use the best performing b value from above and optimize k1

In [3]:
map_scores = []
x = []
for i in range(11):
    k1 = 1 + round(i / 10, 2) 
    x.append(k1)
    print("Running queries for k1=%f" % k1)
    
    # TODO run retrieval using k1 and a fixed b (best value from before)
    
    # TODO evaluate the ranking
    map_score = 0
    map_scores.append(map_score)    

Running queries for k1=1.000000
Running queries for k1=1.100000
Running queries for k1=1.200000
Running queries for k1=1.300000
Running queries for k1=1.400000
Running queries for k1=1.500000
Running queries for k1=1.600000
Running queries for k1=1.700000
Running queries for k1=1.800000
Running queries for k1=1.900000
Running queries for k1=2.000000


**TODO** Plot MAP scores for the various k1 values