# Assignment 1 - _Foundations of Information Retrieval '20/'21_

This assignment is divided in 3 parts, which have to be delivered all together before 04/10/2020 (strictly - no extensions will be granted!), via Canvas. Delivery of the assignment solutions is mandatory.

We will use [ElasticSearch](https://www.elastic.co/) as search engine, as it provides state-of-the-art tools to implement your own engine, and let you focus on methodological aspects of search implementation and optimization.

The assignment is about text-based Information Retrieval and it is structured in three parts:
1. IR performance evaluation (implementation of performance metrics)
2. Setting up a search engine, pre-processing and indexing using ElasticSearch (Indexing, Analyzers)
3. Implementation and optimization of a model of search using ElasticSearch (Similarity measures)


This assignment file contains exercises, marked with the section title __Exercise 01.(x)__, which are evaluated, and other sections that contain support code which you should use as it is. Write your answers between the comments `BEGIN ANSWER` and `END ANSWER`.
Try to complete the solutions for all the exercise sections. 

_Note:_ we leave the comment `#THIS IS GRADED!` in the sections that will be considered for evauation and grading.


### Initial preparation (self-study)
For the It is good to acquire basic knowledge of Python (or refresh it a bit).
For the second and third part of the assignment, please study yourself the [Getting Started guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)" of ElasicSearch and get acquainted with the framework.


# PART 01 - Performance evaluation


### Background information and reading
To solve the exercises in Part 01, study the slides of Lecture 01 (available on Canvas) and the reference book chapter (Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, [Chapter 8, Evaluation in information retrieval](http://nlp.stanford.edu/IR-book/pdf/08eval.pdf), Cambridge University Press. 2008)

### Basic concepts
Suppose the set of relevant documents (the document identifiers - _doc-IDs_) is called `relevant`, then we might define those as follows (in Python):

In [2]:
relevant = set([2, 3, 5, 8, 13, 21])

A perfect run would retrieve exactly these 6 documents in any order. Now, suppose the list of retrieved documents (the document identifiers - _doc-IDs_) is called `retrieved`, and contains the following _doc-IDs_:

In [3]:
retrieved = [4, 2, 18, 16, 8, 46, 32, 22, 47, 39, 3]

One of the simplest evaluation measures we can think of is the _Success at rank 1_. The measure answers the question: Was the first document retrieved a relevant document? _Success at rank 1_ returns 1 if the first document is relevant, and 0 otherwise. A possible implementation is: 

In [19]:
def success_at_1 (relevant, retrieved):
    if len(retrieved) > 0 and retrieved[0] in relevant:
        return 1
    else:
        return 0

success_at_1(relevant, retrieved)

NameError: name 'relevant' is not defined

The first retrieved documentid is 4 which is not in the set of relevant documents, so the score is 0.

Note how easy it is to check if an item occurs in a Python set or list by using the keyword: `in`. Similarly, you can loop over all items in a set of list with: 
`for doc in retrieved:`, 
where doc will refer to each item in the set or list. 

Be sure to use the internet to sharpen your knowledge about Python constructs, for instance on [Python list slicing](https://duckduckgo.com/?q=python+list+slicing). Also note that the code above checks if at least one document is retrieved to avoid an index out of bounds exception (i.e. we avoid to access an empty vector).

> __Suggestion: to be sure of the correctness of the implementations of the performance metrics you are requested, you can compute their values manually and compare them with those of your functions. This is important, as you will use these metrics for later exercises and to compare different models.__

## Exercise 01.A: _Success at k_
The measure _Success at k_ returns 1 if a relevant document is among the first _k_ documents retrieved and zero otherwise. Implement _Success at 5_ below.

> Success at _k_ measures are well-suited in cases where there is typically only one relevant document (or retrieving one relevant document is enough).

In [None]:
#THIS IS GRADED!


def success_at_5(relevant, retrieved):
    # BEGIN ANSWER
    for k in retrieved[:5]:
        if k in relevant:
            return 1        
    return 0
    # END ANSWER
    
success_at_5(relevant, retrieved)

Similarly implement success at rank 10

In [None]:
#THIS IS GRADED!

def success_at_10(relevant, retrieved):
    # BEGIN ANSWER   
    for k in retrieved[:10]:
        if k in relevant:
            return 1   
    return 0
    # END ANSWER
    
success_at_10(relevant, retrieved)

## Exercise 01.B: _Precision, Recall and F-measure_
Implement _Precision_ using Formula 8.1 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

_Hint:_ one can count the number of documents in a list by using the built-in Python function [len()](https://docs.python.org/3/library/functions.html#len) (e.g. `len(retrieved)` for the number of retrieved documents). 

In [None]:
#THIS IS GRADED!

def precision(relevant, retrieved):
    # BEGIN ANSWER
    if not retrieved:
        return 1
    
    relevant_and_retrieved = [k for k in retrieved if k in relevant]
    return len(relevant_and_retrieved) / len(retrieved)
    # END ANSWER
    
precision(relevant, retrieved)

Implement _Recall_ using Formula 8.2 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

In [None]:
#THIS IS GRADED!

def recall(relevant, retrieved):
    # BEGIN ANSWER
    if not relevant:
        return 1
    
    relevant_and_retrieved = [k for k in retrieved if k in relevant]
    return len(relevant_and_retrieved) / len(relevant)
    # END ANSWER
    
recall(relevant, retrieved)

The balanced F measure (_F_ with β=1) is defined as the harmonic mean of precision and
recall. Implement _F_ using Formula 8.6 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

> Tip: reuse your implementations of precision and recall

In [None]:
#THIS IS GRADED!

def f_measure(relevant, retrieved):
    # BEGIN ANSWER
    P = precision(relevant, retrieved)
    R = recall(relevant, retrieved)
    return 2*P*R/(P+R)
    # END ANSWER
    
f_measure(relevant, retrieved)

## Exercise 01.C: _Precision at rank k_ and  _R-Precision_

Precision, Recall and F are _set_-based measures and suited for unranked lists of documents. If our search system returns a ranked _list_ of results, we can measure precision for several cut-off levels _k_ in the ranked list, i.e. we evaluate the relevance of the TOP-_k_ retrieved documents (see lecture slides and the related book chapter). 
We did this before with the _Success at rank 5_ measure for _k_=5.

Implement below the function `precision_at_k()` that measures the precision at rank _k_

> Interesting fact: For _k_=1, the _Precision at rank 1_ would be the samen as _Success at rank 1_ (why?) - Because it must be either 1 out of 1 right or 0 out of 1 correct so. Therefore the precision must be 1 or 0.

In [None]:
#THIS IS GRADED!

def precision_at_k(relevant, retrieved, k):
    # BEGIN ANSWER
    return precision(relevant, retrieved[:k])
    # END ANSWER
    
precision_at_k(relevant, retrieved, k=1)

Implement R-Precision (function `r_precision()`) as defined on Page 161 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

In [None]:
#THIS IS GRADED!

def r_precision(relevant, retrieved):
    # BEGIN ANSWER
    k = len(relevant)
    return precision(relevant, retrieved[:k])
    # END ANSWER
    
r_precision(relevant, retrieved)

## Exercise 01.D:  Interpolated precision at _recall_ X

Another way to address ranked retrieval is to measure precision for several _recall_ levels _X_.

Implement the function `interpolated_precision_at_recall_X()` that measures the interpolated precision at recall level _X_ as defined by Formula 8.7 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

> Tip: calculate for each rank the recall. If the recall is greater than or equal to X, 
> calculate the precision. Keep the highest (maximum) precision of those to be returned at the end.

In [None]:
#THIS IS GRADED!

def interpolated_precision_at_recall_X (relevant, retrieved, X):
    # BEGIN ANSWER
    # The interpolated precision at recall X is undefined where the max recall for the retrieved set does not reach X.
    if recall(relevant, retrieved) < X:
        return 0
    
    P = 0    
    # Loop through each rank.
    for k, _ in enumerate(retrieved):
        if recall(relevant, retrieved[:k]) >= X:
            P = max(P, precision_at_k(relevant, retrieved, k))
    
    return P
    # END ANSWER
    
interpolated_precision_at_recall_X(relevant, retrieved, X=0.1) 

## Exercise 01.E:  _Average Precision_

For a single information need, _Average Precision_ is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved (see [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book), Pages 159 and 160). Implement _Average Precision_ for a single information need. 

In [None]:
#THIS IS GRADED!

def average_precision(relevant, retrieved):
    # BEGIN ANSWER
    # Initialise list of precisions with a zero for each relevant document.
    P = [0] * len(relevant)
    
    for i, doc in enumerate(relevant):
        # If a relevant document is not retrieved, the precision value is taken to be zero. 
        if doc not in retrieved:
            P[i] = 0
        else:
            # Find the precision for the top k documents when doc is retrieved.
            k = retrieved.index(doc)
            P[i] = precision_at_k(relevant, retrieved, k)
    
    # Return the average precision
    return sum(P)/len(P)
    # END ANSWER

average_precision(relevant, retrieved)

## Measures in TREC 

The relevance judgments are provided by TREC in so-called _"qrels"_ files that look as follows:

    1000 Q0 1341 1
    1000 Q0 1231 0
    1001 Q0 12332 1
     ...

The first column is the query identifier, while the second column is the query number within that topic (it is currently unused and should always be Q0). The third column is the document identifier that was examined by the judges. The fourth column is the relevance of the document (_1_ means the document was relevant and _0_ means the document was not relevant).

Below we provide some Python code that reads the _qrels_ and the _run_. The qrels will be put in the Python dictionary `all_relevant`. A [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) provides quick lookup of a set of values given a key. We will use the `query_id` as a key, and a [set](https://docs.python.org/3/tutorial/datastructures.html#sets) of relevant document identifiers. For the partial qrels file above, `all_relevant` would look as follows:

    {
        "1000": set(["1341", "1231"]),
        "1001": set(["12332"])
    }
    
We will use a dictionary called `all_retrieved` with `query_id` as key, and as value a [Python list](https://docs.python.org/3/tutorial/introduction.html#lists) of document identifiers retrieved by the IR system:

    {
        "1000": ["1341", "12346, "2345"],
        "1001": [..., ..., ...],
        ...
    }

Note that, with this data structure, for each `query_id` we can easily access the list of retrieved and relevant documents, and compute the performance metrics. We can then average these measures over all the queries to compute the mean performance of the IR system on the given retrieval task.

Please examine the code below, and make sure you understand every line.

In [None]:
def read_qrels_file(qrels_file):  # reads the content of he qrels file
    trec_relevant = dict()  # query_id -> set([docid1, docid2, ...])
    with open(qrels_file, 'r') as qrels:
        for line in qrels:
            (qid, q0, doc_id, rel) = line.strip().split()
            if qid not in trec_relevant:
                trec_relevant[qid] = set()
            if (rel == "1"):
                trec_relevant[qid].add(doc_id)
    return trec_relevant

def read_run_file(run_file):  
    # read the content of the run file produced by our IR system 
    # (in the following exercises you will create your own run_files)
    trec_retrieved = dict()  # query_id -> [docid1, docid2, ...]
    with open(run_file, 'r') as run:
        for line in run:
            (qid, q0, doc_id, rank, score, tag) = line.strip().split()
            if qid not in trec_retrieved:
                trec_retrieved[qid] = []
            trec_retrieved[qid].append(doc_id) 
    return trec_retrieved
    

def read_eval_files(qrels_file, run_file):
    return read_qrels_file(qrels_file), read_run_file(run_file)

(all_relevant, all_retrieved) = read_eval_files('data/training-qrels.txt', 'data/baselineTREC.run')

### Exercise 01.F: _number of queries_ and _number of retrieved documents per query_
 
Write the Python code that counts the number of queries in the file `baseline.run` and print the value (use the result from the cell above). 

_Hint:_ print the structure and content of the `all_retrieved` and `all_relevant` data structures to understand them better.

In [None]:
#THIS IS GRADED!

# BEGIN ANSWER

# baselineTREC.run is read into the dict all_retrieved. By finding the length, we get the number of keys (queries)
num_queries = len(all_retrieved)
print(num_queries)

# END ANSWER

Write the code that counts, for each query in your baseline run, the number of documents that were retrieved for that query (use `print()` to print the result for each `query_id`).

In [None]:
#THIS IS GRADED!

# BEGIN ANSWER
for query in all_retrieved:
    num_documents = len(all_retrieved[query])
    print("Query: ", query, "  # Documents Retrieved: ", num_documents)
# END ANSWER

## Exercise 01.G: _mean average precision_
Using the `average_precision()` function you implemented above, write the code to compute the _Mean Average Precision_ for the `baseline.run` results. 

In [None]:
#THIS IS GRADED!

def mean_average_precision(all_relevant, all_retrieved):
    # BEGIN ANSWER
    
    count = len(all_retrieved)
        
    precision_per_query = [average_precision(all_relevant[query], all_retrieved[query])  for query in all_retrieved]
    total = sum(precision_per_query)
    
    # END ANSWER
    return "mean AP: ", total / count

mean_average_precision(all_relevant, all_retrieved)

## TREC evaluation

Below you find a function that take `all_relevant` and `all_retrieved` to compute the mean result. It computes the mean value over all queries. The function `mean_metric()`'s first function argument, `measure`, is a special argument: it is a function too! The `mean_metric` function sums the total score for the particular measure and divides it by the number of queries. It computes the average measures over all the queries' results.

_This part will be reused later to compare the results of different models._

In [None]:
def mean_metric(measure, all_relevant, all_retrieved):
    total = 0
    count = 0
    for qid in all_relevant:
        relevant  = all_relevant[qid]
        retrieved = all_retrieved.get(qid, [])
        value = measure(relevant, retrieved)
        total += value
        count += 1
    return "mean " + measure.__name__, total / count

# Example of use of the mean_metric function: computing the average r_precision
mean_metric(r_precision, all_relevant, all_retrieved)

### TREC overview of the results
The following two functions use your implementation of the metrics to create an evaluation overview of the TREC benchmark data. Give a look at the numbers and make you own interpretations of the results. 

In [None]:
def trec_eval(qrels_file, run_file):

    def precision_at_1(rel, ret): return precision_at_k(rel, ret, k=1)
    def precision_at_5(rel, ret): return precision_at_k(rel, ret, k=5)
    def precision_at_10(rel, ret): return precision_at_k(rel, ret, k=10)
    def precision_at_50(rel, ret): return precision_at_k(rel, ret, k=50)
    def precision_at_100(rel, ret): return precision_at_k(rel, ret, k=100)
    def precision_at_recall_00(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.0)
    def precision_at_recall_01(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.1)
    def precision_at_recall_02(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.2)
    def precision_at_recall_03(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.3)
    def precision_at_recall_04(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.4)
    def precision_at_recall_05(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.5)
    def precision_at_recall_06(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.6)
    def precision_at_recall_07(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.7)
    def precision_at_recall_08(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.8)
    def precision_at_recall_09(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.9)
    def precision_at_recall_10(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=1.0)

    (all_relevant, all_retrieved) = read_eval_files(qrels_file, run_file)
    
    unknown_qids = set(all_retrieved.keys()).difference(all_relevant.keys())
    if len(unknown_qids) > 0:
        raise ValueError("Unknown qids in run: {}".format(sorted(list(unknown_qids))))

    metrics = [success_at_1,
               success_at_5,
               success_at_10,
               r_precision,
               precision_at_1,
               precision_at_5,
               precision_at_10,
               precision_at_50,
               precision_at_100,
               precision_at_recall_00,
               precision_at_recall_01,
               precision_at_recall_02,
               precision_at_recall_03,
               precision_at_recall_04,
               precision_at_recall_05,
               precision_at_recall_06,
               precision_at_recall_07,
               precision_at_recall_08,
               precision_at_recall_09,
               precision_at_recall_10,
               average_precision]

    return [mean_metric(metric, all_relevant, all_retrieved) for metric in metrics]


def print_trec_eval(qrels_file, run_file):
    results = trec_eval(qrels_file, run_file)
    print("Results for {}".format(run_file))
    for (metric, score) in results:
        print("{:<30} {:.4}".format(metric, score))

print_trec_eval('data/training-qrels.txt', 'data/baselineTREC.run')

## Exercise 01.H: _significance testing_

Testing the statistical significance of differences in the results of different IR systems is important (see slides and course book - Section 8.8). One of the basic tests one can perform is the two-tailed [sign test](https://en.wikipedia.org/wiki/Sign_test).


For this exercise, we use the run files obtained by  [Hiemstra and Aly](https://djoerdhiemstra.com/wp-content/uploads/trec2014mirex-draft.pdf) for TREC 2014. The `utbase.run` file was generated usinf Language Modeling, while `utexact.run` was generated using an IR system based on mathing the exact query string, abd ranking the documents by  the number of exact matches found. The exact run improves the _Precision at 5_ to 0.456 (compared to 0.440 for the baseline run).  

Implement the code to perform the _sign test_ of statistical significance.
_Hint:_ for each sign, compute the number of queries that increase/descrease performance (called `better, worse` in the code below). How would you use these values to compute the _p_ value of the two-tailed sign test? Is the difference between _utbase_ and _utexact_ significant?
    
Answer: Conduct a binomial test where `better` is the number of successes, `worse` is the number of failures, and the null hypothesis assumes a binomial distribution with p = 0.5. 
Since the performance of the second method is better for 9 queries and also worse for 9 queries, then we get a p-value of 1.0 and fail to reject the null hypothesis. i.e. the difference between _utbase_ and _utexact_ is not statistically significant.

In [None]:
#THIS IS GRADED!

def sign_test_values(measure, qrels_file, run_file_1, run_file_2):
    all_relevant = read_qrels_file(qrels_file)
    all_retrieved_1 = read_run_file(run_file_1)
    all_retrieved_2 = read_run_file(run_file_2)
    better = 0
    worse  = 0
    # BEGIN ANSWER
    
    for query in all_retrieved_1:
        performance_1 = measure(all_relevant[query], all_retrieved_1[query])
        performance_2 = measure(all_relevant[query], all_retrieved_2[query])
        
        if performance_2 > performance_1:
            better += 1
        # Exclude queries with no performance difference between the two methods.
        elif performance_2 < performance_1:
            worse += 1
    
    # END ANSWER
    return(better, worse)
    
def precision_at_rank_5(rel, ret):
    return precision_at_k(rel, ret, k=5)

sign_test_values(precision_at_rank_5, 'data/trec.qrels', 'data/utbase.run', 'data/utexact.run')

# from scipy.stats import binom_test
# w = sign_test_values(precision_at_rank_5, 'data/trec.qrels', 'data/utbase.run', 'data/utexact.run')
# binom_test(w) # Accept the default arguments for the function
### Returns a p-value of 1 > 0.05, thus we fail to reject the null. i.e. there is no difference in performance between the two methods.


# Part 02 - Indexing and querying with ElasticSearch

### Preparation: Getting started with Elasticsearch

We strongly advice you to go through the "Elasticsearch, [reference guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)", and work on the tutorials. The following parts of the assignment will be based on ElasticSearch.

You can skip the section on [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html), as we provide it already installed in the Virtual Machine.

> If you want (disclaimer: we do __not__ give help with this!), you can 
> follow the [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) to run Elasticsearch on your laptop without VM. But beware, your system will now be different from the 
> ones of your colleagues and they might not be able to help you if 
> you have problems that are specific to your system, your operating
> system, or your Elasticsearch version.

### Starting/Stopping ElasticSearch
To start ElasticSearch on the virtual machine, you can type `sudo service elasticsearch start` in a Terminal.
To stop the ElasticSearch server, instead, you can type `sudo service elasticsearch stop`. Refer at the [the official guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-running-init), for more information.

### The REST API

Elasticsearch runs its own server that can be accessed by a regular web browser as the client, for instance by opening this link in your browser: http://localhost:9200. 

Elasticsearch will respond with something like:

    {
        "name" : "fir-machine",
        "cluster_name" : "elasticsearch",
        "cluster_uuid" : "w7SBVo1ESVivMApbLIqRvA",
        "version" : {
            "number" : "7.9.0",
            "build_flavor" : "default",
            "build_type" : "deb",
            "build_hash" : "a479a2a7fce0389512d6a9361301708b92dff667",
            "build_date" : "2020-08-11T21:36:48.204330Z",
            "build_snapshot" : false,
            "lucene_version" : "8.6.0",
            "minimum_wire_compatibility_version" : "6.8.0",
            "minimum_index_compatibility_version" : "6.0.0-beta1"
        },
        "tagline" : "You Know, for Search"
    }


If you see this, then your Elasticsearch node is up and running. The RESTful API uses simple text or JSON over HTTP. 

> REST, API, JSON, HTTP, that's a lot of abbreviations! It is good to
> be familiar with the terminology. Let us explain: The Elasticsearch
> response is not (only) intended for humans. It is supposed to be used 
> by applications that run on the client machines, and therefore the
> interface is called an Application Programming Interface (API). The 
> API uses a format called JSON (JavaScript Object Notation), which 
> can be easily read by machines (and humans). The API sends its JSON
> response using the same method as your web browser displays web
> pages. This method is called HTTP (Hyper Text Transfer Protocol), 
> and it is the reason you can inspect the response in a normal web
> browser. APIs that use HTTP are called RESTful interfaces. REST 
> stands for REpresentational State Transfer, arguably one of the
> simplest ways to define an API.


### Kibana, cURL, and more cURL 

You can interact with your Elasticsearch service in different ways. In this first assignment we will describe three ways. Later during the practical work we will use the Python Elasticsearch client.

1. Using the Kibana Console
2. Using cURL
3. Using cURL from a Jupyter notebook (not recommended)

#### Kibana
Kibana provides a web interface to interact with your Elasticsearch service. It's available from http://localhost:5601. You can use Kibana to create interactive dashboards visualizing data in your Elasticsearch indices. It also provides a console to execute Elasticsearch commands. It's available from http://localhost:5601/app/kibana#/dev_tools

To start Kibana on the virtual machine, you can type `sudo service kibana start` in a Terminal.
To stop the Kibana server, instead, you can type `sudo service kibana stop`.

Many examples from the Elasticsearch user guide can be directly executed in Kibana by clicking the `VIEW IN CONSOLE` button.

#### cURL
[CURL](https://en.wikipedia.org/wiki/CURL) is a software tool that enables you to execute HTTP method requests from the commandline. The name originally stood for "see URL". 

Curl is already installed in the VM operating system. Let's open a bash terminal.
You can exit the shell by executing `exit`.
You can execute curl commands on this prompt, for instance retrieving the Elasticsearch state.
Note you have to use `localhost` as the hostname:
```
labs@fir-machine:~$ curl localhost:9200
{
  "name" : "epRATWu",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "KsOTBsyeTmy6fJCcZ64d_A",
  "version" : {
    "number" : "6.2.4",
    "build_hash" : "ccec39f",
    "build_date" : "2018-04-12T20:37:28.497551Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
```

#### cURL from this notebook

Alternatively, jupyter notebooks allow you to directly execute cURL commands (or other shell commands), by starting a line of code with an exclamation mark (see example below). Plase be warned: when executing commands which result in long output (for instance when indexing a large number of documents), stick to the terminal to execute curl commands. Jupyter might freeze when handling long output from the shell.

## Assignment Part 02 (Let's go!)

_You can work on this part after Lecture 01 already, if you want_

For the following exercises, you will use a TREC genomics document collection and queries. 
It is stored in the folder `data/` in the directory where you have been instructed to place the assignment notebooks (`/`).

The collections contains:

* `trec-medline.json` (the collection in Elasticsearch batch format - because of its size it cannot be indexed with a single curl command!)
* `training-queries-simple.txt` (test queries)
* `training-qrels.txt` (the "relevance judgements" for the test queries, i.e. the correct answers)
* `test-queries-simple.txt`
* `example_matches20.txt` (20 example matches)

To make things easy, the data is already provided in Elasticsearch' batch processing format. 
Inspect the collection file in the terminal:

`head trec-medline.json`

This shows the first 5 documents in the collection (in JSON format prepared for ElasticSearch, as you have seen in the tutorial)

## Exercise 02.A: _indexing_ and _first queries_

Execute the following cell to index the collection in an Elasticsearch index called `genomics'. This code uses the Elasticsearch python api, which we will discuss later (you can read about it yourself, in the meanwhile).

_Note:_ indexing the TREC genomics collection will take some time.

In [1]:
import elasticsearch
import elasticsearch.helpers
import json

def documents():
    """ generates the documents to be indexed as dictionaries """
    with open('data/trec-medline.json') as inp:
        while True:
            try:
                line = next(inp)  # ignore odd line nrs
                if line is None:
                    break
                try:
                    docline = next(inp)
                    doc = json.loads(docline)
                    yield doc
                except json.JSONDecodeError as e:
                    # should not occur (but ignore it anyway)
                    pass
            except StopIteration as e:
                break

In [7]:
es = elasticsearch.Elasticsearch('localhost')

# remove if it already exists
es.indices.delete(index="genomics", ignore=[400, 404])

# and bulk index it
print("Indexing documents, this will take some time...")
_ = elasticsearch.helpers.bulk(
        es, 
        documents(),
        index="genomics",
        chunk_size=2000,
        request_timeout=30
    )
print("Done")

Indexing documents, this will take some time...
Done


Query the index called Genomics and determine how many items are index. 

In [8]:
#THIS IS GRADED!

# write the code here
# BEGIN ANSWER
ind = 'genomics'
res = es.count(index=ind).get('count')

print("There are", res, "items in the index", ind)

# END ANSWER

There are 525937 items in the index genomics


Using the command line (or the Kibana console), search for all documents that contain the word `blood`. 

How many documents containing the term `blood` are there in your index? (searching all fields of the documents).

In [9]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
# Using Kibana console:
POST genomics/_count
{
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
}
# Returns:
# {
#   "count" : 68275,
#   "_shards" : {
#     "total" : 1,
#     "successful" : 1,
#     "skipped" : 0,
#     "failed" : 0
#   }
# }
# END ANSWER

SyntaxError: invalid syntax (<ipython-input-9-a8f361d7200e>, line 6)

## Exercise 02.B: the Python ElasticSearch library

#### Preparation
The command line is fine for doing basic operations on your Elasticsearch indices, but as soon as things get more complex, you better use custom client programs.
We will use the [Elasticsearch client library for Python](https://elasticsearch-py.readthedocs.io). This library will execute the HTTP requests that you have used before (with CURL or Kibana). The library is pre-installed on the VM.

#### Exercise
Write the code that searches the index for _"blood"_ using the [search()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search) function. Your code will take at minimum the following steps:

1. import the python library `elasticsearch`.
2. open a connection with the Elasticsearch host `'elasticsearch'` with `Elasticsearch()`.
3. execute a search with `search()` using the index `genomics`, and a correct query body.
4. print the JSON output of Elasticsearch 

How many hits are there in your index? Is the result the same as in Exercise 01.?

> Elasticsearch runs on localhost on your laptop, at port 9200 (so as http://localhost:9200)


In [28]:
#THIS IS GRADED!

import elasticsearch

# your code below

# Open a connection with Elasticsearch host. 'es' already exists, so call this es2
es2 = elasticsearch.Elasticsearch('localhost')

cnt = es2.count(index='genomics', body = {
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
})

# Get the resulting hits. search() has a maximum of counting 10,000 hits. 
# Returns the first 10 search results.
res_10 = es2.search(index='genomics', body = {
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
})

# Using size=10000 allows us to retrieve the max number of docs in the query results so that we can inspect them.
res_10k = es2.search(index='genomics', size = 10000, body = {
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
})

print("Total count of matching documents:", cnt['count']) # 68,275 hits = same as Exercise 1
print("\nSearch results (First 10 results. The total hits shown is capped at 10,000):\n", res_10)


Total count of matching documents: 68275

Search results (First 10 results. The total hits shown is capped at 10,000):
 {'took': 102, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10000, 'relation': 'gte'}, 'max_score': 15.369452, 'hits': [{'_index': 'genomics', '_type': '_doc', '_id': 'gVAMtXQBu6mw5_RMO2jn', '_score': 15.369452, '_source': {'CON': 'J Pharmacokinet Pharmacodyn. 2001 Apr;28(2):155-69. PMID: 11381568', 'CY': 'England', 'DA': '20020826', 'DCOM': '20030312', 'DP': '2002 Feb', 'EDAT': '2002/08/27 10:00', 'IP': '1', 'IS': '1567-567X', 'JID': '101096520', 'LA': 'eng', 'LR': '20030313', 'MHDA': '2003/03/13 04:00', 'PG': '95-7; author reply 99', 'PMID': '12194538', 'PST': 'ppublish', 'PT': 'Letter', 'SB': 'IM', 'SO': 'J Pharmacokinet Pharmacodyn 2002 Feb;29(1):95-7; author reply 99.', 'TA': 'J Pharmacokinet Pharmacodyn', 'TI': 'Sample size calculation in bioequivalence trials.', 'UI': '22182241', 'VI': '29',

The Python client library returns Python objects, that use [dictionaries](https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries) and [lists](https://docs.python.org/3.6/tutorial/introduction.html#lists).
Use a [for loop](https://docs.python.org/3.6/tutorial/controlflow.html#for-statements) to inspect each hit, and print the retrieved document's titles one by one. 

In [11]:
#THIS IS GRADED!

# your code below
# Get the list of 10,000 hits.
hits = res_10k['hits']['hits']

print("Number of query results returned =", len(hits), "\n\nTitles:")
for i,hit in enumerate(hits):
    print(i, ":", hit['_source']['TI'])

Number of query results returned = 10000 

Titles:
0 : Thrombin functions during tissue factor-induced blood coagulation.
1 : Short deletion within the blood group Dombrock locus causing a Do(null) phenotype.
2 : Working Group on Blood Pressure Monitoring of the European Society of Hypertension International Protocol for validation of blood pressure measuring devices in adults.
3 : Clotting in whole blood: analysis of a biochemical reaction network.
4 : DNB: a partial D with anti-D frequent in Central Europe.
5 : Intrinsic pathway of blood coagulation contributes to thrombogenicity of atherosclerotic plaque.
6 : Persistence of HTLV-I in blood components after leukocyte depletion.
7 : Transplantation of mobilized peripheral blood cells to HLA-identical siblings with standard-risk leukemia.
8 : Absence of CD47 in protein 4.2-deficient hereditary spherocytosis in man: an interaction between the Rh complex and the band 3 complex.
9 : State of the market for devices for blood pressure measu

1239 : Imatinib induces hematologic and cytogenetic responses in patients with chronic myelogenous leukemia in myeloid blast crisis: results of a phase II study.
1240 : Thrombogenicity of beta 2-glycoprotein I-dependent antiphospholipid antibodies in a photochemically induced thrombosis model in the hamster.
1241 : Congenital afibrinogenemia: first identification of splicing mutations in the fibrinogen Bbeta-chain gene causing activation of cryptic splice sites.
1242 : Heterologous cells cooperate to augment stem cell migration, homing, and engraftment.
1243 : Interleukin-1 blockade does not prevent acute graft-versus-host disease: results of a randomized, double-blind, placebo-controlled trial of interleukin-1 receptor antagonist in allogeneic bone marrow transplantation.
1244 : Antimyeloma efficacy of thalidomide in the SCID-hu model.
1245 : [Selected blood coagulation problems in newborn infants]
1246 : Volume of RBCs, 24
1247 : [CD62p expression in platelet during the preparation c

2406 : Dietary fibre in treatment of diabetes: myth or reality?
2407 : Obesity, smoking, and multiple cardiovascular risk factors in young adult African Americans.
2408 : [Markers of oxidative damage in blood of children with cystic fibrosis]
2409 : Assessing the accuracy of three viral risk models in predicting the outcome of implementing HIV and HCV NAT donor screening in Australia and the implications for future HBV NAT.
2410 : The blood platelet as a model for regulating blood coagulation on cell surfaces and its consequences.
2411 : Suicide and the media. Part I: Reportage in nonfictional media.
2412 : Suicide and the media. Part II: Portrayal in fictional media.
2413 : Suicide and the media. Part III: Theoretical issues.
2414 : Blood extraction from lancet wounds using vacuum combined with skin stretching.
2415 : Safer haemotherapy: the responsibilities of government, transfusion service, blood donors, and physician-users.
2416 : Serum cholesterol affects blood pressure regulatio

4072 : Effect of IVIgG treatment on fetal platelet count, HPA-1a titre and clinical outcome in a case of feto-maternal alloimmune thrombocytopenia.
4073 : Effects of peroxidase on hyperlipidemia in mice.
4074 : Effect of buyang huanwu decoction on platelet activating factor content in arterial blood pre
4075 : Evaluating the relationship between arterial blood pressure changes and indices of pulse oximetric plethysmography.
4076 : Multiple sclerosis: low-frequency temporal blood oxygen level-dependent fluctuations indicate reduced functional connectivity initial results.
4077 : Assessment of sirolimus concentrations in whole blood by high-performance liquid chromatography with ultraviolet detection.
4078 : Changes in middle cerebral artery blood flow after carotid endarterectomy as monitored by transcranial Doppler.
4079 : Clinical significance of blood brain natriuretic peptide level measurement in the detection of heart disease in untreated outpatients: comparison of electrocardiogra

5072 : [Changes in synthesis of nitric oxide, blood levels of ACTH and cortisol in viral hepatitis B]
5073 : Allele-specific replication associated with aneuploidy in blood cells of patients with hematologic malignancies.
5074 : [Blood donation]
5075 : [Vitamin A: blood level and dietetics intake in stunted children and adolescents without hormonal disease]
5076 : Long-term effects of oral estradiol and dydrogesterone on carbohydrate metabolism in postmenopausal women.
5077 : In vivo PIV measurement of red blood cell velocity field in microvessels considering mesentery motion.
5078 : [Informative value of immunologic analysis of blood and ejaculate in diagnosing chronic prostatitis]
5079 : Insulinoma in a patient with tuberous sclerosis: is there an association?
5080 : In vitro studies of the influence polyester materials with a different degree of surface wettability have on blood haematological parameters and coagulation and fibrinolysis system parameters.
5081 : A comparative study 

6238 : "Sausage-string" appearance of arteries and arterioles can be caused by an instability of the blood vessel wall.
6239 : Inhibitory mechanism of costunolide, a sesquiterpene lactone isolated from Laurus nobilis, on blood-ethanol elevation in rats: involvement of inhibition of gastric emptying and increase in gastric juice secretion.
6240 : [Blood pressure determination]
6241 : Intrapituitary adenoviral administration of 7B2 can extend life span and reverse endocrinological deficiencies in 7B2 null mice.
6242 : Oat consumption does not affect resting casual and ambulatory 24-h arterial blood pressure in men with high-normal blood pressure to stage I hypertension.
6243 : Clonal T cell receptor gamma-chain gene rearrangement by PCR-based GeneScan analysis in the skin and blood of patients with parapsoriasis and early-stage mycosis fungoides.
6244 : Lipids and nitric oxide in porcine retinal and choroidal blood vessels.
6245 : Mice deficient in the insulin-regulated membrane aminopep

7738 : Peroxidation of proteins and lipids in suspensions of liposomes, in blood serum, and in mouse myeloma cells.
7739 : [The distribution of serum homocysteine and its associated factors in a population of 1 168 subjects in Beijing area]
7740 : Miniaturized electrophoresis: an evolving role in laboratory medicine.
7741 : Reticulocytes. Their usefulness and measurement in peripheral blood.
7742 : Cell saver for blood conservation in cardiac surgery.
7743 : Blood dendritic cells interact with splenic marginal zone B cells to initiate T-independent immune responses.
7744 : Insulin therapy in type 2 diabetes.
7745 : Cerebral blood perfusion after treatment with zolpidem and flumazenil in the baboon.
7746 : [Necessary harmonization of health cost assessment. Autologous peripheral blood progenitor cell transplantation in France]
7747 : Systemic vs. local cytokine and leukocyte responses to unilateral wrist flexion exercise.
7748 : The effect of estrogen use on levels of glucose and insuli

9238 : Multiple organ failure in patients with cardiogenic shock is associated with high plasma levels of interleukin-6.
9239 : Skeletal muscle capillary hemodynamics from rest to contractions: implications for oxygen transfer.
9240 : Interactions between stress, interleukin-1beta, interleukin-6 and cortisol in periodontally diseased patients.
9241 : Is transcranial Doppler ultrasonography (TCD) good enough in determining CO2 reactivity and pressure autoregulation in head-injured patients?
9242 : [Clinical research of patients with acute or chronic hepatic failure treated with molecular adsorbent recirculating system]
9243 : Comparison of PET with radioactive microspheres to assess pulmonary blood flow.
9244 : Bell-bottom aortoiliac endografts: an alternative that preserves pelvic blood flow.
9245 : HEPC-based liposomes trigger cytokine release from peripheral blood cells: effects of liposomal size, dose and lipid composition.
9246 : A simplified double-injection method to quantify cer

## Exercise 02.C: _Search using the Elasticsearch DSL_

You will notice that the native query format of Elasticsearch can be quite verbose.
Elasticsearh provides the Python library `elasticsearch_dsl` to write more concise Elasticsearch queries. 
This is only to simplify the syntax: the library still issues Elasticsearch queries.

For example, a simple `match_all` query looks as follows:
```python
query = {
   "query": {
       "match_all": {}
   }
}
```

The same query can be created with the DSL as follows:
```python
query = Q("match_all")
```

Especially for more complicated boolean queries, to use the native query format can become complicated.
Read more about the DSL [here](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html)

__Exercise:__ Search for the query `blood` and check whether you get the same number of results as for exercise 02.B. You can use DSL, curl or Kibana. 

In [8]:
#THIS IS GRADED!

# your code here

# BEGIN ANSWER
# Install elasticsearch_dsl library when first running cell.
# ! pip3 install elasticsearch-dsl

import elasticsearch
from elasticsearch_dsl import Search, Q

s = Search(using=es2, index='genomics') 
q = Q('query_string', query='blood')
s = s.query(q)

# Execute the query. This will return the first 10 matches.
res = s.execute()

# Get the total count of matches. 
s.count()  # 68,275 => matches the previous answers.

# END ANSWER

Collecting elasticsearch-dsl
  Downloading elasticsearch_dsl-7.3.0-py2.py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.5 MB/s eta 0:00:011
Installing collected packages: elasticsearch-dsl
Successfully installed elasticsearch-dsl-7.3.0


# Making your own TREC run

We will adopt a scientific approach to building search engines. That is, we are not only going to build a search engine and see that it works, but we are also going to _measure_ how well it works, by measuring the search engine's quality. We will adopt the method from the [Text Retrieval Conference](http://trec.nist.gov) (TREC). TREC provides researchers with test collections, that consists of 3 parts:

1. the document collection (in our case a part of the MEDLINE database)
2. the topics (which are natural language descriptions of what the user is searching for: you can think of the as the _queries_)
3. the relevance judgments (for each topic, what documents are relevant)

##  Exercise 02.D

Complete the code of the Python function `make_trec_run()` that reads the topics [training-queries-simple.txt](http:training-queries-simple.txt), and for each topic does a search using Elasticsearch. The program should output a file in the [TREC submission format](https://trec-core.github.io/2017/#submission-guidelines). We already provided the first  lines for this exercise, which include:

1. Open the file `'run_file_name'`' for writing and call it `run_file`.
2. Open the file `'topics_file_name'` for reading, call it `test_queries`.
3. For each line in `test_queries`:
4. Remove the newline using `strip()`, then split the string on the tab character (`'\t'`). The first part of the line is now `qid` (the query identifier) and the last part is `query` (a textual description of the query).
5. complete the Python program such that the correct TREC run file is written to `'run_file_name'`.

> **Note**: Make sure you output the `PMID` (pubmed identifier) of the document `hit['_source']['PMID']`. Do **not** use the elasticsearch identifier `_id` because they do not match the document identifiers in the relevance judgements. They were randomly generated by Elasticsearch during indexing.

In [101]:
#THIS IS GRADED!

import elasticsearch
from elasticsearch_dsl import Search, Q
import re

def make_trec_run(es, topics_file_name, run_file_name, run_name="test"):
    with open(run_file_name, 'w') as run_file:
        with open(topics_file_name, 'r') as test_queries:
            for line in test_queries:
                (qid, query) = line.strip().split('\t')
                # BEGIN ANSWER
                run_tag = "27" + run_name #per TREC guidelines: "each run should have a different tag that identifies the group and the method that produced the run."
                
                # Do search using query. Return 1,000 relevant documents as per question instructions.
                s = Search(using=es, index='genomics') 
                q = Q('multi_match', query=query)
                s = s.query(q)[0:1000]
                res = s.execute()
              
                for i, hit in enumerate(res.hits.hits):
                    output = [str(qid), 'Q0', str(hit['_source']['PMID']), str(i), str(hit['_score']),run_tag]
                    # Write to file.
                    if i == 0 and qid == 1:
                        run_file.write(" ".join(output))
                    else:
                        run_file.write("\n" + " ".join(output))
                # END ANSWER
                
                
# Write the results of the queries contained in the topic file `'data/training-queries-simple.txt'` 
# to the run file `'baseline.run'`, and name this test as `test01`
make_trec_run(es2, 'data/training-queries-simple.txt', 'baseline.run', run_name='test01')

In [102]:
# this prints out (it is a shell command) the content of the file baseline.run 
!cat baseline.run


1 Q0 12056822 0 45.120773 27test01
1 Q0 11929828 1 43.71542 27test01
1 Q0 11943869 2 43.64373 27test01
1 Q0 11751903 3 43.54453 27test01
1 Q0 12573582 4 42.216682 27test01
1 Q0 12384701 5 41.85428 27test01
1 Q0 12624599 6 41.735706 27test01
1 Q0 11981756 7 41.24557 27test01
1 Q0 12175534 8 41.18186 27test01
1 Q0 12065641 9 41.059303 27test01
1 Q0 12444543 10 40.73954 27test01
1 Q0 11933076 11 40.64125 27test01
1 Q0 11980715 12 40.15782 27test01
1 Q0 11790141 13 40.05864 27test01
1 Q0 12018448 14 39.329933 27test01
1 Q0 11876550 15 39.327446 27test01
1 Q0 11759294 16 39.00269 27test01
1 Q0 12126481 17 38.77115 27test01
1 Q0 12045203 18 38.708294 27test01
1 Q0 12036924 19 38.5743 27test01
1 Q0 12052868 20 38.211033 27test01
1 Q0 12012324 21 37.90165 27test01
1 Q0 12214857 22 37.866325 27test01
1 Q0 11886382 23 37.722996 27test01
1 Q0 12429914 24 37.42054 27test01
1 Q0 12455049 25 37.28813 27test01
1 Q0 12444545 26 36.786613 27test01
1 Q0 11880176 27 36.773304

3 Q0 12298013 636 11.10005 27test01
3 Q0 12036988 637 11.098272 27test01
3 Q0 12584999 638 11.089694 27test01
3 Q0 12165438 639 11.087658 27test01
3 Q0 11773055 640 11.085541 27test01
3 Q0 12111492 641 11.082467 27test01
3 Q0 12086825 642 11.078472 27test01
3 Q0 12207604 643 11.07015 27test01
3 Q0 12016971 644 11.063361 27test01
3 Q0 11862618 645 11.058595 27test01
3 Q0 12048686 646 11.058595 27test01
3 Q0 11920742 647 11.049435 27test01
3 Q0 12067473 648 11.048596 27test01
3 Q0 12444553 649 11.048554 27test01
3 Q0 11890680 650 11.048151 27test01
3 Q0 11749167 651 11.04758 27test01
3 Q0 12530267 652 11.04758 27test01
3 Q0 12499504 653 11.04624 27test01
3 Q0 11802805 654 11.038783 27test01
3 Q0 12209017 655 11.035579 27test01
3 Q0 11870317 656 11.03269 27test01
3 Q0 11863072 657 11.030711 27test01
3 Q0 11906605 658 11.030711 27test01
3 Q0 12537542 659 11.030711 27test01
3 Q0 12171855 660 11.027457 27test01
3 Q0 11798890 661 11.025495 27test01
3 Q0 12595689 662 

6 Q0 12184756 891 17.019085 27test01
6 Q0 11823067 892 17.01178 27test01
6 Q0 12093377 893 17.010807 27test01
6 Q0 12110052 894 17.003908 27test01
6 Q0 12584329 895 17.003891 27test01
6 Q0 12126629 896 17.003885 27test01
6 Q0 11851891 897 16.996159 27test01
6 Q0 11862419 898 16.988703 27test01
6 Q0 12136334 899 16.973583 27test01
6 Q0 12197571 900 16.973583 27test01
6 Q0 11983150 901 16.971012 27test01
6 Q0 11772516 902 16.949112 27test01
6 Q0 12126589 903 16.949112 27test01
6 Q0 12177415 904 16.949112 27test01
6 Q0 12032624 905 16.949112 27test01
6 Q0 12209134 906 16.949112 27test01
6 Q0 11814117 907 16.93686 27test01
6 Q0 11923081 908 16.935795 27test01
6 Q0 11934871 909 16.930727 27test01
6 Q0 11754734 910 16.923122 27test01
6 Q0 11964294 911 16.899706 27test01
6 Q0 11896445 912 16.896275 27test01
6 Q0 12235214 913 16.891325 27test01
6 Q0 12493643 914 16.891212 27test01
6 Q0 12045261 915 16.880068 27test01
6 Q0 12610304 916 16.86244 27test01
6 Q0 12438341 9

10 Q0 11733510 226 14.6220665 27test01
10 Q0 12166926 227 14.6153345 27test01
10 Q0 12073176 228 14.614378 27test01
10 Q0 12188900 229 14.602366 27test01
10 Q0 12389889 230 14.594231 27test01
10 Q0 11886590 231 14.569471 27test01
10 Q0 12403827 232 14.56802 27test01
10 Q0 11876568 233 14.564264 27test01
10 Q0 11971762 234 14.5583 27test01
10 Q0 12393857 235 14.5583 27test01
10 Q0 12399449 236 14.5583 27test01
10 Q0 12574114 237 14.5583 27test01
10 Q0 12591740 238 14.5583 27test01
10 Q0 11751295 239 14.5433235 27test01
10 Q0 12066721 240 14.534412 27test01
10 Q0 12442271 241 14.508585 27test01
10 Q0 12227020 242 14.506424 27test01
10 Q0 12360930 243 14.498153 27test01
10 Q0 12522115 244 14.491463 27test01
10 Q0 12153068 245 14.46781 27test01
10 Q0 12175618 246 14.435996 27test01
10 Q0 11802563 247 14.423152 27test01
10 Q0 11802564 248 14.423152 27test01
10 Q0 11802565 249 14.423152 27test01
10 Q0 11802566 250 14.423152 27test01
10 Q0 11802567 251 14.423152 27tes

13 Q0 12110274 593 9.497073 27test01
13 Q0 12110275 594 9.497073 27test01
13 Q0 12110276 595 9.497073 27test01
13 Q0 12110277 596 9.497073 27test01
13 Q0 12110278 597 9.497073 27test01
13 Q0 12110279 598 9.497073 27test01
13 Q0 12110280 599 9.497073 27test01
13 Q0 12110281 600 9.497073 27test01
13 Q0 12110282 601 9.497073 27test01
13 Q0 12110283 602 9.497073 27test01
13 Q0 12110284 603 9.497073 27test01
13 Q0 12110285 604 9.497073 27test01
13 Q0 12110286 605 9.497073 27test01
13 Q0 12110287 606 9.497073 27test01
13 Q0 12110288 607 9.497073 27test01
13 Q0 12110289 608 9.497073 27test01
13 Q0 12208857 609 9.496551 27test01
13 Q0 11744393 610 9.496444 27test01
13 Q0 12351678 611 9.495577 27test01
13 Q0 12208849 612 9.493578 27test01
13 Q0 11953451 613 9.490713 27test01
13 Q0 12123598 614 9.488057 27test01
13 Q0 11933094 615 9.486668 27test01
13 Q0 12065414 616 9.481211 27test01
13 Q0 12196028 617 9.476505 27test01
13 Q0 11744384 618 9.472076 27test01
13 Q0 121282

17 Q0 11777278 73 18.589422 27test01
17 Q0 12507466 74 18.5409 27test01
17 Q0 12047555 75 18.48549 27test01
17 Q0 11812785 76 18.42826 27test01
17 Q0 12561731 77 18.425365 27test01
17 Q0 11898599 78 18.354694 27test01
17 Q0 12062319 79 18.354694 27test01
17 Q0 11850412 80 18.353804 27test01
17 Q0 12101424 81 18.339844 27test01
17 Q0 12023306 82 18.216549 27test01
17 Q0 11990757 83 18.142155 27test01
17 Q0 12135576 84 18.07417 27test01
17 Q0 12115912 85 17.959309 27test01
17 Q0 11888883 86 17.910225 27test01
17 Q0 12438805 87 17.858507 27test01
17 Q0 12477827 88 17.783154 27test01
17 Q0 11782285 89 17.755884 27test01
17 Q0 11848406 90 17.74301 27test01
17 Q0 12556155 91 17.647778 27test01
17 Q0 12065604 92 17.57425 27test01
17 Q0 11992720 93 17.571041 27test01
17 Q0 12009515 94 17.519405 27test01
17 Q0 12503421 95 17.46591 27test01
17 Q0 11841950 96 17.44072 27test01
17 Q0 12163412 97 17.423904 27test01
17 Q0 12036583 98 17.413021 27test01
17 Q0 12196209 99 17.

20 Q0 12502848 338 17.510311 27test01
20 Q0 11949266 339 17.508812 27test01
20 Q0 11884396 340 17.504688 27test01
20 Q0 11901543 341 17.496603 27test01
20 Q0 12183057 342 17.493578 27test01
20 Q0 12468172 343 17.492704 27test01
20 Q0 12503421 344 17.46591 27test01
20 Q0 11841950 345 17.44072 27test01
20 Q0 12406668 346 17.439186 27test01
20 Q0 12214985 347 17.436104 27test01
20 Q0 12119204 348 17.400745 27test01
20 Q0 12490320 349 17.393045 27test01
20 Q0 12220680 350 17.38011 27test01
20 Q0 11839143 351 17.37923 27test01
20 Q0 12434654 352 17.376871 27test01
20 Q0 11883294 353 17.363281 27test01
20 Q0 12368591 354 17.347198 27test01
20 Q0 12372468 355 17.344866 27test01
20 Q0 12359773 356 17.34034 27test01
20 Q0 12226709 357 17.3396 27test01
20 Q0 12061429 358 17.333998 27test01
20 Q0 12098936 359 17.333998 27test01
20 Q0 12417066 360 17.333998 27test01
20 Q0 12376397 361 17.32971 27test01
20 Q0 12383917 362 17.32971 27test01
20 Q0 11950481 363 17.300625 27tes

23 Q0 12444931 562 11.218453 27test01
23 Q0 11767287 563 11.213767 27test01
23 Q0 12517950 564 11.204823 27test01
23 Q0 12169749 565 11.199391 27test01
23 Q0 11888679 566 11.191099 27test01
23 Q0 11896730 567 11.18245 27test01
23 Q0 11744024 568 11.168574 27test01
23 Q0 11907274 569 11.168574 27test01
23 Q0 12065312 570 11.167971 27test01
23 Q0 12213489 571 11.157689 27test01
23 Q0 12020463 572 11.157208 27test01
23 Q0 11853145 573 11.150927 27test01
23 Q0 11861757 574 11.142349 27test01
23 Q0 12228262 575 11.142349 27test01
23 Q0 12011109 576 11.136978 27test01
23 Q0 12133424 577 11.133052 27test01
23 Q0 12167612 578 11.131883 27test01
23 Q0 11896736 579 11.131494 27test01
23 Q0 12192529 580 11.130602 27test01
23 Q0 12443898 581 11.129345 27test01
23 Q0 11943214 582 11.128353 27test01
23 Q0 12202356 583 11.123011 27test01
23 Q0 12368321 584 11.1209955 27test01
23 Q0 12446584 585 11.109984 27test01
23 Q0 12379904 586 11.105028 27test01
23 Q0 12030350 587 11.104

26 Q0 12421668 692 10.942903 27test01
26 Q0 12064825 693 10.929222 27test01
26 Q0 12049181 694 10.928655 27test01
26 Q0 12045339 695 10.927885 27test01
26 Q0 12100196 696 10.924526 27test01
26 Q0 12454862 697 10.923842 27test01
26 Q0 12438361 698 10.922036 27test01
26 Q0 12388550 699 10.920691 27test01
26 Q0 11786545 700 10.9156885 27test01
26 Q0 12052857 701 10.913261 27test01
26 Q0 11932949 702 10.9090185 27test01
26 Q0 12196524 703 10.9068575 27test01
26 Q0 12032738 704 10.903827 27test01
26 Q0 11923237 705 10.902095 27test01
26 Q0 12213814 706 10.902051 27test01
26 Q0 12133840 707 10.901889 27test01
26 Q0 12138400 708 10.899872 27test01
26 Q0 11973348 709 10.899775 27test01
26 Q0 12368277 710 10.899527 27test01
26 Q0 12231531 711 10.896715 27test01
26 Q0 11896688 712 10.89276 27test01
26 Q0 11976319 713 10.88971 27test01
26 Q0 11964079 714 10.8871765 27test01
26 Q0 11911876 715 10.885815 27test01
26 Q0 11961044 716 10.880522 27test01
26 Q0 12137760 717 10.8

30 Q0 12193472 116 17.605083 27test01
30 Q0 12525103 117 17.597605 27test01
30 Q0 12164942 118 17.584639 27test01
30 Q0 12421932 119 17.545078 27test01
30 Q0 12012326 120 17.465714 27test01
30 Q0 12485937 121 17.458214 27test01
30 Q0 12071573 122 17.445196 27test01
30 Q0 11939783 123 17.43901 27test01
30 Q0 12234072 124 17.369576 27test01
30 Q0 11904524 125 17.34418 27test01
30 Q0 12569362 126 17.335072 27test01
30 Q0 12357345 127 17.270042 27test01
30 Q0 11980849 128 17.25582 27test01
30 Q0 12206137 129 17.227173 27test01
30 Q0 12092248 130 17.223497 27test01
30 Q0 11908745 131 17.221272 27test01
30 Q0 12601020 132 17.215416 27test01
30 Q0 11857478 133 17.105793 27test01
30 Q0 12013365 134 17.079464 27test01
30 Q0 12462521 135 17.066885 27test01
30 Q0 12468615 136 17.044462 27test01
30 Q0 11929748 137 17.000448 27test01
30 Q0 12471741 138 16.982235 27test01
30 Q0 12228250 139 16.974796 27test01
30 Q0 12091489 140 16.97179 27test01
30 Q0 12164301 141 16.968946 

33 Q0 11839758 542 12.858073 27test01
33 Q0 12631586 543 12.852785 27test01
33 Q0 11726708 544 12.830162 27test01
33 Q0 12042344 545 12.829608 27test01
33 Q0 11773076 546 12.824338 27test01
33 Q0 11983990 547 12.81091 27test01
33 Q0 11845331 548 12.791119 27test01
33 Q0 11709582 549 12.787827 27test01
33 Q0 12542676 550 12.786416 27test01
33 Q0 12381524 551 12.773248 27test01
33 Q0 12162553 552 12.768028 27test01
33 Q0 12391260 553 12.759589 27test01
33 Q0 12387898 554 12.755987 27test01
33 Q0 12537454 555 12.753056 27test01
33 Q0 12391608 556 12.752372 27test01
33 Q0 12088867 557 12.742845 27test01
33 Q0 11999354 558 12.731163 27test01
33 Q0 11975499 559 12.720007 27test01
33 Q0 11786488 560 12.714251 27test01
33 Q0 12065674 561 12.711642 27test01
33 Q0 11933849 562 12.710995 27test01
33 Q0 12270611 563 12.708426 27test01
33 Q0 12044655 564 12.692804 27test01
33 Q0 12110609 565 12.692034 27test01
33 Q0 11934809 566 12.691493 27test01
33 Q0 11953158 567 12.6702

36 Q0 12144693 659 12.535817 27test01
36 Q0 11929872 660 12.535817 27test01
36 Q0 11959829 661 12.535817 27test01
36 Q0 12050657 662 12.535817 27test01
36 Q0 12021767 663 12.535817 27test01
36 Q0 11972041 664 12.535817 27test01
36 Q0 11734896 665 12.527788 27test01
36 Q0 12194025 666 12.527788 27test01
36 Q0 12455976 667 12.527788 27test01
36 Q0 12471248 668 12.514529 27test01
36 Q0 11932766 669 12.500429 27test01
36 Q0 12502789 670 12.500429 27test01
36 Q0 11839821 671 12.498127 27test01
36 Q0 12520017 672 12.488462 27test01
36 Q0 12520032 673 12.488462 27test01
36 Q0 11765131 674 12.486958 27test01
36 Q0 11902679 675 12.486958 27test01
36 Q0 11744370 676 12.480515 27test01
36 Q0 11925450 677 12.456507 27test01
36 Q0 11919704 678 12.429957 27test01
36 Q0 11781497 679 12.3741255 27test01
36 Q0 11835058 680 12.3741255 27test01
36 Q0 11880339 681 12.3741255 27test01
36 Q0 11927561 682 12.3741255 27test01
36 Q0 11928503 683 12.3741255 27test01
36 Q0 12095613 684 1

39 Q0 11951023 928 11.048149 27test01
39 Q0 12466193 929 11.048149 27test01
39 Q0 11914715 930 11.005099 27test01
39 Q0 11926534 931 11.005099 27test01
39 Q0 11937021 932 11.005099 27test01
39 Q0 11948621 933 11.005099 27test01
39 Q0 12465973 934 11.005099 27test01
39 Q0 12111720 935 11.005099 27test01
39 Q0 12372142 936 11.005099 27test01
39 Q0 12374572 937 11.005099 27test01
39 Q0 12459439 938 10.999022 27test01
39 Q0 12522919 939 10.966305 27test01
39 Q0 12296541 940 10.955241 27test01
39 Q0 11997349 941 10.931357 27test01
39 Q0 12242050 942 10.925026 27test01
39 Q0 11836794 943 10.887803 27test01
39 Q0 12586062 944 10.887803 27test01
39 Q0 11875653 945 10.871458 27test01
39 Q0 12183374 946 10.871458 27test01
39 Q0 12186640 947 10.871458 27test01
39 Q0 12200166 948 10.871458 27test01
39 Q0 12208494 949 10.871458 27test01
39 Q0 12505988 950 10.871458 27test01
39 Q0 12111742 951 10.871458 27test01
39 Q0 12372288 952 10.871458 27test01
39 Q0 12454022 953 10.871

43 Q0 11961101 204 15.279039 27test01
43 Q0 12520013 205 15.273477 27test01
43 Q0 11881810 206 15.242591 27test01
43 Q0 12035802 207 15.239593 27test01
43 Q0 12447381 208 15.234868 27test01
43 Q0 12136413 209 15.231316 27test01
43 Q0 12072176 210 15.226326 27test01
43 Q0 12436113 211 15.210492 27test01
43 Q0 12054512 212 15.195192 27test01
43 Q0 12006672 213 15.184298 27test01
43 Q0 11904380 214 15.177974 27test01
43 Q0 11914720 215 15.177974 27test01
43 Q0 12134149 216 15.177974 27test01
43 Q0 12141779 217 15.177974 27test01
43 Q0 12183041 218 15.177974 27test01
43 Q0 12007414 219 15.177974 27test01
43 Q0 12210559 220 15.177974 27test01
43 Q0 12486701 221 15.177974 27test01
43 Q0 12350270 222 15.177974 27test01
43 Q0 12398416 223 15.177974 27test01
43 Q0 12559957 224 15.177974 27test01
43 Q0 12093382 225 15.145387 27test01
43 Q0 12139511 226 15.133473 27test01
43 Q0 11989718 227 15.112351 27test01
43 Q0 12144703 228 15.10177 27test01
43 Q0 11951037 229 15.0969

46 Q0 11948614 114 14.251796 27test01
46 Q0 12004963 115 14.240157 27test01
46 Q0 12438220 116 14.240157 27test01
46 Q0 12555612 117 14.216894 27test01
46 Q0 12418572 118 14.194429 27test01
46 Q0 11855817 119 14.154394 27test01
46 Q0 12442833 120 14.139578 27test01
46 Q0 11803384 121 14.076817 27test01
46 Q0 12015302 122 14.054104 27test01
46 Q0 12110583 123 14.054104 27test01
46 Q0 12553307 124 14.032275 27test01
46 Q0 12480532 125 13.966044 27test01
46 Q0 11723434 126 13.895734 27test01
46 Q0 11913791 127 13.865978 27test01
46 Q0 12362432 128 13.865163 27test01
46 Q0 12063404 129 13.858891 27test01
46 Q0 12203814 130 13.849791 27test01
46 Q0 12070181 131 13.843348 27test01
46 Q0 12298002 132 13.821615 27test01
46 Q0 12112321 133 13.76326 27test01
46 Q0 12370331 134 13.5883875 27test01
46 Q0 12489150 135 13.545003 27test01
46 Q0 12093749 136 13.436757 27test01
46 Q0 11935223 137 13.424397 27test01
46 Q0 12500631 138 13.416143 27test01
46 Q0 12507466 139 13.412

49 Q0 12086873 794 9.384027 27test01
49 Q0 12349953 795 9.383721 27test01
49 Q0 11960378 796 9.380508 27test01
49 Q0 12097382 797 9.378643 27test01
49 Q0 12207231 798 9.378643 27test01
49 Q0 11846978 799 9.373284 27test01
49 Q0 12163407 800 9.373142 27test01
49 Q0 11738799 801 9.371589 27test01
49 Q0 12223398 802 9.370614 27test01
49 Q0 12210033 803 9.36909 27test01
49 Q0 12580964 804 9.368754 27test01
49 Q0 12049780 805 9.3642025 27test01
49 Q0 12392794 806 9.3642025 27test01
49 Q0 12586340 807 9.3642025 27test01
49 Q0 12517926 808 9.363464 27test01
49 Q0 12388793 809 9.355129 27test01
49 Q0 12438606 810 9.355129 27test01
49 Q0 12230120 811 9.354106 27test01
49 Q0 12130544 812 9.353748 27test01
49 Q0 12089160 813 9.352926 27test01
49 Q0 12223476 814 9.352579 27test01
49 Q0 12483323 815 9.351791 27test01
49 Q0 12079879 816 9.349871 27test01
49 Q0 12213324 817 9.348341 27test01
49 Q0 12154401 818 9.345437 27test01
49 Q0 11876909 819 9.341505 27test01
49 Q0 1197

> Tip: Write a line to `run_file` using `run_file.write(line)`. 
> The newline character is: `'\n'`. Before writing a number to
> the file, cast it to a string using `str()`.
>
> The TREC Submission guidelines allow you to submit up to 1000
> documents per topic. Keep this in mind!

# Index improvements: Tokenization (Analyzers)

_You are advised to work on this part after Lecture 02_

The way documents are indexed influences the performance of the IR systems. 
Elasticsearch [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/mapping.html) define how a document, and its properties (fields) are stored and indexed. When using a different configuration of an ElasticSearch Mapping, the document collection needs to be re-indexed.

## Background
The following part of the assignment requires some self-study of the ElasticSearch tools to support the improvemnet of the indexing. Please read the:
* [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-create-index.html).
* Elasticsearch [Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis.html) contain many options for improving your search engine.

You are again requested to use the [Python Elasticsearch Client](https://elasticsearch-py.readthedocs.io) library documentation.


## Bulk indexing revisited
As we need to re-index the document collection when we use a different Mapping configuration, we developed some functions to support a quick re-indexing in the following exercises.

Below you find the Python code for bulk-indexing our Medline collection, similar to the code of the exercises in the beginning of Part 02. Read the code carefully, as you are required to use the indexing functions later for the completion of the assignment.

> The code uses additional helper functions 
> (`elasticsearch.helpers`) and a library for processing JSON.
> The function `read_documents()` reads the bulk insert file: The 
> function is a generator function. It generates an 'on-demand' list
> by using the statement `yield` for every item of the list. It
> is used in the helper function `elasticsearch.helpers.bulk()`.
> The statement `raise` is Python's approach to throw exceptions, 
> that is, it exits the program with an error.
> Note the additional (keyword) arguments to bulk:
> `chunk_size` indicates the number of documents to be processed by
> elasticsearch in one batch. 
> The request_timeout is set to 30 seconds because processing a single batch
> of documents can take some time.

> __Note:__ _when processing a bulk index, be sure to have few GigaBytes free on your hard drive. If you get a BulkIndexError with read-only/FORBIDDEN errors, you probably have too little hard drive space available for ElasticSearch to work properly._


In [2]:
import elasticsearch
import elasticsearch.helpers
import json

es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

def read_documents(file_name):
    """
    Returns a generator of documents to be indexed by elastic, read from file_name
    """
    with open(file_name, 'r') as documents:
        for line in documents:
            doc_line = json.loads(line)
            if ('index' in doc_line):
                id = doc_line['index']['_id']
            elif ('PMID' in doc_line):
                doc_line['_id'] = id
                yield doc_line
            else:
                raise ValueError('Woops, error in index file')

def create_index(es, index_name, body={}):
    # delete index when it already exists
    es.indices.delete(index=index_name, ignore=[400, 404])
    # create the index 
    es.indices.create(index=index_name, body=body)
                
def index_documents(es, file_name, index_name, body={}):
    create_index(es, index_name, body)
    # bulk index the documents from file_name
    return elasticsearch.helpers.bulk(
        es, 
        read_documents(file_name),
        index=index_name,
        chunk_size=2000,
        request_timeout=30
    )

In [3]:
index_documents(es, 'data/trec-medline.json', 'genomics')

(525937, [])

## Exercise 02.E: _ElasticSearch Analyzers (Tokenization)_

The amount and quality of the tokens used to construct the inverted index are of great importance. In ElasticSearch, mappings and settings also allow specifying what [Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) is used to tokenize your documents and queries. In the mappings below, use the _Dutch_ analyzer for the field `"all"` (where `"all'` indexes the fields `"TI"` and `"AB"`):

> Usually, the same analyzer should be applied to documents and queries, but 
> Elasticsearch allows you to specify a `"search_analyzer"` that is used on 
> your queries (which we do not need to use in the assignment).

In [20]:
#THIS IS GRADED!

analyzer_test = {
  # BEGIN ANSWER
    "mappings": {
        "properties": {
            "AB": {
                "type": "text",
                "copy_to": "all"                
            },
            "TI": {
                "type": "text",
                "copy_to": "all"
            },
            "all": {
                "type": "text",
                "analyzer": "dutch"
            }
#             "all": {
#                 "type": "text",
#                 "analyzer": "dutch",
#                 "fields": {
#                     "AB": {
#                         "type": "text",
#                     },
#                     "TI": {
#                     "type": "text",
#                     }
#                 }
#             }
        }
    }
  # END ANSWER
}

# create the index, but don't index any documents:
create_index(es, 'genomics', body=analyzer_test)

The analyzer defined for the `"all"` field can be tested [as follows](https://elasticsearch-py.readthedocs.io/en/master/api.html#indices). Translated to English the text says: _"This is a Dutch sentence"_. 

    The following script identifies the tokens (based on the use of the dutch tokenizer): try with different tokenizers and different sentences to see how the tokens are created.

In [15]:
from pprint import pprint # pretty print

body = { "field": "all", "text": "dit zijn nederlandse zinnen"}
tokens = es.indices.analyze(index='genomics', body=body)
pprint(tokens)

{'tokens': [{'end_offset': 20,
             'position': 2,
             'start_offset': 9,
             'token': 'nederland',
             'type': '<ALPHANUM>'},
            {'end_offset': 27,
             'position': 3,
             'start_offset': 21,
             'token': 'zinn',
             'type': '<ALPHANUM>'}]}


##  Exercise 02.F: _tweet language analyzer_

Read the documentation for [Custom Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis-custom-analyzer.html). 
Make a custom analyzer for _English tweet language_. The analyzer should do the following:
* change common abbreviations to the full forms: 
  * _b4_ to _before_, 
  * _abt_ to _about_, 
  * _chk_ to _check_, 
  * _cr8_ to _create_, 
  * _dm_ to _direct message_,
  * _f2f_ to _face-to-face_
* use the _standard_ tokenizer;
* put everything to lower-case;
* filter English stopwords.

In [32]:
#THIS IS GRADED!

tweet_analyzer = {
  # BEGIN ANSWER
    "mappings": {
        "properties": {
            "AB": {
                "type": "text",
                "copy_to": "all"                
            },
            "TI": {
                "type": "text",
                "copy_to": "all"
            },
            "all": {
                "type": "text",
                "analyzer": "tweet_analyzer"
            }
#             "all": {
#                 "type": "text",
#                 "analyzer": "tweet_analyzer",
#                 "fields": {
#                     "AB": {
#                         "type": "text",
#                     },
#                     "TI": {
#                     "type": "text",
#                     }
#                 }   
#             }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                 "tweet_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "char_filter": "tweet_char_filter",
                    "filter": [
                        "lowercase",
                        "stop_filter" 
                    ]
                 }
            },
            "char_filter": {
                "tweet_char_filter": {
                    "type":"mapping",
                    "mappings": [
                        "b4 => before",
                        "abt => about",
                        "chk => check",
                        "cr8 => create",
                        "dm => direct message",
                        "f2f => face-to-face" 
                    ]
                }
            },
            "filter": {
                "stop_filter": {
                    "type": "stop",
                    "stopwords": "_english_"
                }
            }
        }
    }
  # END ANSWER
}

# create the index, but don't index any documents:
create_index(es, 'genomics', body=tweet_analyzer)
body = { "field": "all", "text": "cr8 it! what abt dm me?"}
tokens = es.indices.analyze(index='genomics', body=body)
pprint(tokens)

{'tokens': [{'end_offset': 3,
             'position': 0,
             'start_offset': 0,
             'token': 'create',
             'type': '<ALPHANUM>'},
            {'end_offset': 12,
             'position': 2,
             'start_offset': 8,
             'token': 'what',
             'type': '<ALPHANUM>'},
            {'end_offset': 16,
             'position': 3,
             'start_offset': 13,
             'token': 'about',
             'type': '<ALPHANUM>'},
            {'end_offset': 18,
             'position': 4,
             'start_offset': 17,
             'token': 'direct',
             'type': '<ALPHANUM>'},
            {'end_offset': 19,
             'position': 5,
             'start_offset': 18,
             'token': 'message',
             'type': '<ALPHANUM>'},
            {'end_offset': 22,
             'position': 6,
             'start_offset': 20,
             'token': 'me',
             'type': '<ALPHANUM>'}]}


# Part 03: Search models 

_You are advised to work on this part after Lecture 03_

Elasticsearch [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/mapping.html) define how a document, and its properties (fields) are stored and indexed, but also provides tools to implement and exeute different document similarity measures (i.e. search models). 

> See again: [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/indices-create-index.html).

For instance, we can add a new field `"all"` that uses the  [similarity measure](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/similarity.html) _Boolean_, and let it serve as an index for the fields `"TI"` and `"AB"` (title and abstract) as follows:

In [4]:
boolean = {
  "settings" : {
    # a single shard, so we do not suffer from approximate document frequencies
    "number_of_shards" : 1
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "all"
        },
        "TI": {
          "type": "text",
          "copy_to": "all"
        },
        "all": {
            "type": "text",
            "similarity": "boolean"
        }
      }
  }
}

index_documents(es, 'data/trec-medline.json', 'genomics', body=boolean)

(525937, [])

> Most changes to the mappings cannot be done on an existing index. Some (for instance
> similarity measures) can be changed if the index is first closed. Nevertheless, we 
> will in this notebook _re-index_ the collection for every change to the mappings
> using the function `index_documents()` that we defined above. Mappings (and settings)
> can be passed to the function using the `body` parameter.

Let's have a look at the mappings and settings for our index as follows:

In [5]:
es.indices.get(index='genomics')

{'genomics': {'aliases': {},
  'mappings': {'properties': {'AB': {'type': 'text', 'copy_to': ['all']},
    'AD': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'AID': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CI': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CIN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CON': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CY': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DA': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DCOM': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DP': 

Now let's search our new field `"all"` as follows:

In [6]:
query = "blood"
search_type = "dfs_query_then_fetch" # this will use exact document frequencies even for multiple shards
body = {
  "query": {
    "match" : { "all" : query }
  },
  "size": 10
}
es.search(index="genomics", search_type=search_type, body=body)

{'took': 1970,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': 1.0,
  'hits': [{'_index': 'genomics',
    '_type': '_doc',
    '_id': '82',
    '_score': 1.0,
    '_source': {'AB': 'Space flight results in loss of bone mass, especially in weight-bearing bones, a condition that is suggested to be similar to disuse osteoporosis. As models to elucidate the underlying mechanism, bed rest studies were performed and bone metabolism in the rat both during space flight and during hindlimb unloading was investigated. The general picture is that bone formation is decreased partly as a result of reduced osteoblast function, whereas bone resorption is unaltered or increased. This deficit in bone mass can be replaced, but the time span for restoration exceeds the period of unloading. Changes in blood flow, systemic hormones, and locally produced factors are contributing in a yet undefin

## Exercise 03.A: _new run and evaluation_
Create a new run file (e.g. `boolean.run`), compute the retrieval performance with the function `print_trec_eval()` and compare the results with the baseline run file `baseline.run`.

In [16]:
#THIS IS GRADED!

# write your code here
# BEGIN ANSWER

#THIS IS GRADED!

import elasticsearch
from elasticsearch_dsl import Search, Q
import re

def make_trec_run(es, topics_file_name, run_file_name, run_name="test"):
    with open(run_file_name, 'w') as run_file:
        with open(topics_file_name, 'r') as test_queries:
            for line in test_queries:
                (qid, query) = line.strip().split('\t')
                # BEGIN ANSWER
                run_tag = "27" + run_name #per TREC guidelines: "each run should have a different tag that identifies the group and the method that produced the run."
                
                # Do search using query. Return 1,000 relevant documents as per question instructions.
                s = Search(using=es, index='genomics') 
                #q = Q('multi_match', query=query)
                search_type = "dfs_query_then_fetch" # this will use exact document frequencies even for multiple shards
                body = {
                      "query": {
                        "match" : { "all" : query }
                      },
                      "size": 1000
                    }
                res = es.search(index="genomics", search_type=search_type, body=body)
                
#                 s = s.query(q)[0:1000]
#                 res = s.execute()
              
                for i, hit in enumerate(res['hits']['hits']):
                    output = [str(qid), 'Q0', str(hit['_source']['PMID']), str(i), str(hit['_score']),run_tag]
                    # Write to file.
                    if i == 0 and qid == 1:
                        run_file.write(" ".join(output))
                    else:
                        run_file.write("\n" + " ".join(output))
                # END ANSWER
                
                
# Write the results of the queries contained in the topic file `'data/training-queries-simple.txt'` 
# to the run file `'boolean.run'`, and name this test as `test01`
make_trec_run(es, 'data/training-queries-simple.txt', 'boolean.run', run_name='test02')
# END ANSWER

In [12]:
! head data/training-queries-simple.txt

1	"cyclin-dependent kinase inhibitor 1A (p21, Cip1)" in Homo sapiens
2	"DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 5 (RNA helicase, 68kDa)" in Homo sapiens
3	ets variant gene 6 (TEL oncogene) in Homo sapiens
4	fibroblast growth factor 7 (keratinocyte growth factor) in Homo sapiens
5	"glycine receptor, alpha 1 (startle disease/hyperekplexia, stiff man syndrome)" in Homo sapiens
6	"major histocompatibility complex, class II, DQ beta 1" in Homo sapiens
7	Janus kinase 2 (a protein tyrosine kinase) in Homo sapiens
8	luteinizing hormone/choriogonadotropin receptor in Homo sapiens
9	metallothionein 3 (growth inhibitory factor (neurotrophic)) in Homo sapiens
10	protein C (inactivator of coagulation factors Va and VIIIa) in Homo sapiens


In [17]:
! cat boolean.run


1 Q0 11756437 0 7.0 27test02
1 Q0 11762751 1 7.0 27test02
1 Q0 11767002 2 7.0 27test02
1 Q0 11790141 3 7.0 27test02
1 Q0 11809764 4 7.0 27test02
1 Q0 11852120 5 7.0 27test02
1 Q0 11870216 6 7.0 27test02
1 Q0 11876550 7 7.0 27test02
1 Q0 11879190 8 7.0 27test02
1 Q0 11880176 9 7.0 27test02
1 Q0 11886382 10 7.0 27test02
1 Q0 11886527 11 7.0 27test02
1 Q0 12203124 12 7.0 27test02
1 Q0 12204896 13 7.0 27test02
1 Q0 12214273 14 7.0 27test02
1 Q0 12214857 15 7.0 27test02
1 Q0 12085228 16 7.0 27test02
1 Q0 12112322 17 7.0 27test02
1 Q0 12112851 18 7.0 27test02
1 Q0 12175534 19 7.0 27test02
1 Q0 12023399 20 7.0 27test02
1 Q0 12028050 21 7.0 27test02
1 Q0 12032080 22 7.0 27test02
1 Q0 12036924 23 7.0 27test02
1 Q0 12045203 24 7.0 27test02
1 Q0 12056822 25 7.0 27test02
1 Q0 12065641 26 7.0 27test02
1 Q0 12075114 27 7.0 27test02
1 Q0 12079680 28 7.0 27test02
1 Q0 12080324 29 7.0 27test02
1 Q0 12081329 30 7.0 27test02
1 Q0 12126481 31 7.0 27test02
1 Q0 12151347 32

3 Q0 11767918 351 3.0 27test02
3 Q0 11767946 352 3.0 27test02
3 Q0 11768239 353 3.0 27test02
3 Q0 11768602 354 3.0 27test02
3 Q0 11768721 355 3.0 27test02
3 Q0 11769963 356 3.0 27test02
3 Q0 11770011 357 3.0 27test02
3 Q0 11770014 358 3.0 27test02
3 Q0 11770533 359 3.0 27test02
3 Q0 11770798 360 3.0 27test02
3 Q0 11770893 361 3.0 27test02
3 Q0 11771666 362 3.0 27test02
3 Q0 11771754 363 3.0 27test02
3 Q0 11771756 364 3.0 27test02
3 Q0 11771959 365 3.0 27test02
3 Q0 11772517 366 3.0 27test02
3 Q0 11772521 367 3.0 27test02
3 Q0 11772522 368 3.0 27test02
3 Q0 11772624 369 3.0 27test02
3 Q0 11772634 370 3.0 27test02
3 Q0 11773044 371 3.0 27test02
3 Q0 11773055 372 3.0 27test02
3 Q0 11773079 373 3.0 27test02
3 Q0 11773627 374 3.0 27test02
3 Q0 11773631 375 3.0 27test02
3 Q0 11774034 376 3.0 27test02
3 Q0 11775127 377 3.0 27test02
3 Q0 11776317 378 3.0 27test02
3 Q0 11776976 379 3.0 27test02
3 Q0 11777610 380 3.0 27test02
3 Q0 11778658 381 3.0 27test02
3 Q0 117

6 Q0 12112369 638 5.0 27test02
6 Q0 12112675 639 5.0 27test02
6 Q0 12112686 640 5.0 27test02
6 Q0 12114532 641 5.0 27test02
6 Q0 12115060 642 5.0 27test02
6 Q0 12165480 643 5.0 27test02
6 Q0 12167035 644 5.0 27test02
6 Q0 12167050 645 5.0 27test02
6 Q0 12167099 646 5.0 27test02
6 Q0 12167316 647 5.0 27test02
6 Q0 12167562 648 5.0 27test02
6 Q0 12170241 649 5.0 27test02
6 Q0 12170466 650 5.0 27test02
6 Q0 12172023 651 5.0 27test02
6 Q0 12173309 652 5.0 27test02
6 Q0 12173942 653 5.0 27test02
6 Q0 12175724 654 5.0 27test02
6 Q0 12175732 655 5.0 27test02
6 Q0 12175734 656 5.0 27test02
6 Q0 12176915 657 5.0 27test02
6 Q0 12177404 658 5.0 27test02
6 Q0 12177443 659 5.0 27test02
6 Q0 12177688 660 5.0 27test02
6 Q0 12180134 661 5.0 27test02
6 Q0 12180543 662 5.0 27test02
6 Q0 12180827 663 5.0 27test02
6 Q0 12180851 664 5.0 27test02
6 Q0 12181045 665 5.0 27test02
6 Q0 12181439 666 5.0 27test02
6 Q0 12181776 667 5.0 27test02
6 Q0 12183336 668 5.0 27test02
6 Q0 121

10 Q0 11877390 224 6.0 27test02
10 Q0 11877423 225 6.0 27test02
10 Q0 11877432 226 6.0 27test02
10 Q0 11877478 227 6.0 27test02
10 Q0 11878797 228 6.0 27test02
10 Q0 11879781 229 6.0 27test02
10 Q0 11879819 230 6.0 27test02
10 Q0 11880178 231 6.0 27test02
10 Q0 11880298 232 6.0 27test02
10 Q0 11880336 233 6.0 27test02
10 Q0 11880358 234 6.0 27test02
10 Q0 11882348 235 6.0 27test02
10 Q0 11882386 236 6.0 27test02
10 Q0 11882816 237 6.0 27test02
10 Q0 11882974 238 6.0 27test02
10 Q0 11884012 239 6.0 27test02
10 Q0 11884386 240 6.0 27test02
10 Q0 11884391 241 6.0 27test02
10 Q0 11884401 242 6.0 27test02
10 Q0 11885409 243 6.0 27test02
10 Q0 11885412 244 6.0 27test02
10 Q0 11886848 245 6.0 27test02
10 Q0 11887452 246 6.0 27test02
10 Q0 11888297 247 6.0 27test02
10 Q0 11889076 248 6.0 27test02
10 Q0 11890875 249 6.0 27test02
10 Q0 11891769 250 6.0 27test02
10 Q0 11891805 251 6.0 27test02
10 Q0 11891846 252 6.0 27test02
10 Q0 12194736 253 6.0 27test02
10 Q0 1219

13 Q0 11929995 335 3.0 27test02
13 Q0 11932268 336 3.0 27test02
13 Q0 11935223 337 3.0 27test02
13 Q0 11936259 338 3.0 27test02
13 Q0 11937360 339 3.0 27test02
13 Q0 11940574 340 3.0 27test02
13 Q0 11943764 341 3.0 27test02
13 Q0 11948425 342 3.0 27test02
13 Q0 11952375 343 3.0 27test02
13 Q0 11953945 344 3.0 27test02
13 Q0 11955438 345 3.0 27test02
13 Q0 11955446 346 3.0 27test02
13 Q0 11991859 347 3.0 27test02
13 Q0 11994286 348 3.0 27test02
13 Q0 11994469 349 3.0 27test02
13 Q0 11994479 350 3.0 27test02
13 Q0 11996013 351 3.0 27test02
13 Q0 11996585 352 3.0 27test02
13 Q0 11996925 353 3.0 27test02
13 Q0 11997063 354 3.0 27test02
13 Q0 11997106 355 3.0 27test02
13 Q0 11997145 356 3.0 27test02
13 Q0 12006625 357 3.0 27test02
13 Q0 12007195 358 3.0 27test02
13 Q0 12007415 359 3.0 27test02
13 Q0 12011068 360 3.0 27test02
13 Q0 12011974 361 3.0 27test02
13 Q0 12011997 362 3.0 27test02
13 Q0 12015115 363 3.0 27test02
13 Q0 12015117 364 3.0 27test02
13 Q0 1201

17 Q0 12114424 16 5.0 27test02
17 Q0 12193046 17 5.0 27test02
17 Q0 12037603 18 5.0 27test02
17 Q0 12044941 19 5.0 27test02
17 Q0 12047555 20 5.0 27test02
17 Q0 12055246 21 5.0 27test02
17 Q0 12065407 22 5.0 27test02
17 Q0 11909870 23 5.0 27test02
17 Q0 12125810 24 5.0 27test02
17 Q0 12134164 25 5.0 27test02
17 Q0 11976787 26 5.0 27test02
17 Q0 12351724 27 5.0 27test02
17 Q0 12351735 28 5.0 27test02
17 Q0 12359260 29 5.0 27test02
17 Q0 11937514 30 5.0 27test02
17 Q0 11944983 31 5.0 27test02
17 Q0 11952131 32 5.0 27test02
17 Q0 11956218 33 5.0 27test02
17 Q0 11992720 34 5.0 27test02
17 Q0 11994464 35 5.0 27test02
17 Q0 12007189 36 5.0 27test02
17 Q0 12391163 37 5.0 27test02
17 Q0 12408834 38 5.0 27test02
17 Q0 12417297 39 5.0 27test02
17 Q0 12417951 40 5.0 27test02
17 Q0 12419829 41 5.0 27test02
17 Q0 12437987 42 5.0 27test02
17 Q0 12441291 43 5.0 27test02
17 Q0 12451134 44 5.0 27test02
17 Q0 12477825 45 5.0 27test02
17 Q0 12556155 46 5.0 27test02
17 Q0 12

19 Q0 12214235 886 3.0 27test02
19 Q0 12214238 887 3.0 27test02
19 Q0 12214272 888 3.0 27test02
19 Q0 12214709 889 3.0 27test02
19 Q0 12214898 890 3.0 27test02
19 Q0 12215259 891 3.0 27test02
19 Q0 12215267 892 3.0 27test02
19 Q0 12215270 893 3.0 27test02
19 Q0 12215429 894 3.0 27test02
19 Q0 12215435 895 3.0 27test02
19 Q0 12215437 896 3.0 27test02
19 Q0 12215442 897 3.0 27test02
19 Q0 12215445 898 3.0 27test02
19 Q0 12215452 899 3.0 27test02
19 Q0 12215486 900 3.0 27test02
19 Q0 12215492 901 3.0 27test02
19 Q0 12215494 902 3.0 27test02
19 Q0 12215497 903 3.0 27test02
19 Q0 12215514 904 3.0 27test02
19 Q0 12215523 905 3.0 27test02
19 Q0 12215536 906 3.0 27test02
19 Q0 12215545 907 3.0 27test02
19 Q0 12215646 908 3.0 27test02
19 Q0 12215665 909 3.0 27test02
19 Q0 12215668 910 3.0 27test02
19 Q0 12216072 911 3.0 27test02
19 Q0 12216086 912 3.0 27test02
19 Q0 12217078 913 3.0 27test02
19 Q0 12217327 914 3.0 27test02
19 Q0 12217645 915 3.0 27test02
19 Q0 1221

23 Q0 11964166 509 4.0 27test02
23 Q0 11964173 510 4.0 27test02
23 Q0 11965833 511 4.0 27test02
23 Q0 11966980 512 4.0 27test02
23 Q0 11967314 513 4.0 27test02
23 Q0 11968009 514 4.0 27test02
23 Q0 11969291 515 4.0 27test02
23 Q0 11970610 516 4.0 27test02
23 Q0 11971000 517 4.0 27test02
23 Q0 11971018 518 4.0 27test02
23 Q0 11971025 519 4.0 27test02
23 Q0 11971947 520 4.0 27test02
23 Q0 11971984 521 4.0 27test02
23 Q0 11972612 522 4.0 27test02
23 Q0 11974442 523 4.0 27test02
23 Q0 11976336 524 4.0 27test02
23 Q0 11976507 525 4.0 27test02
23 Q0 11976787 526 4.0 27test02
23 Q0 11976831 527 4.0 27test02
23 Q0 11977978 528 4.0 27test02
23 Q0 11978095 529 4.0 27test02
23 Q0 11978101 530 4.0 27test02
23 Q0 11978122 531 4.0 27test02
23 Q0 11978487 532 4.0 27test02
23 Q0 11978788 533 4.0 27test02
23 Q0 11978790 534 4.0 27test02
23 Q0 11978820 535 4.0 27test02
23 Q0 11978821 536 4.0 27test02
23 Q0 11979549 537 4.0 27test02
23 Q0 11980680 538 4.0 27test02
23 Q0 1198

26 Q0 12215540 318 4.0 27test02
26 Q0 12215545 319 4.0 27test02
26 Q0 12215646 320 4.0 27test02
26 Q0 12216102 321 4.0 27test02
26 Q0 12217005 322 4.0 27test02
26 Q0 12217202 323 4.0 27test02
26 Q0 12217688 324 4.0 27test02
26 Q0 12217750 325 4.0 27test02
26 Q0 12218065 326 4.0 27test02
26 Q0 12218185 327 4.0 27test02
26 Q0 12218351 328 4.0 27test02
26 Q0 12218411 329 4.0 27test02
26 Q0 12219002 330 4.0 27test02
26 Q0 12219004 331 4.0 27test02
26 Q0 12219019 332 4.0 27test02
26 Q0 12220227 333 4.0 27test02
26 Q0 12220382 334 4.0 27test02
26 Q0 12220504 335 4.0 27test02
26 Q0 12220594 336 4.0 27test02
26 Q0 12221002 337 4.0 27test02
26 Q0 12221096 338 4.0 27test02
26 Q0 12221100 339 4.0 27test02
26 Q0 12221101 340 4.0 27test02
26 Q0 12221103 341 4.0 27test02
26 Q0 12221292 342 4.0 27test02
26 Q0 12223407 343 4.0 27test02
26 Q0 12223467 344 4.0 27test02
26 Q0 12223550 345 4.0 27test02
26 Q0 12224561 346 4.0 27test02
26 Q0 12224947 347 4.0 27test02
26 Q0 1222

29 Q0 11696844 530 1.0 27test02
29 Q0 11696845 531 1.0 27test02
29 Q0 11696846 532 1.0 27test02
29 Q0 11696847 533 1.0 27test02
29 Q0 11696848 534 1.0 27test02
29 Q0 11696849 535 1.0 27test02
29 Q0 11696850 536 1.0 27test02
29 Q0 11696851 537 1.0 27test02
29 Q0 11696852 538 1.0 27test02
29 Q0 11696853 539 1.0 27test02
29 Q0 11696854 540 1.0 27test02
29 Q0 11696855 541 1.0 27test02
29 Q0 11696943 542 1.0 27test02
29 Q0 11696944 543 1.0 27test02
29 Q0 11696945 544 1.0 27test02
29 Q0 11696946 545 1.0 27test02
29 Q0 11696947 546 1.0 27test02
29 Q0 11696949 547 1.0 27test02
29 Q0 11696950 548 1.0 27test02
29 Q0 11696951 549 1.0 27test02
29 Q0 11696952 550 1.0 27test02
29 Q0 11696953 551 1.0 27test02
29 Q0 11696954 552 1.0 27test02
29 Q0 11696956 553 1.0 27test02
29 Q0 11696957 554 1.0 27test02
29 Q0 11696958 555 1.0 27test02
29 Q0 11696959 556 1.0 27test02
29 Q0 11697173 557 1.0 27test02
29 Q0 11697174 558 1.0 27test02
29 Q0 11697175 559 1.0 27test02
29 Q0 1169

33 Q0 12231239 155 5.0 27test02
33 Q0 12231382 156 5.0 27test02
33 Q0 12233818 157 5.0 27test02
33 Q0 12234961 158 5.0 27test02
33 Q0 12239161 159 5.0 27test02
33 Q0 12244185 160 5.0 27test02
33 Q0 12270121 161 5.0 27test02
33 Q0 12325485 162 5.0 27test02
33 Q0 12351674 163 5.0 27test02
33 Q0 11925197 164 5.0 27test02
33 Q0 11926268 165 5.0 27test02
33 Q0 11931839 166 5.0 27test02
33 Q0 11932268 167 5.0 27test02
33 Q0 11934484 168 5.0 27test02
33 Q0 11934809 169 5.0 27test02
33 Q0 11934812 170 5.0 27test02
33 Q0 11943778 171 5.0 27test02
33 Q0 11944922 172 5.0 27test02
33 Q0 11947902 173 5.0 27test02
33 Q0 11950969 174 5.0 27test02
33 Q0 11956332 175 5.0 27test02
33 Q0 11992625 176 5.0 27test02
33 Q0 12003339 177 5.0 27test02
33 Q0 12003834 178 5.0 27test02
33 Q0 12003937 179 5.0 27test02
33 Q0 12005461 180 5.0 27test02
33 Q0 12012251 181 5.0 27test02
33 Q0 12018594 182 5.0 27test02
33 Q0 12020075 183 5.0 27test02
33 Q0 12117103 184 5.0 27test02
33 Q0 1212

36 Q0 11909680 324 3.0 27test02
36 Q0 11909870 325 3.0 27test02
36 Q0 11910118 326 3.0 27test02
36 Q0 11910128 327 3.0 27test02
36 Q0 11910129 328 3.0 27test02
36 Q0 11912489 329 3.0 27test02
36 Q0 11914076 330 3.0 27test02
36 Q0 11914382 331 3.0 27test02
36 Q0 11914720 332 3.0 27test02
36 Q0 11917141 333 3.0 27test02
36 Q0 11919704 334 3.0 27test02
36 Q0 11920702 335 3.0 27test02
36 Q0 11920714 336 3.0 27test02
36 Q0 11920729 337 3.0 27test02
36 Q0 11922104 338 3.0 27test02
36 Q0 12125274 339 3.0 27test02
36 Q0 12126229 340 3.0 27test02
36 Q0 12130631 341 3.0 27test02
36 Q0 12134149 342 3.0 27test02
36 Q0 12134158 343 3.0 27test02
36 Q0 12134162 344 3.0 27test02
36 Q0 12135352 345 3.0 27test02
36 Q0 12135477 346 3.0 27test02
36 Q0 12135924 347 3.0 27test02
36 Q0 12136015 348 3.0 27test02
36 Q0 12136016 349 3.0 27test02
36 Q0 12136017 350 3.0 27test02
36 Q0 12136018 351 3.0 27test02
36 Q0 12136020 352 3.0 27test02
36 Q0 12136413 353 3.0 27test02
36 Q0 1213

39 Q0 11796036 762 2.0 27test02
39 Q0 11796038 763 2.0 27test02
39 Q0 11796729 764 2.0 27test02
39 Q0 11798069 765 2.0 27test02
39 Q0 11801730 766 2.0 27test02
39 Q0 11803572 767 2.0 27test02
39 Q0 11804780 768 2.0 27test02
39 Q0 11804781 769 2.0 27test02
39 Q0 11804783 770 2.0 27test02
39 Q0 11804786 771 2.0 27test02
39 Q0 11804790 772 2.0 27test02
39 Q0 11804792 773 2.0 27test02
39 Q0 11804965 774 2.0 27test02
39 Q0 11805058 775 2.0 27test02
39 Q0 11805330 776 2.0 27test02
39 Q0 11806636 777 2.0 27test02
39 Q0 11806825 778 2.0 27test02
39 Q0 11808875 779 2.0 27test02
39 Q0 11809726 780 2.0 27test02
39 Q0 11809824 781 2.0 27test02
39 Q0 11809830 782 2.0 27test02
39 Q0 11810291 783 2.0 27test02
39 Q0 11812047 784 2.0 27test02
39 Q0 11812793 785 2.0 27test02
39 Q0 11812999 786 2.0 27test02
39 Q0 11814405 787 2.0 27test02
39 Q0 11814569 788 2.0 27test02
39 Q0 11818059 789 2.0 27test02
39 Q0 11818063 790 2.0 27test02
39 Q0 11818064 791 2.0 27test02
39 Q0 1181

42 Q0 11768306 751 2.0 27test02
42 Q0 11768308 752 2.0 27test02
42 Q0 11768313 753 2.0 27test02
42 Q0 11768656 754 2.0 27test02
42 Q0 11770103 755 2.0 27test02
42 Q0 11771750 756 2.0 27test02
42 Q0 11773052 757 2.0 27test02
42 Q0 11773058 758 2.0 27test02
42 Q0 11775060 759 2.0 27test02
42 Q0 11781497 760 2.0 27test02
42 Q0 11782545 761 2.0 27test02
42 Q0 11782949 762 2.0 27test02
42 Q0 11782951 763 2.0 27test02
42 Q0 11783006 764 2.0 27test02
42 Q0 11786535 765 2.0 27test02
42 Q0 11792846 766 2.0 27test02
42 Q0 11792856 767 2.0 27test02
42 Q0 11792861 768 2.0 27test02
42 Q0 11794784 769 2.0 27test02
42 Q0 11794794 770 2.0 27test02
42 Q0 11796036 771 2.0 27test02
42 Q0 11796038 772 2.0 27test02
42 Q0 11796729 773 2.0 27test02
42 Q0 11798069 774 2.0 27test02
42 Q0 11801730 775 2.0 27test02
42 Q0 11803572 776 2.0 27test02
42 Q0 11804780 777 2.0 27test02
42 Q0 11804781 778 2.0 27test02
42 Q0 11804783 779 2.0 27test02
42 Q0 11804786 780 2.0 27test02
42 Q0 1180

46 Q0 12489171 374 3.0 27test02
46 Q0 12492423 375 3.0 27test02
46 Q0 12499268 376 3.0 27test02
46 Q0 12500631 377 3.0 27test02
46 Q0 12502493 378 3.0 27test02
46 Q0 12503676 379 3.0 27test02
46 Q0 12504083 380 3.0 27test02
46 Q0 12504893 381 3.0 27test02
46 Q0 12507466 382 3.0 27test02
46 Q0 12508276 383 3.0 27test02
46 Q0 12519948 384 3.0 27test02
46 Q0 12520032 385 3.0 27test02
46 Q0 12525161 386 3.0 27test02
46 Q0 12540825 387 3.0 27test02
46 Q0 12543784 388 3.0 27test02
46 Q0 12545684 389 3.0 27test02
46 Q0 12552088 390 3.0 27test02
46 Q0 12553872 391 3.0 27test02
46 Q0 12559564 392 3.0 27test02
46 Q0 12559914 393 3.0 27test02
46 Q0 12574425 394 3.0 27test02
46 Q0 12576549 395 3.0 27test02
46 Q0 12594949 396 3.0 27test02
46 Q0 12594956 397 3.0 27test02
46 Q0 12598614 398 3.0 27test02
46 Q0 12610304 399 3.0 27test02
46 Q0 12629179 400 3.0 27test02
46 Q0 12645611 401 3.0 27test02
46 Q0 11694004 402 2.0 27test02
46 Q0 11694030 403 2.0 27test02
46 Q0 1169

50 Q0 11798802 179 3.0 27test02
50 Q0 11800089 180 3.0 27test02
50 Q0 11801366 181 3.0 27test02
50 Q0 11801372 182 3.0 27test02
50 Q0 11801590 183 3.0 27test02
50 Q0 11801595 184 3.0 27test02
50 Q0 11801740 185 3.0 27test02
50 Q0 11802805 186 3.0 27test02
50 Q0 11802813 187 3.0 27test02
50 Q0 11803201 188 3.0 27test02
50 Q0 11803373 189 3.0 27test02
50 Q0 11803552 190 3.0 27test02
50 Q0 11804498 191 3.0 27test02
50 Q0 11804533 192 3.0 27test02
50 Q0 11804624 193 3.0 27test02
50 Q0 11804955 194 3.0 27test02
50 Q0 11804972 195 3.0 27test02
50 Q0 11805085 196 3.0 27test02
50 Q0 11805244 197 3.0 27test02
50 Q0 11806825 198 3.0 27test02
50 Q0 11807118 199 3.0 27test02
50 Q0 11807127 200 3.0 27test02
50 Q0 11807581 201 3.0 27test02
50 Q0 11807948 202 3.0 27test02
50 Q0 11808895 203 3.0 27test02
50 Q0 11809768 204 3.0 27test02
50 Q0 11809818 205 3.0 27test02
50 Q0 11810020 206 3.0 27test02
50 Q0 11810183 207 3.0 27test02
50 Q0 11810318 208 3.0 27test02
50 Q0 1181

## Exercise 03.B: _Language models_

Custom similarities can be configured by tuning the parameters of the built-in similarities. Read more about these (expert) options in the [similarity module](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules-similarity.html).

> Tip: the example similarity settings have to be used in a `"settings"` object.
> Check your settings and mappings with: `es.indices.get(index='genomics')`.

Make a run that uses Language Models with Jelinek-Mercer smoothing (linear interpolation smoothing) on the field `"all"` that indexes the fields `"TI"` and `"AB"`. Use the parameter `lambda=0.2`. Again evaluate the run using `print_trec_eval`.

In [18]:
#THIS IS GRADED!
lmjelinekmercer = {
  # BEGIN ANSWER
    "settings" : {
    # a single shard, so we do not suffer from approximate document frequencies
        "number_of_shards" : 1,
        "index" : {
            "similarity" : {
              "lmjm" : {
                "type" : "LMJelinekMercer",
                "lambda": "0.2"
              }
            }
        }
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "all"
        },
        "TI": {
          "type": "text",
          "copy_to": "all"
        },
        "all": {
            "type": "text",
            "similarity": "lmjm"
        }
      }
  }  
  # END ANSWER
}

index_documents(es, 'data/trec-medline.json', 'genomics', body=lmjelinekmercer)
make_trec_run(es, 'data/training-queries-simple.txt', 'lmjelinekmercer.run', run_name="test_03")

## Exercise 03.C: _Model comparison_
Compute the results of the `lmjelinekmercer.run` and compare them with those of the `baseline.run` and `boolean.run`. Performing statistical tests may help strenghtening your claims.

In [None]:
#THIS IS GRADED!

# your comments here
# BEGIN ANSWER
# END ANSWER

In [None]:
! head -20 baseline.run
! echo "\n"
! head -20 lmjelinekmercer.run

## Exercise 03.D: _Implement your own similarity measure (Bonus)_ 

We have only seen the results of using the analyzer to queries. The analyzer results from the _documents_ are available using the `termvectors()` function, as follows for document `id=1`: (Additionally, we can get overall field statistics, such as the number of documents)

> First, index the collection again. While waiting, have a coffee or tea :) 

> `id=1` refers to the internal document identifiers, so not to the Pubmed identifier.

_The bonus exercise is not mandatory. It can compensate for lower grades of other exercises._

In [None]:
index_documents(es, 'data/trec-medline.json', 'genomics')

es.termvectors(index="genomics", id="1", fields="TI", 
               term_statistics=True, field_statistics=True, offsets=False)

### Implement the BM25 similarity

Complete the function `bm25_similarity()` below by implementing the BM25 similarity as described by in Section 11.4.3 of [Manning, Raghavan and Schuetze, Chapter 11](https://nlp.stanford.edu/IR-book/pdf/11prob.pdf). Are you able to replicate the score of ElasitcSearch (15.472)? If not, are you using a different variant of the BM25 model?

In [None]:
#THIS IS GRADED!

import math

# math.log(x) computes the logarithm of x

def bm25_similarity (query, doc_id):

    # Get the query tokens (see above)
    query_tokens = es.indices.analyze(index='genomics', body={"field":"TI", "text": query})
    tokens = query_tokens['tokens']

    # Get the term vector for doc_id and the field statistics
    term_vector = es.termvectors(index="genomics", id=doc_id, fields="TI", 
                  term_statistics=True, field_statistics=True, offsets=False)
    vector = term_vector['term_vectors']['TI']['terms']
    f_stats = term_vector['term_vectors']['TI']['field_statistics']

    # The answer should sum over 'tokens', check if the tokens exists in the 'vector',
    # and if so, add the appropriate value to 'similarity'.
    # Tip: add print statements to your code to see what each variable contains.
    
    similarity = 0

    # BEGIN ANSWER
    # END ANSWER
    return similarity

bm25_similarity("curve fitting", 1)

See below for the 'reference score' computed by ElasticSearch:

In [None]:
body = {
  "query": {
    "match" : { "TI" : "curve fitting" }
  }
}
explain = es.explain(index="genomics", id="1", body=body)
print (explain['explanation']['value'])  # BM25 score computed by ElasticSearch