# Assignment 1 - _Foundations of Information Retrieval '20/'21_

This assignment is divided in 3 parts, which have to be delivered all together before 04/10/2020 (strictly - no extensions will be granted!), via Canvas. Delivery of the assignment solutions is mandatory.

We will use [ElasticSearch](https://www.elastic.co/) as search engine, as it provides state-of-the-art tools to implement your own engine, and let you focus on methodological aspects of search implementation and optimization.

The assignment is about text-based Information Retrieval and it is structured in three parts:
1. IR performance evaluation (implementation of performance metrics)
2. Setting up a search engine, pre-processing and indexing using ElasticSearch (Indexing, Analyzers)
3. Implementation and optimization of a model of search using ElasticSearch (Similarity measures)


This assignment file contains exercises, marked with the section title __Exercise 01.(x)__, which are evaluated, and other sections that contain support code which you should use as it is. Write your answers between the comments `BEGIN ANSWER` and `END ANSWER`.
Try to complete the solutions for all the exercise sections. 

_Note:_ we leave the comment `#THIS IS GRADED!` in the sections that will be considered for evauation and grading.


### Initial preparation (self-study)
For the It is good to acquire basic knowledge of Python (or refresh it a bit).
For the second and third part of the assignment, please study yourself the [Getting Started guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)" of ElasicSearch and get acquainted with the framework.


# PART 01 - Performance evaluation


### Background information and reading
To solve the exercises in Part 01, study the slides of Lecture 01 (available on Canvas) and the reference book chapter (Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, [Chapter 8, Evaluation in information retrieval](http://nlp.stanford.edu/IR-book/pdf/08eval.pdf), Cambridge University Press. 2008)

### Basic concepts
Suppose the set of relevant documents (the document identifiers - _doc-IDs_) is called `relevant`, then we might define those as follows (in Python):

In [2]:
relevant = set([2, 3, 5, 8, 13, 21])

A perfect run would retrieve exactly these 6 documents in any order. Now, suppose the list of retrieved documents (the document identifiers - _doc-IDs_) is called `retrieved`, and contains the following _doc-IDs_:

In [3]:
retrieved = [4, 2, 18, 16, 8, 46, 32, 22, 47, 39, 3]

One of the simplest evaluation measures we can think of is the _Success at rank 1_. The measure answers the question: Was the first document retrieved a relevant document? _Success at rank 1_ returns 1 if the first document is relevant, and 0 otherwise. A possible implementation is: 

In [19]:
def success_at_1 (relevant, retrieved):
    if len(retrieved) > 0 and retrieved[0] in relevant:
        return 1
    else:
        return 0

success_at_1(relevant, retrieved)

NameError: name 'relevant' is not defined

The first retrieved documentid is 4 which is not in the set of relevant documents, so the score is 0.

Note how easy it is to check if an item occurs in a Python set or list by using the keyword: `in`. Similarly, you can loop over all items in a set of list with: 
`for doc in retrieved:`, 
where doc will refer to each item in the set or list. 

Be sure to use the internet to sharpen your knowledge about Python constructs, for instance on [Python list slicing](https://duckduckgo.com/?q=python+list+slicing). Also note that the code above checks if at least one document is retrieved to avoid an index out of bounds exception (i.e. we avoid to access an empty vector).

> __Suggestion: to be sure of the correctness of the implementations of the performance metrics you are requested, you can compute their values manually and compare them with those of your functions. This is important, as you will use these metrics for later exercises and to compare different models.__

## Exercise 01.A: _Success at k_
The measure _Success at k_ returns 1 if a relevant document is among the first _k_ documents retrieved and zero otherwise. Implement _Success at 5_ below.

> Success at _k_ measures are well-suited in cases where there is typically only one relevant document (or retrieving one relevant document is enough).

In [None]:
#THIS IS GRADED!


def success_at_5(relevant, retrieved):
    # BEGIN ANSWER
    for k in retrieved[:5]:
        if k in relevant:
            return 1        
    return 0
    # END ANSWER
    
success_at_5(relevant, retrieved)

Similarly implement success at rank 10

In [None]:
#THIS IS GRADED!

def success_at_10(relevant, retrieved):
    # BEGIN ANSWER   
    for k in retrieved[:10]:
        if k in relevant:
            return 1   
    return 0
    # END ANSWER
    
success_at_10(relevant, retrieved)

## Exercise 01.B: _Precision, Recall and F-measure_
Implement _Precision_ using Formula 8.1 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

_Hint:_ one can count the number of documents in a list by using the built-in Python function [len()](https://docs.python.org/3/library/functions.html#len) (e.g. `len(retrieved)` for the number of retrieved documents). 

In [None]:
#THIS IS GRADED!

def precision(relevant, retrieved):
    # BEGIN ANSWER
    if not retrieved:
        return 1
    
    relevant_and_retrieved = [k for k in retrieved if k in relevant]
    return len(relevant_and_retrieved) / len(retrieved)
    # END ANSWER
    
precision(relevant, retrieved)

Implement _Recall_ using Formula 8.2 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

In [None]:
#THIS IS GRADED!

def recall(relevant, retrieved):
    # BEGIN ANSWER
    if not relevant:
        return 1
    
    relevant_and_retrieved = [k for k in retrieved if k in relevant]
    return len(relevant_and_retrieved) / len(relevant)
    # END ANSWER
    
recall(relevant, retrieved)

The balanced F measure (_F_ with β=1) is defined as the harmonic mean of precision and
recall. Implement _F_ using Formula 8.6 from [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

> Tip: reuse your implementations of precision and recall

In [None]:
#THIS IS GRADED!

def f_measure(relevant, retrieved):
    # BEGIN ANSWER
    P = precision(relevant, retrieved)
    R = recall(relevant, retrieved)
    return 2*P*R/(P+R)
    # END ANSWER
    
f_measure(relevant, retrieved)

## Exercise 01.C: _Precision at rank k_ and  _R-Precision_

Precision, Recall and F are _set_-based measures and suited for unranked lists of documents. If our search system returns a ranked _list_ of results, we can measure precision for several cut-off levels _k_ in the ranked list, i.e. we evaluate the relevance of the TOP-_k_ retrieved documents (see lecture slides and the related book chapter). 
We did this before with the _Success at rank 5_ measure for _k_=5.

Implement below the function `precision_at_k()` that measures the precision at rank _k_

> Interesting fact: For _k_=1, the _Precision at rank 1_ would be the samen as _Success at rank 1_ (why?) - Because it must be either 1 out of 1 right or 0 out of 1 correct so. Therefore the precision must be 1 or 0.

In [None]:
#THIS IS GRADED!

def precision_at_k(relevant, retrieved, k):
    # BEGIN ANSWER
    return precision(relevant, retrieved[:k])
    # END ANSWER
    
precision_at_k(relevant, retrieved, k=1)

Implement R-Precision (function `r_precision()`) as defined on Page 161 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

In [None]:
#THIS IS GRADED!

def r_precision(relevant, retrieved):
    # BEGIN ANSWER
    k = len(relevant)
    return precision(relevant, retrieved[:k])
    # END ANSWER
    
r_precision(relevant, retrieved)

## Exercise 01.D:  Interpolated precision at _recall_ X

Another way to address ranked retrieval is to measure precision for several _recall_ levels _X_.

Implement the function `interpolated_precision_at_recall_X()` that measures the interpolated precision at recall level _X_ as defined by Formula 8.7 of [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book).

> Tip: calculate for each rank the recall. If the recall is greater than or equal to X, 
> calculate the precision. Keep the highest (maximum) precision of those to be returned at the end.

In [None]:
#THIS IS GRADED!

def interpolated_precision_at_recall_X (relevant, retrieved, X):
    # BEGIN ANSWER
    # The interpolated precision at recall X is undefined where the max recall for the retrieved set does not reach X.
    if recall(relevant, retrieved) < X:
        return 0
    
    P = 0    
    # Loop through each rank.
    for k, _ in enumerate(retrieved):
        if recall(relevant, retrieved[:k]) >= X:
            P = max(P, precision_at_k(relevant, retrieved, k))
    
    return P
    # END ANSWER
    
interpolated_precision_at_recall_X(relevant, retrieved, X=0.1) 

## Exercise 01.E:  _Average Precision_

For a single information need, _Average Precision_ is the average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved (see [Manning, Raghavan and Schütze](http://nlp.stanford.edu/IR-book), Pages 159 and 160). Implement _Average Precision_ for a single information need. 

In [None]:
#THIS IS GRADED!

def average_precision(relevant, retrieved):
    # BEGIN ANSWER
    # Initialise list of precisions with a zero for each relevant document.
    P = [0] * len(relevant)
    
    for i, doc in enumerate(relevant):
        # If a relevant document is not retrieved, the precision value is taken to be zero. 
        if doc not in retrieved:
            P[i] = 0
        else:
            # Find the precision for the top k documents when doc is retrieved.
            k = retrieved.index(doc)
            P[i] = precision_at_k(relevant, retrieved, k)
    
    # Return the average precision
    return sum(P)/len(P)
    # END ANSWER

average_precision(relevant, retrieved)

## Measures in TREC 

The relevance judgments are provided by TREC in so-called _"qrels"_ files that look as follows:

    1000 Q0 1341 1
    1000 Q0 1231 0
    1001 Q0 12332 1
     ...

The first column is the query identifier, while the second column is the query number within that topic (it is currently unused and should always be Q0). The third column is the document identifier that was examined by the judges. The fourth column is the relevance of the document (_1_ means the document was relevant and _0_ means the document was not relevant).

Below we provide some Python code that reads the _qrels_ and the _run_. The qrels will be put in the Python dictionary `all_relevant`. A [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) provides quick lookup of a set of values given a key. We will use the `query_id` as a key, and a [set](https://docs.python.org/3/tutorial/datastructures.html#sets) of relevant document identifiers. For the partial qrels file above, `all_relevant` would look as follows:

    {
        "1000": set(["1341", "1231"]),
        "1001": set(["12332"])
    }
    
We will use a dictionary called `all_retrieved` with `query_id` as key, and as value a [Python list](https://docs.python.org/3/tutorial/introduction.html#lists) of document identifiers retrieved by the IR system:

    {
        "1000": ["1341", "12346, "2345"],
        "1001": [..., ..., ...],
        ...
    }

Note that, with this data structure, for each `query_id` we can easily access the list of retrieved and relevant documents, and compute the performance metrics. We can then average these measures over all the queries to compute the mean performance of the IR system on the given retrieval task.

Please examine the code below, and make sure you understand every line.

In [None]:
def read_qrels_file(qrels_file):  # reads the content of he qrels file
    trec_relevant = dict()  # query_id -> set([docid1, docid2, ...])
    with open(qrels_file, 'r') as qrels:
        for line in qrels:
            (qid, q0, doc_id, rel) = line.strip().split()
            if qid not in trec_relevant:
                trec_relevant[qid] = set()
            if (rel == "1"):
                trec_relevant[qid].add(doc_id)
    return trec_relevant

def read_run_file(run_file):  
    # read the content of the run file produced by our IR system 
    # (in the following exercises you will create your own run_files)
    trec_retrieved = dict()  # query_id -> [docid1, docid2, ...]
    with open(run_file, 'r') as run:
        for line in run:
            (qid, q0, doc_id, rank, score, tag) = line.strip().split()
            if qid not in trec_retrieved:
                trec_retrieved[qid] = []
            trec_retrieved[qid].append(doc_id) 
    return trec_retrieved
    

def read_eval_files(qrels_file, run_file):
    return read_qrels_file(qrels_file), read_run_file(run_file)

(all_relevant, all_retrieved) = read_eval_files('data/training-qrels.txt', 'data/baselineTREC.run')

### Exercise 01.F: _number of queries_ and _number of retrieved documents per query_
 
Write the Python code that counts the number of queries in the file `baseline.run` and print the value (use the result from the cell above). 

_Hint:_ print the structure and content of the `all_retrieved` and `all_relevant` data structures to understand them better.

In [None]:
#THIS IS GRADED!

# BEGIN ANSWER

# baselineTREC.run is read into the dict all_retrieved. By finding the length, we get the number of keys (queries)
num_queries = len(all_retrieved)
print(num_queries)

# END ANSWER

Write the code that counts, for each query in your baseline run, the number of documents that were retrieved for that query (use `print()` to print the result for each `query_id`).

In [None]:
#THIS IS GRADED!

# BEGIN ANSWER
for query in all_retrieved:
    num_documents = len(all_retrieved[query])
    print("Query: ", query, "  # Documents Retrieved: ", num_documents)
# END ANSWER

## Exercise 01.G: _mean average precision_
Using the `average_precision()` function you implemented above, write the code to compute the _Mean Average Precision_ for the `baseline.run` results. 

In [None]:
#THIS IS GRADED!

def mean_average_precision(all_relevant, all_retrieved):
    # BEGIN ANSWER
    
    count = len(all_retrieved)
        
    precision_per_query = [average_precision(all_relevant[query], all_retrieved[query])  for query in all_retrieved]
    total = sum(precision_per_query)
    
    # END ANSWER
    return "mean AP: ", total / count

mean_average_precision(all_relevant, all_retrieved)

## TREC evaluation

Below you find a function that take `all_relevant` and `all_retrieved` to compute the mean result. It computes the mean value over all queries. The function `mean_metric()`'s first function argument, `measure`, is a special argument: it is a function too! The `mean_metric` function sums the total score for the particular measure and divides it by the number of queries. It computes the average measures over all the queries' results.

_This part will be reused later to compare the results of different models._

In [None]:
def mean_metric(measure, all_relevant, all_retrieved):
    total = 0
    count = 0
    for qid in all_relevant:
        relevant  = all_relevant[qid]
        retrieved = all_retrieved.get(qid, [])
        value = measure(relevant, retrieved)
        total += value
        count += 1
    return "mean " + measure.__name__, total / count

# Example of use of the mean_metric function: computing the average r_precision
mean_metric(r_precision, all_relevant, all_retrieved)

### TREC overview of the results
The following two functions use your implementation of the metrics to create an evaluation overview of the TREC benchmark data. Give a look at the numbers and make you own interpretations of the results. 

In [None]:
def trec_eval(qrels_file, run_file):

    def precision_at_1(rel, ret): return precision_at_k(rel, ret, k=1)
    def precision_at_5(rel, ret): return precision_at_k(rel, ret, k=5)
    def precision_at_10(rel, ret): return precision_at_k(rel, ret, k=10)
    def precision_at_50(rel, ret): return precision_at_k(rel, ret, k=50)
    def precision_at_100(rel, ret): return precision_at_k(rel, ret, k=100)
    def precision_at_recall_00(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.0)
    def precision_at_recall_01(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.1)
    def precision_at_recall_02(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.2)
    def precision_at_recall_03(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.3)
    def precision_at_recall_04(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.4)
    def precision_at_recall_05(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.5)
    def precision_at_recall_06(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.6)
    def precision_at_recall_07(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.7)
    def precision_at_recall_08(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.8)
    def precision_at_recall_09(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=0.9)
    def precision_at_recall_10(rel, ret): return interpolated_precision_at_recall_X(rel, ret, X=1.0)

    (all_relevant, all_retrieved) = read_eval_files(qrels_file, run_file)
    
    unknown_qids = set(all_retrieved.keys()).difference(all_relevant.keys())
    if len(unknown_qids) > 0:
        raise ValueError("Unknown qids in run: {}".format(sorted(list(unknown_qids))))

    metrics = [success_at_1,
               success_at_5,
               success_at_10,
               r_precision,
               precision_at_1,
               precision_at_5,
               precision_at_10,
               precision_at_50,
               precision_at_100,
               precision_at_recall_00,
               precision_at_recall_01,
               precision_at_recall_02,
               precision_at_recall_03,
               precision_at_recall_04,
               precision_at_recall_05,
               precision_at_recall_06,
               precision_at_recall_07,
               precision_at_recall_08,
               precision_at_recall_09,
               precision_at_recall_10,
               average_precision]

    return [mean_metric(metric, all_relevant, all_retrieved) for metric in metrics]


def print_trec_eval(qrels_file, run_file):
    results = trec_eval(qrels_file, run_file)
    print("Results for {}".format(run_file))
    for (metric, score) in results:
        print("{:<30} {:.4}".format(metric, score))

print_trec_eval('data/training-qrels.txt', 'data/baselineTREC.run')

## Exercise 01.H: _significance testing_

Testing the statistical significance of differences in the results of different IR systems is important (see slides and course book - Section 8.8). One of the basic tests one can perform is the two-tailed [sign test](https://en.wikipedia.org/wiki/Sign_test).


For this exercise, we use the run files obtained by  [Hiemstra and Aly](https://djoerdhiemstra.com/wp-content/uploads/trec2014mirex-draft.pdf) for TREC 2014. The `utbase.run` file was generated usinf Language Modeling, while `utexact.run` was generated using an IR system based on mathing the exact query string, abd ranking the documents by  the number of exact matches found. The exact run improves the _Precision at 5_ to 0.456 (compared to 0.440 for the baseline run).  

Implement the code to perform the _sign test_ of statistical significance.
_Hint:_ for each sign, compute the number of queries that increase/descrease performance (called `better, worse` in the code below). How would you use these values to compute the _p_ value of the two-tailed sign test? Is the difference between _utbase_ and _utexact_ significant?
    
Answer: Conduct a binomial test where `better` is the number of successes, `worse` is the number of failures, and the null hypothesis assumes a binomial distribution with p = 0.5. 
Since the performance of the second method is better for 9 queries and also worse for 9 queries, then we get a p-value of 1.0 and fail to reject the null hypothesis. i.e. the difference between _utbase_ and _utexact_ is not statistically significant.

In [None]:
#THIS IS GRADED!

def sign_test_values(measure, qrels_file, run_file_1, run_file_2):
    all_relevant = read_qrels_file(qrels_file)
    all_retrieved_1 = read_run_file(run_file_1)
    all_retrieved_2 = read_run_file(run_file_2)
    better = 0
    worse  = 0
    # BEGIN ANSWER
    
    for query in all_retrieved_1:
        performance_1 = measure(all_relevant[query], all_retrieved_1[query])
        performance_2 = measure(all_relevant[query], all_retrieved_2[query])
        
        if performance_2 > performance_1:
            better += 1
        # Exclude queries with no performance difference between the two methods.
        elif performance_2 < performance_1:
            worse += 1
    
    # END ANSWER
    return(better, worse)
    
def precision_at_rank_5(rel, ret):
    return precision_at_k(rel, ret, k=5)

sign_test_values(precision_at_rank_5, 'data/trec.qrels', 'data/utbase.run', 'data/utexact.run')

# from scipy.stats import binom_test
# w = sign_test_values(precision_at_rank_5, 'data/trec.qrels', 'data/utbase.run', 'data/utexact.run')
# binom_test(w) # Accept the default arguments for the function
### Returns a p-value of 1 > 0.05, thus we fail to reject the null. i.e. there is no difference in performance between the two methods.


# Part 02 - Indexing and querying with ElasticSearch

### Preparation: Getting started with Elasticsearch

We strongly advice you to go through the "Elasticsearch, [reference guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html)", and work on the tutorials. The following parts of the assignment will be based on ElasticSearch.

You can skip the section on [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html), as we provide it already installed in the Virtual Machine.

> If you want (disclaimer: we do __not__ give help with this!), you can 
> follow the [Installation](https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html) to run Elasticsearch on your laptop without VM. But beware, your system will now be different from the 
> ones of your colleagues and they might not be able to help you if 
> you have problems that are specific to your system, your operating
> system, or your Elasticsearch version.

### Starting/Stopping ElasticSearch
To start ElasticSearch on the virtual machine, you can type `sudo service elasticsearch start` in a Terminal.
To stop the ElasticSearch server, instead, you can type `sudo service elasticsearch stop`. Refer at the [the official guide](https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html#deb-running-init), for more information.

### The REST API

Elasticsearch runs its own server that can be accessed by a regular web browser as the client, for instance by opening this link in your browser: http://localhost:9200. 

Elasticsearch will respond with something like:

    {
        "name" : "fir-machine",
        "cluster_name" : "elasticsearch",
        "cluster_uuid" : "w7SBVo1ESVivMApbLIqRvA",
        "version" : {
            "number" : "7.9.0",
            "build_flavor" : "default",
            "build_type" : "deb",
            "build_hash" : "a479a2a7fce0389512d6a9361301708b92dff667",
            "build_date" : "2020-08-11T21:36:48.204330Z",
            "build_snapshot" : false,
            "lucene_version" : "8.6.0",
            "minimum_wire_compatibility_version" : "6.8.0",
            "minimum_index_compatibility_version" : "6.0.0-beta1"
        },
        "tagline" : "You Know, for Search"
    }


If you see this, then your Elasticsearch node is up and running. The RESTful API uses simple text or JSON over HTTP. 

> REST, API, JSON, HTTP, that's a lot of abbreviations! It is good to
> be familiar with the terminology. Let us explain: The Elasticsearch
> response is not (only) intended for humans. It is supposed to be used 
> by applications that run on the client machines, and therefore the
> interface is called an Application Programming Interface (API). The 
> API uses a format called JSON (JavaScript Object Notation), which 
> can be easily read by machines (and humans). The API sends its JSON
> response using the same method as your web browser displays web
> pages. This method is called HTTP (Hyper Text Transfer Protocol), 
> and it is the reason you can inspect the response in a normal web
> browser. APIs that use HTTP are called RESTful interfaces. REST 
> stands for REpresentational State Transfer, arguably one of the
> simplest ways to define an API.


### Kibana, cURL, and more cURL 

You can interact with your Elasticsearch service in different ways. In this first assignment we will describe three ways. Later during the practical work we will use the Python Elasticsearch client.

1. Using the Kibana Console
2. Using cURL
3. Using cURL from a Jupyter notebook (not recommended)

#### Kibana
Kibana provides a web interface to interact with your Elasticsearch service. It's available from http://localhost:5601. You can use Kibana to create interactive dashboards visualizing data in your Elasticsearch indices. It also provides a console to execute Elasticsearch commands. It's available from http://localhost:5601/app/kibana#/dev_tools

To start Kibana on the virtual machine, you can type `sudo service kibana start` in a Terminal.
To stop the Kibana server, instead, you can type `sudo service kibana stop`.

Many examples from the Elasticsearch user guide can be directly executed in Kibana by clicking the `VIEW IN CONSOLE` button.

#### cURL
[CURL](https://en.wikipedia.org/wiki/CURL) is a software tool that enables you to execute HTTP method requests from the commandline. The name originally stood for "see URL". 

Curl is already installed in the VM operating system. Let's open a bash terminal.
You can exit the shell by executing `exit`.
You can execute curl commands on this prompt, for instance retrieving the Elasticsearch state.
Note you have to use `localhost` as the hostname:
```
labs@fir-machine:~$ curl localhost:9200
{
  "name" : "epRATWu",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "KsOTBsyeTmy6fJCcZ64d_A",
  "version" : {
    "number" : "6.2.4",
    "build_hash" : "ccec39f",
    "build_date" : "2018-04-12T20:37:28.497551Z",
    "build_snapshot" : false,
    "lucene_version" : "7.2.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}
```

#### cURL from this notebook

Alternatively, jupyter notebooks allow you to directly execute cURL commands (or other shell commands), by starting a line of code with an exclamation mark (see example below). Plase be warned: when executing commands which result in long output (for instance when indexing a large number of documents), stick to the terminal to execute curl commands. Jupyter might freeze when handling long output from the shell.

## Assignment Part 02 (Let's go!)

_You can work on this part after Lecture 01 already, if you want_

For the following exercises, you will use a TREC genomics document collection and queries. 
It is stored in the folder `data/` in the directory where you have been instructed to place the assignment notebooks (`/`).

The collections contains:

* `trec-medline.json` (the collection in Elasticsearch batch format - because of its size it cannot be indexed with a single curl command!)
* `training-queries-simple.txt` (test queries)
* `training-qrels.txt` (the "relevance judgements" for the test queries, i.e. the correct answers)
* `test-queries-simple.txt`
* `example_matches20.txt` (20 example matches)

To make things easy, the data is already provided in Elasticsearch' batch processing format. 
Inspect the collection file in the terminal:

`head trec-medline.json`

This shows the first 5 documents in the collection (in JSON format prepared for ElasticSearch, as you have seen in the tutorial)

## Exercise 02.A: _indexing_ and _first queries_

Execute the following cell to index the collection in an Elasticsearch index called `genomics'. This code uses the Elasticsearch python api, which we will discuss later (you can read about it yourself, in the meanwhile).

_Note:_ indexing the TREC genomics collection will take some time.

In [3]:
import elasticsearch
import elasticsearch.helpers
import json

def documents():
    """ generates the documents to be indexed as dictionaries """
    with open('data/trec-medline.json') as inp:
        while True:
            try:
                line = next(inp)  # ignore odd line nrs
                if line is None:
                    break
                try:
                    docline = next(inp)
                    doc = json.loads(docline)
                    yield doc
                except json.JSONDecodeError as e:
                    # should not occur (but ignore it anyway)
                    pass
            except StopIteration as e:
                break

In [4]:
es = elasticsearch.Elasticsearch('localhost')

# remove if it already exists
es.indices.delete(index="genomics", ignore=[400, 404])

# and bulk index it
print("Indexing documents, this will take some time...")
_ = elasticsearch.helpers.bulk(
        es, 
        documents(),
        index="genomics",
        chunk_size=2000,
        request_timeout=30
    )
print("Done")

Indexing documents, this will take some time...
Done


Query the index called Genomics and determine how many items are index. 

In [8]:
#THIS IS GRADED!

# write the code here
# BEGIN ANSWER
ind = 'genomics'
res = es.count(index=ind).get('count')

print("There are", res, "items in the index", ind)

# END ANSWER

There are 525937 items in the index genomics


Using the command line (or the Kibana console), search for all documents that contain the word `blood`. 

How many documents containing the term `blood` are there in your index? (searching all fields of the documents).

In [9]:
#THIS IS GRADED!

# write the code that generates the answer here (you may also use Kibana)
# BEGIN ANSWER
# Using Kibana console:
POST genomics/_count
{
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
}
# Returns:
# {
#   "count" : 68275,
#   "_shards" : {
#     "total" : 1,
#     "successful" : 1,
#     "skipped" : 0,
#     "failed" : 0
#   }
# }
# END ANSWER

SyntaxError: invalid syntax (<ipython-input-9-a8f361d7200e>, line 6)

## Exercise 02.B: the Python ElasticSearch library

#### Preparation
The command line is fine for doing basic operations on your Elasticsearch indices, but as soon as things get more complex, you better use custom client programs.
We will use the [Elasticsearch client library for Python](https://elasticsearch-py.readthedocs.io). This library will execute the HTTP requests that you have used before (with CURL or Kibana). The library is pre-installed on the VM.

#### Exercise
Write the code that searches the index for _"blood"_ using the [search()](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.search) function. Your code will take at minimum the following steps:

1. import the python library `elasticsearch`.
2. open a connection with the Elasticsearch host `'elasticsearch'` with `Elasticsearch()`.
3. execute a search with `search()` using the index `genomics`, and a correct query body.
4. print the JSON output of Elasticsearch 

How many hits are there in your index? Is the result the same as in Exercise 01.?

> Elasticsearch runs on localhost on your laptop, at port 9200 (so as http://localhost:9200)


In [8]:
#THIS IS GRADED!

import elasticsearch

# your code below

# Open a connection with Elasticsearch host. 'es' already exists, so call this es2
es2 = elasticsearch.Elasticsearch('localhost')

cnt = es2.count(index='genomics', body = {
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
})

# Get the resulting hits. search() has a maximum of counting 10,000 hits. 
# Returns the first 10 search results.
res_10 = es2.search(index='genomics', body = {
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
})

# Using size=10000 allows us to retrieve the max number of docs in the query results so that we can inspect them.
res_10k = es2.search(index='genomics', size = 10000, body = {
 "query": {
   "query_string": {
     "query": "blood"
   }
 }
})

print("Total count of matching documents:", cnt['count']) # 68,275 hits = same as Exercise 1
print("\nSearch results (First 10 results. The total hits shown is capped at 10,000):\n", res_10)


Total count of matching documents: 68275

Search results (First 10 results. The total hits shown is capped at 10,000):
 {'took': 100, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 10000, 'relation': 'gte'}, 'max_score': 15.369452, 'hits': [{'_index': 'genomics', '_type': '_doc', '_id': '327954', '_score': 15.369452, '_source': {'CON': 'J Pharmacokinet Pharmacodyn. 2001 Apr;28(2):155-69. PMID: 11381568', 'CY': 'England', 'DA': '20020826', 'DCOM': '20030312', 'DP': '2002 Feb', 'EDAT': '2002/08/27 10:00', 'IP': '1', 'IS': '1567-567X', 'JID': '101096520', 'LA': 'eng', 'LR': '20030313', 'MHDA': '2003/03/13 04:00', 'PG': '95-7; author reply 99', 'PMID': '12194538', 'PST': 'ppublish', 'PT': 'Letter', 'SB': 'IM', 'SO': 'J Pharmacokinet Pharmacodyn 2002 Feb;29(1):95-7; author reply 99.', 'TA': 'J Pharmacokinet Pharmacodyn', 'TI': 'Sample size calculation in bioequivalence trials.', 'UI': '22182241', 'VI': '29', 'MH': ['*Mode

The Python client library returns Python objects, that use [dictionaries](https://docs.python.org/3.6/tutorial/datastructures.html#dictionaries) and [lists](https://docs.python.org/3.6/tutorial/introduction.html#lists).
Use a [for loop](https://docs.python.org/3.6/tutorial/controlflow.html#for-statements) to inspect each hit, and print the retrieved document's titles one by one. 

In [11]:
#THIS IS GRADED!

# your code below
# Get the list of 10,000 hits.
hits = res_10k['hits']['hits']

print("Number of query results returned =", len(hits), "\n\nTitles:")
for i,hit in enumerate(hits):
    print(i, ":", hit['_source']['TI'])

Number of query results returned = 10000 

Titles:
0 : Thrombin functions during tissue factor-induced blood coagulation.
1 : Short deletion within the blood group Dombrock locus causing a Do(null) phenotype.
2 : Working Group on Blood Pressure Monitoring of the European Society of Hypertension International Protocol for validation of blood pressure measuring devices in adults.
3 : Clotting in whole blood: analysis of a biochemical reaction network.
4 : DNB: a partial D with anti-D frequent in Central Europe.
5 : Intrinsic pathway of blood coagulation contributes to thrombogenicity of atherosclerotic plaque.
6 : Persistence of HTLV-I in blood components after leukocyte depletion.
7 : Transplantation of mobilized peripheral blood cells to HLA-identical siblings with standard-risk leukemia.
8 : Absence of CD47 in protein 4.2-deficient hereditary spherocytosis in man: an interaction between the Rh complex and the band 3 complex.
9 : State of the market for devices for blood pressure measu

1239 : Imatinib induces hematologic and cytogenetic responses in patients with chronic myelogenous leukemia in myeloid blast crisis: results of a phase II study.
1240 : Thrombogenicity of beta 2-glycoprotein I-dependent antiphospholipid antibodies in a photochemically induced thrombosis model in the hamster.
1241 : Congenital afibrinogenemia: first identification of splicing mutations in the fibrinogen Bbeta-chain gene causing activation of cryptic splice sites.
1242 : Heterologous cells cooperate to augment stem cell migration, homing, and engraftment.
1243 : Interleukin-1 blockade does not prevent acute graft-versus-host disease: results of a randomized, double-blind, placebo-controlled trial of interleukin-1 receptor antagonist in allogeneic bone marrow transplantation.
1244 : Antimyeloma efficacy of thalidomide in the SCID-hu model.
1245 : [Selected blood coagulation problems in newborn infants]
1246 : Volume of RBCs, 24
1247 : [CD62p expression in platelet during the preparation c

2406 : Dietary fibre in treatment of diabetes: myth or reality?
2407 : Obesity, smoking, and multiple cardiovascular risk factors in young adult African Americans.
2408 : [Markers of oxidative damage in blood of children with cystic fibrosis]
2409 : Assessing the accuracy of three viral risk models in predicting the outcome of implementing HIV and HCV NAT donor screening in Australia and the implications for future HBV NAT.
2410 : The blood platelet as a model for regulating blood coagulation on cell surfaces and its consequences.
2411 : Suicide and the media. Part I: Reportage in nonfictional media.
2412 : Suicide and the media. Part II: Portrayal in fictional media.
2413 : Suicide and the media. Part III: Theoretical issues.
2414 : Blood extraction from lancet wounds using vacuum combined with skin stretching.
2415 : Safer haemotherapy: the responsibilities of government, transfusion service, blood donors, and physician-users.
2416 : Serum cholesterol affects blood pressure regulatio

4072 : Effect of IVIgG treatment on fetal platelet count, HPA-1a titre and clinical outcome in a case of feto-maternal alloimmune thrombocytopenia.
4073 : Effects of peroxidase on hyperlipidemia in mice.
4074 : Effect of buyang huanwu decoction on platelet activating factor content in arterial blood pre
4075 : Evaluating the relationship between arterial blood pressure changes and indices of pulse oximetric plethysmography.
4076 : Multiple sclerosis: low-frequency temporal blood oxygen level-dependent fluctuations indicate reduced functional connectivity initial results.
4077 : Assessment of sirolimus concentrations in whole blood by high-performance liquid chromatography with ultraviolet detection.
4078 : Changes in middle cerebral artery blood flow after carotid endarterectomy as monitored by transcranial Doppler.
4079 : Clinical significance of blood brain natriuretic peptide level measurement in the detection of heart disease in untreated outpatients: comparison of electrocardiogra

5072 : [Changes in synthesis of nitric oxide, blood levels of ACTH and cortisol in viral hepatitis B]
5073 : Allele-specific replication associated with aneuploidy in blood cells of patients with hematologic malignancies.
5074 : [Blood donation]
5075 : [Vitamin A: blood level and dietetics intake in stunted children and adolescents without hormonal disease]
5076 : Long-term effects of oral estradiol and dydrogesterone on carbohydrate metabolism in postmenopausal women.
5077 : In vivo PIV measurement of red blood cell velocity field in microvessels considering mesentery motion.
5078 : [Informative value of immunologic analysis of blood and ejaculate in diagnosing chronic prostatitis]
5079 : Insulinoma in a patient with tuberous sclerosis: is there an association?
5080 : In vitro studies of the influence polyester materials with a different degree of surface wettability have on blood haematological parameters and coagulation and fibrinolysis system parameters.
5081 : A comparative study 

6238 : "Sausage-string" appearance of arteries and arterioles can be caused by an instability of the blood vessel wall.
6239 : Inhibitory mechanism of costunolide, a sesquiterpene lactone isolated from Laurus nobilis, on blood-ethanol elevation in rats: involvement of inhibition of gastric emptying and increase in gastric juice secretion.
6240 : [Blood pressure determination]
6241 : Intrapituitary adenoviral administration of 7B2 can extend life span and reverse endocrinological deficiencies in 7B2 null mice.
6242 : Oat consumption does not affect resting casual and ambulatory 24-h arterial blood pressure in men with high-normal blood pressure to stage I hypertension.
6243 : Clonal T cell receptor gamma-chain gene rearrangement by PCR-based GeneScan analysis in the skin and blood of patients with parapsoriasis and early-stage mycosis fungoides.
6244 : Lipids and nitric oxide in porcine retinal and choroidal blood vessels.
6245 : Mice deficient in the insulin-regulated membrane aminopep

7738 : Peroxidation of proteins and lipids in suspensions of liposomes, in blood serum, and in mouse myeloma cells.
7739 : [The distribution of serum homocysteine and its associated factors in a population of 1 168 subjects in Beijing area]
7740 : Miniaturized electrophoresis: an evolving role in laboratory medicine.
7741 : Reticulocytes. Their usefulness and measurement in peripheral blood.
7742 : Cell saver for blood conservation in cardiac surgery.
7743 : Blood dendritic cells interact with splenic marginal zone B cells to initiate T-independent immune responses.
7744 : Insulin therapy in type 2 diabetes.
7745 : Cerebral blood perfusion after treatment with zolpidem and flumazenil in the baboon.
7746 : [Necessary harmonization of health cost assessment. Autologous peripheral blood progenitor cell transplantation in France]
7747 : Systemic vs. local cytokine and leukocyte responses to unilateral wrist flexion exercise.
7748 : The effect of estrogen use on levels of glucose and insuli

9238 : Multiple organ failure in patients with cardiogenic shock is associated with high plasma levels of interleukin-6.
9239 : Skeletal muscle capillary hemodynamics from rest to contractions: implications for oxygen transfer.
9240 : Interactions between stress, interleukin-1beta, interleukin-6 and cortisol in periodontally diseased patients.
9241 : Is transcranial Doppler ultrasonography (TCD) good enough in determining CO2 reactivity and pressure autoregulation in head-injured patients?
9242 : [Clinical research of patients with acute or chronic hepatic failure treated with molecular adsorbent recirculating system]
9243 : Comparison of PET with radioactive microspheres to assess pulmonary blood flow.
9244 : Bell-bottom aortoiliac endografts: an alternative that preserves pelvic blood flow.
9245 : HEPC-based liposomes trigger cytokine release from peripheral blood cells: effects of liposomal size, dose and lipid composition.
9246 : A simplified double-injection method to quantify cer

## Exercise 02.C: _Search using the Elasticsearch DSL_

You will notice that the native query format of Elasticsearch can be quite verbose.
Elasticsearh provides the Python library `elasticsearch_dsl` to write more concise Elasticsearch queries. 
This is only to simplify the syntax: the library still issues Elasticsearch queries.

For example, a simple `match_all` query looks as follows:
```python
query = {
   "query": {
       "match_all": {}
   }
}
```

The same query can be created with the DSL as follows:
```python
query = Q("match_all")
```

Especially for more complicated boolean queries, to use the native query format can become complicated.
Read more about the DSL [here](https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html)

__Exercise:__ Search for the query `blood` and check whether you get the same number of results as for exercise 02.B. You can use DSL, curl or Kibana. 

In [8]:
#THIS IS GRADED!

# your code here

# BEGIN ANSWER
# Install elasticsearch_dsl library when first running cell.
# ! pip3 install elasticsearch-dsl

import elasticsearch
from elasticsearch_dsl import Search, Q

s = Search(using=es2, index='genomics') 
q = Q('query_string', query='blood')
s = s.query(q)

# Execute the query. This will return the first 10 matches.
res = s.execute()

# Get the total count of matches. 
s.count()  # 68,275 => matches the previous answers.

# END ANSWER

Collecting elasticsearch-dsl
  Downloading elasticsearch_dsl-7.3.0-py2.py3-none-any.whl (62 kB)
[K     |████████████████████████████████| 62 kB 1.5 MB/s eta 0:00:011
Installing collected packages: elasticsearch-dsl
Successfully installed elasticsearch-dsl-7.3.0


# Making your own TREC run

We will adopt a scientific approach to building search engines. That is, we are not only going to build a search engine and see that it works, but we are also going to _measure_ how well it works, by measuring the search engine's quality. We will adopt the method from the [Text Retrieval Conference](http://trec.nist.gov) (TREC). TREC provides researchers with test collections, that consists of 3 parts:

1. the document collection (in our case a part of the MEDLINE database)
2. the topics (which are natural language descriptions of what the user is searching for: you can think of the as the _queries_)
3. the relevance judgments (for each topic, what documents are relevant)

##  Exercise 02.D

Complete the code of the Python function `make_trec_run()` that reads the topics [training-queries-simple.txt](http:training-queries-simple.txt), and for each topic does a search using Elasticsearch. The program should output a file in the [TREC submission format](https://trec-core.github.io/2017/#submission-guidelines). We already provided the first  lines for this exercise, which include:

1. Open the file `'run_file_name'`' for writing and call it `run_file`.
2. Open the file `'topics_file_name'` for reading, call it `test_queries`.
3. For each line in `test_queries`:
4. Remove the newline using `strip()`, then split the string on the tab character (`'\t'`). The first part of the line is now `qid` (the query identifier) and the last part is `query` (a textual description of the query).
5. complete the Python program such that the correct TREC run file is written to `'run_file_name'`.

> **Note**: Make sure you output the `PMID` (pubmed identifier) of the document `hit['_source']['PMID']`. Do **not** use the elasticsearch identifier `_id` because they do not match the document identifiers in the relevance judgements. They were randomly generated by Elasticsearch during indexing.

In [9]:
#THIS IS GRADED!

import elasticsearch
from elasticsearch_dsl import Search, Q
import re

def make_trec_run(es, topics_file_name, run_file_name, run_name="test"):
    with open(run_file_name, 'w') as run_file:
        with open(topics_file_name, 'r') as test_queries:
            for line in test_queries:
                (qid, query) = line.strip().split('\t')
                # BEGIN ANSWER
                run_tag = "27" + run_name #per TREC guidelines: "each run should have a different tag that identifies the group and the method that produced the run."
                
                # Do search using query. Return 1,000 relevant documents as per question instructions.
                s = Search(using=es, index='genomics') 
                q = Q('multi_match', query=query)
                s = s.query(q)[0:1000]
                res = s.execute()
              
                for i, hit in enumerate(res.hits.hits):
                    output = [str(qid), 'Q0', str(hit['_source']['PMID']), str(i), str(hit['_score']),run_tag]
                    # Write to file.
                    if i == 0 and qid == 1:
                        run_file.write(" ".join(output))
                    else:
                        run_file.write("\n" + " ".join(output))
                # END ANSWER
                
                
# Write the results of the queries contained in the topic file `'data/training-queries-simple.txt'` 
# to the run file `'baseline.run'`, and name this test as `test01`
make_trec_run(es2, 'data/training-queries-simple.txt', 'baseline.run', run_name='test01')

In [10]:
# this prints out (it is a shell command) the content of the file baseline.run 
!cat baseline.run


1 Q0 12056822 0 45.120773 27test01
1 Q0 11929828 1 43.71542 27test01
1 Q0 11943869 2 43.64373 27test01
1 Q0 11751903 3 43.54453 27test01
1 Q0 12573582 4 42.216682 27test01
1 Q0 12384701 5 41.85428 27test01
1 Q0 12624599 6 41.735706 27test01
1 Q0 11981756 7 41.24557 27test01
1 Q0 12175534 8 41.18186 27test01
1 Q0 12065641 9 41.059303 27test01
1 Q0 12444543 10 40.73954 27test01
1 Q0 11933076 11 40.64125 27test01
1 Q0 11980715 12 40.211548 27test01
1 Q0 11790141 13 40.05864 27test01
1 Q0 12018448 14 39.329933 27test01
1 Q0 11876550 15 39.327446 27test01
1 Q0 12045203 16 39.06666 27test01
1 Q0 11759294 17 39.00269 27test01
1 Q0 12407107 18 38.99308 27test01
1 Q0 12429914 19 38.82344 27test01
1 Q0 12126481 20 38.77115 27test01
1 Q0 12036924 21 38.5743 27test01
1 Q0 12431783 22 38.562275 27test01
1 Q0 12081329 23 38.32687 27test01
1 Q0 12052868 24 38.211033 27test01
1 Q0 12368211 25 38.13008 27test01
1 Q0 12012324 26 37.90165 27test01
1 Q0 12214857 27 37.866325 2

3 Q0 12205127 242 15.176595 27test01
3 Q0 12360408 243 15.164554 27test01
3 Q0 12168840 244 15.162107 27test01
3 Q0 12511526 245 15.138282 27test01
3 Q0 11996973 246 15.134506 27test01
3 Q0 11857543 247 15.131138 27test01
3 Q0 11894914 248 15.119946 27test01
3 Q0 12081210 249 15.1053295 27test01
3 Q0 12072386 250 15.102147 27test01
3 Q0 12023815 251 15.10195 27test01
3 Q0 12161545 252 15.084579 27test01
3 Q0 12117724 253 15.071104 27test01
3 Q0 11916011 254 15.067863 27test01
3 Q0 11793389 255 15.059092 27test01
3 Q0 12439744 256 15.0553055 27test01
3 Q0 12401556 257 15.0501995 27test01
3 Q0 12021276 258 15.045597 27test01
3 Q0 12106867 259 15.003692 27test01
3 Q0 12227752 260 14.971842 27test01
3 Q0 11907494 261 14.96573 27test01
3 Q0 12527901 262 14.953565 27test01
3 Q0 11759083 263 14.945784 27test01
3 Q0 12386813 264 14.935541 27test01
3 Q0 12391845 265 14.92109 27test01
3 Q0 11861293 266 14.9176035 27test01
3 Q0 12049788 267 14.900828 27test01
3 Q0 121281

5 Q0 12504915 854 14.831844 27test01
5 Q0 11931617 855 14.8312645 27test01
5 Q0 11854298 856 14.828937 27test01
5 Q0 11982764 857 14.827291 27test01
5 Q0 11955515 858 14.82612 27test01
5 Q0 12102264 859 14.825395 27test01
5 Q0 12034569 860 14.820732 27test01
5 Q0 11792287 861 14.81447 27test01
5 Q0 12428187 862 14.804743 27test01
5 Q0 12036080 863 14.803597 27test01
5 Q0 12511078 864 14.803413 27test01
5 Q0 12061869 865 14.803283 27test01
5 Q0 12144133 866 14.799407 27test01
5 Q0 11837539 867 14.794883 27test01
5 Q0 12427030 868 14.787772 27test01
5 Q0 11927835 869 14.785785 27test01
5 Q0 11861317 870 14.778712 27test01
5 Q0 12372875 871 14.777195 27test01
5 Q0 12207932 872 14.773366 27test01
5 Q0 12107787 873 14.770002 27test01
5 Q0 12206674 874 14.766117 27test01
5 Q0 12042033 875 14.765737 27test01
5 Q0 12040021 876 14.765338 27test01
5 Q0 12135978 877 14.760097 27test01
5 Q0 12358726 878 14.756177 27test01
5 Q0 11804624 879 14.755266 27test01
5 Q0 12552493

8 Q0 11838736 892 12.401187 27test01
8 Q0 12017540 893 12.398596 27test01
8 Q0 11905507 894 12.39844 27test01
8 Q0 11858768 895 12.395767 27test01
8 Q0 11906933 896 12.395767 27test01
8 Q0 12205237 897 12.395767 27test01
8 Q0 12192896 898 12.395767 27test01
8 Q0 12413091 899 12.395767 27test01
8 Q0 11994389 900 12.395767 27test01
8 Q0 11925404 901 12.395767 27test01
8 Q0 11937129 902 12.395767 27test01
8 Q0 12041692 903 12.395767 27test01
8 Q0 11967203 904 12.395767 27test01
8 Q0 12479022 905 12.395767 27test01
8 Q0 12112006 906 12.394958 27test01
8 Q0 12501882 907 12.393836 27test01
8 Q0 12089341 908 12.387423 27test01
8 Q0 12382103 909 12.385925 27test01
8 Q0 11911945 910 12.384094 27test01
8 Q0 12149581 911 12.37431 27test01
8 Q0 12042878 912 12.371927 27test01
8 Q0 12220661 913 12.35785 27test01
8 Q0 12506166 914 12.356619 27test01
8 Q0 12170086 915 12.355699 27test01
8 Q0 11770184 916 12.353831 27test01
8 Q0 11889221 917 12.346978 27test01
8 Q0 12193281 9

11 Q0 12471715 869 20.409353 27test01
11 Q0 11928791 870 20.409353 27test01
11 Q0 12369431 871 20.409353 27test01
11 Q0 12434430 872 20.405064 27test01
11 Q0 11996904 873 20.401148 27test01
11 Q0 12558974 874 20.400417 27test01
11 Q0 11726669 875 20.399988 27test01
11 Q0 12175684 876 20.390373 27test01
11 Q0 11870222 877 20.388565 27test01
11 Q0 12221287 878 20.388552 27test01
11 Q0 12106606 879 20.387634 27test01
11 Q0 12207905 880 20.378061 27test01
11 Q0 11959814 881 20.376698 27test01
11 Q0 12215260 882 20.3747 27test01
11 Q0 12379699 883 20.36989 27test01
11 Q0 12163043 884 20.36972 27test01
11 Q0 11755134 885 20.369537 27test01
11 Q0 12045574 886 20.365528 27test01
11 Q0 12149089 887 20.365528 27test01
11 Q0 12502857 888 20.359798 27test01
11 Q0 12082019 889 20.3577 27test01
11 Q0 12134150 890 20.356995 27test01
11 Q0 12492428 891 20.352472 27test01
11 Q0 12097191 892 20.34457 27test01
11 Q0 12235379 893 20.342705 27test01
11 Q0 12565822 894 20.337872 27t

14 Q0 12466548 689 10.057163 27test01
14 Q0 12100553 690 10.056177 27test01
14 Q0 11796690 691 10.05534 27test01
14 Q0 12575909 692 10.05534 27test01
14 Q0 12368472 693 10.052078 27test01
14 Q0 12446775 694 10.048089 27test01
14 Q0 12087167 695 10.048089 27test01
14 Q0 12095300 696 10.048089 27test01
14 Q0 12325365 697 10.046129 27test01
14 Q0 12034847 698 10.043685 27test01
14 Q0 11909873 699 10.029354 27test01
14 Q0 11866090 700 10.02875 27test01
14 Q0 12021282 701 10.02875 27test01
14 Q0 12118408 702 10.02875 27test01
14 Q0 11741996 703 10.027193 27test01
14 Q0 12049793 704 10.021978 27test01
14 Q0 12243451 705 10.020887 27test01
14 Q0 12049666 706 10.014797 27test01
14 Q0 12203325 707 10.012722 27test01
14 Q0 11822666 708 10.005669 27test01
14 Q0 11844978 709 10.00205 27test01
14 Q0 11749206 710 10.000267 27test01
14 Q0 11786681 711 9.986645 27test01
14 Q0 11833653 712 9.977362 27test01
14 Q0 11804040 713 9.97419 27test01
14 Q0 11909716 714 9.97419 27test01

17 Q0 12130532 810 15.185207 27test01
17 Q0 11897378 811 15.181104 27test01
17 Q0 11906766 812 15.17971 27test01
17 Q0 12101011 813 15.17971 27test01
17 Q0 12444555 814 15.17475 27test01
17 Q0 12399495 815 15.17475 27test01
17 Q0 11953320 816 15.17475 27test01
17 Q0 12053110 817 15.174354 27test01
17 Q0 11973636 818 15.174354 27test01
17 Q0 11971032 819 15.173387 27test01
17 Q0 11882948 820 15.173132 27test01
17 Q0 12121994 821 15.172003 27test01
17 Q0 12208750 822 15.165625 27test01
17 Q0 12107641 823 15.163793 27test01
17 Q0 12385642 824 15.159387 27test01
17 Q0 11856869 825 15.1587 27test01
17 Q0 12237320 826 15.154493 27test01
17 Q0 12376101 827 15.150892 27test01
17 Q0 12186849 828 15.150658 27test01
17 Q0 12358355 829 15.149735 27test01
17 Q0 12539049 830 15.149735 27test01
17 Q0 11751871 831 15.147228 27test01
17 Q0 12607314 832 15.147027 27test01
17 Q0 12386293 833 15.1323 27test01
17 Q0 11862947 834 15.131767 27test01
17 Q0 12239089 835 15.131767 27tes

20 Q0 12224947 984 18.212103 27test01
20 Q0 11817654 985 18.211876 27test01
20 Q0 11985963 986 18.21135 27test01
20 Q0 11953382 987 18.21135 27test01
20 Q0 12107190 988 18.21135 27test01
20 Q0 12091384 989 18.207817 27test01
20 Q0 11845298 990 18.206837 27test01
20 Q0 12234960 991 18.201803 27test01
20 Q0 12118242 992 18.176569 27test01
20 Q0 12181165 993 18.169657 27test01
20 Q0 12188922 994 18.162376 27test01
20 Q0 11792021 995 18.161636 27test01
20 Q0 12206818 996 18.161636 27test01
20 Q0 12032168 997 18.161636 27test01
20 Q0 12355496 998 18.161636 27test01
20 Q0 12438403 999 18.15842 27test01
21 Q0 12057070 0 43.494823 27test01
21 Q0 12443887 1 42.19693 27test01
21 Q0 12074924 2 41.515602 27test01
21 Q0 11895498 3 41.30649 27test01
21 Q0 12581905 4 41.05872 27test01
21 Q0 12088642 5 40.72311 27test01
21 Q0 12213814 6 40.659103 27test01
21 Q0 12442329 7 40.54587 27test01
21 Q0 12516571 8 40.318893 27test01
21 Q0 12050042 9 40.056316 27test01
21 Q0 12473100 

23 Q0 11972481 855 12.591207 27test01
23 Q0 12430699 856 12.59087 27test01
23 Q0 12065458 857 12.585067 27test01
23 Q0 12165521 858 12.578651 27test01
23 Q0 12446004 859 12.577942 27test01
23 Q0 12163024 860 12.577861 27test01
23 Q0 12033902 861 12.576317 27test01
23 Q0 11937564 862 12.565023 27test01
23 Q0 12502804 863 12.560577 27test01
23 Q0 12526466 864 12.559472 27test01
23 Q0 11896730 865 12.559358 27test01
23 Q0 11825904 866 12.552558 27test01
23 Q0 12368321 867 12.551268 27test01
23 Q0 11896721 868 12.5485935 27test01
23 Q0 12427277 869 12.541856 27test01
23 Q0 12006104 870 12.536792 27test01
23 Q0 12105273 871 12.5313 27test01
23 Q0 12429851 872 12.524916 27test01
23 Q0 11927614 873 12.524745 27test01
23 Q0 12460746 874 12.519722 27test01
23 Q0 12186925 875 12.518645 27test01
23 Q0 11886531 876 12.514586 27test01
23 Q0 12050180 877 12.513609 27test01
23 Q0 11782537 878 12.5064335 27test01
23 Q0 12505708 879 12.506225 27test01
23 Q0 11855737 880 12.5043

26 Q0 12009329 573 13.790174 27test01
26 Q0 11907092 574 13.789752 27test01
26 Q0 12015149 575 13.78975 27test01
26 Q0 11825668 576 13.789393 27test01
26 Q0 12183469 577 13.786128 27test01
26 Q0 11893391 578 13.780816 27test01
26 Q0 12213087 579 13.769077 27test01
26 Q0 12188575 580 13.768289 27test01
26 Q0 12388215 581 13.761387 27test01
26 Q0 11751855 582 13.759655 27test01
26 Q0 12472537 583 13.7587185 27test01
26 Q0 12237935 584 13.758381 27test01
26 Q0 11861282 585 13.756425 27test01
26 Q0 12459787 586 13.755285 27test01
26 Q0 11927603 587 13.751639 27test01
26 Q0 11969475 588 13.746648 27test01
26 Q0 11804781 589 13.745124 27test01
26 Q0 11970996 590 13.741933 27test01
26 Q0 12093730 591 13.740305 27test01
26 Q0 12221285 592 13.735923 27test01
26 Q0 12522557 593 13.727145 27test01
26 Q0 11872041 594 13.725847 27test01
26 Q0 11934692 595 13.717369 27test01
26 Q0 12148652 596 13.716356 27test01
26 Q0 12145202 597 13.711328 27test01
26 Q0 11773069 598 13.710

29 Q0 12073772 240 8.8049345 27test01
29 Q0 12073773 241 8.8049345 27test01
29 Q0 12073774 242 8.8049345 27test01
29 Q0 12073775 243 8.8049345 27test01
29 Q0 12073776 244 8.8049345 27test01
29 Q0 12073777 245 8.8049345 27test01
29 Q0 12073778 246 8.8049345 27test01
29 Q0 12182109 247 8.8049345 27test01
29 Q0 12182110 248 8.8049345 27test01
29 Q0 12182111 249 8.8049345 27test01
29 Q0 12182112 250 8.8049345 27test01
29 Q0 12182113 251 8.8049345 27test01
29 Q0 12182114 252 8.8049345 27test01
29 Q0 12182115 253 8.8049345 27test01
29 Q0 12182116 254 8.8049345 27test01
29 Q0 12182117 255 8.8049345 27test01
29 Q0 12182118 256 8.8049345 27test01
29 Q0 11985303 257 8.8049345 27test01
29 Q0 11980356 258 8.8049345 27test01
29 Q0 11980357 259 8.8049345 27test01
29 Q0 11980358 260 8.8049345 27test01
29 Q0 11980359 261 8.8049345 27test01
29 Q0 11980360 262 8.8049345 27test01
29 Q0 11980361 263 8.8049345 27test01
29 Q0 11980362 264 8.8049345 27test01
29 Q0 11980363 265 8.8049

32 Q0 12214623 165 15.374653 27test01
32 Q0 12185660 166 15.374653 27test01
32 Q0 12021401 167 15.372334 27test01
32 Q0 12062475 168 15.344592 27test01
32 Q0 12586759 169 15.327641 27test01
32 Q0 11934652 170 15.30059 27test01
32 Q0 12493614 171 15.284097 27test01
32 Q0 12080040 172 15.283095 27test01
32 Q0 12198596 173 15.281681 27test01
32 Q0 11996894 174 15.271616 27test01
32 Q0 11923945 175 15.26474 27test01
32 Q0 12591720 176 15.249012 27test01
32 Q0 12186377 177 15.244616 27test01
32 Q0 11938733 178 15.228601 27test01
32 Q0 11857789 179 15.222047 27test01
32 Q0 12356083 180 15.222047 27test01
32 Q0 12437589 181 15.195136 27test01
32 Q0 11911506 182 15.177961 27test01
32 Q0 12123286 183 15.177304 27test01
32 Q0 12237165 184 15.175959 27test01
32 Q0 11751902 185 15.166299 27test01
32 Q0 11948478 186 15.1530075 27test01
32 Q0 11950274 187 15.147318 27test01
32 Q0 11765043 188 15.1464615 27test01
32 Q0 11854469 189 15.144921 27test01
32 Q0 11803451 190 15.127

34 Q0 11776260 934 11.470186 27test01
34 Q0 12007571 935 11.46992 27test01
34 Q0 11844576 936 11.465542 27test01
34 Q0 11777248 937 11.463413 27test01
34 Q0 11916368 938 11.462961 27test01
34 Q0 12141563 939 11.462497 27test01
34 Q0 11960537 940 11.461083 27test01
34 Q0 11914041 941 11.458647 27test01
34 Q0 12167556 942 11.4573965 27test01
34 Q0 12481521 943 11.4573965 27test01
34 Q0 11920744 944 11.4532995 27test01
34 Q0 11812535 945 11.453047 27test01
34 Q0 12028746 946 11.450636 27test01
34 Q0 12437986 947 11.449379 27test01
34 Q0 11842148 948 11.44813 27test01
34 Q0 12597088 949 11.447656 27test01
34 Q0 11830362 950 11.44681 27test01
34 Q0 11955952 951 11.44681 27test01
34 Q0 12373430 952 11.446247 27test01
34 Q0 12486438 953 11.44624 27test01
34 Q0 11853111 954 11.443108 27test01
34 Q0 12012745 955 11.442743 27test01
34 Q0 11743100 956 11.441623 27test01
34 Q0 12417685 957 11.441581 27test01
34 Q0 12216993 958 11.437222 27test01
34 Q0 12369817 959 11.42930

37 Q0 12184493 556 16.439302 27test01
37 Q0 11991082 557 16.438828 27test01
37 Q0 12459265 558 16.435724 27test01
37 Q0 12231507 559 16.43258 27test01
37 Q0 12063071 560 16.432104 27test01
37 Q0 11894955 561 16.42891 27test01
37 Q0 11805060 562 16.420126 27test01
37 Q0 11818063 563 16.41903 27test01
37 Q0 12205098 564 16.416763 27test01
37 Q0 11862486 565 16.414568 27test01
37 Q0 11782551 566 16.414215 27test01
37 Q0 11926534 567 16.410059 27test01
37 Q0 11850624 568 16.40382 27test01
37 Q0 12164780 569 16.4009 27test01
37 Q0 11812492 570 16.390476 27test01
37 Q0 12403324 571 16.390476 27test01
37 Q0 11920729 572 16.390476 27test01
37 Q0 11854460 573 16.388597 27test01
37 Q0 11884035 574 16.378529 27test01
37 Q0 12130803 575 16.376839 27test01
37 Q0 11794793 576 16.370152 27test01
37 Q0 12610537 577 16.370152 27test01
37 Q0 12113771 578 16.368591 27test01
37 Q0 12000797 579 16.36185 27test01
37 Q0 12466289 580 16.36038 27test01
37 Q0 12426370 581 16.36038 27tes

40 Q0 12421697 522 23.52689 27test01
40 Q0 12151316 523 23.519575 27test01
40 Q0 12021770 524 23.518673 27test01
40 Q0 12540577 525 23.515692 27test01
40 Q0 11967999 526 23.508596 27test01
40 Q0 12449645 527 23.495737 27test01
40 Q0 12438685 528 23.487587 27test01
40 Q0 12393882 529 23.48173 27test01
40 Q0 12135352 530 23.481018 27test01
40 Q0 11923210 531 23.480406 27test01
40 Q0 11992543 532 23.470987 27test01
40 Q0 11889043 533 23.47061 27test01
40 Q0 11929872 534 23.469225 27test01
40 Q0 12545881 535 23.456358 27test01
40 Q0 11869341 536 23.422333 27test01
40 Q0 11965658 537 23.421532 27test01
40 Q0 12196400 538 23.371792 27test01
40 Q0 12136017 539 23.359854 27test01
40 Q0 11848721 540 23.349699 27test01
40 Q0 12578968 541 23.349518 27test01
40 Q0 12114204 542 23.342705 27test01
40 Q0 12126643 543 23.338976 27test01
40 Q0 12142223 544 23.31608 27test01
40 Q0 11758729 545 23.295654 27test01
40 Q0 12510193 546 23.290258 27test01
40 Q0 12204819 547 23.27597 2

43 Q0 12486703 597 15.058524 27test01
43 Q0 12208233 598 15.058308 27test01
43 Q0 12074600 599 15.058308 27test01
43 Q0 11880543 600 15.054107 27test01
43 Q0 12456751 601 15.054107 27test01
43 Q0 12072480 602 15.043915 27test01
43 Q0 12471248 603 15.043915 27test01
43 Q0 12096187 604 15.043915 27test01
43 Q0 12023222 605 15.043915 27test01
43 Q0 12140323 606 15.043915 27test01
43 Q0 11943326 607 15.030118 27test01
43 Q0 12594826 608 15.028262 27test01
43 Q0 12557913 609 15.012203 27test01
43 Q0 11861559 610 14.99348 27test01
43 Q0 12077188 611 14.99348 27test01
43 Q0 11964164 612 14.98744 27test01
43 Q0 11734896 613 14.985294 27test01
43 Q0 11751861 614 14.985294 27test01
43 Q0 12050657 615 14.985294 27test01
43 Q0 12455976 616 14.985294 27test01
43 Q0 12021767 617 14.985294 27test01
43 Q0 11897392 618 14.956639 27test01
43 Q0 12220128 619 14.956573 27test01
43 Q0 12370312 620 14.956573 27test01
43 Q0 12105418 621 14.94609 27test01
43 Q0 11991710 622 14.94609 2

46 Q0 12427979 819 10.001263 27test01
46 Q0 12053578 820 9.993403 27test01
46 Q0 12194877 821 9.992336 27test01
46 Q0 12014335 822 9.990861 27test01
46 Q0 12056010 823 9.98239 27test01
46 Q0 11971011 824 9.981016 27test01
46 Q0 12109856 825 9.9803 27test01
46 Q0 12511597 826 9.9803 27test01
46 Q0 12021171 827 9.97956 27test01
46 Q0 11833653 828 9.977362 27test01
46 Q0 11720310 829 9.975929 27test01
46 Q0 12006659 830 9.975929 27test01
46 Q0 11804040 831 9.97419 27test01
46 Q0 11909716 832 9.97419 27test01
46 Q0 12051761 833 9.97419 27test01
46 Q0 12032305 834 9.97419 27test01
46 Q0 12122124 835 9.973148 27test01
46 Q0 12110201 836 9.964271 27test01
46 Q0 11807178 837 9.961302 27test01
46 Q0 12403777 838 9.961302 27test01
46 Q0 11932210 839 9.961302 27test01
46 Q0 12077334 840 9.959469 27test01
46 Q0 12477838 841 9.956891 27test01
46 Q0 12011358 842 9.951347 27test01
46 Q0 12172011 843 9.949514 27test01
46 Q0 12436371 844 9.94609 27test01
46 Q0 12463209 845 9.9

50 Q0 12016920 395 10.270081 27test01
50 Q0 12514122 396 10.269645 27test01
50 Q0 12384151 397 10.26717 27test01
50 Q0 12034575 398 10.265884 27test01
50 Q0 12101050 399 10.262214 27test01
50 Q0 11771955 400 10.259737 27test01
50 Q0 12081195 401 10.2572565 27test01
50 Q0 12009726 402 10.250688 27test01
50 Q0 12037684 403 10.250688 27test01
50 Q0 12037686 404 10.2447605 27test01
50 Q0 12079333 405 10.243333 27test01
50 Q0 11978796 406 10.243198 27test01
50 Q0 12193163 407 10.242594 27test01
50 Q0 12121891 408 10.23926 27test01
50 Q0 11798737 409 10.228697 27test01
50 Q0 12145098 410 10.228353 27test01
50 Q0 11794658 411 10.226373 27test01
50 Q0 12003938 412 10.222871 27test01
50 Q0 12094262 413 10.222255 27test01
50 Q0 12242235 414 10.215776 27test01
50 Q0 12151344 415 10.214844 27test01
50 Q0 11792061 416 10.212061 27test01
50 Q0 11895948 417 10.209793 27test01
50 Q0 12210211 418 10.208393 27test01
50 Q0 12591874 419 10.208393 27test01
50 Q0 12382270 420 10.197

> Tip: Write a line to `run_file` using `run_file.write(line)`. 
> The newline character is: `'\n'`. Before writing a number to
> the file, cast it to a string using `str()`.
>
> The TREC Submission guidelines allow you to submit up to 1000
> documents per topic. Keep this in mind!

# Index improvements: Tokenization (Analyzers)

_You are advised to work on this part after Lecture 02_

The way documents are indexed influences the performance of the IR systems. 
Elasticsearch [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/mapping.html) define how a document, and its properties (fields) are stored and indexed. When using a different configuration of an ElasticSearch Mapping, the document collection needs to be re-indexed.

## Background
The following part of the assignment requires some self-study of the ElasticSearch tools to support the improvemnet of the indexing. Please read the:
* [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/indices-create-index.html).
* Elasticsearch [Analyzers](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis.html) contain many options for improving your search engine.

You are again requested to use the [Python Elasticsearch Client](https://elasticsearch-py.readthedocs.io) library documentation.


## Bulk indexing revisited
As we need to re-index the document collection when we use a different Mapping configuration, we developed some functions to support a quick re-indexing in the following exercises.

Below you find the Python code for bulk-indexing our Medline collection, similar to the code of the exercises in the beginning of Part 02. Read the code carefully, as you are required to use the indexing functions later for the completion of the assignment.

> The code uses additional helper functions 
> (`elasticsearch.helpers`) and a library for processing JSON.
> The function `read_documents()` reads the bulk insert file: The 
> function is a generator function. It generates an 'on-demand' list
> by using the statement `yield` for every item of the list. It
> is used in the helper function `elasticsearch.helpers.bulk()`.
> The statement `raise` is Python's approach to throw exceptions, 
> that is, it exits the program with an error.
> Note the additional (keyword) arguments to bulk:
> `chunk_size` indicates the number of documents to be processed by
> elasticsearch in one batch. 
> The request_timeout is set to 30 seconds because processing a single batch
> of documents can take some time.

> __Note:__ _when processing a bulk index, be sure to have few GigaBytes free on your hard drive. If you get a BulkIndexError with read-only/FORBIDDEN errors, you probably have too little hard drive space available for ElasticSearch to work properly._


In [2]:
import elasticsearch
import elasticsearch.helpers
import json

es = elasticsearch.Elasticsearch(host='localhost')  # in case you use Docker, the host is 'elasticsearch'

def read_documents(file_name):
    """
    Returns a generator of documents to be indexed by elastic, read from file_name
    """
    with open(file_name, 'r') as documents:
        for line in documents:
            doc_line = json.loads(line)
            if ('index' in doc_line):
                id = doc_line['index']['_id']
            elif ('PMID' in doc_line):
                doc_line['_id'] = id
                yield doc_line
            else:
                raise ValueError('Woops, error in index file')

def create_index(es, index_name, body={}):
    # delete index when it already exists
    es.indices.delete(index=index_name, ignore=[400, 404])
    # create the index 
    es.indices.create(index=index_name, body=body)
                
def index_documents(es, file_name, index_name, body={}):
    create_index(es, index_name, body)
    # bulk index the documents from file_name
    return elasticsearch.helpers.bulk(
        es, 
        read_documents(file_name),
        index=index_name,
        chunk_size=2000,
        request_timeout=30
    )

In [3]:
index_documents(es, 'data/trec-medline.json', 'genomics')

(525937, [])

## Exercise 02.E: _ElasticSearch Analyzers (Tokenization)_

The amount and quality of the tokens used to construct the inverted index are of great importance. In ElasticSearch, mappings and settings also allow specifying what [Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html) is used to tokenize your documents and queries. In the mappings below, use the _Dutch_ analyzer for the field `"all"` (where `"all'` indexes the fields `"TI"` and `"AB"`):

> Usually, the same analyzer should be applied to documents and queries, but 
> Elasticsearch allows you to specify a `"search_analyzer"` that is used on 
> your queries (which we do not need to use in the assignment).

In [20]:
#THIS IS GRADED!

analyzer_test = {
  # BEGIN ANSWER
    "mappings": {
        "properties": {
            "AB": {
                "type": "text",
                "copy_to": "all"                
            },
            "TI": {
                "type": "text",
                "copy_to": "all"
            },
            "all": {
                "type": "text",
                "analyzer": "dutch"
            }
#             "all": {
#                 "type": "text",
#                 "analyzer": "dutch",
#                 "fields": {
#                     "AB": {
#                         "type": "text",
#                     },
#                     "TI": {
#                     "type": "text",
#                     }
#                 }
#             }
        }
    }
  # END ANSWER
}

# create the index, but don't index any documents:
create_index(es, 'genomics', body=analyzer_test)

The analyzer defined for the `"all"` field can be tested [as follows](https://elasticsearch-py.readthedocs.io/en/master/api.html#indices). Translated to English the text says: _"This is a Dutch sentence"_. 

    The following script identifies the tokens (based on the use of the dutch tokenizer): try with different tokenizers and different sentences to see how the tokens are created.

In [15]:
from pprint import pprint # pretty print

body = { "field": "all", "text": "dit zijn nederlandse zinnen"}
tokens = es.indices.analyze(index='genomics', body=body)
pprint(tokens)

{'tokens': [{'end_offset': 20,
             'position': 2,
             'start_offset': 9,
             'token': 'nederland',
             'type': '<ALPHANUM>'},
            {'end_offset': 27,
             'position': 3,
             'start_offset': 21,
             'token': 'zinn',
             'type': '<ALPHANUM>'}]}


##  Exercise 02.F: _tweet language analyzer_

Read the documentation for [Custom Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/analysis-custom-analyzer.html). 
Make a custom analyzer for _English tweet language_. The analyzer should do the following:
* change common abbreviations to the full forms: 
  * _b4_ to _before_, 
  * _abt_ to _about_, 
  * _chk_ to _check_, 
  * _cr8_ to _create_, 
  * _dm_ to _direct message_,
  * _f2f_ to _face-to-face_
* use the _standard_ tokenizer;
* put everything to lower-case;
* filter English stopwords.

In [32]:
#THIS IS GRADED!

tweet_analyzer = {
  # BEGIN ANSWER
    "mappings": {
        "properties": {
            "AB": {
                "type": "text",
                "copy_to": "all"                
            },
            "TI": {
                "type": "text",
                "copy_to": "all"
            },
            "all": {
                "type": "text",
                "analyzer": "tweet_analyzer"
            }
#             "all": {
#                 "type": "text",
#                 "analyzer": "tweet_analyzer",
#                 "fields": {
#                     "AB": {
#                         "type": "text",
#                     },
#                     "TI": {
#                     "type": "text",
#                     }
#                 }   
#             }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                 "tweet_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "char_filter": "tweet_char_filter",
                    "filter": [
                        "lowercase",
                        "stop_filter" 
                    ]
                 }
            },
            "char_filter": {
                "tweet_char_filter": {
                    "type":"mapping",
                    "mappings": [
                        "b4 => before",
                        "abt => about",
                        "chk => check",
                        "cr8 => create",
                        "dm => direct message",
                        "f2f => face-to-face" 
                    ]
                }
            },
            "filter": {
                "stop_filter": {
                    "type": "stop",
                    "stopwords": "_english_"
                }
            }
        }
    }
  # END ANSWER
}

# create the index, but don't index any documents:
create_index(es, 'genomics', body=tweet_analyzer)
body = { "field": "all", "text": "cr8 it! what abt dm me?"}
tokens = es.indices.analyze(index='genomics', body=body)
pprint(tokens)

{'tokens': [{'end_offset': 3,
             'position': 0,
             'start_offset': 0,
             'token': 'create',
             'type': '<ALPHANUM>'},
            {'end_offset': 12,
             'position': 2,
             'start_offset': 8,
             'token': 'what',
             'type': '<ALPHANUM>'},
            {'end_offset': 16,
             'position': 3,
             'start_offset': 13,
             'token': 'about',
             'type': '<ALPHANUM>'},
            {'end_offset': 18,
             'position': 4,
             'start_offset': 17,
             'token': 'direct',
             'type': '<ALPHANUM>'},
            {'end_offset': 19,
             'position': 5,
             'start_offset': 18,
             'token': 'message',
             'type': '<ALPHANUM>'},
            {'end_offset': 22,
             'position': 6,
             'start_offset': 20,
             'token': 'me',
             'type': '<ALPHANUM>'}]}


# Part 03: Search models 

_You are advised to work on this part after Lecture 03_

Elasticsearch [Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/mapping.html) define how a document, and its properties (fields) are stored and indexed, but also provides tools to implement and exeute different document similarity measures (i.e. search models). 

> See again: [Index Settings and Mappings](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/indices-create-index.html).

For instance, we can add a new field `"all"` that uses the  [similarity measure](https://www.elastic.co/guide/en/elasticsearch/reference/7.8/similarity.html) _Boolean_, and let it serve as an index for the fields `"TI"` and `"AB"` (title and abstract) as follows:

In [1]:
boolean = {
  "settings" : {
    # a single shard, so we do not suffer from approximate document frequencies
    "number_of_shards" : 1
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "all"
        },
        "TI": {
          "type": "text",
          "copy_to": "all"
        },
        "all": {
            "type": "text",
            "similarity": "boolean"
        }
      }
  }
}

index_documents(es, 'data/trec-medline.json', 'genomics', body=boolean)

NameError: name 'index_documents' is not defined

> Most changes to the mappings cannot be done on an existing index. Some (for instance
> similarity measures) can be changed if the index is first closed. Nevertheless, we 
> will in this notebook _re-index_ the collection for every change to the mappings
> using the function `index_documents()` that we defined above. Mappings (and settings)
> can be passed to the function using the `body` parameter.

Let's have a look at the mappings and settings for our index as follows:

In [5]:
es.indices.get(index='genomics')

{'genomics': {'aliases': {},
  'mappings': {'properties': {'AB': {'type': 'text', 'copy_to': ['all']},
    'AD': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'AID': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CI': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CIN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CN': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CON': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'CY': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DA': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DCOM': {'type': 'text',
     'fields': {'keyword': {'type': 'keyword', 'ignore_above': 256}}},
    'DP': 

Now let's search our new field `"all"` as follows:

In [6]:
query = "blood"
search_type = "dfs_query_then_fetch" # this will use exact document frequencies even for multiple shards
body = {
  "query": {
    "match" : { "all" : query }
  },
  "size": 10
}
es.search(index="genomics", search_type=search_type, body=body)

{'took': 1970,
 'timed_out': False,
 '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
 'hits': {'total': {'value': 10000, 'relation': 'gte'},
  'max_score': 1.0,
  'hits': [{'_index': 'genomics',
    '_type': '_doc',
    '_id': '82',
    '_score': 1.0,
    '_source': {'AB': 'Space flight results in loss of bone mass, especially in weight-bearing bones, a condition that is suggested to be similar to disuse osteoporosis. As models to elucidate the underlying mechanism, bed rest studies were performed and bone metabolism in the rat both during space flight and during hindlimb unloading was investigated. The general picture is that bone formation is decreased partly as a result of reduced osteoblast function, whereas bone resorption is unaltered or increased. This deficit in bone mass can be replaced, but the time span for restoration exceeds the period of unloading. Changes in blood flow, systemic hormones, and locally produced factors are contributing in a yet undefin

## Exercise 03.A: _new run and evaluation_
Create a new run file (e.g. `boolean.run`), compute the retrieval performance with the function `print_trec_eval()` and compare the results with the baseline run file `baseline.run`.

In [3]:
#THIS IS GRADED!

# write your code here
# BEGIN ANSWER

#THIS IS GRADED!

import elasticsearch
from elasticsearch_dsl import Search, Q
import re

def make_trec_run(es, topics_file_name, run_file_name, run_name="test"):
    with open(run_file_name, 'w') as run_file:
        with open(topics_file_name, 'r') as test_queries:
            for line in test_queries:
                (qid, query) = line.strip().split('\t')
                # BEGIN ANSWER
                run_tag = "27" + run_name #per TREC guidelines: "each run should have a different tag that identifies the group and the method that produced the run."
                
                # Do search using query. Return 1,000 relevant documents as per question instructions.
                s = Search(using=es, index='genomics') 
                #q = Q('multi_match', query=query)
                search_type = "dfs_query_then_fetch" # this will use exact document frequencies even for multiple shards
                body = {
                      "query": {
                        "match" : { "all" : query }
                      },
                      "size": 1000
                    }
                res = es.search(index="genomics", search_type=search_type, body=body)
                
#                 s = s.query(q)[0:1000]
#                 res = s.execute()
              
                for i, hit in enumerate(res['hits']['hits']):
                    output = [str(qid), 'Q0', str(hit['_source']['PMID']), str(i), str(hit['_score']),run_tag]
                    # Write to file.
                    if i == 0 and qid == 1:
                        run_file.write(" ".join(output))
                    else:
                        run_file.write("\n" + " ".join(output))
                # END ANSWER
                
                
# Write the results of the queries contained in the topic file `'data/training-queries-simple.txt'` 
# to the run file `'boolean.run'`, and name this test as `test01`
make_trec_run(es, 'data/training-queries-simple.txt', 'boolean.run', run_name='test02')
# END ANSWER

In [12]:
! head data/training-queries-simple.txt

1	"cyclin-dependent kinase inhibitor 1A (p21, Cip1)" in Homo sapiens
2	"DEAD/H (Asp-Glu-Ala-Asp/His) box polypeptide 5 (RNA helicase, 68kDa)" in Homo sapiens
3	ets variant gene 6 (TEL oncogene) in Homo sapiens
4	fibroblast growth factor 7 (keratinocyte growth factor) in Homo sapiens
5	"glycine receptor, alpha 1 (startle disease/hyperekplexia, stiff man syndrome)" in Homo sapiens
6	"major histocompatibility complex, class II, DQ beta 1" in Homo sapiens
7	Janus kinase 2 (a protein tyrosine kinase) in Homo sapiens
8	luteinizing hormone/choriogonadotropin receptor in Homo sapiens
9	metallothionein 3 (growth inhibitory factor (neurotrophic)) in Homo sapiens
10	protein C (inactivator of coagulation factors Va and VIIIa) in Homo sapiens


In [4]:
! cat boolean.run


1 Q0 11756437 0 7.0 27test02
1 Q0 11762751 1 7.0 27test02
1 Q0 11767002 2 7.0 27test02
1 Q0 11790141 3 7.0 27test02
1 Q0 11809764 4 7.0 27test02
1 Q0 11852120 5 7.0 27test02
1 Q0 11870216 6 7.0 27test02
1 Q0 11876550 7 7.0 27test02
1 Q0 11879190 8 7.0 27test02
1 Q0 11880176 9 7.0 27test02
1 Q0 11886382 10 7.0 27test02
1 Q0 11886527 11 7.0 27test02
1 Q0 12112322 12 7.0 27test02
1 Q0 12112851 13 7.0 27test02
1 Q0 12118375 14 7.0 27test02
1 Q0 12407107 15 7.0 27test02
1 Q0 12414954 16 7.0 27test02
1 Q0 12429914 17 7.0 27test02
1 Q0 12431783 18 7.0 27test02
1 Q0 12028050 19 7.0 27test02
1 Q0 12032080 20 7.0 27test02
1 Q0 12036924 21 7.0 27test02
1 Q0 12045203 22 7.0 27test02
1 Q0 12056822 23 7.0 27test02
1 Q0 12065641 24 7.0 27test02
1 Q0 12075114 25 7.0 27test02
1 Q0 12079680 26 7.0 27test02
1 Q0 12080324 27 7.0 27test02
1 Q0 12081329 28 7.0 27test02
1 Q0 12085228 29 7.0 27test02
1 Q0 12175534 30 7.0 27test02
1 Q0 12012324 31 7.0 27test02
1 Q0 12012326 32

4 Q0 11893088 373 6.0 27test02
4 Q0 11893695 374 6.0 27test02
4 Q0 11893917 375 6.0 27test02
4 Q0 11893921 376 6.0 27test02
4 Q0 11894130 377 6.0 27test02
4 Q0 11894405 378 6.0 27test02
4 Q0 11895278 379 6.0 27test02
4 Q0 12087246 380 6.0 27test02
4 Q0 12087465 381 6.0 27test02
4 Q0 12087470 382 6.0 27test02
4 Q0 12088116 383 6.0 27test02
4 Q0 12088755 384 6.0 27test02
4 Q0 12088865 385 6.0 27test02
4 Q0 12089343 386 6.0 27test02
4 Q0 12090925 387 6.0 27test02
4 Q0 12091063 388 6.0 27test02
4 Q0 12091365 389 6.0 27test02
4 Q0 12091377 390 6.0 27test02
4 Q0 12091450 391 6.0 27test02
4 Q0 12091805 392 6.0 27test02
4 Q0 12091808 393 6.0 27test02
4 Q0 12093081 394 6.0 27test02
4 Q0 12093122 395 6.0 27test02
4 Q0 12093153 396 6.0 27test02
4 Q0 12093616 397 6.0 27test02
4 Q0 12093741 398 6.0 27test02
4 Q0 12095482 399 6.0 27test02
4 Q0 12095911 400 6.0 27test02
4 Q0 12095987 401 6.0 27test02
4 Q0 12097230 402 6.0 27test02
4 Q0 12097297 403 6.0 27test02
4 Q0 120

8 Q0 11700850 99 3.0 27test02
8 Q0 11700852 100 3.0 27test02
8 Q0 11700862 101 3.0 27test02
8 Q0 11710435 102 3.0 27test02
8 Q0 11710559 103 3.0 27test02
8 Q0 11713979 104 3.0 27test02
8 Q0 11714095 105 3.0 27test02
8 Q0 11719705 106 3.0 27test02
8 Q0 11720244 107 3.0 27test02
8 Q0 11721695 108 3.0 27test02
8 Q0 11722172 109 3.0 27test02
8 Q0 11725733 110 3.0 27test02
8 Q0 11726286 111 3.0 27test02
8 Q0 11726654 112 3.0 27test02
8 Q0 11726668 113 3.0 27test02
8 Q0 11727361 114 3.0 27test02
8 Q0 11728520 115 3.0 27test02
8 Q0 11729946 116 3.0 27test02
8 Q0 11732643 117 3.0 27test02
8 Q0 11732843 118 3.0 27test02
8 Q0 11732980 119 3.0 27test02
8 Q0 11732981 120 3.0 27test02
8 Q0 11732985 121 3.0 27test02
8 Q0 11734696 122 3.0 27test02
8 Q0 11735246 123 3.0 27test02
8 Q0 11735251 124 3.0 27test02
8 Q0 11737151 125 3.0 27test02
8 Q0 11739383 126 3.0 27test02
8 Q0 11741891 127 3.0 27test02
8 Q0 11743608 128 3.0 27test02
8 Q0 11744348 129 3.0 27test02
8 Q0 1174

11 Q0 11878814 961 5.0 27test02
11 Q0 11878877 962 5.0 27test02
11 Q0 11878894 963 5.0 27test02
11 Q0 11878987 964 5.0 27test02
11 Q0 11879035 965 5.0 27test02
11 Q0 11879180 966 5.0 27test02
11 Q0 11879184 967 5.0 27test02
11 Q0 11879916 968 5.0 27test02
11 Q0 11880153 969 5.0 27test02
11 Q0 11880175 970 5.0 27test02
11 Q0 11880237 971 5.0 27test02
11 Q0 11880267 972 5.0 27test02
11 Q0 11880294 973 5.0 27test02
11 Q0 11880369 974 5.0 27test02
11 Q0 11880631 975 5.0 27test02
11 Q0 11880646 976 5.0 27test02
11 Q0 11882378 977 5.0 27test02
11 Q0 11882926 978 5.0 27test02
11 Q0 11882948 979 5.0 27test02
11 Q0 11883532 980 5.0 27test02
11 Q0 11884079 981 5.0 27test02
11 Q0 11884148 982 5.0 27test02
11 Q0 11884242 983 5.0 27test02
11 Q0 11884391 984 5.0 27test02
11 Q0 11884406 985 5.0 27test02
11 Q0 11884456 986 5.0 27test02
11 Q0 11884459 987 5.0 27test02
11 Q0 11884493 988 5.0 27test02
11 Q0 11884514 989 5.0 27test02
11 Q0 11884521 990 5.0 27test02
11 Q0 1188

19 Q0 11850179 451 3.0 27test02
19 Q0 11850232 452 3.0 27test02
19 Q0 11850237 453 3.0 27test02
19 Q0 11850408 454 3.0 27test02
19 Q0 11850420 455 3.0 27test02
19 Q0 11850545 456 3.0 27test02
19 Q0 11851753 457 3.0 27test02
19 Q0 11853534 458 3.0 27test02
19 Q0 11853538 459 3.0 27test02
19 Q0 11853549 460 3.0 27test02
19 Q0 11853560 461 3.0 27test02
19 Q0 11853704 462 3.0 27test02
19 Q0 11853892 463 3.0 27test02
19 Q0 11853962 464 3.0 27test02
19 Q0 11854168 465 3.0 27test02
19 Q0 11854176 466 3.0 27test02
19 Q0 11854187 467 3.0 27test02
19 Q0 11854193 468 3.0 27test02
19 Q0 11854196 469 3.0 27test02
19 Q0 11854279 470 3.0 27test02
19 Q0 11854288 471 3.0 27test02
19 Q0 11854295 472 3.0 27test02
19 Q0 11854297 473 3.0 27test02
19 Q0 11854400 474 3.0 27test02
19 Q0 11854414 475 3.0 27test02
19 Q0 11854421 476 3.0 27test02
19 Q0 11854460 477 3.0 27test02
19 Q0 11855808 478 3.0 27test02
19 Q0 11855816 479 3.0 27test02
19 Q0 11855825 480 3.0 27test02
19 Q0 1185

22 Q0 12182494 482 2.0 27test02
22 Q0 12183378 483 2.0 27test02
22 Q0 12183437 484 2.0 27test02
22 Q0 12184826 485 2.0 27test02
22 Q0 12186119 486 2.0 27test02
22 Q0 12186148 487 2.0 27test02
22 Q0 12186154 488 2.0 27test02
22 Q0 12186157 489 2.0 27test02
22 Q0 12186750 490 2.0 27test02
22 Q0 12187430 491 2.0 27test02
22 Q0 12187509 492 2.0 27test02
22 Q0 12188605 493 2.0 27test02
22 Q0 12190270 494 2.0 27test02
22 Q0 12191437 495 2.0 27test02
22 Q0 12192414 496 2.0 27test02
22 Q0 12192574 497 2.0 27test02
22 Q0 12192971 498 2.0 27test02
22 Q0 12193303 499 2.0 27test02
22 Q0 12194576 500 2.0 27test02
22 Q0 12195432 501 2.0 27test02
22 Q0 12196061 502 2.0 27test02
22 Q0 12196261 503 2.0 27test02
22 Q0 12196416 504 2.0 27test02
22 Q0 12196842 505 2.0 27test02
22 Q0 12197007 506 2.0 27test02
22 Q0 12197628 507 2.0 27test02
22 Q0 12197964 508 2.0 27test02
22 Q0 12197981 509 2.0 27test02
22 Q0 12197982 510 2.0 27test02
22 Q0 12198547 511 2.0 27test02
22 Q0 1189

30 Q0 12196468 314 4.0 27test02
30 Q0 12196523 315 4.0 27test02
30 Q0 11896052 316 4.0 27test02
30 Q0 11897111 317 4.0 27test02
30 Q0 11897112 318 4.0 27test02
30 Q0 11897113 319 4.0 27test02
30 Q0 11897617 320 4.0 27test02
30 Q0 11897660 321 4.0 27test02
30 Q0 11897694 322 4.0 27test02
30 Q0 11897704 323 4.0 27test02
30 Q0 11897856 324 4.0 27test02
30 Q0 11900860 325 4.0 27test02
30 Q0 11901217 326 4.0 27test02
30 Q0 11901225 327 4.0 27test02
30 Q0 11901232 328 4.0 27test02
30 Q0 11903456 329 4.0 27test02
30 Q0 11904524 330 4.0 27test02
30 Q0 11905987 331 4.0 27test02
30 Q0 11906276 332 4.0 27test02
30 Q0 11906710 333 4.0 27test02
30 Q0 11906718 334 4.0 27test02
30 Q0 11906950 335 4.0 27test02
30 Q0 11906969 336 4.0 27test02
30 Q0 11907120 337 4.0 27test02
30 Q0 11907138 338 4.0 27test02
30 Q0 11907171 339 4.0 27test02
30 Q0 11908745 340 4.0 27test02
30 Q0 11909957 341 4.0 27test02
30 Q0 11911281 342 4.0 27test02
30 Q0 11911407 343 4.0 27test02
30 Q0 1191

33 Q0 11757743 306 4.0 27test02
33 Q0 11757835 307 4.0 27test02
33 Q0 11758766 308 4.0 27test02
33 Q0 11758796 309 4.0 27test02
33 Q0 11759293 310 4.0 27test02
33 Q0 11759691 311 4.0 27test02
33 Q0 11759827 312 4.0 27test02
33 Q0 11760396 313 4.0 27test02
33 Q0 11760832 314 4.0 27test02
33 Q0 11761721 315 4.0 27test02
33 Q0 11762174 316 4.0 27test02
33 Q0 11762220 317 4.0 27test02
33 Q0 11762559 318 4.0 27test02
33 Q0 11762792 319 4.0 27test02
33 Q0 11762850 320 4.0 27test02
33 Q0 11762856 321 4.0 27test02
33 Q0 11763238 322 4.0 27test02
33 Q0 11763346 323 4.0 27test02
33 Q0 11763512 324 4.0 27test02
33 Q0 11763992 325 4.0 27test02
33 Q0 11764989 326 4.0 27test02
33 Q0 11765082 327 4.0 27test02
33 Q0 11765790 328 4.0 27test02
33 Q0 11766042 329 4.0 27test02
33 Q0 11766763 330 4.0 27test02
33 Q0 11767077 331 4.0 27test02
33 Q0 11767254 332 4.0 27test02
33 Q0 11767256 333 4.0 27test02
33 Q0 11767488 334 4.0 27test02
33 Q0 11769973 335 4.0 27test02
33 Q0 1177

38 Q0 12097267 977 2.0 27test02
38 Q0 12097325 978 2.0 27test02
38 Q0 12097338 979 2.0 27test02
38 Q0 12097340 980 2.0 27test02
38 Q0 12097479 981 2.0 27test02
38 Q0 12097480 982 2.0 27test02
38 Q0 12097481 983 2.0 27test02
38 Q0 12097500 984 2.0 27test02
38 Q0 12098703 985 2.0 27test02
38 Q0 12099411 986 2.0 27test02
38 Q0 12099695 987 2.0 27test02
38 Q0 12100739 988 2.0 27test02
38 Q0 12100884 989 2.0 27test02
38 Q0 12100885 990 2.0 27test02
38 Q0 12100893 991 2.0 27test02
38 Q0 12100894 992 2.0 27test02
38 Q0 12101099 993 2.0 27test02
38 Q0 12101100 994 2.0 27test02
38 Q0 12101419 995 2.0 27test02
38 Q0 12101424 996 2.0 27test02
38 Q0 12105185 997 2.0 27test02
38 Q0 12107168 998 2.0 27test02
38 Q0 12107285 999 2.0 27test02
39 Q0 12019232 0 4.0 27test02
39 Q0 11693959 1 3.0 27test02
39 Q0 11695180 2 3.0 27test02
39 Q0 11695189 3 3.0 27test02
39 Q0 11710527 4 3.0 27test02
39 Q0 11719588 5 3.0 27test02
39 Q0 11734896 6 3.0 27test02
39 Q0 11744366 7 3.0 27t

42 Q0 12297148 435 3.0 27test02
42 Q0 12324888 436 3.0 27test02
42 Q0 12350270 437 3.0 27test02
42 Q0 12351680 438 3.0 27test02
42 Q0 12351787 439 3.0 27test02
42 Q0 12351791 440 3.0 27test02
42 Q0 12352953 441 3.0 27test02
42 Q0 12353025 442 3.0 27test02
42 Q0 12354665 443 3.0 27test02
42 Q0 12355067 444 3.0 27test02
42 Q0 12355089 445 3.0 27test02
42 Q0 12355261 446 3.0 27test02
42 Q0 12356729 447 3.0 27test02
42 Q0 12360288 448 3.0 27test02
42 Q0 12361959 449 3.0 27test02
42 Q0 12362054 450 3.0 27test02
42 Q0 12364792 451 3.0 27test02
42 Q0 12364793 452 3.0 27test02
42 Q0 12364794 453 3.0 27test02
42 Q0 12364795 454 3.0 27test02
42 Q0 11962500 455 3.0 27test02
42 Q0 11962604 456 3.0 27test02
42 Q0 11964130 457 3.0 27test02
42 Q0 11964164 458 3.0 27test02
42 Q0 11964379 459 3.0 27test02
42 Q0 11964389 460 3.0 27test02
42 Q0 11964404 461 3.0 27test02
42 Q0 11965431 462 3.0 27test02
42 Q0 11965435 463 3.0 27test02
42 Q0 11965438 464 3.0 27test02
42 Q0 1196

46 Q0 11770046 869 2.0 27test02
46 Q0 11770079 870 2.0 27test02
46 Q0 11770083 871 2.0 27test02
46 Q0 11770115 872 2.0 27test02
46 Q0 11770183 873 2.0 27test02
46 Q0 11770191 874 2.0 27test02
46 Q0 11770533 875 2.0 27test02
46 Q0 11770894 876 2.0 27test02
46 Q0 11771037 877 2.0 27test02
46 Q0 11771389 878 2.0 27test02
46 Q0 11771657 879 2.0 27test02
46 Q0 11771660 880 2.0 27test02
46 Q0 11771662 881 2.0 27test02
46 Q0 11771666 882 2.0 27test02
46 Q0 11771668 883 2.0 27test02
46 Q0 11771671 884 2.0 27test02
46 Q0 11771672 885 2.0 27test02
46 Q0 11771673 886 2.0 27test02
46 Q0 11771679 887 2.0 27test02
46 Q0 11771703 888 2.0 27test02
46 Q0 11771734 889 2.0 27test02
46 Q0 11771737 890 2.0 27test02
46 Q0 11771752 891 2.0 27test02
46 Q0 11771754 892 2.0 27test02
46 Q0 11771758 893 2.0 27test02
46 Q0 11771880 894 2.0 27test02
46 Q0 11771889 895 2.0 27test02
46 Q0 11771892 896 2.0 27test02
46 Q0 11771894 897 2.0 27test02
46 Q0 11771952 898 2.0 27test02
46 Q0 1177

50 Q0 11697430 25 3.0 27test02
50 Q0 11697431 26 3.0 27test02
50 Q0 11697507 27 3.0 27test02
50 Q0 11697711 28 3.0 27test02
50 Q0 11697795 29 3.0 27test02
50 Q0 11699080 30 3.0 27test02
50 Q0 11699448 31 3.0 27test02
50 Q0 11699459 32 3.0 27test02
50 Q0 11699643 33 3.0 27test02
50 Q0 11699647 34 3.0 27test02
50 Q0 11699905 35 3.0 27test02
50 Q0 11700956 36 3.0 27test02
50 Q0 11713904 37 3.0 27test02
50 Q0 11713969 38 3.0 27test02
50 Q0 11714101 39 3.0 27test02
50 Q0 11715436 40 3.0 27test02
50 Q0 11718223 41 3.0 27test02
50 Q0 11718447 42 3.0 27test02
50 Q0 11718705 43 3.0 27test02
50 Q0 11724263 44 3.0 27test02
50 Q0 11724276 45 3.0 27test02
50 Q0 11724804 46 3.0 27test02
50 Q0 11725834 47 3.0 27test02
50 Q0 11725839 48 3.0 27test02
50 Q0 11726270 49 3.0 27test02
50 Q0 11726710 50 3.0 27test02
50 Q0 11727818 51 3.0 27test02
50 Q0 11727924 52 3.0 27test02
50 Q0 11730788 53 3.0 27test02
50 Q0 11731157 54 3.0 27test02
50 Q0 11731158 55 3.0 27test02
50 Q0 11

## Exercise 03.B: _Language models_

Custom similarities can be configured by tuning the parameters of the built-in similarities. Read more about these (expert) options in the [similarity module](https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules-similarity.html).

> Tip: the example similarity settings have to be used in a `"settings"` object.
> Check your settings and mappings with: `es.indices.get(index='genomics')`.

Make a run that uses Language Models with Jelinek-Mercer smoothing (linear interpolation smoothing) on the field `"all"` that indexes the fields `"TI"` and `"AB"`. Use the parameter `lambda=0.2`. Again evaluate the run using `print_trec_eval`.

In [6]:
#THIS IS GRADED!
lmjelinekmercer = {
  # BEGIN ANSWER
    "settings" : {
    # a single shard, so we do not suffer from approximate document frequencies
        "number_of_shards" : 1,
        "index" : {
            "similarity" : {
              "lmjm" : {
                "type" : "LMJelinekMercer",
                "lambda": "0.2"
              }
            }
        }
  },
  "mappings": {
      "properties": {
        "AB": {
          "type": "text",
          "copy_to": "all"
        },
        "TI": {
          "type": "text",
          "copy_to": "all"
        },
        "all": {
            "type": "text",
            "similarity": "lmjm"
        }
      }
  }  
  # END ANSWER
}

index_documents(es, 'data/trec-medline.json', 'genomics', body=lmjelinekmercer)


In [None]:
make_trec_run(es, 'data/training-queries-simple.txt', 'lmjelinekmercer.run', run_name="test_03")

## Exercise 03.C: _Model comparison_
Compute the results of the `lmjelinekmercer.run` and compare them with those of the `baseline.run` and `boolean.run`. Performing statistical tests may help strenghtening your claims.

In [None]:
#THIS IS GRADED!

# your comments here
# BEGIN ANSWER
# END ANSWER

In [None]:
! head -20 baseline.run
! echo "\n"
! head -20 lmjelinekmercer.run

## Exercise 03.D: _Implement your own similarity measure (Bonus)_ 

We have only seen the results of using the analyzer to queries. The analyzer results from the _documents_ are available using the `termvectors()` function, as follows for document `id=1`: (Additionally, we can get overall field statistics, such as the number of documents)

> First, index the collection again. While waiting, have a coffee or tea :) 

> `id=1` refers to the internal document identifiers, so not to the Pubmed identifier.

_The bonus exercise is not mandatory. It can compensate for lower grades of other exercises._

In [5]:
index_documents(es, 'data/trec-medline.json', 'genomics')

es.termvectors(index="genomics", id="1", fields="TI", 
               term_statistics=True, field_statistics=True, offsets=False)

{'_index': 'genomics',
 '_type': '_doc',
 '_id': '1',
 '_version': 1,
 'found': True,
 'took': 216,
 'term_vectors': {'TI': {'field_statistics': {'sum_doc_freq': 5686640,
    'doc_count': 485510,
    'sum_ttf': 5989891},
   'terms': {'analysis': {'doc_freq': 12432,
     'ttf': 12528,
     'term_freq': 1,
     'tokens': [{'position': 10}]},
    'and': {'doc_freq': 182150,
     'ttf': 213076,
     'term_freq': 1,
     'tokens': [{'position': 11}]},
    'bias': {'doc_freq': 294,
     'ttf': 296,
     'term_freq': 1,
     'tokens': [{'position': 1}]},
    'by': {'doc_freq': 37779,
     'ttf': 38892,
     'term_freq': 1,
     'tokens': [{'position': 7}]},
    'curve': {'doc_freq': 183,
     'ttf': 185,
     'term_freq': 1,
     'tokens': [{'position': 12}]},
    'data': {'doc_freq': 2588,
     'ttf': 2652,
     'term_freq': 1,
     'tokens': [{'position': 5}]},
    'fitting': {'doc_freq': 61,
     'ttf': 61,
     'term_freq': 1,
     'tokens': [{'position': 13}]},
    'in': {'doc_freq': 211

### Implement the BM25 similarity

Complete the function `bm25_similarity()` below by implementing the BM25 similarity as described by in Section 11.4.3 of [Manning, Raghavan and Schuetze, Chapter 11](https://nlp.stanford.edu/IR-book/pdf/11prob.pdf). Are you able to replicate the score of ElasitcSearch (15.472)? If not, are you using a different variant of the BM25 model?

In [39]:
#THIS IS GRADED!

import math

# math.log(x) computes the logarithm of x

def bm25_similarity (query, doc_id):

    # Get the query tokens (see above)
    query_tokens = es.indices.analyze(index='genomics', body={"field":"TI", "text": query})
    tokens = query_tokens['tokens']

    # Get the term vector for doc_id and the field statistics
    term_vector = es.termvectors(index="genomics", id=doc_id, fields="TI", 
                  term_statistics=True, field_statistics=True, offsets=False)
    vector = term_vector['term_vectors']['TI']['terms']
    f_stats = term_vector['term_vectors']['TI']['field_statistics']

    # The answer should sum over 'tokens', check if the tokens exists in the 'vector',
    # and if so, add the appropriate value to 'similarity'.
    # Tip: add print statements to your code to see what each variable contains.
    
    similarity = 0

    # BEGIN ANSWER
    # Set the parameters for BM25 to the same defaults as ElasticSearch. An optimal k1 should usually fall between 1.2 and 2 and be found through optimisation.
    k1 = 1.2
    b = 0.75
    N = f_stats['doc_count'] # Total number of documents containing the field 'TI'
    # Length of the document is given by the sum of each term's frequency. For doc 1, each term's frequency is one but other titles we may have repeating words.
    L_d = sum([vector[t]['term_freq'] for t in vector])
    
    # Calculate average document length by dividing:
        # (sum of the total term frequencies across all docs and all terms) / (number of documents)
    q = {"query": {"exists": {"field":"TI"}}} # We only want documents where the Title field exists.
    total_docs = es.count(index='genomics', body=q).get('count') # Get the total number of docs with the field 'TI'

    sum_ttf = f_stats['sum_ttf'] # Get the sum of the total term frequencies across all docs and all terms
    # Calculate the average document length.
    L_av = sum_ttf/total_docs # Average (mean) length of all documents in the index.

    # Loop over the terms in the query.
    for token in tokens:
        term = token['token'] # Extract the word from the token dict.
        df = 0 # Initialise (and re-set) frequencies to zero. 
        tf = 0
        if term in vector.keys():
            df = vector[term]['doc_freq']
            tf = vector[term]['term_freq']
            # Calculate the term's contribution to the similarity score, based on the BM25 definition in the course book and Wikipedia.
            term_score = math.log(1 + (N-df +0.5)/(df+0.5)) * ((k1 +1)*tf)/(k1*((1-b) + b*(L_d/L_av)) + tf)
            # Increment similarity for this term
            similarity += term_score
    
    # END ANSWER
    return similarity

bm25_similarity("curve fitting", 1)

15.471867259295507

See below for the 'reference score' computed by ElasticSearch:

In [25]:
body = {
  "query": {
    "match" : { "TI" : "curve fitting" }
  }
}
explain = es.explain(index="genomics", id="1", body=body)
print (explain['explanation']['value'])  # BM25 score computed by ElasticSearch

15.471867
