<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/Data/NPLCorpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch, for detailed explaination see [this Notebook.](https://)
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch()
es.ping()  # got True

# Parsing NPL Corpus

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/).  <br>
For detailed information about the format of the files, see the PragmaLingu [ Benchmarks](https://pragmalingu.de/docs/benchmarks/overview)

### Dowlnoad Data

Download and unzip data.

In [None]:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz
!tar -xf npl.tar.gz

Set paths to the dowloaded data as variables:

In [None]:
PATH_TO_NPL_TXT = '/content/doc-text'
PATH_TO_NPL_QRY = '/content/query-text'
PATH_TO_NPL_REL = '/content/rlv-ass'

### Imports

Make all the imports:

In [None]:
from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np
import os

### Process Data

Get the text entries from the text, query and relevance assessement files preprocessed as a list:

In [None]:
txt_entry_marker = re.compile('\n   /\n')
qry_entry_marker = re.compile('\n/\n')
rel_entry_marker = re.compile('\n   /\n')

def get_data(PATH_TO_FILES, marker):
  """
  Reads multiple files and spilts text into entries at the entry marker.
  The 'marker' contains the regex at which we want to split
  Pops last element since it's empty.
  """
  with open (PATH_TO_FILES,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop()
  return lines

npl_txt_list = get_data(PATH_TO_NPL_TXT, txt_entry_marker)
npl_qry_list = get_data(PATH_TO_NPL_QRY, qry_entry_marker)
npl_rel_list = get_data(PATH_TO_NPL_REL, rel_entry_marker)

The entries in the text files only contain the ID and the text.

In [None]:
# process the documents

npl_txt_data = defaultdict(dict)

for entry in npl_txt_list:
  splitted = entry.split('\n')
  splitted = list(filter(None, splitted))
  ID = splitted[0]
  text = ' '.join(map(str, splitted[1:]))
  npl_txt_data[ID]['text'] = text

The query file is the same as the text file:

In [None]:
# process the queries

npl_qry_data = defaultdict(dict)

for entry in npl_qry_list:
  splitted = entry.split('\n')
  splitted = list(filter(None, splitted))
  ID = int(splitted[0])
  question = ' '.join(map(str, splitted[1:]))
  npl_qry_data[ID]['question'] = question

Relevance assesments are saved as numpy and parsed to dictionary:

In [None]:
# process relevance assesments

npl_rel = defaultdict(list)

for entry in npl_rel_list:
  splitted = entry.replace('\n','').split(' ')
  splitted = list(filter(None, splitted))
  ID = int(splitted[0])
  npl_rel[ID].extend(splitted[1:]) 

### Create index for NPL corpus

Create an index for the NPL corpus. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
npl_index = "npl-corpus"

es.indices.create(npl_index)

create_response = es.cat.indices()
print(create_response)

Index all the documents that are processed to the created index in elasticsearch:

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/#example-usage))

In [None]:
for ID, doc_data in npl_txt_data.items():
  es.index(index=npl_index, id=ID, body=doc_data)

Verify if everthing went right by printing it:

In [None]:
create_response = es.cat.indices()
print(create_response)

TransportError: ignored

### Use Corpus in Ranking API

Use the ranking evaluation API from elasticsearch to evaluate the corpus:

(For more information see the [python documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval) and the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval))

In [None]:
npl_index = "npl-corpus"

def create_query_body(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "relevant_rating_threshold": 1,
              "k": 20
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

npl_create = create_query_body(npl_qry_data, npl_rel, npl_index)

npl_eval_body = json.dumps(npl_create)

Print results of Ranking API:

In [None]:
npl_res = es.rank_eval(npl_eval_body,npl_index)

print(json.dumps(npl_res, indent=4, sort_keys=True))

{
    "details": {
        "Query_1": {
            "hits": [
                {
                    "hit": {
                        "_id": "4817",
                        "_index": "npl-corpus",
                        "_score": 16.205086,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "8582",
                        "_index": "npl-corpus",
                        "_score": 16.07975,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "8565",
                        "_index": "npl-corpus",
                        "_score": 14.960199,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
    