<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/Data/TimeCorpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch, for detailed explaination see [this Notebook.](https://)
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch()
es.ping()  # got True

# Parsing Time Corpus

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/time/).  <br>
For detailed information about the format of the files, see the PragmaLingu [ Benchmarks](https://pragmalingu.de/docs/benchmarks/overview)

### Dowlnoad Data

Download and unzip data.

In [None]:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/time/time.tar.gz
!tar -xf time.tar.gz

Set paths to the dowloaded data as variables:

In [None]:
PATH_TO_TIME_TXT = '/content/TIME.ALL'
PATH_TO_TIME_QRY = '/content/TIME.QUE'
PATH_TO_TIME_REL = '/content/TIME.REL'

### Imports

Make all the imports:

In [None]:
from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np
import os

### Process Data

Get the text entries from the text and query file preprocessed as a list:

In [None]:
txt_entry_marker = re.compile('\*TEXT')
qry_entry_marker = re.compile('\*FIND    ')
def get_data(PATH_TO_FILES, marker):
  """
  Reads multiple files and spilts text into entries at the entry marker.
  The 'marker' contains the regex at which we want to split
  Pops last element since it's empty.
  """
  with open (PATH_TO_FILES,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

time_txt_list = get_data(PATH_TO_TIME_TXT, txt_entry_marker)
time_qry_list = get_data(PATH_TO_TIME_QRY, qry_entry_marker)

The entries in the text files contain information on the ID, the pubilcation date, the page and the text. Since we don't need the page we can split the text at the 'PAGE' marker.

In [None]:
page_split = re.compile('PAGE \d{3}')

time_txt_data = defaultdict(dict)
ID = 1
for entry in time_txt_list:
  splitted = re.split(page_split,entry)
  time_txt_data[ID]['text'] = splitted[1]
  ID += 1

The query file is easily parsed by splitting the elements at the newline marker and saving the second splitted element as ID while the following elements are the question:

In [None]:
time_qry_data = defaultdict(dict)
qry_split = re.compile('\n\n ')

for query in time_qry_list:
  entries = re.split(qry_split,query)
  time_qry_data[int(entries[0])]['question'] = entries[1]

Relevance assesments are saved as numpy and parsed to dictionary:

In [None]:
# process relevance assesments without rating
time_rel = defaultdict(list)

rel_marker = re.compile('            \n        ')
rel_split = re.compile('\n')

with open (PATH_TO_TIME_REL,'r') as f:
  for lines in f:
    line = lines.strip().replace('   ',' ').replace('  ',' ').split(' ')
    if len(line) > 1:
      time_rel[int(line[0])].extend(line[1:])

### Create index for Time corpus

Create an index for the Time corpus. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
time_index = "time-corpus"

es.indices.create(time_index)

create_response = es.cat.indices()
print(create_response)

Index all the documents that are processed to the created index in elasticsearch:

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/#example-usage))

In [None]:
for ID, doc_data in time_txt_data.items():
  es.index(index=time_index, id=ID, body=doc_data)

Verify if everthing went right by printing it:

In [None]:
create_response = es.cat.indices()
print(create_response)

green  open porter-medline-corpus        m2NiUrTURhypZ2paB_BARw 1 0 1032   0     1mb     1mb
green  open snowball-cranfield-corpus    LZOyuDGLT0-_8TH1nk5kgw 1 0 1400   0   1.5mb   1.5mb
yellow open security-auditlog-2020.07.26 3KQWvUwlRHOYj3AeOYcfkg 1 1   31   0  75.6kb  75.6kb
yellow open security-auditlog-2020.07.27 C3A4S_lhSC-y5NEvhhhAAg 1 1   20   0  36.9kb  36.9kb
yellow open security-auditlog-2020.07.28 sLtcImDYTmafKHDoxhmg8g 1 1   22   0  72.3kb  72.3kb
yellow open security-auditlog-2020.07.29 s_wFvB3PTeKWfVSZFdlgOw 1 1    9   0 123.3kb 123.3kb
green  open snowball-medline-corpus      FqIexMbRRHKEBFapHtahZQ 1 0 1032   0     1mb     1mb
green  open porter-index                 cF_QK9lTRBqtP5WjDcF4SA 1 0    1   0   4.5kb   4.5kb
yellow open pragmalingu-cisi-corpus      CH3y6fjrR-6dlstL1FwRAQ 1 1    0   0    208b    208b
green  open .opendistro_security         aOOHbiN_QwKao4FpRXXu7w 1 0    7   1  45.4kb  45.4kb
green  open stemming-index               cphykMe_SEefw7_Hh-Ne6w 1 0   

### Use Corpus in Ranking API

Use the ranking evaluation API from elasticsearch to evaluate the corpus:

(For more information see the [python documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval) and the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval))

In [None]:
time_index = "time-corpus"

def create_query_body(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "relevant_rating_threshold": 1,
              "k": 20
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

time_create = create_query_body(time_qry_data, time_rel, time_index)

time_eval_body = json.dumps(time_create)

Print results of Ranking API:

In [None]:
time_res = es.rank_eval(time_eval_body,time_index)

print(json.dumps(time_res, indent=4, sort_keys=True))

{
    "details": {
        "Query_1": {
            "hits": [
                {
                    "hit": {
                        "_id": "304",
                        "_index": "time-corpus",
                        "_score": 21.359013,
                        "_type": "_doc"
                    },
                    "rating": 1
                },
                {
                    "hit": {
                        "_id": "257",
                        "_index": "time-corpus",
                        "_score": 20.824394,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "288",
                        "_index": "time-corpus",
                        "_score": 20.80226,
                        "_type": "_doc"
                    },
                    "rating": 1
                },
                {
                    "hit": {
          