<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/00_Data/ADICorpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch.
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch()
es.ping()  # got True

# Parsing ADI Corpus

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/adi/).  <br>
For detailed information about the format of the files, see the PragmaLingu [Data Sets](https://pragmalingu.de/docs/guides/data-comparison). 
You can learn about parsing in general by reading our [parsing guide](https://pragmalingu.de/docs/guides/how-to-parse).

## Dowlnoad Data

Download and unzip data.

In [None]:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/adi/adi.tar.gz
!tar -xf adi.tar.gz

Set paths to the dowloaded data as variables:

In [None]:
PATH_TO_ADI_TXT = '/content/ADI.ALL'
PATH_TO_ADI_QRY = '/content/ADI.QRY'
PATH_TO_ADI_REL = '/content/ADI.REL'

## Imports

Make all the imports:

In [None]:
from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

## Process Data

Get the text entries from the text and query file preprocessed as a list:

In [None]:
ID_marker = re.compile('^\.I',re.MULTILINE)

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

adi_txt_list = get_data(PATH_TO_ADI_TXT, ID_marker)
adi_qry_list = get_data(PATH_TO_ADI_QRY, ID_marker)


Process the list of the text file into nested dictionaries which can be saved as json:

In [None]:
adi_title_start = re.compile('\.T')
adi_author_start = re.compile('\.A')
adi_text_start = re.compile('\.W')

# process the document data

adi_txt_data = defaultdict(dict)

for line in adi_txt_list:
  entries = re.split(adi_title_start,line,1)
  id = entries[0].strip()
  no_id = entries[1]
  if len(re.split(adi_author_start, no_id,1)) > 1:
    no_id_entries = re.split(adi_author_start, no_id,1)
    adi_txt_data[id]['title'] = no_id_entries[0]
    no_title = no_id_entries[1]
    no_title_entries = re.split(adi_text_start, no_title)
    adi_txt_data[id]['author'] = no_title_entries[0]
    adi_txt_data[id]['text'] = no_title_entries[1]
  else:
    no_id_entries = re.split(adi_text_start, no_id)
    adi_txt_data[id]['title'] = no_id_entries[0]
    adi_txt_data[id]['text'] = no_id_entries[1]


Same process as the text files with the query file but with less information to parse:

In [None]:
qry_chunk_start = re.compile('\.W')

# process the query data
adi_qry_data = defaultdict(dict)

for n in range(0,len(adi_qry_list)-1):
  line = adi_qry_list[n+1]
  _ , question = re.split(qry_chunk_start,line)
  adi_qry_data[n+1]['question'] = question

Relevance assesments are saved as numpy and parsed to dictionary:

In [None]:
# process relevance assesment

adi_rel_data = open(PATH_TO_ADI_REL)
adi_np = np.loadtxt(adi_rel_data, dtype=float)

adi_rel = defaultdict(list)
for row in adi_np:
  adi_rel[int(row[0])].append(int(row[1])) 

## Create index for ADI corpus

Create an index for the Cranfield corpus. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
adi_index = "adi-corpus"
es.indices.create(adi_index)
#print new index list
create_response = es.cat.indices()
print(create_response)

Index all the documents that are processed to the created index in elasticsearch:

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/#example-usage))

In [None]:
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
adi_index = "adi-corpus"

for ID, doc_data in adi_txt_data.items():
  es.index(index=adi_index, id=ID, body=doc_data)

Verify if everthing went right by printing it:

In [None]:
create_response = es.cat.indices()
print(create_response)

yellow open cranfield-corpus             vMQyNTmhSy-0U1kqx5RxHw 1 1 1400  0   1.7mb   1.7mb
green  open .apm-agent-configuration     ve_Nk7yfSXy2y-zTf9M2kw 1 0    0  0    283b    283b
yellow open adi-corpus                   CodLkS3oRgyvEO2ToTZf8Q 1 1   82  0  80.7kb  80.7kb
green  open .kibana_1                    sxcOGmbCSvy0vcGeyhi7RQ 1 0   59  3 936.5kb 936.5kb
yellow open medline-corpus               tL_MDxdLTvKQotxMmSw6uQ 1 1 1032  0   1.1mb   1.1mb
green  open searchguard                  akAb6wLrTuOFHDCgBmezlA 1 0    7  0  32.6kb  32.6kb
yellow open test-index                   z5z3SbtRSS67rMlXmKHWEg 1 1    1  0   4.5kb   4.5kb
yellow open cranfield-corpus-eng         Ivu9_hU_SMeE18yHf7P0yw 1 1 1400  0   1.3mb   1.3mb
green  open .kibana_task_manager_1       OjDBu1J4Rc2WvZ3xBljVjg 1 0    2  0    32kb    32kb
green  open kibana_sample_data_ecommerce gCXw4kSbRtGx3i2uvbmxCQ 1 0 4675  0   4.7mb   4.7mb
yellow open my_index                     W9OiE90UQnmz1-aptT4C-A 1 1    6 18  14.

## Use Corpus in Ranking API

Use the ranking evaluation API from elasticsearch to evaluate the corpus:

(For more information see the [python documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval) and the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval))

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval
from collections import defaultdict

adi_index = "adi-corpus"

def create_query_body(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "relevant_rating_threshold": 1,
              "k": 20
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

adi_create = create_query_body(adi_qry_data, adi_rel, adi_index)

adi_eval_body = json.dumps(adi_create)

Print results of Ranking API:

In [None]:
adi_res = es.rank_eval(adi_eval_body, adi_index)

print(json.dumps(adi_res, indent=4, sort_keys=True))

{
    "details": {
        "Query_1": {
            "hits": [
                {
                    "hit": {
                        "_id": "12",
                        "_index": "adi-corpus",
                        "_score": 13.849522,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "77",
                        "_index": "adi-corpus",
                        "_score": 13.300932,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "58",
                        "_index": "adi-corpus",
                        "_score": 10.635579,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
         