<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/00_Data/CISICorpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch.
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch()
es.ping()  # got True

# Parsing CISI Corpus

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/).  <br>
For detailed information about the format of the files, see the PragmaLingu [Data Sets](https://pragmalingu.de/docs/guides/data-comparison). 
You can learn about parsing in general by reading our [parsing guide](https://pragmalingu.de/docs/guides/how-to-parse).

### Dowlnoad Data

Download and unzip data.

In [None]:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/cisi.tar.gz
!tar -xf cisi.tar.gz

Set paths to the dowloaded data as variables:

In [None]:
PATH_TO_CISI_TXT = '/content/CISI.ALL'
PATH_TO_CISI_QRY = '/content/CISI.QRY'
PATH_TO_CISI_REL = '/content/CISI.REL'

### Imports

Make all the imports:

In [None]:
from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

### Process Data

Get the text entries from the text and query file preprocessed as a list.
Since this corpus seems to have '.I' in other places than the ID we have to treat it a little diffrent than the other ones:

In [None]:
ID_marker = re.compile('^\.I',re.MULTILINE)

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

cisi_txt_list = get_data(PATH_TO_CISI_TXT, ID_marker)
cisi_qry_list = get_data(PATH_TO_CISI_QRY, ID_marker)

Process the list of the text file into nested dictionaries which can be saved as json:

In [None]:
cisi_title_start = re.compile('[\n]\.T')
cisi_author_start = re.compile('[\n]\.A')
cisi_date_start = re.compile('[\n]\.B')
cisi_text_start = re.compile('[\n]\.W')
cisi_cross_start = re.compile('[\n]\.X')

# process the document data

cisi_txt_data = defaultdict(dict)

for line in cisi_txt_list:
  entries = re.split(cisi_title_start,line,1)
  id = entries[0].strip()#save the id
  no_id = entries[1] 
  
  if len(re.split(cisi_author_start, no_id)) >= 2: # is there just one author?
    no_id_entries = re.split(cisi_author_start, no_id,1)
    cisi_txt_data[id]['title'] = no_id_entries[0].strip() # save title
    no_title = no_id_entries[1]

    if len(re.split(cisi_date_start, no_title)) > 1: # is there a publication date?
      no_title_entries = re.split(cisi_date_start, no_title)
      cisi_txt_data[id]['author'] = no_title_entries[0].strip() # save athour
      no_author = no_title_entries[1]
      no_author_entries = re.split(cisi_text_start, no_author)
      cisi_txt_data[id]['publication_date'] = no_author_entries[0].strip() # save publication date
      no_author_date = no_author_entries[1]
    else:
      no_title_entries = re.split(cisi_text_start, no_title)
      cisi_txt_data[id]['author'] = no_title_entries[0].strip() # save athour
      no_author_date = no_title_entries[1]

  else:
    no_id_entries = re.split(cisi_author_start, no_id)
    cisi_txt_data[id]['title'] = no_id_entries[0].strip() # save title
    cisi_txt_data[id]['author'] = no_id_entries[1].strip() # save first author
    no_title_entries = re.split(cisi_text_start, no_title)
    cisi_txt_data[id]['author'] += ','+no_title_entries[0].strip() # save second athour
    no_author_date = no_title_entries[1]

  last_entries = re.split(cisi_cross_start, no_author_date)
  cisi_txt_data[id]['text'] = last_entries[0].strip() # save text
  cisi_txt_data[id]['cross-refrences'] = last_entries[1].strip() # save cross refrences


Same process as the text files with the query file but with less information to parse:

In [None]:
qry_chunk_start = re.compile('[\n]\.W')

# process the query data
cisi_qry_data = defaultdict(dict)

for line in cisi_qry_list:
  id , question = re.split(qry_chunk_start,line)
  id = re.findall('\d{1,3}',id)[0]
  cisi_qry_data[id]['question'] = question


Relevance assesments are parsed and saved to a dictionary of lists:

In [None]:
# process the relevance assesment data
cisi_rel = defaultdict(list)

with open (PATH_TO_CISI_REL,'r') as f:
  for line in f:
    line = line.replace(' ','\t')
    line = re.split('\t',line)
    line = list(filter(None, line))
    cisi_rel[str(line[0])].append(str(line[1]))

### Create index for CISI corpus

Create an index for the CISI corpus. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
es.indices.create("cisi-corpus")
#print new index list
create_response = es.cat.indices()
print(create_response)

Index all the documents that are processed to the created index in elasticsearch:

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/#example-usage))

In [None]:
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
cisi_index = "cisi-corpus"

for ID, doc_data in cisi_txt_data.items():
  es.index(index=cisi_index, id=ID, body=doc_data)

Verify if everthing went right by printing it:

In [None]:
create_response = es.cat.indices()
print(create_response)

green  open porter-medline-corpus        pCZngrxjSS2FwFgxbUqNnA 1 0  1032   0     1mb     1mb
green  open snowball-cranfield-corpus    W6kIploWQJGNMm6xwGtJbg 1 0  1400   0   1.5mb   1.5mb
yellow open security-auditlog-2020.07.26 3KQWvUwlRHOYj3AeOYcfkg 1 1    31   0  75.6kb  75.6kb
yellow open security-auditlog-2020.07.27 C3A4S_lhSC-y5NEvhhhAAg 1 1    20   0  36.9kb  36.9kb
yellow open security-auditlog-2020.07.28 sLtcImDYTmafKHDoxhmg8g 1 1    22   0  72.3kb  72.3kb
yellow open security-auditlog-2020.07.29 s_wFvB3PTeKWfVSZFdlgOw 1 1     9   0 123.3kb 123.3kb
green  open snowball-medline-corpus      OpgXBcIrRnma8hWRj3YeSg 1 0  1032   0     1mb     1mb
green  open porter-index                 en25T-o8Tduy5B0rCQnBrg 1 0     1   0   4.5kb   4.5kb
yellow open pragmalingu-cisi-corpus      CH3y6fjrR-6dlstL1FwRAQ 1 1     0   0    208b    208b
green  open stemmer-cacm-corpus          xDyo3niKQ2GPnqOwqeB1YQ 1 0  3204   0     4mb     4mb
green  open stemming-time-corpus         RntweNy1Rl2zhkwcpEL

### Use Corpus in Ranking API

Use the ranking evaluation API from elasticsearch to evaluate the corpus:

(For more information see the [python documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval) and the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval))

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval
from collections import defaultdict

cisi_index = "cisi-corpus"

def create_query_body(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "relevant_rating_threshold": 1,
              "k": 20
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

cisi_create = create_query_body(cisi_qry_data, cisi_rel, cisi_index)

cisi_eval_body = json.dumps(cisi_create)

Print results of Ranking API:

In [None]:
cisi_res = es.rank_eval(cisi_eval_body, cisi_index)

print(json.dumps(cisi_res, indent=4, sort_keys=True))

{
    "details": {
        "Query_1": {
            "hits": [
                {
                    "hit": {
                        "_id": "722",
                        "_index": "cisi-corpus",
                        "_score": 29.84168,
                        "_type": "_doc"
                    },
                    "rating": 1
                },
                {
                    "hit": {
                        "_id": "589",
                        "_index": "cisi-corpus",
                        "_score": 27.530773,
                        "_type": "_doc"
                    },
                    "rating": 1
                },
                {
                    "hit": {
                        "_id": "236",
                        "_index": "cisi-corpus",
                        "_score": 26.994562,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
          