<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/00_Data/CACMCorpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch.
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch()
es.ping()  # got True

# Parsing CACM Corpus

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/).  <br>
For detailed information about the format of the files, see the PragmaLingu [Data Sets](https://pragmalingu.de/docs/guides/data-comparison). 
You can learn about parsing in general by reading our [parsing guide](https://pragmalingu.de/docs/guides/how-to-parse).

### Dowlnoad Data

Download and unzip data.

In [None]:
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/cacm.tar.gz
!tar -xf cacm.tar.gz

--2021-01-19 13:28:11--  http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/cacm.tar.gz
Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253
Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 705665 (689K) [application/gzip]
Saving to: ‘cacm.tar.gz’


2021-01-19 13:28:13 (528 KB/s) - ‘cacm.tar.gz’ saved [705665/705665]



Set paths to the dowloaded data as variables:

In [None]:
PATH_TO_CACM_TXT = '/content/cacm.all'
PATH_TO_CACM_QRY = '/content/query.text'
PATH_TO_CACM_REL = '/content/qrels.text'
PATH_TO_CACM_COMMON_WORDS = '/content/common_words'

### Imports

Make all the imports:

In [None]:
from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

### Process Data

Get the text entries from the text and query file preprocessed as a list:

In [None]:
ID_marker = re.compile('^\.I',re.MULTILINE)

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

cacm_txt_list = get_data(PATH_TO_CACM_TXT, ID_marker)
cacm_qry_list = get_data(PATH_TO_CACM_QRY, ID_marker)

Process the list of the text file into nested dictionaries which can be saved as json:

In [None]:
# process text file
cacm_chunk_title = re.compile('\.[T]\n')
cacm_chunk_markers = re.compile('^\.', re.MULTILINE)
cacm_chunk_txt = re.compile('W\n')
cacm_chunk_pub = re.compile('B\n')
cacm_chunk_author =re.compile('A\n')
cacm_chunk_add = re.compile('N\n')
cacm_chunk_cross = re.compile('X\n')

cacm_txt_data = defaultdict(dict)

for line in cacm_txt_list:
    entries = re.split(cacm_chunk_title, line)
    id = entries[0].strip()  # save id
    no_id = entries[1]
    list_of_entries = re.split(cacm_chunk_markers, no_id)
    cacm_txt_data[id]['title'] = list_of_entries[0].strip()
    for entry in list_of_entries:
        if len(re.split(cacm_chunk_txt, entry)) == 2: # check for text
            cacm_txt_data[id]['text'] = re.split(cacm_chunk_txt, entry)[1]
        if len(re.split(cacm_chunk_pub, entry)) == 2: # check for publication date
          cacm_txt_data[id]['publication_date'] = re.split(cacm_chunk_pub, entry)[1]
        if len(re.split(cacm_chunk_author, entry)) == 2: # check for author
            cacm_txt_data[id]['author'] = re.split(cacm_chunk_author, entry)[1]
        if len(re.split(cacm_chunk_add, entry)) == 2: # check for additional date
            cacm_txt_data[id]['add_date'] = re.split(cacm_chunk_add, entry)[1]
        if len(re.split(cacm_chunk_cross, entry)) == 2:  # check for cross-references
            cacm_txt_data[id]['cross-references'] = re.split(cacm_chunk_cross, entry)[1]

Same process as the text files with the query file but with less information to parse:

In [None]:
qry_chunk_start = re.compile('^\.W\n',re.MULTILINE)

# process the query data
cacm_qry_data = defaultdict(dict)

for n in range(0,len(cacm_qry_list)):
  line = cacm_qry_list[n]
  _ , question = re.split(qry_chunk_start,line)
  cacm_qry_data[n+1]['question'] = question

Relevance assesments are saved as numpy and parsed to dictionary:

In [None]:
# process the relevance assesment data
cacm_rel = defaultdict(list)

with open (PATH_TO_CACM_REL,'r') as f:
  for line in f:
    line = re.split(' ',line)
    cacm_rel[int(line[0])].append(line[1])

### Create index for CACM corpus

Create an index for the CACM corpus. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
cacm_index = "cacm-corpus"
es.indices.create(cacm_index)
#print new index list
create_response = es.cat.indices()
print(create_response)

Index all the documents that are processed to the created index in elasticsearch:

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/#example-usage))

In [None]:
cacm_index = "cacm-corpus"

for ID, doc_data in cacm_txt_data.items():
  es.index(index=cacm_index, id=ID, body=doc_data)

Verify if everthing went right by printing it:

In [None]:
create_response = es.cat.indices()
print(create_response)

### Use Corpus in Ranking API

Use the ranking evaluation API from elasticsearch to evaluate the corpus:

(For more information see the [python documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval) and the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval))

In [None]:
cacm_index = "cacm-corpus"

def create_query_body(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "relevant_rating_threshold": 1,
              "k": 20
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

cacm_create = create_query_body(cacm_qry_data, cacm_rel, cacm_index)

cacm_eval_body = json.dumps(cacm_create)

Print results of Ranking API:

In [None]:
cacm_res = es.rank_eval(cacm_eval_body,cacm_index)

print(json.dumps(cacm_res, indent=4, sort_keys=True))

{
    "details": {
        "Query_1": {
            "hits": [
                {
                    "hit": {
                        "_id": "3078",
                        "_index": "cacm-corpus",
                        "_score": 25.15888,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "2434",
                        "_index": "cacm-corpus",
                        "_score": 22.287754,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
                        "_id": "2572",
                        "_index": "cacm-corpus",
                        "_score": 17.322393,
                        "_type": "_doc"
                    },
                    "rating": null
                },
                {
                    "hit": {
 