<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/First_Experiment_Stemming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# First Experiment

For our first experiment we connect the Notebook to an Elasticsearch instance and compare a standard Elasticsearch operator with two build-in stemming methods: 'Stemmer Token Filter' and 'Hunspell Token Filter'. 
(To read details about this experiment visit our [website](https://pragmalingu.de/docs/experiments/experiment1))

## Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch, for detailed explaination see [this Notebook.](https://)
Download:

In [1]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [2]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch(["localhost:9200/"])
#wait a bit
import time
time.sleep(30)
es.ping()  # got True

[?25l[K     |█▌                              | 10kB 18.5MB/s eta 0:00:01[K     |███                             | 20kB 6.5MB/s eta 0:00:01[K     |████▌                           | 30kB 7.0MB/s eta 0:00:01[K     |██████                          | 40kB 6.1MB/s eta 0:00:01[K     |███████▌                        | 51kB 5.7MB/s eta 0:00:01[K     |█████████                       | 61kB 5.6MB/s eta 0:00:01[K     |██████████▌                     | 71kB 5.6MB/s eta 0:00:01[K     |████████████                    | 81kB 5.7MB/s eta 0:00:01[K     |█████████████▌                  | 92kB 5.7MB/s eta 0:00:01[K     |███████████████                 | 102kB 5.9MB/s eta 0:00:01[K     |████████████████▌               | 112kB 5.9MB/s eta 0:00:01[K     |██████████████████              | 122kB 5.9MB/s eta 0:00:01[K     |███████████████████▍            | 133kB 5.9MB/s eta 0:00:01[K     |█████████████████████           | 143kB 5.9MB/s eta 0:00:01[K     |██████████████████████▍   

True

## Initialize Stemmers

### Algorithmic Stemmer

Algorithmic stemmers apply a series of rules to each word to reduce it to its root form.

In this way, they present a few advantages:
1. They require little setup and usually work well out of the box;
2. They use little memory;
3. They are typically faster than dictionary stemmers.

(For more information see the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#algorithmic-stemmers))

**Analyzer**

An analyzer contains three lower-level building blocks: character filters, a tokenizer, and token filters.
To apply stemming we first need to configure a custom analyzer that makes use of the stemmer filter. 

(Stemmer filter reference guide at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html#analysis-stemmer-tokenfilter))


In [3]:
#the order of filter and analyzer is arbitrary
stemming_analyzer = {
    "filter" : {
        "eng_stemmer" : {
        "type" : "stemmer",
        "name" : "english"
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "eng_stemmer"]
        }
    }
}

#create the correct settings 
stemmer_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": stemming_analyzer
    }
}

### Dictionary Stemmer

Dictionary stemmers look up words in a provided dictionary, replacing unstemmed word variants with stemmed words from the dictionary.

In theory, these are well suited for:
1. Stemming irregular words;
2. Discerning between words that are spelled similarly but not related conceptually.

They also admit a few drawbacks:
1. Highly dependent on dictionary quality;
2. Size and performance, as a result of having to load all words, prefixes and suffixes from the dictionary.

(For more information see the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#dictionary-stemmers))

**Analyzer**

An analyzer contains three lower-level building blocks: character filters, a tokenizer, and token filters.
To apply dictionary stemming, we first need to configure a custom analyzer that makes use of the stemmer filter. 

(Hunspell filter reference guide at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html#analysis-hunspell-tokenfilter))

In [4]:
#the order of filter and analyzer is arbitrary
dictionary_analyzer = {
    "filter" : {
        "dictionary_stemmer" : {
          "type" : "hunspell",
          "locale" : "en_US",
          "dedup" : True  #duplicate tokens are removed from the filter’s output
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "dictionary_stemmer"]
        }
    }
}

#create the correct settings 
hunspell_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": dictionary_analyzer
    }
}

**Dictionary**

To use the Hunspell Token Stemmer for English US you have to download the Hunspell dictionary for that language and make it accessable for your Elasticsearch instance at 'elsaticsearch-7.9.1/ect/elasticsearch/hunspell/en_US':
(For more information visit the [Elasticsearch Website](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html#analysis-hunspell-tokenfilter-dictionary-config))

In [5]:
!wget https://cgit.freedesktop.org/libreoffice/dictionaries/tree/en/en_US.aff -P /content/elasticsearch-7.9.1/config/hunspell/en_US
!wget https://cgit.freedesktop.org/libreoffice/dictionaries/tree/en/en_US.dic -P /content/elasticsearch-7.9.1/config/hunspell/en_US

--2020-09-16 06:58:10--  https://cgit.freedesktop.org/libreoffice/dictionaries/tree/en/en_US.aff
Resolving cgit.freedesktop.org (cgit.freedesktop.org)... 131.252.210.161
Connecting to cgit.freedesktop.org (cgit.freedesktop.org)|131.252.210.161|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/content/elasticsearch-7.9.1/config/hunspell/en_US/en_US.aff’

en_US.aff               [ <=>                ]  23.45K  --.-KB/s    in 0.1s    

2020-09-16 06:58:11 (162 KB/s) - ‘/content/elasticsearch-7.9.1/config/hunspell/en_US/en_US.aff’ saved [24012]

--2020-09-16 06:58:11--  https://cgit.freedesktop.org/libreoffice/dictionaries/tree/en/en_US.dic
Resolving cgit.freedesktop.org (cgit.freedesktop.org)... 131.252.210.161
Connecting to cgit.freedesktop.org (cgit.freedesktop.org)|131.252.210.161|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘/content/elasticsearch-7.9.1/config/h

In [6]:
!echo "#indices.analysis.hunspell.dictionary.location : /content/elasticsearch-7.9.1" >> elasticsearch-7.9.1/config/elasticsearch.yml

In [15]:
!ls

ADI.ALL  ADI.QRY  adi.tar.gz	       elasticsearch-7.9.1-linux-x86_64.tar.gz
ADI.BLN  ADI.REL  elasticsearch-7.9.1  sample_data


## Parse Data

Get different corpora, format them and feed them to elasticsearch

### ADI Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/adi/).  <br>
For detailed information about the parsing of this corpus look at [ this Notebook](https://colab.research.google.com/github/pragmalingu/private_experiments/blob/adi_corpus/ADICorpus.ipynb) or for parsing in generel read [this guide](https://).

In [7]:
# download and unzip data
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/adi/adi.tar.gz
!tar -xf adi.tar.gz

# set paths to the dowloaded data as variables
PATH_TO_ADI_TXT = '/content/ADI.ALL'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

# get the text and query files

ID_marker = re.compile('\.I')

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read().replace('\n'," ")
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

adi_txt_list = get_data(PATH_TO_ADI_TXT, ID_marker)

# process text file

adi_title_start = re.compile('\.T')
adi_author_start = re.compile('\.A')
adi_text_start = re.compile('\.W')

adi_txt_data = defaultdict(dict)

for line in adi_txt_list:
  entries = re.split(adi_title_start,line,1)
  id = entries[0].strip()
  no_id = entries[1]
  if len(re.split(adi_author_start, no_id,1)) > 1:
    no_id_entries = re.split(adi_author_start, no_id,1)
    adi_txt_data[id]['title'] = no_id_entries[0]
    no_title = no_id_entries[1]
    no_title_entries = re.split(adi_text_start, no_title)
    adi_txt_data[id]['author'] = no_title_entries[0]
    adi_txt_data[id]['text'] = no_title_entries[1]
  else:
    no_id_entries = re.split(adi_text_start, no_id)
    adi_txt_data[id]['title'] = no_id_entries[0]
    adi_txt_data[id]['text'] = no_id_entries[1]

--2020-09-16 06:58:25--  http://ir.dcs.gla.ac.uk/resources/test_collections/adi/adi.tar.gz
Resolving ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)... 130.209.240.253
Connecting to ir.dcs.gla.ac.uk (ir.dcs.gla.ac.uk)|130.209.240.253|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17307 (17K) [application/gzip]
Saving to: ‘adi.tar.gz’


2020-09-16 06:58:25 (797 KB/s) - ‘adi.tar.gz’ saved [17307/17307]



**Indexing**

Create an index of the adi corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [26]:
es.indices.delete(adi_index)
es.indices.delete(stemmer_adi_index)

{'acknowledged': True}

In [8]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
adi_index = "adi-corpus"
stemmer_adi_index = "stemmer-adi-corpus"
hunspell_adi_index = "hunspell-adi-corpus"

es.indices.create(adi_index)
es.indices.create(stemmer_adi_index, body=stemmer_settings)
es.indices.create(hunspell_adi_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in adi_txt_data.items():
  es.index(index=adi_index, id=ID, body=doc_data)
  es.index(index=stemmer_adi_index, id=ID, body=doc_data)
  es.index(index=hunspell_adi_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

TransportError: ignored

### CACM Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/).  <br>
For detailed information about the format of the files, see the PragmaLingu [ Benchmarks](https://pragmalingu.de/docs/benchmarks/overview)

In [None]:
# download and unzip data
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/cacm.tar.gz
!tar -xf cacm.tar.gz

# set paths to the dowloaded data as variablesDownload and unzip data.

PATH_TO_CACM_TXT = '/content/cacm.all'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

# get the text and query files

ID_marker = re.compile('^\.I',re.MULTILINE)

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

cacm_txt_list = get_data(PATH_TO_CACM_TXT, ID_marker)

# process text file

cacm_chunk_title = re.compile('\.[T]\n')
cacm_chunk_txt = re.compile('\n\.W\n') # not enough
cacm_chunk_txt_pub = re.compile('\.[W,B]')
cacm_chunk_publication = re.compile('\.[B]\n')
cacm_chunk_author = re.compile('^\.[A]\n', re.MULTILINE)
cacm_chunk_author_add_cross = re.compile('^\.[A,N,X]\n',re.MULTILINE) # not enough
cacm_chunk_add_cross = re.compile('\.[B,N,X]\n')


cacm_txt_data = defaultdict(dict)

for line in cacm_txt_list:
  entries= re.split(cacm_chunk_title,line)
  id = entries[0].strip() #save id
  no_id = entries[1]

  if len(re.split(cacm_chunk_txt, no_id)) == 2: # is there text
    no_id_entries = re.split(cacm_chunk_txt_pub, no_id,1)
    cacm_txt_data[id]['title'] = no_id_entries[0].strip() # save title
    cacm_txt_data[id]['text'] = no_id_entries[1].strip() # save text
    no_title_txt = no_id_entries[1]

    if len(re.split(cacm_chunk_author, no_title_txt)) == 2: # is there a auhtor
      no_title_entries = re.split(cacm_chunk_author_add_cross, no_title_txt)
      cacm_txt_data[id]['publication_date'] = no_title_entries[0].strip() # save publication date
      cacm_txt_data[id]['author'] = no_title_entries[1].strip() # save athor
      cacm_txt_data[id]['add_date'] = no_title_entries[2].strip() # save add date
      cacm_txt_data[id]['cross-references'] = no_title_entries[3].strip() # save cross-references

    else:
      no_title_entries = re.split(cacm_chunk_publication, no_title_txt)
      cacm_txt_data[id]['publication_date'] = no_title_entries[0].strip() # save publication date
      cacm_txt_data[id]['add_date'] = no_title_entries[1].strip() # save add date
      cacm_txt_data[id]['cross-references'] = no_title_entries[1].strip() # save cross-references

  else:
    no_id_entries = re.split(cacm_chunk_publication, no_id,1)
    cacm_txt_data[id]['title'] = no_id_entries[0].strip() # save title
    no_title = no_id_entries[1]

    if len(re.split(cacm_chunk_author, no_title,1)) == 2: # is there a auhtor
      no_title_entries = re.split(cacm_chunk_author_add_cross, no_title)
      cacm_txt_data[id]['publication_date'] = no_title_entries[0].strip() # save publication date
      cacm_txt_data[id]['author'] = no_title_entries[1].strip() # save athor
      cacm_txt_data[id]['add_date'] = no_title_entries[2].strip() # save add date
      cacm_txt_data[id]['cross-references'] = no_title_entries[3].strip() # save cross-references

    else:
      no_title_entries = re.split(cacm_chunk_add_cross, no_title)
      cacm_txt_data[id]['publication_date'] = no_title_entries[0].strip() # save publication date
      cacm_txt_data[id]['add_date'] = no_title_entries[1].strip() # save add date
      cacm_txt_data[id]['cross-references'] = no_title_entries[2].strip() # save cross-references

**Indexing**

Create an index of the CACM corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
cacm_index = "cacm-corpus"
stemmer_cacm_index = "stemmer-cacm-corpus"
hunspell_cacm_index = "hunspell-cacm-corpus"

es.indices.create(cacm_index)
es.indices.create(stemmer_cacm_index, body=stemmer_settings)
es.indices.create(hunspell_cacm_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in cacm_txt_data.items():
  es.index(index=cacm_index, id=ID, body=doc_data)
  es.index(index=stemmer_cacm_index, id=ID, body=doc_data)
  es.index(index=hunspell_cacm_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

### CISI Corpus

**Parsing** 

can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/).  <br>
For detailed information about the parsing of this corpus look at [ this Notebook](https://colab.research.google.com/github/pragmalingu/private_experiments/blob/cisi_corpus/CISICorpus.ipynb) or for parsing in generel read [this guide](https://).

In [None]:

# download and unzip data
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/cisi.tar.gz
!tar -xf cisi.tar.gz

# set paths to the dowloaded data as variablesDownload and unzip data.
PATH_TO_CISI_TXT = '/content/CISI.ALL'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

# get the text file

ID_marker = re.compile('^\.I',re.MULTILINE)

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

cisi_txt_list = get_data(PATH_TO_CISI_TXT, ID_marker)

# process text file

cisi_title_start = re.compile('[\n]\.T')
cisi_author_start = re.compile('[\n]\.A')
cisi_date_start = re.compile('[\n]\.B')
cisi_text_start = re.compile('[\n]\.W')
cisi_cross_start = re.compile('[\n]\.X')

cisi_txt_data = defaultdict(dict)

for line in cisi_txt_list:
  entries = re.split(cisi_title_start,line,1)
  id = entries[0].strip()#save the id
  no_id = entries[1] 
  
  if len(re.split(cisi_author_start, no_id)) >= 2: # is there just one author?
    no_id_entries = re.split(cisi_author_start, no_id,1)
    cisi_txt_data[id]['title'] = no_id_entries[0].strip() # save title
    no_title = no_id_entries[1]

    if len(re.split(cisi_date_start, no_title)) > 1: # is there a publication date?
      no_title_entries = re.split(cisi_date_start, no_title)
      cisi_txt_data[id]['author'] = no_title_entries[0].strip() # save athour
      no_author = no_title_entries[1]
      no_author_entries = re.split(cisi_text_start, no_author)
      cisi_txt_data[id]['publication_date'] = no_author_entries[0].strip() # save publication date
      no_author_date = no_author_entries[1]
    else:
      no_title_entries = re.split(cisi_text_start, no_title)
      cisi_txt_data[id]['author'] = no_title_entries[0].strip() # save athour
      no_author_date = no_title_entries[1]

  else:
    no_id_entries = re.split(cisi_author_start, no_id)
    cisi_txt_data[id]['title'] = no_id_entries[0].strip() # save title
    cisi_txt_data[id]['author'] = no_id_entries[1].strip() # save first author
    no_title_entries = re.split(cisi_text_start, no_title)
    cisi_txt_data[id]['author'] += ','+no_title_entries[0].strip() # save second athour
    no_author_date = no_title_entries[1]

  last_entries = re.split(cisi_cross_start, no_author_date)
  cisi_txt_data[id]['text'] = last_entries[0].strip() # save text
  cisi_txt_data[id]['cross-refrences'] = last_entries[1].strip() # save cross refrences

**Indexing**

Create an index of the CISI corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
cisi_index = "cisi-corpus"
stemmer_cisi_index = "stemmer-cisi-corpus"
hunspell_cisi_index = "hunspell-cisi-corpus"

es.indices.create(cisi_index)
es.indices.create(stemmer_cisi_index, body=stemmer_settings)
es.indices.create(hunspell_cisi_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in cisi_txt_data.items():
  es.index(index=cisi_index, id=ID, body=doc_data)
  es.index(index=stemmer_cisi_index, id=ID, body=doc_data)
  es.index(index=hunspell_cisi_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

### Cranfield Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/cran/).  <br>
For detailed information about the parsing of this corpus look at [ this Notebook](https://colab.research.google.com/github/pragmalingu/private_experiments/blob/cranfield_corpus/CranfieldCorpus.ipynb) or for parsing in generel read [this guide](https://).

In [None]:
# download and unzip data
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/cran/cran.tar.gz
!tar -xf cran.tar.gz

# set paths to the dowloaded data as variables
PATH_TO_CRAN_TXT = '/content/cran.all.1400'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

# get the text entries from the text and query files

ID_marker = re.compile('\.I')

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read().replace('\n'," ")
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

cran_txt_list = get_data(PATH_TO_CRAN_TXT, ID_marker)

# process text file

cran_chunk_start = re.compile('\.[A,B,T,W]')
cran_txt_data = defaultdict(dict)

for line in cran_txt_list:
  entries= re.split(cran_chunk_start,line)
  id = entries[0].strip()
  title = entries[1]
  author = entries[2]
  publication_date = entries[3]
  text = entries[4]
  cran_txt_data[id]['title'] = title
  cran_txt_data[id]['author'] = author
  cran_txt_data[id]['publication_date'] = publication_date
  cran_txt_data[id]['text'] = text

**Indexing**

Create an index of the Cranfield corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
cranfield_index = "cranfield-corpus"
stemmer_cranfield_index = "stemmer-cranfield-corpus"
hunspell_cranfield_index = "hunspell-cranfield-corpus"

es.indices.create(cranfield_index)
es.indices.create(stemmer_cranfield_index, body=stemmer_settings)
es.indices.create(hunspell_cranfield_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in cranfield_txt_data.items():
  es.index(index=cranfield_index, id=ID, body=doc_data)
  es.index(index=stemmer_cranfield_index, id=ID, body=doc_data)
  es.index(index=hunspell_cranfield_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

### LISA Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/lisa/).  <br>
For detailed information about the format of the files, see the PragmaLingu [ Benchmarks](https://pragmalingu.de/docs/benchmarks/overview)

In [None]:
# download and unzip data
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/lisa/lisa.tar.gz
!tar -xf lisa.tar.gz

# set paths to the dowloaded data as variables

PATH_TO_LISA_TXT = '/content/'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np
import os

# get the text files

file_regex = re.compile('LISA[0-5]')
lisa_files = [i for i in os.listdir(PATH_TO_LISA_TXT) if os.path.isfile(os.path.join(PATH_TO_LISA_TXT,i)) and re.match(file_regex,i)]

txt_entry_marker = re.compile('\*{44}',re.MULTILINE)

def get_data(PATH_TO_FILES, marker):
  """
  Reads multiple files and spilts text into entries at the entry marker.
  The 'marker' contains the regex at which we want to split
  Pops last element since it's empty.
  """
  with open (PATH_TO_FILES,'r') as f:
    text = f.read().replace('     ','')
    lines = re.split(marker,text)
    lines.pop()
  return lines

lisa_txt_list = []
for name in lisa_files: 
  lisa_txt_list.extend(get_data(PATH_TO_LISA_TXT+name, txt_entry_marker))

# process text file

doc_strip = re.compile('\n?Document {1,2}')

lisa_txt_list_stripped = []
lisa_txt_data = defaultdict(dict)

for el in lisa_txt_list:
  lisa_txt_list_stripped.append(re.sub(doc_strip,'', el))

for entry in lisa_txt_list_stripped:
  parts = entry.split('\n')
  empty_index = parts.index('')
  ID = parts[0]
  title = parts[1:empty_index]
  text = parts[empty_index+1:]
  lisa_txt_data[ID]['title'] = title
  lisa_txt_data[ID]['text'] = text

**Indexing**

Create an index of the LISA corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
lisa_index = "lisa-corpus"
stemmer_lisa_index = "stemmer-lisa-corpus"
hunspell_lisa_index = "hunspell-lisa-corpus"

es.indices.create(lisa_index)
es.indices.create(stemmer_lisa_index, body=stemmer_settings)
es.indices.create(hunspell_lisa_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in lisa_txt_data.items():
  es.index(index=lisa_index, id=ID, body=doc_data)
  es.index(index=stemmer_lisa_index, id=ID, body=doc_data)
  es.index(index=hunspell_lisa_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

### Medline Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/med/).  <br>
For detailed information about the parsing of this corpus look at [ this Notebook](https://colab.research.google.com/github/pragmalingu/private_experiments/blob/medline_corpus/MedlineCorpus.ipynb) or for parsing in generel read [this guide](https://).

In [None]:
# download and unzip data
!wget http://ir.dcs.gla.ac.uk/resources/test_collections/medl/med.tar.gz
!tar -xf med.tar.gz

# set paths to the dowloaded data as variables
PATH_TO_MED_TXT = '/content/MED.ALL'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np

# get the text and query files

ID_marker = re.compile('\.I')

def get_data(PATH_TO_FILE, marker):
  """
  Reads file and spilts text into entries at the ID marker '.I'.
  First entry is empty, so it's removed.
  'marker' contains the regex at which we want to split
  """
  with open (PATH_TO_FILE,'r') as f:
    text = f.read().replace('\n'," ")
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

med_txt_list = get_data(PATH_TO_MED_TXT, ID_marker)

# process the text file

chunk_start = re.compile('\.W')
med_txt_data = defaultdict(dict)

def fill_dictionary(dictionary, chunk_list, marker, key_name):
  for n in range(0,len(chunk_list)-1):
    line = chunk_list[n+1]
    _ , chunk = re.split(marker,line)
    dictionary[n+1][key_name] = chunk.strip()

fill_dictionary(med_txt_data, med_txt_list, chunk_start, 'text')

**Indexing**

Create an index of the medline corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
medline_index = "medline-corpus"
stemmer_medline_index = "stemmer-medline-corpus"
hunspell_medline_index = "hunspell-medline-corpus"

es.indices.create(medline_index)
es.indices.create(stemmer_medline_index, body=stemmer_settings)
es.indices.create(hunspell_medline_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in medline_txt_data.items():
  es.index(index=medline_index, id=ID, body=doc_data)
  es.index(index=stemmer_medline_index, id=ID, body=doc_data)
  es.index(index=hunspell_medline_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

### NPL Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/npl/).  <br>
For detailed information about the format of the files, see the PragmaLingu [ Benchmarks](https://pragmalingu.de/docs/benchmarks/overview)

In [None]:
# download and unzip data

!wget http://ir.dcs.gla.ac.uk/resources/test_collections/npl/npl.tar.gz
!tar -xf npl.tar.gz

# set paths to the dowloaded data as variablesDownload and unzip data.

PATH_TO_NPL_TXT = '/content/doc-text'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np
import os


# get the text, query and rel files

txt_entry_marker = re.compile('\n   /\n')

def get_data(PATH_TO_FILES, marker):
  """
  Reads multiple files and spilts text into entries at the entry marker.
  The 'marker' contains the regex at which we want to split
  Pops last element since it's empty.
  """
  with open (PATH_TO_FILES,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop()
  return lines

npl_txt_list = get_data(PATH_TO_NPL_TXT, txt_entry_marker)

# process the text file

npl_txt_data = defaultdict(dict)

for entry in npl_txt_list:
  splitted = entry.split('\n')
  splitted = list(filter(None, splitted))
  ID = splitted[0]
  text = ' '.join(map(str, splitted[1:]))
  npl_txt_data[ID]['text'] = text

**Indexing**

Create an index of the NPL corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
npl_index = "npl-corpus"
stemmer_npl_index = "stemmer-npl-corpus"
hunspell_npl_index = "hunspell-npl-corpus"

es.indices.create(npl_index)
es.indices.create(stemmer_npl_index, body=stemmer_settings)
es.indices.create(hunspell_npl_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in npl_txt_data.items():
  es.index(index=npl_index, id=ID, body=doc_data)
  es.index(index=stemmer_npl_index, id=ID, body=doc_data)
  es.index(index=hunspell_npl_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

### Time Corpus

**Parsing**

You can get the corpus from [this link](http://ir.dcs.gla.ac.uk/resources/test_collections/time/).  <br>
For detailed information about the format of the files, see the PragmaLingu [ Benchmarks](https://pragmalingu.de/docs/benchmarks/overview)

In [None]:
# download and unzip data

!wget http://ir.dcs.gla.ac.uk/resources/test_collections/time/time.tar.gz
!tar -xf time.tar.gz

# set paths to the dowloaded data as variablesDownload and unzip data.

PATH_TO_TIME_TXT = '/content/TIME.ALL'

from collections import defaultdict
import re
import json
from io import StringIO
import numpy as np
import os

# get the text and query file

txt_entry_marker = re.compile('\*TEXT')

def get_data(PATH_TO_FILES, marker):
  """
  Reads multiple files and spilts text into entries at the entry marker.
  The 'marker' contains the regex at which we want to split
  Pops last element since it's empty.
  """
  with open (PATH_TO_FILES,'r') as f:
    text = f.read()
    lines = re.split(marker,text)
    lines.pop(0)
  return lines

time_txt_list = get_data(PATH_TO_TIME_TXT, txt_entry_marker)

# process text file

page_split = re.compile('PAGE \d{3}')

time_txt_data = defaultdict(dict)
ID = 1
for entry in time_txt_list:
  splitted = re.split(page_split,entry)
  time_txt_data[ID]['text'] = splitted[1]
  ID += 1

**Indexing**

Create an index of the Time corpus for every test setting and index all the documents. This is only possible if it isn't created yet.

(For more information see the [Elasticsearch documentation](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create))

In [None]:
#create index, see https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.create
time_index = "time-corpus"
stemmer_time_index = "stemmer-time-corpus"
hunspell_time_index = "hunspell-time-corpus"

es.indices.create(time_index)
es.indices.create(stemmer_time_index, body=stemmer_settings)
es.indices.create(hunspell_time_index, body=hunspell_settings)
#index document, see https://elasticsearch-py.readthedocs.io/en/master/#example-usage
for ID, doc_data in time_txt_data.items():
  es.index(index=time_index, id=ID, body=doc_data)
  es.index(index=stemmer_time_index, id=ID, body=doc_data)
  es.index(index=hunspell_time_index, id=ID, body=doc_data)

create_response = es.cat.indices()
print(create_response)

## Evaluation

Since the data is formatted, we can now feed it to the [Elasticsearch Ranking Evaluation API](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html).

### Recall

In this section we only evaluate the Recall scores.

**Multi Match Query**

Here we evaluate the data with the ["multi_match"](https://pragmalingu.de/docs/experiments/experiment1#standard-elasticsearch) option of elastic search:

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html 

import json
from collections import defaultdict

cran_index = 'cranfield-corpus'
cran_stem_index = 'stemming-cranfield-corpus'
med_index = 'medline-corpus'
adi_index = 'adi-corpus'
cisi_index = 'cisi-corpus'
cacm_index = 'cacm-corpus'
lisa_index = 'lisa-corpus'
time_index = 'time-corpus'
npl_index = 'npl-corpus'

#function to get normal match evaluation body 
def create_query_body_match_recall(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests": '',
      "metric": {
          "recall": {
              "relevant_rating_threshold": 1,
              "k": 20
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

#Cranfield
cran_create_match_recall = create_query_body_match_recall(cran_qry_data, cran_rel, cran_index)
cran_eval_body_match_recall = json.dumps(cran_create_match_recall)
cran_res_match_recall = es.rank_eval(cran_eval_body_match_recall, cran_index)
#print(json.dumps(cran_res, indent=4, sort_keys=True))

#Medline
med_create_match_recall = create_query_body_match_recall(med_qry_data, med_rel, med_index)
med_eval_body_match_recall = json.dumps(med_create_match_recall)
med_res_match_recall = es.rank_eval(med_eval_body_match_recall, med_index)
#print(json.dumps(med_res, indent=4, sort_keys=True))

#ADI
adi_create_match_recall = create_query_body_match_recall(adi_qry_data, adi_rel, adi_index)
adi_eval_body_match_recall = json.dumps(adi_create_match_recall)
adi_res_match_recall = es.rank_eval(adi_eval_body_match_recall, adi_index)
#print(json.dumps(adi_create_match_recall, indent=4, sort_keys=True))

#CISI
cisi_create_match_recall = create_query_body_match_recall(cisi_qry_data, cisi_rel, cisi_index)
cisi_eval_body_match_recall = json.dumps(cisi_create_match_recall)
cisi_res_match_recall = es.rank_eval(cisi_eval_body_match_recall, cisi_index)
#print(json.dumps(cisi_res, indent=4, sort_keys=True))

#CACM
cacm_create_match_recall = create_query_body_match_recall(cacm_qry_data, cacm_rel, cacm_index)
cacm_eval_body_match_recall = json.dumps(cacm_create_match_recall)
cacm_res_match_recall = es.rank_eval(cacm_eval_body_match_recall,cacm_index)
#print(json.dumps(cacm_res, indent=4, sort_keys=True))

#LISA
lisa_create_match_recall = create_query_body_match_recall(lisa_qry_data, lisa_rel, lisa_index)
lisa_eval_body_match_recall = json.dumps(lisa_create_match_recall)
lisa_res_match_recall = es.rank_eval(lisa_eval_body_match_recall,lisa_index)
#print(json.dumps(lisa_res, indent=4, sort_keys=True))

#TIME
time_create_match_recall = create_query_body_match_recall(time_qry_data, time_rel, time_index)
time_eval_body_match_recall = json.dumps(time_create_match_recall)
time_res_match_recall = es.rank_eval(time_eval_body_match_recall,time_index)
#print(json.dumps(time_res, indent=4, sort_keys=True))

#NPL
npl_create_match_recall = create_query_body_match_recall(npl_qry_data, npl_rel, npl_index)
npl_eval_body_match_recall = json.dumps(npl_create_match_recall)
npl_res_match_recall = es.rank_eval(npl_eval_body_match_recall,npl_index)
#print(json.dumps(npl_res, indent=4, sort_keys=True))"""

**Stemmer Token Filter**

Here we evaluate the data with the ["stemmer token filter"](https://pragmalingu.de/docs/experiments/experiment1#stemmer-token-filter) option of elastic search:

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval

import json
from collections import defaultdict

cran_index = 'stemming-cranfield-corpus'
med_index = 'stemming-medline-corpus'
adi_index = 'stemming-adi-corpus'
cisi_index = 'stemming-cisi-corpus'
cacm_index = 'stemming-cacm-corpus'
lisa_index = 'stemming-lisa-corpus'
time_index = 'stemming-time-corpus'
npl_index = 'stemming-npl-corpus'

#function to get normal match evaluation body 
def create_query_body_stemming_recall(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "k": 20,
              "relevant_rating_threshold": 1
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

#Cranfield
cran_create_stemming_recall = create_query_body_stemming_recall(cran_qry_data, cran_rel, cran_index)
cran_eval_body_stemming_recall = json.dumps(cran_create_stemming_recall)
cran_res_stemming_recall = es.rank_eval(cran_eval_body_stemming_recall, cran_index)
#print(json.dumps(cran_res, indent=4, sort_keys=True))

#Medline
med_create_stemming_recall = create_query_body_stemming_recall(med_qry_data, med_rel, med_index)
med_eval_body_stemming_recall = json.dumps(med_create_stemming_recall)
med_res_stemming_recall = es.rank_eval(med_eval_body_stemming_recall, med_index)
#print(json.dumps(med_res, indent=4, sort_keys=True))

#ADI
adi_create_stemming_recall = create_query_body_stemming_recall(adi_qry_data, adi_rel, adi_index)
adi_eval_body_stemming_recall = json.dumps(adi_create_stemming_recall)
adi_res_stemming_recall = es.rank_eval(adi_eval_body_stemming_recall, adi_index)
#print(json.dumps(adi_res, indent=4, sort_keys=True))

#CISI
cisi_create_stemming_recall = create_query_body_stemming_recall(cisi_qry_data, cisi_rel, cisi_index)
cisi_eval_body_stemming_recall = json.dumps(cisi_create_stemming_recall)
cisi_res_stemming_recall = es.rank_eval(cisi_eval_body_stemming_recall, cisi_index)
#print(json.dumps(cisi_res, indent=4, sort_keys=True))

#CACM
cacm_create_stemming_recall = create_query_body_stemming_recall(cacm_qry_data, cacm_rel, cacm_index)
cacm_eval_body_stemming_recall = json.dumps(cacm_create_stemming_recall)
cacm_res_stemming_recall = es.rank_eval(cacm_eval_body_stemming_recall,cacm_index)
#print(json.dumps(cacm_res, indent=4, sort_keys=True))

#LISA
lisa_create_stemming_recall = create_query_body_stemming_recall(lisa_qry_data, lisa_rel, lisa_index)
lisa_eval_body_stemming_recall = json.dumps(lisa_create_stemming_recall)
lisa_res_stemming_recall = es.rank_eval(lisa_eval_body_stemming_recall,lisa_index)
#print(json.dumps(lisa_res, indent=4, sort_keys=True))

#TIME
time_create_stemming_recall = create_query_body_stemming_recall(time_qry_data, time_rel, time_index)
time_eval_body_stemming_recall = json.dumps(time_create_stemming_recall)
time_res_stemming_recall = es.rank_eval(time_eval_body_stemming_recall,time_index)
#print(json.dumps(time_res, indent=4, sort_keys=True))

#NPL
npl_create_stemming_recall = create_query_body_stemming_recall(npl_qry_data, npl_rel, npl_index)
npl_eval_body_stemming_recall = json.dumps(npl_create_stemming_recall)
npl_res_stemming_recall = es.rank_eval(npl_eval_body_stemming_recall,npl_index)
#print(json.dumps(npl_res, indent=4, sort_keys=True))

**Hunspell Token Filter**

Here we evaluate the data with the ["hunspell token filter"](https://pragmalingu.de/docs/experiments/experiment1#hunspell-token-filter) option of elastic search:

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval

import json
from collections import defaultdict

cran_index = 'hunspell-cranfield-corpus'
med_index = 'hunspell-medline-corpus'
adi_index = 'hunspell-adi-corpus'
cisi_index = 'hunspell-cisi-corpus'
cacm_index = 'hunspell-cacm-corpus'
lisa_index = 'hunspell-lisa-corpus'
time_index = 'hunspell-time-corpus'
npl_index = 'hunspell-npl-corpus'

#function to get normal match evaluation body 
def create_query_body_hunspell_recall(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "recall": {
              "k": 20,
              "relevant_rating_threshold": 1
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

#Cranfield
cran_create_hunspell_recall = create_query_body_hunspell_recall(cran_qry_data, cran_rel, cran_index)
cran_eval_body_hunspell_recall = json.dumps(cran_create_hunspell_recall)
cran_res_hunspell_recall = es.rank_eval(cran_eval_body_hunspell_recall, cran_index)
#print(json.dumps(cran_res, indent=4, sort_keys=True))

#Medline
med_create_hunspell_recall = create_query_body_hunspell_recall(med_qry_data, med_rel, med_index)
med_eval_body_hunspell_recall = json.dumps(med_create_hunspell_recall)
med_res_hunspell_recall = es.rank_eval(med_eval_body_hunspell_recall, med_index)
#print(json.dumps(med_res, indent=4, sort_keys=True))

#ADI
adi_create_hunspell_recall = create_query_body_hunspell_recall(adi_qry_data, adi_rel, adi_index)
adi_eval_body_hunspell_recall = json.dumps(adi_create_hunspell_recall)
adi_res_hunspell_recall = es.rank_eval(adi_eval_body_hunspell_recall, adi_index)
#print(json.dumps(adi_res, indent=4, sort_keys=True))

#CISI
cisi_create_hunspell_recall = create_query_body_hunspell_recall(cisi_qry_data, cisi_rel, cisi_index)
cisi_eval_body_hunspell_recall = json.dumps(cisi_create_hunspell_recall)
cisi_res_hunspell_recall = es.rank_eval(cisi_eval_body_hunspell_recall, cisi_index)
#print(json.dumps(cisi_res, indent=4, sort_keys=True))

#CACM
cacm_create_hunspell_recall = create_query_body_hunspell_recall(cacm_qry_data, cacm_rel, cacm_index)
cacm_eval_body_hunspell_recall = json.dumps(cacm_create_hunspell_recall)
cacm_res_hunspell_recall = es.rank_eval(cacm_eval_body_hunspell_recall,cacm_index)
#print(json.dumps(cacm_res, indent=4, sort_keys=True))

#LISA
lisa_create_hunspell_recall = create_query_body_hunspell_recall(lisa_qry_data, lisa_rel, lisa_index)
lisa_eval_body_hunspell_recall = json.dumps(lisa_create_hunspell_recall)
lisa_res_hunspell_recall = es.rank_eval(lisa_eval_body_hunspell_recall,lisa_index)
#print(json.dumps(lisa_res, indent=4, sort_keys=True))

#TIME
time_create_hunspell_recall = create_query_body_hunspell_recall(time_qry_data, time_rel, time_index)
time_eval_body_hunspell_recall = json.dumps(time_create_hunspell_recall)
time_res_hunspell_recall = es.rank_eval(time_eval_body_hunspell_recall,time_index)
#print(json.dumps(time_res, indent=4, sort_keys=True))

#NPL
npl_create_hunspell_recall = create_query_body_hunspell_recall(npl_qry_data, npl_rel, npl_index)
npl_eval_body_hunspell_recall = json.dumps(npl_create_hunspell_recall)
npl_res_hunspell_recall = es.rank_eval(npl_eval_body_hunspell_recall,npl_index)
#print(json.dumps(npl_res, indent=4, sort_keys=True))

### Precision

In this section we only evaluate the Precision scores.

**Multi Match Query**

Here we evaluate the data with the ["multi_match"](https://pragmalingu.de/docs/experiments/experiment1#standard-elasticsearch) option of elastic search:

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval

from collections import defaultdict

cran_index = 'cranfield-corpus'
med_index = 'medline-corpus'
adi_index = 'adi-corpus'
cisi_index = 'cisi-corpus'
cacm_index = 'cacm-corpus'
lisa_index = 'lisa-corpus'
time_index = 'time-corpus'
npl_index = 'npl-corpus'

# function to get normal match evaluation body 

def create_query_body_match_precision(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "precision": {
              "k": 20,
              "relevant_rating_threshold": 1
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

#Cranfield
cran_create_match_precision = create_query_body_match_precision(cran_qry_data, cran_rel, cran_index)
cran_eval_body_match_precision = json.dumps(cran_create_match_precision)
cran_res_match_precision = es.rank_eval(cran_eval_body_match_precision, cran_index)
#print(json.dumps(cran_res, indent=4, sort_keys=True))

#Medline
med_create_match_precision = create_query_body_match_precision(med_qry_data, med_rel, med_index)
med_eval_body_match_precision = json.dumps(med_create_match_precision)
med_res_match_precision = es.rank_eval(med_eval_body_match_precision, med_index)
#print(json.dumps(med_res, indent=4, sort_keys=True))

#Adi
adi_create_match_precision = create_query_body_match_precision(adi_qry_data, adi_rel, adi_index)
adi_eval_body_match_precision = json.dumps(adi_create_match_precision)
adi_res_match_precision = es.rank_eval(adi_eval_body_match_precision, adi_index)
#print(json.dumps(adi_res, indent=4, sort_keys=True))

#CISI
cisi_create_match_precision = create_query_body_match_precision(cisi_qry_data, cisi_rel, cisi_index)
cisi_eval_body_match_precision = json.dumps(cisi_create_match_precision)
cisi_res_match_precision = es.rank_eval(cisi_eval_body_match_precision, cisi_index)
#print(json.dumps(cisi_res, indent=4, sort_keys=True))

#CACM
cacm_create_match_precision = create_query_body_match_precision(cacm_qry_data, cacm_rel, cacm_index)
cacm_eval_body_match_precision = json.dumps(cacm_create_match_precision)
cacm_res_match_precision = es.rank_eval(cacm_eval_body_match_precision,cacm_index)
#print(json.dumps(cacm_res, indent=4, sort_keys=True))

#LISA
lisa_create_match_precision = create_query_body_match_precision(lisa_qry_data, lisa_rel, lisa_index)
lisa_eval_body_match_precision = json.dumps(lisa_create_match_precision)
lisa_res_match_precision = es.rank_eval(lisa_eval_body_match_precision,lisa_index)
#print(json.dumps(lisa_res, indent=4, sort_keys=True))

#TIME
time_create_match_precision = create_query_body_match_precision(time_qry_data, time_rel, time_index)
time_eval_body_match_precision = json.dumps(time_create_match_precision)
time_res_match_precision = es.rank_eval(time_eval_body_match_precision,time_index)
#print(json.dumps(time_res, indent=4, sort_keys=True))

#NPL
npl_create_match_precision = create_query_body_match_precision(npl_qry_data, npl_rel, npl_index)
npl_eval_body_match_precision = json.dumps(npl_create_match_precision)
npl_res_match_precision = es.rank_eval(npl_eval_body_match_precision,npl_index)
#print(json.dumps(npl_res, indent=4, sort_keys=True))

**Stemmer Token Filter**

Here we evaluate the data with the ["stemmer token filter"](https://pragmalingu.de/docs/experiments/experiment1#stemmer-token-filter) option of elastic search:

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval

from collections import defaultdict

cran_index = 'stemming-cranfield-corpus'
med_index = 'stemming-medline-corpus'
adi_index = 'stemming-adi-corpus'
cisi_index = 'stemming-cisi-corpus'
cacm_index = 'stemming-cacm-corpus'
lisa_index = 'stemming-lisa-corpus'
time_index = 'stemming-time-corpus'
npl_index = 'stemming-npl-corpus'

#function to get normal match evaluation body 

def create_query_body_stemming_precision(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "precision": {
              "k": 20,
              "relevant_rating_threshold": 1
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

#Cranfield
cran_create_stemming_precision = create_query_body_stemming_precision(cran_qry_data, cran_rel, cran_index)
cran_eval_body_stemming_precision = json.dumps(cran_create_stemming_precision)
cran_res_stemming_precision = es.rank_eval(cran_eval_body_stemming_precision, cran_index)
#print(json.dumps(cran_res, indent=4, sort_keys=True))

#Medline
med_create_stemming_precision = create_query_body_stemming_precision(med_qry_data, med_rel, med_index)
med_eval_body_stemming_precision = json.dumps(med_create_stemming_precision)
med_res_stemming_precision = es.rank_eval(med_eval_body_stemming_precision, med_index)
#print(json.dumps(med_res, indent=4, sort_keys=True))

#Adi
adi_create_stemming_precision = create_query_body_stemming_precision(adi_qry_data, adi_rel, adi_index)
adi_eval_body_stemming_precision = json.dumps(adi_create_stemming_precision)
adi_res_stemming_precision = es.rank_eval(adi_eval_body_stemming_precision, adi_index)
#print(json.dumps(adi_res, indent=4, sort_keys=True))

#CISI
cisi_create_stemming_precision = create_query_body_stemming_precision(cisi_qry_data, cisi_rel, cisi_index)
cisi_eval_body_stemming_precision = json.dumps(cisi_create_stemming_precision)
cisi_res_stemming_precision = es.rank_eval(cisi_eval_body_stemming_precision, cisi_index)
#print(json.dumps(cisi_res, indent=4, sort_keys=True))

#CACM
cacm_create_stemming_precision = create_query_body_stemming_precision(cacm_qry_data, cacm_rel, cacm_index)
cacm_eval_body_stemming_precision = json.dumps(cacm_create_stemming_precision)
cacm_res_stemming_precision = es.rank_eval(cacm_eval_body_stemming_precision,cacm_index)
#print(json.dumps(cacm_res, indent=4, sort_keys=True))

#LISA
lisa_create_stemming_precision = create_query_body_stemming_precision(lisa_qry_data, lisa_rel, lisa_index)
lisa_eval_body_stemming_precision = json.dumps(lisa_create_stemming_precision)
lisa_res_stemming_precision = es.rank_eval(lisa_eval_body_stemming_precision,lisa_index)
#print(json.dumps(lisa_res, indent=4, sort_keys=True))

#TIME
time_create_stemming_precision = create_query_body_stemming_precision(time_qry_data, time_rel, time_index)
time_eval_body_stemming_precision = json.dumps(time_create_stemming_precision)
time_res_stemming_precision = es.rank_eval(time_eval_body_stemming_precision,time_index)
#print(json.dumps(time_res, indent=4, sort_keys=True))

#NPL
npl_create_stemming_precision = create_query_body_stemming_precision(npl_qry_data, npl_rel, npl_index)
npl_eval_body_stemming_precision= json.dumps(npl_create_stemming_precision)
npl_res_stemming_precision = es.rank_eval(npl_eval_body_stemming_precision,npl_index)
#print(json.dumps(npl_res, indent=4, sort_keys=True))

**Hunspell Token Filter**

Here we evaluate the data with the ["hunspell token filter"](https://pragmalingu.de/docs/experiments/experiment1#hunspell-token-filter) option of elastic search:

In [None]:
#use rank eval api, see https://elasticsearch-py.readthedocs.io/en/master/api.html?highlight=_rank_eval#elasticsearch.Elasticsearch.rank_eval 
#and https://www.elastic.co/guide/en/elasticsearch/reference/current/search-rank-eval.html#search-rank-eval

from collections import defaultdict

cran_index = 'hunspell-cranfield-corpus'
med_index = 'hunspell-medline-corpus'
adi_index = 'hunspell-adi-corpus'
cisi_index = 'hunspell-cisi-corpus'
cacm_index = 'hunspell-cacm-corpus'
lisa_index = 'hunspell-lisa-corpus'
time_index = 'hunspell-time-corpus'
npl_index = 'hunspell-npl-corpus'

#function to get normal match evaluation body 

def create_query_body_hunspell_precision(query_dict, rel_dict, index_name):
  """
  The function creates a request for every query in query_dict and rates the relevant documents with rel_dict to 1.
  The index name has to be the same as from the documents your looking at.
  An evaluation body for the elasticsearch ranking API is returned.
  """
  eval_body = {
      "requests":'',
      "metric": {
          "precision": {
              "k": 20,
              "relevant_rating_threshold": 1
              }
      }
  }
  requests = [] 
  current_request = defaultdict(lambda: defaultdict())
  current_rel = {"_index": index_name, "_id": '', "rating": int}
  for query_ID, query_txt in query_dict.items():
    current_query = {"query": { "multi_match": { "query": '' , "fields" : ["title","text"]}}}
    current_query["query"]["multi_match"]["query"] = query_txt['question']
    current_request["id"] = 'Query_'+str(query_ID)
    current_request["request"] = current_query.copy()
    current_request["ratings"] = [{"_index": index_name, "_id": str(el), "rating": 1} for el in rel_dict[query_ID]]
    requests.append(current_request.copy())
  eval_body["requests"] = requests
  return eval_body

#Cranfield
cran_create_hunspell_precision = create_query_body_hunspell_precision(cran_qry_data, cran_rel, cran_index)
cran_eval_body_hunspell_precision = json.dumps(cran_create_hunspell_precision)
cran_res_hunspell_precision = es.rank_eval(cran_eval_body_hunspell_precision, cran_index)
#print(json.dumps(cran_res, indent=4, sort_keys=True))

#Medline
med_create_hunspell_precision = create_query_body_hunspell_precision(med_qry_data, med_rel, med_index)
med_eval_body_hunspell_precision = json.dumps(med_create_hunspell_precision)
med_res_hunspell_precision = es.rank_eval(med_eval_body_hunspell_precision, med_index)
#print(json.dumps(med_res, indent=4, sort_keys=True))

#Adi
adi_create_hunspell_precision = create_query_body_hunspell_precision(adi_qry_data, adi_rel, adi_index)
adi_eval_body_hunspell_precision = json.dumps(adi_create_hunspell_precision)
adi_res_hunspell_precision = es.rank_eval(adi_eval_body_hunspell_precision, adi_index)
#print(json.dumps(adi_res, indent=4, sort_keys=True))

#CISI
cisi_create_hunspell_precision = create_query_body_hunspell_precision(cisi_qry_data, cisi_rel, cisi_index)
cisi_eval_body_hunspell_precision = json.dumps(cisi_create_hunspell_precision)
cisi_res_hunspell_precision = es.rank_eval(cisi_eval_body_hunspell_precision, cisi_index)
#print(json.dumps(cisi_res, indent=4, sort_keys=True))

#CACM
cacm_create_hunspell_precision = create_query_body_hunspell_precision(cacm_qry_data, cacm_rel, cacm_index)
cacm_eval_body_hunspell_precision = json.dumps(cacm_create_hunspell_precision)
cacm_res_hunspell_precision = es.rank_eval(cacm_eval_body_hunspell_precision,cacm_index)
#print(json.dumps(cacm_res, indent=4, sort_keys=True))

#LISA
lisa_create_hunspell_precision = create_query_body_hunspell_precision(lisa_qry_data, lisa_rel, lisa_index)
lisa_eval_body_hunspell_precision = json.dumps(lisa_create_hunspell_precision)
lisa_res_hunspell_precision = es.rank_eval(lisa_eval_body_hunspell_precision,lisa_index)
#print(json.dumps(lisa_res, indent=4, sort_keys=True))

#TIME
time_create_hunspell_precision = create_query_body_hunspell_precision(time_qry_data, time_rel, time_index)
time_eval_body_hunspell_precision = json.dumps(time_create_hunspell_precision)
time_res_hunspell_precision = es.rank_eval(time_eval_body_hunspell_precision,time_index)
#print(json.dumps(time_res, indent=4, sort_keys=True))

#NPL
npl_create_hunspell_precision = create_query_body_hunspell_precision(npl_qry_data, npl_rel, npl_index)
npl_eval_body_hunspell_precision= json.dumps(npl_create_hunspell_precision)
npl_res_hunspell_precision = es.rank_eval(npl_eval_body_hunspell_precision,npl_index)
#print(json.dumps(npl_res, indent=4, sort_keys=True))

## Visualization

The last step is to visualize the data so we can analyze the differences:

### Recall

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

match_metrics_recall = []
match_metrics_recall.append(round(cran_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(cisi_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(adi_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(med_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(cacm_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(lisa_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(time_res_match_recall['metric_score'], 3))
match_metrics_recall.append(round(npl_res_match_recall['metric_score'], 3))

stemming_metrics_recall = []
stemming_metrics_recall.append(round(cran_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(cisi_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(adi_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(med_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(cacm_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(lisa_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(time_res_stemming_recall['metric_score'], 3))
stemming_metrics_recall.append(round(npl_res_stemming_recall['metric_score'], 3))

hunspell_metrics_recall = []
hunspell_metrics_recall.append(round(cran_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(cisi_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(adi_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(med_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(cacm_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(lisa_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(time_res_hunspell_recall['metric_score'], 3))
hunspell_metrics_recall.append(round(npl_res_hunspell_recall['metric_score'], 3))

labels = ['cranfield', 'CISI', 'ADI', 'medline', 'CACM','LISA','Time','NPL']

x = np.arange(len(labels))*1.8  # the label locations

width = 0.5  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width, match_metrics_recall , width, label='With multi match')
rects2 = ax.bar(x, stemming_metrics_recall, width, label='With stemming')
rects3 = ax.bar(x + width, hunspell_metrics_recall, width, label='With hunspell')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('metric scores')
ax.set_title('Recall scores by corpus')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)
autolabel(rects3)

fig.tight_layout()
fig.set_figwidth(16)
fig.set_figheight(8)

plt.show()

In [None]:
from tabulate import tabulate

match_metrics_recall.insert(0, 'standard search') 
stemming_metrics_recall.insert(0, 'stemming search') 
hunspell_metrics_recall.insert(0, 'hunspell search')

l = [match_metrics_recall, stemming_metrics_recall, hunspell_metrics_recall]
table = tabulate(l, headers=['cranfield', 'CISI', 'ADI', 'medline', 'CACM','LISA','Time','NPL'], tablefmt='orgtbl')

print(table)

### Precision

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

match_metrics_precision = []
match_metrics_precision.append(round(cran_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(cisi_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(adi_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(med_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(cacm_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(lisa_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(time_res_match_precision['metric_score'], 3))
match_metrics_precision.append(round(npl_res_match_precision['metric_score'], 3))

stemming_metrics_precision = []
stemming_metrics_precision.append(round(cran_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(cisi_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(adi_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(med_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(cacm_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(lisa_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(time_res_stemming_precision['metric_score'], 3))
stemming_metrics_precision.append(round(npl_res_stemming_precision['metric_score'], 3))

hunspell_metrics_precision = []
hunspell_metrics_precision.append(round(cran_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(cisi_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(adi_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(med_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(cacm_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(lisa_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(time_res_hunspell_precision['metric_score'], 3))
hunspell_metrics_precision.append(round(npl_res_hunspell_precision['metric_score'], 3))

labels = ['cranfield', 'CISI', 'ADI', 'medline', 'CACM','LISA','Time','NPL']

x = np.arange(len(labels))*1.8  # the label locations

width = 0.5  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width, match_metrics_precision , width, label='With multi match')
rects2 = ax.bar(x, stemming_metrics_precision, width, label='With stemming')
rects3 = ax.bar(x + width, hunspell_metrics_precision, width, label='With hunspell')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('metric scores')
ax.set_title('Scores by corpus')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()


def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)
autolabel(rects2)
autolabel(rects3)

fig.tight_layout()
fig.set_figwidth(16)
fig.set_figheight(8)

plt.show()

In [None]:
from tabulate import tabulate

match_metrics_precision.insert(0, 'standard search') 
stemming_metrics_precision.insert(0, 'stemming search') 

l = [match_metrics_precision, stemming_metrics_precision]
table = tabulate(l, headers=['cranfield', 'CISI', 'ADI', 'medline', 'CACM','LISA','Time','NPL'], tablefmt='orgtbl')

print(table)

### F-Score

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

def f_score(recall,precision):
  fscore = 2*((recall*precision)/(recall+precision))
  return fscore

match_metrics_fscore = []
match_metrics_fscore.append(round(f_score(cran_res_match_recall['metric_score'], cran_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(cisi_res_match_recall['metric_score'], cisi_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(adi_res_match_recall['metric_score'], adi_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(med_res_match_recall['metric_score'], med_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(cacm_res_match_recall['metric_score'], cacm_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(lisa_res_match_recall['metric_score'], lisa_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(time_res_match_recall['metric_score'], time_res_match_precision['metric_score']),3))
match_metrics_fscore.append(round(f_score(npl_res_match_recall['metric_score'], npl_res_match_precision['metric_score']),3))

stemming_metrics_fscore = []
stemming_metrics_fscore.append(round(f_score(cran_res_stemming_recall['metric_score'], cran_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(cisi_res_stemming_recall['metric_score'], cisi_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(adi_res_stemming_recall['metric_score'], adi_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(med_res_stemming_recall['metric_score'], med_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(cacm_res_stemming_recall['metric_score'], cacm_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(lisa_res_stemming_recall['metric_score'], lisa_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(time_res_stemming_recall['metric_score'], time_res_stemming_precision['metric_score']),3))
stemming_metrics_fscore.append(round(f_score(npl_res_stemming_recall['metric_score'], npl_res_stemming_precision['metric_score']),3))

hunspell_metrics_fscore = []
hunspell_metrics_fscore.append(round(f_score(cran_res_hunspell_recall['metric_score'], cran_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(cisi_res_hunspell_recall['metric_score'], cisi_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(adi_res_hunspell_recall['metric_score'], adi_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(med_res_hunspell_recall['metric_score'], med_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(cacm_res_hunspell_recall['metric_score'], cacm_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(lisa_res_hunspell_recall['metric_score'], lisa_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(time_res_hunspell_recall['metric_score'], time_res_hunspell_precision['metric_score']),3))
hunspell_metrics_fscore.append(round(f_score(npl_res_hunspell_recall['metric_score'], npl_res_hunspell_precision['metric_score']),3))

labels = ['cranfield', 'CISI', 'ADI', 'medline', 'CACM','LISA','Time','NPL']

x = np.arange(len(labels))*1.8  # the label locations

width = 0.5  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width, match_metrics_fscore , width, label='With multi match')
rects2 = ax.bar(x, stemming_metrics_fscore, width, label='With stemming')
rects3 = ax.bar(x + width, hunspell_metrics_fscore, width, label='With hunspell')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('metric scores')
ax.set_title('Scores by corpus')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()


def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3),  # 3 points vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')


autolabel(rects1)
autolabel(rects2)
autolabel(rects3)

fig.tight_layout()
fig.set_figwidth(16)
fig.set_figheight(8)

plt.show()

In [None]:
from tabulate import tabulate

match_metrics_fscore.insert(0, 'standard search') 
stemming_metrics_fscore.insert(0, 'stemming search') 
#hunspell_metrics_fscore.insert(0, 'stemming hunspell search')

l = [match_metrics_fscore, stemming_metrics_fscore]
table = tabulate(l, headers=['cranfield', 'CISI', 'ADI', 'medline', 'CACM','LISA','Time','NPL'], tablefmt='orgtbl')

print(table)

Read more on this experiment on our [website](https://pragmalingu.de/docs/experiments/experiment1).