# Full text search (FTS)

In most cases I use brilliant `restmagic` library with Jupyter Notebook magicks, because:
- it's very easy to use and similar to `curl`
- a lot less hassle than `requests` with conversion of dicts to JSONs
- I could not get Elasticsearch Python client working

In [1]:
import json
import os
import re

import requests

%reload_ext restmagic

With the configuration below I will only have to specify endpoint for Elasticsearch - the URL base/root will be set automatically.

The results of each query will be available in the `_` variable.

In [2]:
%%rest_root "http://localhost:9200/"
Content-Type: application/json

Requests defaults are set.


## Setup, data loading

Define an ES analyzer for Polish texts containing:
- standard tokenizer
- synonym filter with the following definitions:
  - kpk - kodeks postępowania karnego
  - kpc - kodeks postępowania cywilnego
  - kk - kodeks karny
  - kc - kodeks cywilny
- Morfologik-based lemmatizer
- lowercase filter

At the same time I define the index with `content` field with bills' contents.

In [3]:
%rest -q DELETE "bills"

<Response [200]>

In [4]:
%%rest PUT "bills"

{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "bills_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "graph_synonyms",
              "morfologik_stem"
            ]
          }
        },
        "filter": {
          "graph_synonyms": {
            "type": "synonym_graph",
            "synonyms": [
              "kpk, kodeks postępowania karnego",
              "kpc, kodeks postępowania cywilnego",
              "kk, kodeks karny",
              "kc, kodeks cywilny"
            ]
          }
        }
      }
    }
  },
    
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "bills_analyzer"
      }
    }
  }
}


{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "bills"
}

<Response [200]>

Check whether it works:

In [5]:
%%rest GET "bills/_analyze"

{
    "analyzer": "bills_analyzer",
    "text": "Ustawa o KK i niektórych innych ustawach"
}

{
  "tokens": [
    {
      "token": "ustawa",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "o",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "ocean",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "ojciec",
      "start_offset": 7,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "kodeks",
      "start_offset": 9,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 2
    },
    {
      "token": "kodeks karny",
      "start_offset": 9,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2,
      "positionLength": 2
    },
    {
      "token": "karny",
      "start_offset": 9,
      "end_offset": 11,
      "type": "SYNONYM",
      "position": 3
    },
    {
      "token": "i

<Response [200]>

Mostly works well, with a few interesting things, e.g. "o" resulted in: o, ocean, ojciec.

Loading the data is easier with requests due to the loop-based nature of the operation. Filenames (without `.txt`) will be used as document IDs.

In [6]:
data_dir = "corpus"
files = os.listdir(data_dir)

for filename in files:
    filepath = os.path.join(data_dir, filename)
    with open(filepath, encoding="UTF-8") as file:
        content = file.read()
    
    text_id = filename.split(".")[0]
    
    requests.put(f"http://localhost:9200/bills/_doc/{text_id}", json={"content": content})


## Counting legislative acts with _ustawa_

Determine the number of legislative acts containing the word _ustawa_ (in any form).

Here I use Search API with `hits.total.value` filter, since I want the total number of documents that the word was found in.

In [7]:
%%rest GET "bills/_search?filter_path=hits.total.value"

{
  "query": {
    "match": {
      "content": "ustawa"
    }
  }
}

{
  "hits": {
    "total": {
      "value": 1178
    }
  }
}

<Response [200]>

Interestingly, we have 1179 files and 1178 of them contain the word _ustawa_, so there is just one file that does not. I decided to check it out of curiosity.

In [8]:
%%rest GET "bills/_search"

{
  "query": {
    "bool": {
      "must_not": {
        "match": {
          "content": "ustawa"
        }
      }
    }
  }
}


{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.0,
    "hits": [
      {
        "_index": "bills",
        "_type": "_doc",
        "_id": "1996_400",
        "_score": 0.0,
        "_source": {
          "content": "\n\n\n\n\nBrak tekstu w postaci elektronicznej \n"
        }
      }
    ]
  }
}

<Response [200]>

So that one file just has no text.

## Counting _ustawa_ occurrences

Determine the number of occurrences of the word _ustawa_ (in any form).

Determine the number of occurrences of the word _ustaw_ (in any form).

Obviously for full-text search engine like ES the both searches should return exactly the same number of hits.

I researched 3 possible solutions for this:
- count (Search API) with aggregation
- scripting (using Painless language)
- term vectors API

I rejected the first solution, since it requires dynamic mappings introduced in Elasticsearch 7.11 (Morfologik requires ES 7.10), see [this answer to my question on StackOverflow regarding this task](https://stackoverflow.com/a/69731030/9472066). However, it should be noted that this is the only universal solution that would work in production environment (e.g. many replicas, many shards, large indexes).

Scripting turned out to be hard and didn't work well with text field.

Term vectors API actually does what is needed here. For given document it returns counts of each word, but not only in that text, since for each word we also get `"ttf"` field. It stands for Total Term Frequency, i.e. number of occurrences of that word in the entire index. However, there are 2 caveats with this approach:
- it only works for a single shard and there is no way to do a single Elasticsearch query across many shards, i.e. if we had multiple shards, we would have to do a loop in Python; this is not a problem here, since we have just a single shard (default behavior since Elasticsearch 7.X)
- there is no easy way to specify the particular word that we want count of, instead we have to use ["artificial documents" mechanism](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#docs-termvectors-artificial-doc), providing an artificial document with just the word we want counted in the index; note that the word from artificial document will **not** be included in this count (which is good)

The last solution is also the most efficient one:
- count with aggregation takes about 10x longer
- scripting requires re-inverting the index, which takes a lot of memory

In [20]:
%%rest GET "bills/_termvectors"

{
  "doc" : {
    "content": "ustawa"
  },
  "offsets": false,
  "positions": false,
  "field_statistics": false,
  "term_statistics": true
}

{
  "_index": "bills",
  "_type": "_doc",
  "_version": 0,
  "found": true,
  "took": 0,
  "term_vectors": {
    "content": {
      "terms": {
        "ustawa": {
          "doc_freq": 1178,
          "ttf": 24934,
          "term_freq": 1
        }
      }
    }
  }
}

<Response [200]>

In [21]:
%%rest GET "bills/_termvectors"

{
  "doc" : {
    "content": "ustaw"
  },
  "offsets": false,
  "positions": false,
  "field_statistics": false,
  "term_statistics": true
}

{
  "_index": "bills",
  "_type": "_doc",
  "_version": 0,
  "found": true,
  "took": 0,
  "term_vectors": {
    "content": {
      "terms": {
        "ustawa": {
          "doc_freq": 1178,
          "ttf": 24934,
          "term_freq": 1
        },
        "ustawić": {
          "doc_freq": 378,
          "ttf": 913,
          "term_freq": 1
        }
      }
    }
  }
}

<Response [200]>

In [22]:
_.json()["term_vectors"]["content"]["terms"]["ustawa"]["ttf"]

24934

Counts for _ustawa_ and _ustaw_ are the same, which is the expected behavior. We can also see that Elasticsearch's analyzer turned _ustaw_ into basic form _ustawa_ using Morfologik.

## 