# lab2 - Lemmatization and full text search (FTS)

In [114]:
from elasticsearch import Elasticsearch
from elasticsearch.client import IndicesClient
from pathlib import Path
from pprint import pprint

## Tasks

1. Install ElasticSearch (ES).
2. Install an ES plugin for Polish https://github.com/allegro/elasticsearch-analysis-morfologik 

In [2]:
# !docker-compose up

In [3]:
es = Elasticsearch()

3. Define an ES analyzer for Polish texts containing:
   1. standard tokenizer
   1. synonym filter with the following definitions:
      1. kpk - kodeks postępowania karnego
      1. kpc - kodeks postępowania cywilnego
      1. kk - kodeks karny
      1. kc - kodeks cywilny
   1. Morfologik-based lemmatizer
   1. lowercase filter
   

In [86]:
es.indices.create(
    index="przewie",
    body={
        "settings": {
            "analysis": {
              "analyzer": "morfologik",
              "filter": {
                  "kodeks_synonym" : {
                    "type" : "synonym",
                        "synonyms" : [
                            "kpk => kodeks postępowania karnego",
                            "kpc => kodeks postępowania cywilnego",
                            "kk => kodeks karny",
                            "kc => kodeks cywilny",
                        ]
                    }    
                  },
            "tokenizer": "standard",
            }
        },
        "mappings": {
            "act": {
                "properties": {
                    "text": {
                        "type": "text",
                        "analyzer": "morfologik"
                    }
                }
            }
        }
    }
)



{'acknowledged': True, 'shards_acknowledged': True, 'index': 'przewie'}

In [87]:
def analyze(text):
    return 

In [118]:
es.indices.analyze(
        "przewie",
        {
            "tokenizer": "standard",
             "filter": ["kodeks_synonym", "lowercase", "morfologik_stem"],
            "text": "będę analizować ustawy o kc"
        }
    )

{'tokens': [{'token': 'być',
   'start_offset': 0,
   'end_offset': 4,
   'type': '<ALPHANUM>',
   'position': 0},
  {'token': 'analizować',
   'start_offset': 5,
   'end_offset': 15,
   'type': '<ALPHANUM>',
   'position': 1},
  {'token': 'ustawa',
   'start_offset': 16,
   'end_offset': 22,
   'type': '<ALPHANUM>',
   'position': 2},
  {'token': 'o',
   'start_offset': 23,
   'end_offset': 24,
   'type': '<ALPHANUM>',
   'position': 3},
  {'token': 'ocean',
   'start_offset': 23,
   'end_offset': 24,
   'type': '<ALPHANUM>',
   'position': 3},
  {'token': 'ojciec',
   'start_offset': 23,
   'end_offset': 24,
   'type': '<ALPHANUM>',
   'position': 3},
  {'token': 'kodeks',
   'start_offset': 25,
   'end_offset': 27,
   'type': 'SYNONYM',
   'position': 4},
  {'token': 'cywilny',
   'start_offset': 25,
   'end_offset': 27,
   'type': 'SYNONYM',
   'position': 5}]}

4. Define an ES index for storing the contents of the legislative acts.

5. Load the data to the ES index.

In [89]:
for i, f in enumerate([f for f in Path("../data").iterdir() if f.is_file()]):
    text = f.open().read()
    es.create(
        index="przewie",
        doc_type="act",
        id=f,
        body={"text": text}
    )

1. Determine the number of legislative acts containing the word **ustawa** (in any form).

In [93]:
es.search(
    index="przewie",
    doc_type="act",
    body={
        "query": {
            "match": {
                "text": {
                    "query": "ustawa"
                }
            }
        }
    }
)["hits"]["total"]

1179

2. Determine the number of legislative acts containing the words **kodeks postępowania cywilnego** in the specified order, but in an any inflection form.


In [99]:
es.search(
    index="przewie",
    doc_type="act",
    body={
        "query": {
            "match_phrase": {
                "text": {
                    "query": "kodeks postępowania cywilnego"
                }
            }
        }
    }
)["hits"]["total"]

100

3. Determine the number of legislative acts containing the words **wchodzi w życie** 
   (in any form) allowing for up to 2 additional words in the searched phrase.
   

In [100]:
es.search(
    index="przewie",
    doc_type="act",
    body={
        "query": {
            "match_phrase": {
                "text": {
                    "query": "wchodzi w życie",
                    "slop": 2
                }
            }
        }
    }
)["hits"]["total"]

1175

4. Determine the 10 documents that are the most relevant for the phrase **konstytucja**.

In [115]:
best_hits = sorted(
    es.search(
        index="przewie",
        doc_type="act",
        body={
            "query": {
                "match": {
                    "text": {
                        "query": "konstytucja",
                    }
                }
            },
            "highlight": {
                "fields": {
                    "text": {}
                },
                "number_of_fragments": 3
            }
        }
    )["hits"]["hits"],
    key= lambda h: -h["_score"]
)[:10]

[h["_id"] for h in best_hits]

['../data/2000_443.txt',
 '../data/1997_604.txt',
 '../data/1996_350.txt',
 '../data/1997_642.txt',
 '../data/1996_199.txt',
 '../data/1997_629.txt',
 '../data/1999_688.txt',
 '../data/2001_23.txt',
 '../data/1997_681.txt',
 '../data/2001_1082.txt']

5. Print the excerpts containing the word **konstytucja** (up to three excerpts per document) 
   from the previous task.

In [117]:
pprint([h["highlight"] for h in best_hits])

[{'text': ['umowy międzynarodowej i nie wypełnia przesłanek określonych w art. '
           '89\n'
           '     ust. 1 lub art. 90 <em>Konstytucji</em>',
           'międzynarodowej lub załącznika nie\n'
           '     wypełnia przesłanek określonych w art. 89 ust. 1 lub art. 90 '
           '<em>Konstytucji</em>',
           'co do zasadności wyboru\n'
           '  trybu ratyfikacji umowy międzynarodowej, o którym mowa w art. 89 '
           'ust. 2\n'
           '  <em>Konstytucji</em>']},
 {'text': ['Jeżeli Trybunał Konstytucyjny wyda orzeczenie o sprzeczności celów '
           'partii \n'
           '   politycznej z <em>Konstytucją</em>',
           'Jeżeli Trybunał Konstytucyjny wyda orzeczenie o sprzeczności z '
           '<em>Konstytucją</em>\n'
           '   celów lub działalności',
           'Ciężar udowodnienia niezgodności z <em>Konstytucją</em> spoczywa\n'
           '                na wnioskodawcy, który w tym']},
 {'text': ['Za naruszenie <em>Konstytucji</em>

## Hints

1. Full text search engines were developed for storing and searching textual data.
1. The most popular FTSes are SOLR and ElasticSearch (ES).
1. Some relational databases support full text search, but usually it is limited and not easy to adapt.
1. Both for SOLR and ES there are plugins supporting Polish.
1. FTSes use *inverted-index* to store the data. At loading time the text is split by *tokenizer* into 
   *tokens* and individual tokens go through *filters*. The resulting tokens are placed as keys in a hash-like
   structure. The values are so called *posting lists*, containing pointers to the documents where the tokens come from.
1. The minimal FTS configuration requires two elements: a tokenizer and a set of filters (the set might be empty in the extreme
   case). **Changing the configuration of an index does not result in the new definitions being applied to the already
   stored documents.** In such cases the index has to be *rebuilt*, meaning that the documents have to be loaded once
   again.
1. FTSes contain a large number of tokenizers, e.g. they may know semantics of HTML documents and treat HTML tags as
   tokens. Some popular tokenizers include:
   1. *standard tokenizer* - based on the Unicode tokenization rules,
   1. *whitespace tokenizer* - which splits the tokens by white spaces,
   1. *url tokenizer* - which keeps the URLs as indivisible tokens.
1. Some tokens such as commas and full stops might be removed at the stage of filtering. Filtering of common tokens reduces the index size.
1. Some popular filters include:
   1. *lowercase filter* - which downcases the letters,
   1. *ASCII folding filter* - which removes Polish diacritics,
   1. *stop token filter* - which removes the specified tokens (described above),
   1. *lematizers* - which find the base form of a word,
   1. etc. (present implementation of ES has more than 50 filters)
1. **Lemmatization** is a process when the inflected form of a word is replaced with its base form, e.g
   the form *psu* is replaced with *pies*. You should notice that there are many ambiguous forms, e.g.
   *goli* can have the following base forms: *golić*, *gol* and *goły*. To overcome the ambiguity, FTSes 
   take very pragmatic approach - for a given inflected form all possible base forms are put in the index.
   Even though it's not valid from the linguistics' point of view, it works well in practice.