# Lemmatization and full text search (FTS)
The task is concentrated on using full text search engine (ElasticSearch) to perform basic search operations in a text corpus.

In [1]:
!pip install elasticsearch==7.10.1



In [3]:
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import NotFoundError
import os
import numpy as np
from tqdm import tqdm
import sys
from time import sleep

Elasticsearch was downloaded in version 7.10.1 and run locally. Elasticsearch-analysis-morfologik plugin was added to ES using elasticsearch-plugin script.

## Define an ES index for storing the contents of the legislative acts

In [4]:
es = Elasticsearch()

In [4]:
es.indices.delete('my_index')

{'acknowledged': True}

In [5]:
    es.indices.create(
        index="my_index",
        body={
            "settings": {
                "analysis": {
                    "analyzer": {
                        "default": {
                            "type": "custom",
                            "tokenizer": "standard",
                            "filter": [
                                "nlp_synonyms",
                                "morfologik_stem",
                                "lowercase",
                            ]
                        }
                    },
                    "filter": {
                        "nlp_synonyms": {
                            "type": "synonym",
                            "synonyms": [
                                "kpk,kodeks postępowania karnego",
                                "kpc,kodeks postępowania cywilnego",
                                "kk,kodeks karny",
                                "kc,kodeks cywilny",
                            ]
                        }
                    }
                }
            },
        }
    )

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'my_index'}

Name analyzer "default" to not include it in each query to ES.

## Load the data to the ES index

In [6]:
ustawy_path = '../ustawy'
with tqdm(total=1179, file=sys.stdout) as pbar:        
    for filename in os.listdir(ustawy_path):
        with open(ustawy_path + '/' + filename, 'r', encoding='utf8') as f:
            content = f.read()
            body = {"content": content}
            es.create(index="my_index", id=filename, body=body)
            pbar.update(1)
sleep(3)

100%|██████████████████████████████████████████████████████████████████████████████| 1179/1179 [00:19<00:00, 60.58it/s]


## Determine the number of legislative acts containing the word ustawa (in any form).

In [5]:
es.search(
    index="my_index",
    body={
        "query": {
            "match": {
                "content": {
                    "query": "ustawa"
                }
            }
        }
    }
)["hits"]["total"]["value"]

1178

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html - returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching.

From response is retrieved total count of found documents containing 'ustawa' in any form.

The next task is impossible to do as "ustaw" is a form of "ustawa". So it was skipped.

## Determine the number of occurrences of the word ustawa by searching for this particular form, including the other inflectional forms.

In [6]:
es.termvectors(index="my_index",
                        id="2001_1382.txt",
                        body={
                            "fields": ["content"],
                            "term_statistics": True,
                            "field_statistics": True
                        })["term_vectors"]["content"]["terms"]["ustawa"]["ttf"]

24934

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#docs-termvectors-artificial-doc

## Determine the number of legislative acts containing the words kodeks postępowania cywilnego in the specified order, but in any inflection form.

In [7]:
es.search(
    index="my_index",
    body={
        "query": {
            "match_phrase": {
                "content": {
                    "query": "kodeks postępowania cywilnego"
                }
            }
        }
    }
)["hits"]["total"]["value"]

99

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

## Determine the number of legislative acts containing the words wchodzi w życie (in any form) allowing for up to 2 additional words in the searched phrase.

In [8]:
es.search(
    index="my_index",
    body={
        "query": {
            "match_phrase": {
                "content": {
                    "query": "wchodzi w życie",
                    "slop": 2
                }
            }
        },
    }
)["hits"]["total"]["value"]

1174

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html - slop set to 2.

## Determine the 10 documents that are the most relevant for the phrase konstytucja.

In [9]:
results = es.search(
    index="my_index",
    body={
        "query": {
            "match": {
                "content": {
                    "query": "konstytucja",
                }
            }
        },
        "highlight": {
            "fields": {
                "content": {}
            },
            "boundary_scanner": "sentence",
            "number_of_fragments": 3,
            "order": "score"
        }
    }
)["hits"]
top = [[result['_score'], result['_id'], result['highlight']['content']] for result in sorted(results['hits'], key=lambda x: -x['_score'])][:10]
list(map(lambda x: x[1], top))

['1997_629.txt',
 '2000_443.txt',
 '1997_604.txt',
 '1996_350.txt',
 '1997_642.txt',
 '2001_23.txt',
 '1996_199.txt',
 '1999_688.txt',
 '1997_681.txt',
 '2001_1082.txt']

https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html - number_of_fragments set to 3.

## Print the excerpts containing the word konstytucja (up to three excerpts per document) from the previous task.

In [10]:
for l in list(map(lambda x: x[2], top)):
    for sentence in l:
        print(sentence)
        print("-----------------------------------------------------------------------------------------------")

W ustawie  konstytucyjnej z  dnia 23 kwietnia 1992 r. o trybie przygotowania i 
uchwalenia <em>Konstytucji</em>
-----------------------------------------------------------------------------------------------
o zmianie ustawy konstytucyjnej o trybie przygotowania
           i uchwalenia <em>Konstytucji</em> Rzeczypospolitej
-----------------------------------------------------------------------------------------------
Do zgłoszenia projektu <em>Konstytucji</em> załącza się wykaz 
                obywateli popierających zgłoszenie
-----------------------------------------------------------------------------------------------
umowy międzynarodowej i nie wypełnia przesłanek określonych w art. 89
     ust. 1 lub art. 90 <em>Konstytucji</em>
-----------------------------------------------------------------------------------------------
co do zasadności wyboru
  trybu ratyfikacji umowy międzynarodowej, o którym mowa w art. 89 ust. 2
  <em>Konstytucji</em>
-------------------------------------

Autor: Mikołaj Sikora (grupa 12:50)