# Full Text Search with elasticsearch

## 1. and 2. Install elasticsearch and morfologik

Under windows, download and unzip the elasticsearch-8.15.2 in this directory.

In `elasticsearch-8.15.2/config/elasticsearch.yml` change the contents of security settings :

```
# Enable security features
xpack.security.enabled: false

xpack.security.enrollment.enabled: false
```

Install morfologik:

```
elasticsearch-8.15.2\bin\elasticsearch-plugin install pl.allegro.tech.elasticsearch.plugin:elasticsearch-analysis-morfologik:8.15.2
```


Run the following command:

```
elasticsearch-8.15.2\bin\elasticsearch.bat
```

---

### Import needed libraries

As elasticsearch python package 8.15.2 isn't out yet, instead of reinstalling elasticsearch as 8.15.1 version I decided to make my life harder by using requests

In [1]:
import numpy as np
from datasets import load_dataset
import json
import requests

---

### Create urls and headers for requests

In [2]:
headers = {"Content-Type": "application/json"}
elastic_url = "http://localhost:9200/pol"
bulk_url = "http://localhost:9200/pol/_bulk"
search_url = "http://localhost:9200/pol/_search?pretty"

### Clean up the elasticsearch index

In [3]:
!curl -X DELETE "localhost:9200/pol"
!curl -X DELETE "localhost:9200/pol_without_lemmatizer"

{"acknowledged":true}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0    434      0 --:--:-- --:--:-- --:--:--   437


{"acknowledged":true}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    21  100    21    0     0    437      0 --:--:-- --:--:-- --:--:--   446


---

## 3. and 4. Create analyzers with and without synonyms

First create a list of synonyms compatible with elasticsearch

In [4]:
synonym_list = [
    "kwiecień, kwi, IV",
    "styczeń, sty, I",
    "luty, lut, II",
    "marzec, mar, III",
    "maj, V",
    "czerwiec, cze, VI",
    "lipiec, lip, VII",
    "sierpień, sie, VIII",
    "wrzesień, wrz, IX",
    "październik, paź, X",
    "listopad, lis, XI",
    "grudzień, gru, XII",
]

Then buld analyzer with settings and mapping

In [5]:
analyzer_settings = {
    "settings": {
        "analysis": {
            "filter": {
                "polish_synonym": {
                    "type": "synonym",
                    "synonyms": synonym_list,
                }
            },
            "analyzer": {
                "polish_with_synonyms": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "polish_synonym",
                        "morfologik_stem",
                        "lowercase",
                    ],
                },
                "polish": {
                    "tokenizer": "standard",
                    "filter": ["lowercase", "morfologik_stem", "lowercase"],
                },
            },
        }
    },
    "mappings": {
        "properties": {
            "text_synonyms": {
                "type": "text",
                "analyzer": "polish_with_synonyms",
                "fields": {"keyword": {"type": "keyword"}},
            },
            "text": {
                "type": "text",
                "analyzer": "polish",
                "fields": {"keyword": {"type": "keyword"}},
            },
        }
    },
}

analyzer_settings = json.dumps(analyzer_settings)

# send the analyzer settings and mappings to elasticsearch
response = requests.put(
    elastic_url,
    headers=headers,
    data=analyzer_settings,
)

response.json()

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'pol'}

---

## 5. and 6. Create and load the fiqa-pl corpus index into elasticsearch

Fetch fiqa-pl corpus and build and index. Then post it in elasticsearch as bulk data

In [6]:
fiqa_corpus = load_dataset("clarin-knext/fiqa-pl", "corpus")["corpus"]

# create data index
data = []
for _id, text in zip(fiqa_corpus["_id"], fiqa_corpus["text"]):
    id_head = json.dumps(
        {"index": {"_index": "pol", "_id": str(_id)}}, ensure_ascii=False
    )
    # to create double index for synonyms and no synonyms we need to input the text data twice
    # this is not optimal
    content = json.dumps({"text_synonyms": text, "text": text}, ensure_ascii=False)
    data.append(id_head)
    data.append(content)

# join the bul data
bulk_data = "\n".join([item for item in data]) + "\n"

response = requests.post(bulk_url, headers=headers, data=bulk_data)

print(response.status_code)
print(response.json()["errors"])

200
False


---

## 7. Count files and occurrences

Create function to retrieve the number of files and occurrences of given word under certain index field (`"text"` or `"text_synonyms"`)

In [7]:
def retrieve_counts(field, word):
    # get number of documents that contain the searched word
    query_dict = {"query": {"match": {field: word}}}
    query = json.dumps(query_dict)
    response = requests.get(search_url, headers=headers, data=query)
    n_documents = response.json()["hits"]["total"]["value"]

    # retrieve the document ids
    query_dict = {
        "size": n_documents,
        "query": {"match": {field: word}},
        "_source": False,
    }
    query = json.dumps(query_dict)
    response = requests.get(search_url, headers=headers, data=query)
    doc_ids = [idx["_id"] for idx in response.json()["hits"]["hits"]]

    # count occurrences in all documents using termvectors
    word_count = 0
    for idx in doc_ids:
        termvectors_url = f"http://localhost:9200/pol/_termvectors/{idx}?pretty"

        query_dict = {
            "fields": [field],
            "term_statistics": False,
            "field_statistics": False,
            "positions": False,
            "offsets": False,
        }

        query = json.dumps(query_dict)
        response = requests.get(termvectors_url, headers=headers, data=query)
        data = response.json()["term_vectors"][field]["terms"][word]
        word_count += data["term_freq"]

    return len(doc_ids), word_count

---

### Get number of documents and term occurrences for kwiecień without synonyms

In [8]:
count_docs, count_words = retrieve_counts("text", "kwiecień")

print(f"number of documents without synonyms   : {count_docs}")
print(f"number of occurrences without synonyms : {count_words}")

number of documents without synonyms   : 257
number of occurrences without synonyms : 353


### Get number of documents and term occurrences for kwiecień with synonyms

In [9]:
count_docs, count_words = retrieve_counts("text_synonyms", "kwiecień")

print(f"number of documents with synonyms   : {count_docs}")
print(f"number of occurrences with synonyms : {count_words}")

number of documents with synonyms   : 306
number of occurrences with synonyms : 439


We can see that scores are clearly better for analyzer that uses month synonyms

---

## 8. Dowload fiqa-pl-grels and fiqa-pl queries

In [10]:
fiqa_queries = load_dataset("clarin-knext/fiqa-pl", "queries")["queries"]
fiqa_qa = load_dataset("clarin-knext/fiqa-pl-qrels")["test"]

In [11]:
# create map query id -> query text
query_map = {
    int(idx): q for idx, q in zip(fiqa_queries["_id"], fiqa_queries["text"]) if int(idx)
}

# create map query id -> relevant corpus ids
query_corpus_map = {}
for query_id, corpus_id in zip(fiqa_qa["query-id"], fiqa_qa["corpus-id"]):
    if query_id not in query_corpus_map:
        query_corpus_map[query_id] = []
    query_corpus_map[query_id].append(corpus_id)

---

## 9. Compute NDCG@5 for all queries

In [12]:
def ndcg_at_k(field, k, search_url, printing=False):
    # create logarithm vector to avoid unnecessary computations later
    logs = np.log2(np.arange(2, 2 + k))

    # iterate over all queries to compute ndcg for each of them
    ndcg_list = []
    for query_id, corpus_id_list in query_corpus_map.items():
        query_text = query_map[query_id]

        query_dict = {"query": {"match": {field: query_text}}, "size": k}

        query_request = json.dumps(query_dict)
        response = requests.post(search_url, headers=headers, data=query_request)
        data = response.json()
        hits = [int(h["_id"]) for h in data["hits"]["hits"]]

        # sometimes the list of correct matches is shorter than k
        # in  those cases we pad with 0s
        idcg = [1 if i < len(corpus_id_list) else 0 for i in range(k)]
        dcg = [1 if h in corpus_id_list else 0 for h in hits]

        # this part of code is for exercise 10 and 11 only - analysis of example results
        if printing:
            if printing == "0" and np.sum(dcg) == 0:
                print(query_text, end="\n\n")
                for d in data["hits"]["hits"]:
                    print(d["_source"]["text_synonyms"])
                    print()
                return
            if printing == "45" and np.sum(dcg) == 2 and dcg[3] and dcg[4]:
                print(query_text, end="\n\n")
                for d in data["hits"]["hits"]:
                    print(d["_source"]["text_synonyms"])
                    print()
                return
            if printing == "1" and dcg[0]:
                print(query_text, end="\n\n")
                for d in data["hits"]["hits"]:
                    print(d["_source"]["text_synonyms"])
                    print()
                return

        idcg = np.array(idcg) / logs
        dcg = np.array(dcg) / logs

        ndcg_list.append(dcg.sum() / idcg.sum())

    return np.mean(ndcg_list)

---

### NDCG@5 without synonyms

In [13]:
ndcg_at_k("text", 5, search_url)

0.1851291130797741

### NDCG@5 with synonyms

In [14]:
ndcg_at_k("text_synonyms", 5, search_url)

0.1858026377473443

We can notice a very small improvement in search results. It's not surprising as not all queries rely on month names. Most of the results will be similar.

---

### Modify analyzer to remove morfologik lemmatizer

We have to create new analyzers

In [15]:
new_analyzer_settings = {
    "settings": {
        "analysis": {
            "filter": {
                "polish_synonym": {
                    "type": "synonym",
                    "synonyms": synonym_list,
                }
            },
            "analyzer": {
                "polish_with_synonyms": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "polish_synonym",
                        "lowercase",
                    ],
                },
                "polish": {
                    "tokenizer": "standard",
                    "filter": ["lowercase", "lowercase"],
                },
            },
        }
    },
    "mappings": {
        "properties": {
            "text_synonyms": {
                "type": "text",
                "analyzer": "polish_with_synonyms",
                "fields": {"keyword": {"type": "keyword"}},
            },
            "text": {
                "type": "text",
                "analyzer": "polish",
                "fields": {"keyword": {"type": "keyword"}},
            },
        }
    },
}

new_analyzer_settings = json.dumps(new_analyzer_settings)

response = requests.put(
    elastic_url + "_without_lemmatizer",
    headers=headers,
    data=new_analyzer_settings,
)

response.json()

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'pol_without_lemmatizer'}

To use the new analyzer without polish lemmatizer we have to reindex the data

In [16]:
reindex = {"source": {"index": "pol"}, "dest": {"index": "pol_without_lemmatizer"}}

response = requests.post(
    "http://localhost:9200/_reindex",
    headers=headers,
    data=json.dumps(reindex),
)

response.json()

{'took': 6719,
 'timed_out': False,
 'total': 57638,
 'updated': 0,
 'created': 57638,
 'deleted': 0,
 'batches': 58,
 'version_conflicts': 0,
 'noops': 0,
 'retries': {'bulk': 0, 'search': 0},
 'throttled_millis': 0,
 'requests_per_second': -1.0,
 'throttled_until_millis': 0,
 'failures': []}

---

### NDCG@5 without synonyms and lemmatizer

In [17]:
new_search_url = "http://localhost:9200/pol_without_lemmatizer/_search?pretty"

In [18]:
ndcg_at_k("text", 5, new_search_url)

0.12444419614599363

### NDCG@5 with synonyms and without lemmatizer

In [19]:
ndcg_at_k("text_synonyms", 5, new_search_url)

0.1331294076376364

The results are significantly worse without the use of lemmatizer. It was not unexpected since the text analysis tends to be more chaotic for a language that is unknown to the used tool

---

### Question 1. What are the strengths and weaknesses of regular expressions versus full text search regarding processing of text?

Regular expressions are easy very precise. They don't treat the text as a language and can be used for finding patterns in all sorts of character sequences. Regular expression don't require creating additional big data structures such as fts index. However, with fts, many things can be simplified and automated. To search for all forms of month names, creating a proper regex will pose a much harder task than using fts 

### Question 2. Can an LLM be applied in the context of searching for documents? Justify your answer, excluding the obvious observation that an LLM can be used to formulate the answer.

An LLM can be used to create a brief summary of text. Searching through such summary might yield worse results as they will not contain all information, but having a list of such briefs greatly improve the search speed.

Additionally a language model can be used to assess how well the context of query matches the context of the summary.

---

---

---

---

### Analyze example results (additional tasks 10 and 11)

When no results are found

In [20]:
ndcg_at_k("text_synonyms", 5, search_url, "0")

1 EIN prowadzący działalność pod wieloma nazwami firm

„Tak, możesz utworzyć konto firmowe PayPal bez sformalizowania firmy za pomocą dokumentów rządowych itp. Mówiąc najprościej, „prowadzenie firmy” oznacza po prostu „prowadzenie działalności jako” (D/B/A) nazwa handlowa.Możesz użyć adresu prywatnej skrzynki pocztowej, takiej jak te podane w sklepie UPS.Poproś dziecko ze stoiskiem z lemoniadą lub pudełkiem ciasteczek Girl Scout - wystarczy, że załatwisz formalności rządowe, takie jak rejestracja LLC lub uzyskanie podatku EIN po przekroczeniu pewnych progów działalności, a płacenie za rzeczy generalnie nie jest tym. Ponadto niektóre firmy, w przypadku niektórych relacji, będą wymagały formalności biznesowych, takich jak EIN, co z kolei będzie wymagało stworzenia nazwy handlowej i rejestrując go w państwie. Na przykład, jeśli założysz tradycyjne konto sprzedawcy kart kredytowych, prawdopodobnie będą tego chcieli”.

Nie należy zakładać firmy w USA tylko po to, aby uzyskać numer identyfik

All of the above results talk about the topic of the query "prowadzący działalność pod wieloma nazwami firm"

The query isn't very specific so many topics matched it. Quite possibly some of them were more related to it than the intended results because of how vague it was.

---

When results are found on the 4th and 5th position

In [21]:
ndcg_at_k("text_synonyms", 5, search_url, "45")

Plusy i minusy Pożyczki tylko na odsetki

Stopa limitu obejmuje wszelkie odsetki od kredytu hipotecznego, a nie spłaty kredytu hipotecznego. Stopa ograniczenia reprezentuje dochód netto, który jest czynszem brutto minus wszystkie koszty, w tym odsetki od pożyczki. Spłaty kredytu hipotecznego stanowią część Twoich obliczeń przepływów pieniężnych, a nie Twoich obliczeń zwrotu. ROI to kalkulacja, która oblicza Twój dochód netto w porównaniu z początkową inwestycją, czyli Twoją zaliczką plus koszty, a nie wartością nieruchomości.

HELOCs zazwyczaj mają 10 lat losowania i 5 lat zwrotu. W czasie losowania możesz płacić odsetki tylko jeśli chcesz. Stawka może wynosić od Prime minus 1,5 do Prime plus (dość trochę). Oczywiście zawsze możesz rozejrzeć się za lepszą ofertą niż obecnie, o ile masz kapitał w swoim domu.

Kluczem do zrozumienia kredytu hipotecznego jest przyjrzenie się harmonogramowi amortyzacji. Włóż 100k, 4,5% odsetek, 30 lat, 360 miesięcznych płatności i spójrz na wyniki. Powinie

All of the above texts touch a topic mentioned in the query. It's moderately precise but does not use enough keywords that could move the most relevant results higher in the list

---

When the result appears on the first position

In [22]:
ndcg_at_k("text_synonyms", 5, search_url, "1")

Jak rozliczyć zarobione i wydane pieniądze przed założeniem firmowych kont bankowych?

Środki zarobione i wydane przed otwarciem dedykowanego konta firmowego należy sklasyfikować według ich pochodzenia. Na przykład, jeśli Twoja firma uzyskała dochód, gdzie trafiły te pieniądze? Jeśli weźmiesz pieniądze osobiście, zostanie to uznane za „dystrybucję” lub „pożyczkę”. To od Ciebie zależy, którą z dwóch opcji wybierzesz. Z drugiej strony, jeśli Twoja firma miała wydatek, który zapłaciłeś osobiście, zostanie to uznane za „wkład kapitałowy” lub „pożyczkę” od Ciebie. Jeśli zdecydujesz się rejestrować te transakcje jako pożyczki, możesz je skompensować razem, więc nie potrzebujesz dwóch oddzielnych kont, pożyczki dla ciebie i pożyczki od ciebie. Kiedy otwierano konto bankowe, skąd pochodziła wpłata początkowa? Jeśli pochodzi z Twoich osobistych środków, jest to „wkład kapitałowy” lub „pożyczka” od Ciebie. Z dźwięku twojego pytania, zdeponowałeś to, co pozostało po poprzednich przychodach/wydatk

The query created by the above text is very clear and precise. Because of that the first retrieved result already matches it well.

The above examples show how precision of language can impact the search results when using a full text search