### Multilingual-e5-large võrdlus eri dokumendiparsimisemeetodite põhjal
1) Esimene samm (meile) - võrrelda semantilist otsingut kaht viisi parsitud andmetel. Siin tuleks välja mõelda, kuidas otsing kõige paremini üles ehitada. Tekst peaks sisaldama vähemalt ühte sõna terminist?
2) Teine samm (terminoloogidele) - tekstiotsingu ja semantilise otsingu vahe
3) Võib võrrelda, kas API-päringu vastusega sidumine võib semantilisel otsingul anda paremaid tulemusi.


#### Terminoloogide valitud terminid:
* *area reconnaissance* – meil arutatud ja Militermis olemas, peaks leidma erinevaid definitsioone ja kasutusnäiteid;
* *capability* – laialt kasutatud sõna, mis ka militaarvaldkonnas defineeritud; saaks katsetada koos terminitega capability planning, capability development, capability management;
* *battle tank* ja/või *main battle tank* – sage kasutus, leitavad natuke erinevad lähenemised, TI harjutamiseks tundub hea, sest infot jagub;
* *operational environment*
* *area of operations* 
* *air operation* – väga nö lihtne ja tavaline termin, kas TI leiab meie konteksti vajalikku infot;
* *air defence* - lihtne termin peidab keerukamat sisu ja TI peaks leidma mitu paralleelset kontseptsiooni;
* *rotary-wing aircraft* – terminal hulk sünonüüme ja (heli)kopteritest peaks leidma päris palju infot eri kohtadest;
* *urban warfare* – väga palju sünonüüme, lai kasutus;
* *forward defence*, *forward defence posture* – ise veel juurdleme ja otsime infot, aga vaataks TI võimeid;
* *psychological operation*, *psychological warfare* – kui mingil hetkel tahaks katsetada või õpetada, kuidas TI tunneb ära sama konteksti või termini eri versioonid.

In [None]:
# Impordid
from collections import defaultdict

import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.http.models import (FieldCondition, Filter, MatchText,
                                       MatchValue)
from sentence_transformers import SentenceTransformer

In [None]:
client_host = 'ekivm'
client_port = 6333

# SentenceTransformers model
embedding_model = 'intfloat/multilingual-e5-large'

collection_test = 'kva_test_collection'


In [None]:
# Baasiühendus
client = QdrantClient(client_host, port=client_port)

In [None]:
# Mudeli initsialiseerimine
model = SentenceTransformer(embedding_model)

In [None]:
# Semantilise otsingu meetod
query_filter_validated = Filter(
            must=[
            FieldCondition(
                key="validated",
                match=MatchValue(
                    value=True,
                ),
            )
        ])

#query_filter_keyword = Filter()
# todo: implement keyword search


def get_similarities(text, query_filter, collection_name="kva_test_collection", response_limit = 5):
    search_result = client.search(
        collection_name=collection_name,
        query_vector=list(model.encode(text, normalize_embeddings=True).astype(float)),
        query_filter=query_filter,
        limit=response_limit,
        timeout=100)
    
    result_dict = defaultdict(list)
    
    for point in search_result:
        if not point.payload:
            continue
        result_dict['response_text'].append(point.payload["text"])
        result_dict['response_type'].append(point.payload["content_type"])
        result_dict['score'].append(point.score)
        result_dict['filename'].append(point.payload["filename"])
        result_dict['page_no'].append(point.payload["page_number"])

    return result_dict


In [None]:
df_properties = {
    'white-space': 'pre-wrap', # Allows text to wrap within cells
    'width': '300px', # Adjust as needed
}

---
### Esialgne tulemuste võrdluse tabel

In [None]:
collection_structured = 'kva_documents_structured'
collection_simple = 'kva_documents_simple'

collections_to_use = [collection_test] # [collection_simple, collection_structured]

query_inputs = [
    'area reconnaissance',
    'capability',
    'capability planning',
    'capability development',
    'capability management',
    'battle tank',
    'main battle tank',
    'operational environment',
    'area of operations',
    'air operation',
    'air defence',
    'rotary-wing aircraft',
    'urban warfare',
    'forward defence',
    'forward defence posture',
    'psychological operation',
    'psychological warfare'
    ]

In [None]:
keywords_col_data = list()
similarity_ranks_col_data = list()
results_by_collection = {coll: [] for coll in collections_to_use}

In [None]:
data = {coll: [] for coll in collections_to_use}
data.update({'märksõna': [], 'sarnasus': []})
data

In [None]:
for query in query_inputs:

    query_responses = defaultdict(list)
    
    for collection in collections_to_use:

        responses = get_similarities('query : ' + query, query_filter_validated, collection, response_limit=5)
        
        for text, type, score, fname, page in zip(responses['response_text'], responses['response_type'], responses['score'], responses['filename'], responses['page_no']):
            response_text = f'Fail: {fname}\nLk: {page}\nTüüp: {type}\nOtsingutulemus: {text}'
            query_responses[collection].append(response_text)

    no_of_responses = len(query_responses[collection])

    data['sarnasus'].extend(list(range(1, no_of_responses + 1)))
    data['märksõna'].extend([query]*no_of_responses)
    
    for k, v in query_responses.items():
        data[k].extend(v)

In [None]:
len(query_inputs)

In [None]:
for k, v in data.items():
    print(k, len(v))

In [None]:
df = pd.DataFrame(data=data)

df.set_index(['märksõna', 'sarnasus'], inplace=True)

display(df.style.set_properties(**df_properties))

In [None]:
data = {
    "Märksõna": keywords_col_data,
    "Parser": collections_col_data,
    "Sarnasus": similarity_ranks_col_data,
    "Vaste": results_col_data
}
df = pd.DataFrame(data)
df.set_index(['Märksõna', 'Sarnasus'], inplace=True)
display(df.style.set_properties(**df_properties))


In [None]:
import pandas as pd
import numpy as np

# Creating a MultiIndex for columns
columns = pd.MultiIndex.from_tuples([
    ('A', 'foo'), ('A', 'bar'),
    ('B', 'foo'), ('B', 'bar')
], names=['level_1', 'level_2'])


# Creating a DataFrame with the MultiIndex for columns
df = pd.DataFrame(np.random.randn(3, 4), columns=['level_1', 'level_2', 'foo', 'bar'])
#df.set_index(['foo', 'bar'], inplace=True)


display(df)


In [None]:
import pandas as pd

# Assuming df is your DataFrame
# Step 1: Reset the index
df_reset = df.reset_index()

# Step 2: Pivot the DataFrame
df_pivoted = df_reset.pivot(index=['Märksõna', 'Sarnasus'], columns='Parser', values='Vaste')

# Step 3: Reset the index (optional)
df_final = df_pivoted.reset_index()

# Display the final DataFrame
display(df_final)


In [None]:
kw = 'explain the meaning of "battle tank" in context of given file'
query = 'query: ' + kw

response_df = get_similarities(kw, collection_name=collection_test, query_filter=query_filter_validated).style.set_properties(**df_properties)

display(response_df)