### Multilingual-e5-large võrdlus eri dokumendiparsimisemeetodite põhjal
1) Esimene samm (meile) - võrrelda semantilist otsingut kaht viisi parsitud andmetel. Siin tuleks välja mõelda, kuidas otsing kõige paremini üles ehitada. Tekst peaks sisaldama vähemalt ühte sõna terminist?
2) Teine samm (terminoloogidele) - tekstiotsingu ja semantilise otsingu vahe
3) Võib võrrelda, kas API-päringu vastusega sidumine võib semantilisel otsingul anda paremaid tulemusi.


#### Terminoloogide valitud terminid:
* *area reconnaissance* – meil arutatud ja Militermis olemas, peaks leidma erinevaid definitsioone ja kasutusnäiteid;
* *capability* – laialt kasutatud sõna, mis ka militaarvaldkonnas defineeritud; saaks katsetada koos terminitega capability planning, capability development, capability management;
* *battle tank* ja/või *main battle tank* – sage kasutus, leitavad natuke erinevad lähenemised, TI harjutamiseks tundub hea, sest infot jagub;
* *operational environment*
* *area of operations* 
* *air operation* – väga nö lihtne ja tavaline termin, kas TI leiab meie konteksti vajalikku infot;
* *air defence* - lihtne termin peidab keerukamat sisu ja TI peaks leidma mitu paralleelset kontseptsiooni;
* *rotary-wing aircraft* – terminal hulk sünonüüme ja (heli)kopteritest peaks leidma päris palju infot eri kohtadest;
* *urban warfare* – väga palju sünonüüme, lai kasutus;
* *forward defence*, *forward defence posture* – ise veel juurdleme ja otsime infot, aga vaataks TI võimeid;
* *psychological operation*, *psychological warfare* – kui mingil hetkel tahaks katsetada või õpetada, kuidas TI tunneb ära sama konteksti või termini eri versioonid.

In [None]:
# Impordid
from collections import defaultdict

import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.http.models import (FieldCondition, Filter, MatchText,
                                       MatchValue)
from sentence_transformers import SentenceTransformer
import re

In [None]:
client_host = 'localhost'
client_port = 6333

# SentenceTransformers model
embedding_model = 'intfloat/multilingual-e5-large'

collection_test = 'kva_test_collection'


In [None]:
# Baasiühendus
client = QdrantClient(client_host, port=client_port)

In [None]:
# Mudeli initsialiseerimine
model = SentenceTransformer(embedding_model)

In [None]:
df_properties = {
    'white-space': 'pre-wrap', # Allows text to wrap within cells
    'width': '300px', # Adjust as needed
    'text-align': 'left'
}

# Semantilise otsingu meetod
query_filter_validated = Filter(
            must=[
            FieldCondition(
                key="validated",
                match=MatchValue(
                    value=True,
                ),
            )
        ])

#query_filter_keyword = Filter()
# todo: implement keyword search

def calculate_ranking_similarity(collections: list, result_df, query: str):
    """
    Calculates the similarity percentage between two collections based on the correct sequence of responses.

    Args:
    - collections (list): A list containing two collections to compare.
    - result_df (DataFrame): A DataFrame containing the search results for the given query.
    - query (str): The query string used to search the collections.

    Returns:
    - tuple: A tuple containing the query string and the calculated similarity percentage.
    """

    collections_to_compare = collections # two collections

    file_pattern = re.compile('^Fail: ([^\n]*)\n')
    page_pattern = re.compile('\nLk: ([^\n]*)\n')


    df = result_df.loc[query]

    # Extracting files and page numbers for each response in each collection
    result_files_by_collection = {collection: [] for collection in collections_to_compare}
    for row in df.itertuples():
        for collection in collections_to_compare:
            response = getattr(row, collection)
            filename = re.search(file_pattern, response).group(1)
            page_no = re.search(page_pattern, response).group(1)
            result_files_by_collection[collection].append((filename, page_no))
    
    # Calculating similarity % based on correct sequence of responses
    response_count = 0
    similar_response_count = 0
    for c1, c2 in zip(*[v for v in result_files_by_collection.values()]):
        if c1 == c2:
            similar_response_count += 1
        response_count += 1

    similarity = similar_response_count*100/response_count
    
    return(query, similarity)


def get_similarities(text, query_filter, collection_name="kva_test_collection", response_limit = 5):
    """
    Searches for similar documents in a collection based on a given text and query filter.

    Args:
    - text (str): The text to search for similar documents.
    - query_filter (dict): A dictionary specifying the filter criteria for the search.
    - collection_name (str, optional): The name of the collection to search within, defaults to "kva_test_collection".
    - response_limit (int, optional): The maximum number of search results to return, defaults to 5.

    Returns:
    - dict: A dictionary containing lists of response texts, response types, scores, filenames, and page numbers.
    """

    search_result = client.search(
        collection_name=collection_name,
        query_vector=list(model.encode(text, normalize_embeddings=True).astype(float)),
        query_filter=query_filter,
        limit=response_limit)
    
    result_dict = defaultdict(list)
    
    for point in search_result:
        if not point.payload:
            continue
        result_dict['response_text'].append(point.payload["text"])
        result_dict['response_type'].append(point.payload["content_type"])
        result_dict['score'].append(point.score)
        result_dict['filename'].append(point.payload["filename"])
        result_dict['page_no'].append(point.payload["page_number"])

    return result_dict


def get_group_similarities(text, query_filter, collection_name="kva_test_collection", response_limit = 5, group_size=4):
    """
    Searches for similar documents in a collection based on a given text and query filter, grouping results.

    Args:
    - text (str): The text to search for similar documents.
    - query_filter (dict): A dictionary specifying the filter criteria for the search.
    - collection_name (str, optional): The name of the collection to search within, defaults to "kva_test_collection".
    - response_limit (int, optional): The maximum number of search results to return, defaults to 5.
    - group_size (int, optional): The maximum number of results per group, defines the number of filenames and page numbers returned per group.

    Returns:
    - dict: A dictionary containing lists of response texts, response types, scores, filenames, and page numbers for each group.
    """

    
    search_params = {
    "query_vector": list(model.encode(text, normalize_embeddings=True).astype(float)),
    "limit": 5, 
    "group_by": "text", 
    "group_size": group_size,
    "with_payload": True,
    "with_vectors": False,
    "query_filter": query_filter
    }

    # Perform the search
    search_results = client.search_groups(collection_name=collection_name, **search_params).groups

    result_dict = defaultdict(list)

    for group in search_results:

        points = group.hits

        result_dict['response_text'].append(group.id)
        result_dict['response_type'].append(points[0].payload["content_type"])
        result_dict['score'].append(points[0].score)
        result_dict['filename'].append([point.payload["filename"] for point in points])
        result_dict['page_no'].append([point.payload["page_number"] for point in points])
    
    return result_dict


---
## Testandmestiku koostamine

### Testandmestiku alusandmed

In [None]:
collection_structured = 'kva_documents_structured'
collection_simple = 'kva_documents_simple'

collections_to_use = [collection_simple, collection_structured] 

query_inputs = [
    'battle tank - self-propelled armoured fighting vehicle, capable of heavy firepower, primarily of a high muzzle velocity direct fire main gun necessary to engage armoured and other targets, with high cross-country mobility, with a high level of self-protection, and which is not designed and equipped primarily to transport combat troops',
    'area of operations - area within a joint operations area defined by the joint force commander for conducting tactical level operations',
    'urban warfare - combats de rues en ville',
    "forward defence - strategie visant a arreter un agresseur aussi pres que possible des frontieres;s'oppose a la defense en profondeur(1)"
    ]

### Esimene testandmestik (duplikaatidega)

In [None]:
data = {coll: [] for coll in collections_to_use}
data.update({'märksõna': [], 'sarnasus': []})

for query in query_inputs:

    query_responses = defaultdict(list)
    
    for collection in collections_to_use:

        responses = get_similarities('query : ' + query, query_filter_validated, collection, response_limit=5)
        
        for text, type, score, fname, page in zip(responses['response_text'], responses['response_type'], responses['score'], responses['filename'], responses['page_no']):
            response_text = f'Fail: {fname}\nLk: {page}\nTüüp: {type}\n\n{text}'
            query_responses[collection].append(response_text)

    no_of_responses = len(query_responses[collection])

    data['sarnasus'].extend(list(range(1, no_of_responses + 1)))
    data['märksõna'].extend([query]*no_of_responses)
    
    for k, v in query_responses.items():
        data[k].extend(v)

Exceli tabelina salvestamine

In [None]:
df = pd.DataFrame(data=data)
df.set_index(['märksõna', 'sarnasus'], inplace=True)

df[:].to_excel('~/projects/kva/kva_data/test_data/test_2.xlsx')

Hindamine:

In [None]:
for query in query_inputs:
    print(calculate_ranking_similarity(collections_to_use, df, query))

### Teine testandmestik (grupeeritud vastused, duplikaatideta)

In [None]:
data = {coll: [] for coll in collections_to_use}
data.update({'märksõna': [], 'sarnasus': []})

for query in query_inputs:

    query_responses = defaultdict(list)
    
    for collection in collections_to_use:

        responses = get_group_similarities('query : ' + query, query_filter_validated, collection, response_limit=5)
        
        for text, type, score, fname, page in zip(responses['response_text'], responses['response_type'], responses['score'], responses['filename'], responses['page_no']):
            response_text = f'Fail: {fname}\nLk: {page}\nTüüp: {type}\n\n{text}'
            query_responses[collection].append(response_text)

    no_of_responses = len(query_responses[collection])

    data['sarnasus'].extend(list(range(1, no_of_responses + 1)))
    data['märksõna'].extend([query]*no_of_responses)
    
    for k, v in query_responses.items():
        data[k].extend(v)

Exceli tabelina salvestamine

In [None]:
df = pd.DataFrame(data=data)
df.set_index(['märksõna', 'sarnasus'], inplace=True)
df[:].to_excel('~/projects/kva/kva_data/test_data/test_2.xlsx')

Hindamine

In [None]:
for query in query_inputs:
    print(calculate_ranking_similarity(collections_to_use, df, query))