Given 3 documents ask the question: is the D1 more similar to D2 or to D3? 

Select only documents from PWDB (length between 200 - 500 characters).
Provide 1000 triples of documents. 
Make sure that the selection is random. 

Task evaluators:
* Q1: is D1 more similar to D2 or to D3? Answer: D2 or D3
* Q2: How similar is D1 to D2 on a scale from 1 to 5. Answer: 1-5 (1 not similar, 5 very similar)
* Q3: How similar is D1 to D3 on a scale from 1 to 5. Answer: 1-5 (1 not similar, 5 very similar)

Data frame header:
|id d1| text d1 | id2 | text d2 | id d3 | text d3| q1 | q2 | q3 |

Cross checking  method:
Partition the PWDB space randomly into groups of 4 documents (i1,i2,i3,i4) and generate the following test question:

* d1=i1; d2=i2; d3=i3
* d1=d2; d2=i3; d3=i4
* d1=d3; d2=i4; d3=i1

Partition the space of PWDB documents into groups of 4 documents and generate test triples:
* very small test set using 20 random groups
* small test set using 50 random groups
* medium test set using 150 random groups


In [183]:
import sys
sys.path.append("/home/jovyan/work/sem-covid/")
sys.path = list(set(sys.path))

import os
os.getcwd()
os.chdir('/home/jovyan/work/sem-covid/')

from sem_covid.services.store_registry import store_registry
from sem_covid import config

es_store = store_registry.es_index_store()
df = es_store.get_dataframe(index_name=config.UNIFIED_DATASET_ELASTIC_SEARCH_INDEX_NAME)

print(df);

100% (6360 of 6360) |####################| Elapsed Time: 0:00:16 Time:  0:00:16


                                                  title  \
_id                                                       
1624  COMMISSION STAFF WORKING DOCUMENT Accompanying...   
1625  Regulation (EU) 2021/267 of the European Parli...   
1626  COMMUNICATION FROM THE COMMISSION TO THE EUROP...   
1627  European Parliament resolution of 15 May 2020 ...   
1628  REPORT FROM THE COMMISSION TO THE EUROPEAN PAR...   
...                                                 ...   
6355  Statement from the National Public Health Emer...   
6356  Ministers McConalogue and Heydon launch Code o...   
6357  Press Release on Civil Defence in the context ...   
6358  Minister O’Gorman launches ‘LGBTI+ Youth in Ir...   
6359  Statement from the National Public Health Emer...   

                                                content  \
_id                                                       
1624  COMMISSION STAFF WORKING DOCUMENT Accompanying...   
1625  Regulation (EU) 2021/267 of the European Parli...

In [184]:
ds_pwdb = df[df["doc_source"]=="ds_pwdb"].copy()

Generate partitions of random documen indexes

In [185]:
import random 

index_list = ds_pwdb.index.to_list()
random.shuffle(index_list)

n = 4
index_partitions = [index_list[i:i + n] for i in range(0, len(index_list), n)]


In [186]:
def generate_single_record(data_frame, index_list ):
    return {"target id":index_list[0],
         "target title":data_frame["title"][index_list[0]],
         "target content":data_frame["content"][index_list[0]],
         "ref1 id":index_list[1],
         "ref1 title":data_frame["title"][index_list[1]],
         "ref1 content":data_frame["content"][index_list[1]],
         "ref2 id":index_list[2],
         "ref2 title":data_frame["title"][index_list[2]],
         "ref2 content":data_frame["content"][index_list[2]],
         "Q1: is *target* more similar to *ref1* or to *ref2*? Answer: *ref1* or *ref2*":"",
         "Q2: How similar is *target* to *ref1* on a scale from 1 to 5. Answer: 1-5 (1 not similar, 5 very similar)":"",
         "Q3: How similar is *target* to *ref2* on a scale from 1 to 5. Answer: 1-5 (1 not similar, 5 very similar)":"",
        }


def generate_three_records(data_frame,index_references):
    l = index_references
    results = [
        generate_single_record(data_frame,l),
        generate_single_record(data_frame,l[1:] + l[:1]),
        generate_single_record(data_frame,l[2:] + l[:2]),
        #generate_single_record(data_frame,l[3:] + l[:3]),   
    ]
    return results


test teh correctness of the result

In [187]:
import pprint
x = generate_three_records(ds_pwdb,partitions[0])

assert x[0]["target id"] == partitions[0][0]
assert x[0]["ref1 id"] == partitions[0][1]
assert x[0]["ref2 id"] == partitions[0][2]


assert x[2]["target id"] == partitions[0][2]
assert x[2]["ref1 id"] == partitions[0][3]
assert x[2]["ref2 id"] == partitions[0][0]

prepare the function to generate the final results

In [193]:
import itertools

def generate_evaluation_dataset(data_frame,index_partitions,number_of_partitions_to_consider):
    result = [
        generate_three_records(data_frame, partition_index_references)        
        for partition_index_references in index_partitions[:number_of_partitions_to_consider]
    ]
    result = list(itertools.chain(*result))
    return pd.DataFrame(result)

simple test

In [None]:
n = 30
res = generate_evaluation_dataset(ds_pwdb,index_partitions,n)
print(len(res))
assert len(res) == 3 * n

generate the final results

In [None]:
import pathlib

evaluation_ds_20 = generate_evaluation_dataset(ds_pwdb,index_partitions,20)
evaluation_ds_50 = generate_evaluation_dataset(ds_pwdb,index_partitions[21:],50)
evaluation_ds_150 = generate_evaluation_dataset(ds_pwdb,index_partitions[51:],150)


In [191]:
import csv

local_path = pathlib.Path("/home/jovyan/work/sem-covid/tests/test_data")

evaluation_ds_20.to_csv(
    local_path / "evaluation_ds_20.csv", 
    quoting=csv.QUOTE_ALL,
    escapechar="\\",
    index=False)
evaluation_ds_50.to_csv(
    local_path / "evaluation_ds_50.csv", 
    quoting=csv.QUOTE_ALL,
    escapechar="\\",
    index=False)
evaluation_ds_150.to_csv(
    local_path / "evaluation_ds_150.csv", 
    quoting=csv.QUOTE_ALL,
    escapechar="\\",
    index=False)
