# Duplicated Datasets Analysis

In this notebook we want to analyze and retrieve from the datasets.json list provided which datasets are duplicated by considering different aspects and analysis. 
Sections:
* Section 1: duplicated datasets analysis. We are going to consider the following definition: two ore more datasets are duplicated if they share title, description, author, tags and download links
* Section 2: We are going to analyze which datasets share only the download links
* Section 3: We are going to analyze which datasets share the links but have different titles, descriptions, authors, tags

In [55]:
import json
import os
from pandas import json_normalize

In [56]:
#open the datasets.json file
scriptDir = os.path.dirname(os.path.realpath('__file__'))
datasets_file = open(os.path.join(scriptDir,"../../files/datasets.json"), "r", encoding="utf-8")
datasets_json = json.load(datasets_file, strict=False)
datasets_file.close()

## Section 1: Duplicated Datasets Analysis

In this section we are going to retrieve from the datasets.json file wich datasets are duplicated by considering the following definition: "two or more datasets are duplicated if they share the same links, title, description, tags and author fields". After getting a list of sets of duplicated datasets we are going to analyze the qrels file in order to see if duplicated datasets have different relevance scores in a given query. Furthermore we are going to analyzer ACORDAR runs in order to see if, in presence of a duplicated dataset in a rank, also the duplicated datasets are returned. 

In [76]:
duplicates = dict()

for dataset in datasets_json["datasets"]:
    #obtain a list of different ordered links from the dataset
    distinct_links = sorted(set(dataset["download"]))

    #build a string with the link concatenation
    links_as_string = ""
    for link in distinct_links:
        links_as_string += link

    #concatenate to the string also title description tags and author
    key = dataset["title"] + dataset.get("description","") + dataset["tags"] + dataset.get("author", "") 
    key += links_as_string

    if key not in duplicates.keys():
        duplicates[key] = list()
        
    duplicates[key].append(int(dataset["dataset_id"]))



In [77]:
i = 0 
duplicated_sets_list = list()
for key in duplicates.keys():
    if len(duplicates[key]) > 1:
        print(f"Duplicated datasets: {duplicates[key]}")
        duplicated_sets_list.append(duplicates[key])
        i+=1

print("Number of sets of duplicated datasets:"+str(i))

Duplicated datasets: [596, 1012]
Duplicated datasets: [787, 796, 1166]
Duplicated datasets: [1123, 1848, 4013]
Duplicated datasets: [1203, 4435]
Duplicated datasets: [1221, 3340, 6542]
Duplicated datasets: [1650, 1655, 4591, 4604, 4609]
Duplicated datasets: [1870, 3489]
Duplicated datasets: [1871, 3485]
Duplicated datasets: [1921, 6661]
Duplicated datasets: [1927, 3327, 3801, 8006, 9540]
Duplicated datasets: [1928, 3341, 8033]
Duplicated datasets: [1934, 8389]
Duplicated datasets: [1937, 3328, 3802, 7794, 9541]
Duplicated datasets: [1942, 3333, 7754]
Duplicated datasets: [1943, 3342, 7801]
Duplicated datasets: [1947, 7008]
Duplicated datasets: [1949, 6856]
Duplicated datasets: [1953, 3334, 3799, 8642, 9539]
Duplicated datasets: [1972, 3773, 9360]
Duplicated datasets: [1973, 3766, 8975]
Duplicated datasets: [1994, 3355]
Duplicated datasets: [2000, 3354]
Duplicated datasets: [2028, 3356]
Duplicated datasets: [2053, 3358]
Duplicated datasets: [2071, 3357]
Duplicated datasets: [2115, 3346]

In [78]:
#now we analyze the qrels file
qrels_file = open(os.path.join(scriptDir,"../../files/qrels.txt"), "r")

#mapping is a dictionary dataset_id -> query_id -> relevance_score for dataset_id in query_id
mapping = dict()

while True:
    line = qrels_file.readline().strip()

    if not line:
        break

    split = line.split("\t")
    query_id = int(split[0])
    dataset_id = int(split[2])
    rel = int(split[3])

    if dataset_id not in mapping.keys():
        mapping[dataset_id] = dict()

    mapping[dataset_id][query_id] = int(rel)    

In [79]:
#let's see if duplicated datasets have different rel. judgments in the same query

for duplicated_datasets in duplicated_sets_list:
    
    query_datasets = list()

    #retrieve for every group of duplicated datasets in which queries they appear in the qrels
    for dataset_id in duplicated_datasets:        
        if dataset_id in mapping.keys():
            query_datasets.extend(list(mapping[dataset_id].keys()))

    #remove the duplicates (python does not have ad add all method for the sets)
    query_datasets = set(query_datasets)
        
    for query in query_datasets:

        relevance_scores = list()


        for dataset_id in duplicated_datasets:
            rel = None
        
            if dataset_id in mapping.keys() and query in mapping[dataset_id]:
                rel = mapping[dataset_id][query]

            relevance_scores.append(rel)

        score = relevance_scores[0]
        for i in range(1, len(relevance_scores)):
            if relevance_scores[i] != score:
                print(f"Query: {query}, different scores for datasets: {duplicated_datasets} : {relevance_scores}")
                break

Query: 1237, different scores for datasets: [596, 1012] : [2, None]
Query: 1122, different scores for datasets: [1937, 3328, 3802, 7794, 9541] : [None, None, 2, None, None]
Query: 1138, different scores for datasets: [3313, 7353] : [0, 1]
Query: 90, different scores for datasets: [3332, 4803, 6463] : [None, None, 0]
Query: 153, different scores for datasets: [10445, 10803, 11413] : [0, None, 0]
Query: 1070, different scores for datasets: [13530, 15050] : [None, 0]
Query: 1034, different scores for datasets: [13530, 15050] : [None, 0]
Query: 99, different scores for datasets: [13530, 15050] : [None, 0]
Query: 54, different scores for datasets: [13530, 15050] : [None, 0]
Query: 104, different scores for datasets: [13579, 14924] : [0, 1]
Query: 1056, different scores for datasets: [13611, 15054] : [0, 1]
Query: 1062, different scores for datasets: [13765, 15344] : [0, None]
Query: 79, different scores for datasets: [13765, 15344] : [0, None]
Query: 1044, different scores for datasets: [13

In [83]:
#now we are going to analyze the run ranks in order to see if duplicated datasets are returned in the same rank

def generate_run_ranks(run_file) -> dict:
    run = dict()
    while True:
        line = run_file.readline()

        if not line:
            break
        
        split = line.split("\t")

        query = int(split[0])
        dataset =int(split[2])

        if query not in run.keys():
            run[query] = list()

        run[query].append(int(dataset))
    
    return run


In [84]:
def find_duplicates_in_runs(run: dict, duplicated_sets: set):
    for query in run.keys():
        for dataset in run[query]:
            for duplicated_datasets in duplicated_sets:
                if dataset in duplicated_datasets:
                    print(f"Query: {query}, Rank: {run[query]}, Dataset duplicated returned: {dataset} in group: {duplicated_datasets}")
                    break

In [85]:
#analyze acordar runs of BM25 in every configurations to see if these duplicated datasets are returned with the same score in the same run

run_meta_file = open(os.path.join(scriptDir, "../../files/run/ACORDAR/BM25F[m].txt"), "r")
run_data_file = open(os.path.join(scriptDir, "../../files/run/ACORDAR/BM25F[d].txt"), "r")
run_all_file = open(os.path.join(scriptDir, "../../files/run/ACORDAR/BM25F.txt"), "r")

run_meta = generate_run_ranks(run_meta_file)
run_data = generate_run_ranks(run_data_file)
run_all = generate_run_ranks(run_all_file)

find_duplicates_in_runs(run_meta,duplicated_sets_list)


Query: 1090, Rank: [13568, 15014, 13582, 14491, 13290, 13622, 15092, 14921, 13577, 14050], Dataset duplicated returned: 13622 in group: [13622, 15092]
Query: 1090, Rank: [13568, 15014, 13582, 14491, 13290, 13622, 15092, 14921, 13577, 14050], Dataset duplicated returned: 15092 in group: [13622, 15092]
Query: 1097, Rank: [12513, 8984, 40797, 40693, 40769, 13579, 14924, 40802, 51136, 87571], Dataset duplicated returned: 13579 in group: [13579, 14924]
Query: 1097, Rank: [12513, 8984, 40797, 40693, 40769, 13579, 14924, 40802, 51136, 87571], Dataset duplicated returned: 14924 in group: [13579, 14924]
Query: 1180, Rank: [28603, 28554, 71732, 28742, 28636, 11418, 12509, 12398, 28697, 2243], Dataset duplicated returned: 12509 in group: [12398, 12509]
Query: 1180, Rank: [28603, 28554, 71732, 28742, 28636, 11418, 12509, 12398, 28697, 2243], Dataset duplicated returned: 12398 in group: [12398, 12509]
Query: 1012, Rank: [15014, 13568, 13290, 13581, 11727, 30429, 14493, 13622, 15092, 13582], Dataset

In [86]:
find_duplicates_in_runs(run_data,duplicated_sets_list)

Query: 14, Rank: [10280, 2985, 54472, 53736, 53737, 4789, 12509, 12398, 2206, 53744], Dataset duplicated returned: 12509 in group: [12398, 12509]
Query: 14, Rank: [10280, 2985, 54472, 53736, 53737, 4789, 12509, 12398, 2206, 53744], Dataset duplicated returned: 12398 in group: [12398, 12509]
Query: 124, Rank: [6183, 4789, 11608, 14979, 2976, 43391, 31774, 53744, 12513, 12509], Dataset duplicated returned: 12509 in group: [12398, 12509]
Query: 1097, Rank: [12513, 14162, 63267, 15054, 13611, 15013, 15002, 15007, 15004, 10110], Dataset duplicated returned: 15054 in group: [13611, 15054]
Query: 1097, Rank: [12513, 14162, 63267, 15054, 13611, 15013, 15002, 15007, 15004, 10110], Dataset duplicated returned: 13611 in group: [13611, 15054]
Query: 163, Rank: [3579, 45413, 3367, 13284, 14417, 74800, 13579, 14924, 8342, 4158], Dataset duplicated returned: 13579 in group: [13579, 14924]
Query: 163, Rank: [3579, 45413, 3367, 13284, 14417, 74800, 13579, 14924, 8342, 4158], Dataset duplicated returned

In [87]:
find_duplicates_in_runs(run_all,duplicated_sets_list)

Query: 1002, Rank: [46561, 12509, 12398, 11683, 86273, 3531, 41757, 4594, 3370, 9017], Dataset duplicated returned: 12509 in group: [12398, 12509]
Query: 1002, Rank: [46561, 12509, 12398, 11683, 86273, 3531, 41757, 4594, 3370, 9017], Dataset duplicated returned: 12398 in group: [12398, 12509]
Query: 1023, Rank: [13283, 14277, 13347, 16047, 14324, 13997, 12636, 14546, 13907, 15270], Dataset duplicated returned: 13907 in group: [13907, 15270]
Query: 1023, Rank: [13283, 14277, 13347, 16047, 14324, 13997, 12636, 14546, 13907, 15270], Dataset duplicated returned: 15270 in group: [13907, 15270]
Query: 1090, Rank: [13568, 15014, 13582, 14491, 13290, 13622, 15092, 14735, 13263, 14364], Dataset duplicated returned: 13622 in group: [13622, 15092]
Query: 1090, Rank: [13568, 15014, 13582, 14491, 13290, 13622, 15092, 14735, 13263, 14364], Dataset duplicated returned: 15092 in group: [13622, 15092]
Query: 1097, Rank: [12513, 8984, 40797, 40693, 40769, 40802, 13579, 14924, 15054, 13611], Dataset dupl

# Section 2

Now we are going to retrieve the datasets that share the same links.

In [88]:
same_links = dict()

for dataset in datasets_json["datasets"]:
    distinct_links = sorted(set(dataset["download"]))

    links_as_string = ""
    for link in distinct_links:
        links_as_string += link
    
    if links_as_string not in same_links.keys():
        same_links[links_as_string] = list()
        
    same_links[links_as_string].append(int(dataset["dataset_id"]))



In [89]:
i = 0 
duplicated_links_sets_list = list() 
for key in same_links.keys():
    if len(same_links[key]) > 1:
        print(f"Datasets with the same links: {same_links[key]}")
        duplicated_links_sets_list.append(same_links[key])
        i+=1

print(i)

Datasets with the same links: [596, 1012]
Datasets with the same links: [649, 4335, 4499, 5456]
Datasets with the same links: [787, 796, 1166]
Datasets with the same links: [1123, 1848, 4013]
Datasets with the same links: [1203, 4435]
Datasets with the same links: [1221, 3340, 6542]
Datasets with the same links: [1650, 1655, 4591, 4604, 4609]
Datasets with the same links: [1689, 5212]
Datasets with the same links: [1690, 5179]
Datasets with the same links: [1691, 5284]
Datasets with the same links: [1695, 5115]
Datasets with the same links: [1696, 5112]
Datasets with the same links: [1697, 5311]
Datasets with the same links: [1698, 5132]
Datasets with the same links: [1699, 5213]
Datasets with the same links: [1700, 5316]
Datasets with the same links: [1702, 5170]
Datasets with the same links: [1703, 5228]
Datasets with the same links: [1704, 5343]
Datasets with the same links: [1706, 5169]
Datasets with the same links: [1707, 5217]
Datasets with the same links: [1708, 5113]
Datasets w

In [90]:
#remove from the previous list all the datasets sets that were retrieved in the previous section in order to mantain only the 
#datasets that share the links but have the other meta fields different (author, tags, title and description)

duplicated_sets_set = set()
for i in range(len(duplicated_sets_list)):
    duplicated_sets_set = duplicated_sets_set | set([tuple(duplicated_sets_list[i])])
    
duplicated_links_sets_set = set()
for i in range(len(duplicated_links_sets_list)):
    duplicated_links_sets_set = duplicated_links_sets_set | set([tuple(duplicated_links_sets_list[i])])

difference = duplicated_links_sets_set - duplicated_sets_set

for group in difference:
    print(group)

print(len(difference))


(1724, 5133)
(1736, 5098)
(5247, 5524)
(1698, 5132)
(5238, 12495)
(1710, 5225)
(13339, 14308)
(1794, 5224)
(4930, 5026)
(5118, 12167)
(13369, 14336)
(11852, 11854)
(16153, 21662)
(13077, 14573)
(4848, 5091)
(5357, 5514)
(5127, 11178)
(11848, 11850)
(5083, 12166)
(1822, 5288)
(12172, 12328)
(1795, 5289)
(1770, 5235)
(4927, 5020)
(13522, 14517)
(4950, 5223)
(10818, 11858)
(5323, 5505)
(11002, 12332)
(1742, 9868)
(5143, 5437)
(13633, 14503)
(9883, 9911)
(13193, 14597)
(4926, 5021)
(13378, 14252)
(1732, 5116)
(5386, 7385)
(13252, 14411)
(4895, 5051)
(1699, 5213)
(5075, 5375)
(12168, 12353)
(1828, 5315)
(1809, 5344)
(15968, 15979)
(5253, 5528)
(1776, 5286)
(5356, 5513)
(9899, 11864)
(1726, 5294)
(4911, 5019)
(15969, 15980)
(5186, 5474)
(5076, 10071)
(13368, 14344)
(12530, 14657)
(5320, 5431)
(5138, 5463)
(1743, 5120)
(1771, 9786)
(4921, 5038)
(4949, 5176)
(5123, 5539)
(7388, 12339)
(5090, 10118)
(4937, 5048)
(13420, 14383)
(1896, 4547)
(1739, 5109)
(4333, 5165)
(10624, 11873)
(13360, 14298)

## Section 3

From the sets of datasets that have the same links but different metadata fields, retrieved in the end of the previous section, we want to see if these datasets have very different metadata fields. 

In [91]:
#create a dataframe with all the datasets descriptions
datasets_df = json_normalize(datasets_json['datasets'])

datasets_df.set_index("dataset_id", inplace=True)

In [93]:
#this function calculate the difference as a Jaccard distance between the two metadata strings
def metadata_difference(metadata_1: str, metadata_2: str) -> float:

    metadata_1_set = metadata_1.split(" ")
    metadata_1_set = set(metadata_1_set)

    metadata_2_set = metadata_2.split(" ")
    metadata_2_set = set(metadata_2_set)

    union = metadata_2_set.union(metadata_1_set)

    intersection = metadata_1_set.intersection(metadata_2_set)

    return len(intersection) / len(union)
   

In [103]:
j = 0
#lets analyze the tags
for tpl in difference:
    tags = datasets_df.loc[str(tpl[0]), "tags"]
    metadata = str(datasets_df.loc[str(tpl[0]), "title"]) + str(datasets_df.loc[str(tpl[0]), "description"]) + str(datasets_df.loc[str(tpl[0]), "author"])
    
    for i in range(1, len(tpl)):
        if datasets_df.loc[str(tpl[i]), "tags"] != tags:
            print(f"Different tags in {tpl[0]} and {tpl[i]}")
            j+=1

print(f"Found {j} datasets with different tags")

Different tags in 13339 and 14308
Different tags in 1794 and 5224
Different tags in 13369 and 14336
Different tags in 16153 and 21662
Different tags in 13077 and 14573
Different tags in 13522 and 14517
Different tags in 13633 and 14503
Different tags in 13193 and 14597
Different tags in 13378 and 14252
Different tags in 13252 and 14411
Different tags in 15968 and 15979
Different tags in 15969 and 15980
Different tags in 13368 and 14344
Different tags in 12530 and 14657
Different tags in 13420 and 14383
Different tags in 13360 and 14298
Different tags in 5359 and 5515
Different tags in 13261 and 14450
Different tags in 13427 and 14235
Different tags in 13715 and 14430
Different tags in 13288 and 14580
Different tags in 15853 and 16021
Different tags in 13279 and 14387
Different tags in 12636 and 14546
Different tags in 13409 and 14362
Different tags in 13356 and 15381
Different tags in 13394 and 14599
Different tags in 12879 and 14645
Different tags in 13425 and 14299
Different tags in 

In [106]:
j = 0
#lets analyze the tags
for tpl in difference:
    metadata = str(datasets_df.loc[str(tpl[0]), "title"]) + str(datasets_df.loc[str(tpl[0]), "description"])
    
    for i in range(1, len(tpl)):
        metadata_1 = str(datasets_df.loc[str(tpl[i]), "title"]) + str(datasets_df.loc[str(tpl[i]), "description"])
        if metadata_difference(metadata, metadata_1) < 0.8:
            print(f"{tpl[0]} and {tpl[1]} are very different in title and description")
            j+=1

print(f"Found {j} datasets with very different titles and description")


1698 and 5132 are very different in title and description
5075 and 5375 are very different in title and description
15968 and 15979 are very different in title and description
15969 and 15980 are very different in title and description
5186 and 5474 are very different in title and description
12530 and 14657 are very different in title and description
5138 and 5463 are very different in title and description
7388 and 12339 are very different in title and description
4333 and 5165 are very different in title and description
15853 and 16021 are very different in title and description
5105 and 5530 are very different in title and description
4933 and 5027 are very different in title and description
1760 and 5131 are very different in title and description
4894 and 5061 are very different in title and description
1801 and 5339 are very different in title and description
1756 and 5282 are very different in title and description
5003 and 9409 are very different in title and description
1751 