# Dictionary Cleaning

The structure of the data has two issues at this point of the process. Firstly, there are multiple classes in every file. The following code wants to collect only unique id's to exclude all describing data like professions, place names, etc. Also the "namedgraphs(KHI)" takes a huge part of the data and has no value to the further analysis. This class will also be excluded.
Secondly, the structure is very file oriented. Every class is connected to the file in which it was stored. this relation is not relevant for further analysis of the data. Therefore the unique ids will form the new bases in the transformed data structure.

In [1]:
import json

dict_file = "data/entity_dump.json"

with open(dict_file, encoding="utf8") as f:
    data = json.load(f)

#### Analysis for double keys

In [2]:
length = 0
for element in list(data.keys()):
    length += len(list(data[element].keys()))

In [3]:
key_list = []
for element in list(data.keys()):
    for index in range(len(list(data[element].keys()))):
        key_list.append(list(data[element].keys())[index])

print(len(key_list))

609534


In [4]:
key_dict = {}

for element in key_list:
    if element not in key_dict:
        key_dict[element] = 1
    else:
        key_dict[element] +=1

In [5]:
doubled_keys = {}
for element in key_dict:
    if key_dict[element] > 1:
        doubled_keys[element] = key_dict[element]

In [6]:
print(list(doubled_keys.keys())[:10])

['78A3ECC8-66C6-3EC3-A5C7-CD6A3D2FC6DE(type)', 'EAE7042F-9813-3A08-A641-29C38FDA23DB(actor)', 'AB957EFD-FA4E-31E1-A0FC-D2A65C9A2567(type)', '328B39A4-725C-3411-ACD8-743B21C49700(type)', 'C728B7D3-D346-3F4B-92A3-EF87ED5F8BE4(type)', '241030FF-AC0E-3B4F-BCB1-F0BBEE3CBA66(place)', '71D1C456-088F-3C01-89A7-4B2ED119E73E(type)', '5E3A8561-554B-3215-A2F1-E5258EF166E4(type)', 'F6197719-FC31-3AE5-838D-AF4B69A37C72(type)', '48378153-A40D-3533-9CD7-EF6D02595E0C(type)']


In [7]:
for element in list(data.keys()):
    if "B3D29692-6F9F-30CE-9E42-558C85BE3301(technique)" in list(data[element].keys()):
        print(element)

artworks_lvl2/formated-part_3414_cleaned.trig
artworks_lvl2/formated-part_3406_cleaned.trig
artworks_lvl2/formated-part_3367_cleaned.trig
artworks_lvl2/formated-part_3085_cleaned.trig
artworks_lvl2/formated-part_3107_cleaned.trig
artworks_lvl2/formated-part_3463_cleaned.trig
artworks_lvl2/formated-part_3421_cleaned.trig
artworks_lvl2/formated-part_3358_cleaned.trig
artworks_lvl2/formated-part_3097_cleaned.trig
artworks_lvl2/formated-part_3379_cleaned.trig
artworks_lvl2/formated-part_3507_cleaned.trig
artworks_lvl2/formated-part_3401_cleaned.trig
artworks_lvl2/formated-part_3359_cleaned.trig
artworks_lvl2/formated-part_3255_cleaned.trig
artworks_lvl2/formated-part_3202_cleaned.trig
artworks/formated-part_3356_cleaned.trig
artworks/formated-part_3366_cleaned.trig
artworks/formated-part_3167_cleaned.trig
artworks/formated-part_3354_cleaned.trig
artworks/formated-part_3126_cleaned.trig
artworks/formated-part_3380_cleaned.trig
artworks/formated-part_3250_cleaned.trig
artworks/formated-part_

#### Transformation

Changing the hierarchy and puting all classes in one dict

In [8]:
restructured_data = {}

for key in list(data.keys()):
    for key_2 in list(data[key].keys()):
        restructured_data[key_2] = data[key][key_2]

In [9]:
for key in list(restructured_data.keys()):
    for element in range(len(restructured_data[key])):
        for element_2 in range(len(restructured_data[key][element])):
            if type(restructured_data[key][element][element_2]) is list:
                for element_3 in range(len(restructured_data[key][element][element_2])):
                    restructured_data[key][element][element_2][element_3] = restructured_data[key][element][element_2][element_3].replace(">(", "(")

In [10]:
print(restructured_data[list(restructured_data.keys())[0]])

print(len(list(restructured_data.keys())))

[[['U8B6Y4YM(type)', 'Namedgraph(custom)'], ['has_provider(custom)', 'KHI(source)']], [['78A3ECC8-66C6-3EC3-A5C7-CD6A3D2FC6DE(type)', 'E55_Type(cidoc-crm)'], {'Label': 'rechteckig'}, ['P2_has_type(cidoc-crm)', 'C57BB460-B736-3F52-8867-FFCF908C12B6(type)'], ['P67i_is_referred_to_by(cidoc-crm)', 'U8B6Y4YM(type)'], ['has_provider(custom)', 'KHI(source)']]]
237879


In [11]:
# excluding the unwanted "namedgraphs" class

restructured_data.pop("namedgraphs(KHI)")

[[['02552804(actor)', 'Namedgraph(custom)']],
 [['07601433%2CT%2C002%2CT%2C001%2CT%2C001(work)', 'Namedgraph(custom)']],
 [['TOOGTZOY(type)', 'Namedgraph(custom)']],
 [['TNZJPOA2(type)', 'Namedgraph(custom)']],
 [['1H84RLO2(type)', 'Namedgraph(custom)']],
 [['namedgraphs(KHI)', 'SourceNamedgraph(custom)'],
  ['consists_of(custom)', '07601433%2CT%2C002%2CT%2C001%2CT%2C001(work)']]]

In [12]:
print(list(restructured_data.keys())[:3])

['78A3ECC8-66C6-3EC3-A5C7-CD6A3D2FC6DE(type)', 'EAE7042F-9813-3A08-A641-29C38FDA23DB(actor)', 'AB957EFD-FA4E-31E1-A0FC-D2A65C9A2567(type)']


#### Key statistics

In [13]:
key_tuple = ()

for key in restructured_data.keys():
    key_tuple = key_tuple + (key,)
"""
for element in data:
    for key in data[element].keys():
        key_tuple = key_tuple + (key,)
"""

'\nfor element in data:\n    for key in data[element].keys():\n        key_tuple = key_tuple + (key,)\n'

In [14]:
category = key_tuple[0].replace(")", "").split("(")[1]
category

'type'

In [15]:
category_stats_dict = {}
for index in range(len(key_tuple)):
    category = key_tuple[index].replace(")", "").split("(")[1]
    if category not in category_stats_dict:
        category_stats_dict[category] = {"Count":1, "IDs":[key_tuple[index]]}
    else:
        category_stats_dict[category]["Count"] += 1
        category_stats_dict[category]["IDs"].append(key_tuple[index])


In [16]:
for element in category_stats_dict:
    print("Category:\t{}\nCount:\t{}\nExamples:\t{}\n\n".format(element, category_stats_dict[element]["Count"], category_stats_dict[element]["IDs"][:5]))

Category:	type
Count:	34919
Examples:	['78A3ECC8-66C6-3EC3-A5C7-CD6A3D2FC6DE(type)', 'AB957EFD-FA4E-31E1-A0FC-D2A65C9A2567(type)', '328B39A4-725C-3411-ACD8-743B21C49700(type)', 'C728B7D3-D346-3F4B-92A3-EF87ED5F8BE4(type)', '71D1C456-088F-3C01-89A7-4B2ED119E73E(type)']


Category:	actor
Count:	16011
Examples:	['EAE7042F-9813-3A08-A641-29C38FDA23DB(actor)', 'F4B7CF34-0AF4-36B4-82C2-11E56AB87103(actor)', '02552782(actor)', '07600210(actor)', '00086500(actor)']


Category:	place
Count:	9105
Examples:	['241030FF-AC0E-3B4F-BCB1-F0BBEE3CBA66(place)', '946DD00F-13F3-342B-999D-9C5874DC5DE4(place)', '4A887B8A-68AC-39B0-8D9D-027BDDEDB06B(place)', '2996325D-BD5E-3FB0-8815-F546ADB971AD(place)', '90AE43FF-9F61-3FCA-8E00-9FE49999A499(place)']


Category:	work
Count:	98948
Examples:	['07874792(work)', '07874803(work)', '07874807(work)', '07960073%2CT%2C011%2CT(work)', '07874809(work)']


Category:	material
Count:	607
Examples:	['B6D5D085-72FC-3D9C-8D9D-2CBEA3DB4D07(material)', '1D4D1728-9C35-39A6-912E

In [17]:
key_length = 0
for element in data:
    key_length += len(data[element].keys())

#### Dumping the data

In [18]:
with open("data/entity_dump_cleaned_and_restructured.json", "w", encoding="utf8") as f:
    json.dump(restructured_data, f, indent=4)