In [None]:
# TODO from insights by prof
# dont save in json but locally in a var
# only match by labels (no other matching needed, no properties, superclasses, etc.)
# just save labels in list with key is class uri and value is label: if memory is not a problem save in to structures: key value of label and value is class uri
# 5 - 10 min for human, mouse ontologies is reasonable

# Output: in rdf format (see reference_anatomy) just put final matches in there above the threshold (defined by the user). As relation use owl:equivalentClass

# Project Description
The goal of the project is to develop a simple yet effective ontology alignment framework in Python that focuses on lexical similarity matching. The framework will utilize both string matching techniques and the semantic capabilities of large language models to identify potential alignments between entities (such as classes) in two different ontologies.

### Objectives
1. **Develop an ontology alignment framework** that can process and compare ontologies based on textual content.
2. **Implement lexical similarity matching** using both basic string matching techniques and advanced semantic analysis with embeddings from LLMs.
3. **Output alignments with confidence scores**, enabling users to understand and evaluate the quality and reliability of the suggested alignments.

### Steps to Perform

#### Step 1: Ontology Parsing
- **Goal**: Load and parse the ontologies to be aligned.
- **Tasks**:
  - Utilize libraries like `rdflib` or `owlready2` to read ontology files.
  - Extract relevant textual information (e.g., class names, labels, descriptions).

#### Step 2: Lexical Similarity Matching
This step is divided into two sub-steps: string matching and embeddings matching.

##### a. String Matching
- **Goal**: Implement direct and fuzzy string comparison techniques to find matches based on textual similarity.
- **Tasks**:
  - Perform normalization (e.g., lowercasing, removing special characters).
  - Use string comparison methods (exact match, substring search, edit distance).

##### b. Embeddings Matching Using LLMs
- **Goal**: Use the semantic context provided by LLMs to match terms based on their meanings.
- **Tasks**:
  - Generate embeddings for the textual content of each ontology using models from the Hugging Face Transformers library.
  - Calculate similarity scores between embeddings (e.g., using cosine similarity).

#### Step 3: Combining and Filtering
- **Goal**: Aggregate results from both matching techniques and refine the output.
- **Tasks**:
  - Combine scores from string and embeddings matching.
  - Apply thresholds to filter out matches with low confidence.
  - Optionally, use simple structural checks to add confidence to matches (e.g., matched entities have similar parent classes).

#### Step 4: Output and Evaluation
- **Goal**: Output the alignment results and provide means for evaluation.
- **Tasks**:
  - Format the output in a structured way (e.g., JSON, CSV) that lists entity pairs and their matching scores.
  - If possible, evaluate the effectiveness using known benchmarks or test cases to calculate precision, recall, and F1-score.

### Summary
The project is centered on creating a practical tool for ontology matching, focusing on textual content using both conventional and advanced NLP techniques. By combining string-based and semantic-based approaches, the framework aims to provide robust alignments that are supported by both literal and contextual text similarities. This dual approach enhances the capability of the alignment process, making it more flexible and potentially more accurate than using only one method.

## Notes

- [This paper](https://arxiv.org/pdf/2309.07172) suggests that Flan-T5-XXL might perform best: [Hugging face link to model](https://huggingface.co/google/flan-t5-xxl)

In [1]:
# imports
import json
from owlready2 import *
import rdflib
import pandas as pd
from collections import OrderedDict, defaultdict

  from pandas.core import (


**rdflib vs owlready2:**

Interchangeability: Given that OWL is an application of RDF, tools that can parse RDF/XML can generally handle .owl files, and vice versa, provided that the ontology-specific constructs are understood by the tool. This is why libraries like rdflib, which are capable of parsing RDF, are suitable for handling OWL files serialized in RDF/XML format.

Flexibility: Choosing to work with rdflib for general RDF handling and owlready2 for specific ontology manipulations where needed is a flexible approach. It allows you to leverage the strengths of both libraries—rdflib for its robust RDF manipulation and SPARQL querying capabilities, and owlready2 for its ontology-specific features like reasoning and direct manipulation of classes and properties.

# Load/ Parse ontologies

In [2]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"

# output paths
onto1_path_out = "ontology_jsons/onto1.json"
onto2_path_out = "ontology_jsons/onto2.json"

In [3]:
def load_ontology(file_path):
    """
    Loads an ontology from a given file path, which can be in RDF (.rdf) or OWL (.owl) format.
    
    Args:
    file_path (str): The file path to the ontology file.
    
    Returns:
    rdflib.Graph: A graph containing the ontology data.
    """
    # Create a new RDF graph
    graph = rdflib.Graph()

    # Bind some common namespaces to the graph
    namespaces = {
        "rdf": rdflib.namespace.RDF,
        "rdfs": rdflib.namespace.RDFS,
        "owl": rdflib.namespace.OWL,
        "xsd": rdflib.namespace.XSD
    }
    for prefix, namespace in namespaces.items():
        graph.namespace_manager.bind(prefix, namespace)

    # Attempt to parse the file
    try:
        graph.parse(file_path, format=rdflib.util.guess_format(file_path))
        print(f"Successfully loaded ontology from {file_path}")
    except Exception as e:
        print(f"Failed to load ontology from {file_path}: {e}")
        return None

    return graph

In [4]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)
print(onto1_graph, onto2_graph)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']]. [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']].


In [5]:
def preprocess_label(label):
    return str(label).replace("_", " ").strip(" ,.").lower()

### Outdated approach: parsing to json

In [10]:
def extract_ontology_details_to_json(graph):
    """
    Extracts and returns details of each class and property in the given RDF graph in JSON format.

    Args:
    graph (rdflib.Graph): The RDF graph containing the ontology data.
    """

    # Query for classes
    class_query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label ?label_dt ?label_lang ?superclass
    WHERE {
        ?class rdf:type owl:Class.
        OPTIONAL { ?class rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
        OPTIONAL { ?class rdfs:subClassOf ?superclass }
    }
    """
    classes = graph.query(class_query)
    ontology_details = {}

    # Process class results
    for row in classes:
        class_uri, label, label_dt, label_lang, superclass = row
        class_key = str(class_uri)
        if class_key not in ontology_details:
            ontology_details[class_key] = {'uri': class_key, 'type': 'class', 'labels': [], 'superclasses': []}

        if label and {'value': preprocess_label(label), 'datatype': str(label_dt) if label_dt else None, 'language': str(label_lang) if label_lang else None} not in ontology_details[class_key]['labels']:
            ontology_details[class_key]['labels'].append({'value': preprocess_label(label), 'datatype': str(label_dt) if label_dt else None, 'language': str(label_lang) if label_lang else None})

        if superclass and str(superclass) not in ontology_details[class_key]['superclasses']:
            ontology_details[class_key]['superclasses'].append(str(superclass))

    # Query for properties
    property_query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?entity ?label ?label_dt ?label_lang
    WHERE {
        { ?entity rdf:type owl:ObjectProperty }
        UNION
        { ?entity rdf:type owl:DatatypeProperty }.
        OPTIONAL { ?entity rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
    }
    """
    
    properties = graph.query(property_query)

    # Process property results
    for row in properties:
        prop_uri, label, label_dt, label_lang = row
        prop_key = str(prop_uri)
        if prop_key not in ontology_details:
            ontology_details[prop_key] = {'uri': prop_key, 'type': 'property', 'labels': []}

        if label and {'value': preprocess_label(label), 'datatype': str(label_dt) if label_dt else None, 'language': str(label_lang) if label_lang else None} not in ontology_details[prop_key]['labels']:
            ontology_details[prop_key]['labels'].append({'value': preprocess_label(label), 'datatype': str(label_dt) if label_dt else None, 'language': str(label_lang) if label_lang else None})

        # this will be empty but then we have a homogenous json structure
        if 'superclasses' not in ontology_details[prop_key]:
            ontology_details[prop_key]['superclasses'] = []

    # Convert dictionary to JSON
    json_data = json.dumps(list(ontology_details.values()), indent=4, separators=(',', ': '))
    return json_data

In [11]:
onto1_json = extract_ontology_details_to_json(onto1_graph)
onto2_json = extract_ontology_details_to_json(onto2_graph)

In [7]:
# function to save as json
def save_to_json(file_path, raw_data):
    with open(file_path, 'w') as f:
        f.write(raw_data)
    print(f"Data has been saved to '{file_path}'.")

# function to load json data
def load_json_data(file_path):
    """
    Reads JSON data from a file and returns it.

    Parameters:
        file_path (str): The path to the JSON file to be read.
    
    Returns:
        dict/list: The data loaded from the JSON file.
    """
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

In [8]:
save_to_json(onto1_path_out, onto1_json)
save_to_json(onto2_path_out, onto2_json)

Data has been saved to 'ontology_jsons/onto1.json'.
Data has been saved to 'ontology_jsons/onto2.json'.


In [10]:
onto1_json_data = load_json_data(onto1_path_out)
onto2_json_data = load_json_data(onto2_path_out)

print("Ontology 1 Data:", onto1_json_data)
print("Ontology 2 Data:", onto2_json_data)

Ontology 1 Data: [{'uri': 'http://mouse.owl#MA_0000001', 'type': 'class', 'labels': [{'value': 'mouse anatomy', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': []}, {'uri': 'http://mouse.owl#MA_0000002', 'type': 'class', 'labels': [{'value': 'spinal cord grey matter', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://mouse.owl#MA_0001112', 'Nd3552c7b58f54bf18684374237d4b555']}, {'uri': 'http://mouse.owl#MA_0000003', 'type': 'class', 'labels': [{'value': 'organ system', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://www.w3.org/2002/07/owl#Thing', 'N5bf0d8c689bb4eeca7c6af49d499c61b']}, {'uri': 'http://mouse.owl#MA_0000004', 'type': 'class', 'labels': [{'value': 'trunk', 'datatype': 'http://www.w3.org/2001/XMLSchema#string', 'language': None}], 'superclasses': ['http://mouse.owl#MA_0002433']}, {'uri': 'http://mouse.owl#MA_0000005', 'type': 'clas

### New approach without json and instead dicts

In [91]:
def extract_ontology_details_to_dict(graph):
    # Query for classes
    class_query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label ?label_dt ?label_lang
    WHERE {
        ?class rdf:type owl:Class.
        OPTIONAL { ?class rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
    }
    """
    classes = graph.query(class_query)
    ontology_labels_dict = OrderedDict()
    labels_list = []

    # Process class results
    for row in classes:
        class_uri, label, label_dt, label_lang = row
        class_key = str(class_uri)
        label_str = preprocess_label(label)
        if label_str not in ontology_labels_dict:
            ontology_labels_dict[label_str] = class_key
            labels_list.append(label_str)
            
    
    # TODO implement matching with properties

    return ontology_labels_dict, labels_list

In [32]:
onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

In [9]:
def find_duplicates(ordered_dict):
    # Step 1: Count occurrences of each value
    value_counts = defaultdict(int)
    for key, value in ordered_dict.items():
        value_counts[value] += 1

    # Step 2: Filter to find values that appear more than once
    duplicates = {value for value, count in value_counts.items() if count > 1}

    # Step 3: Collect keys for these duplicate values
    duplicate_keys = {key: value for key, value in ordered_dict.items() if value in duplicates}

    return duplicate_keys

# Find duplicates
onto1_duplicate_keys = find_duplicates(onto1_dict)
print("Duplicate keys and values:", onto1_duplicate_keys)
# => filtering for multiple labels works

Duplicate keys and values: {'2 respiratory system epithelium': 'http://mouse.owl#MA_0001823', 'respiratory system epithelium': 'http://mouse.owl#MA_0001823'}


some more cleaning ideas:
- remove non-meaningful classes or properties

## 3. Matching

### Required libraries

In [11]:
#!pip install python-Levenshtein

In [12]:
#!pip install scikit-learn

In [1]:
#!pip install linktransformer

Collecting linktransformer
  Downloading linktransformer-0.1.14-py3-none-any.whl.metadata (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.0/58.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting faiss-cpu (from linktransformer)
  Downloading faiss_cpu-1.8.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (3.6 kB)
Collecting hdbscan (from linktransformer)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting sentence-transformers (from linktransformer)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl.metadata (11 kB)
Collecting transformers (from linktransformer)
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
[2K     [90

### 3.1. String Matching

For String Matching we will implement 4 different methods that the user then can chose via a parameter when calling the method.

The metrics we will use are:
- Levenshtein distance
- Jaccard Similarity
- Cosine Similarity
- TF-IDF
- LinkTransformer

In [92]:
import Levenshtein
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
import linktransformer as lt

def levenshtein_distance(str1, str2):
    return Levenshtein.distance(str1, str2)

def calc_cosine_similarity(str1, str2):
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform([str1, str2])
    return cosine_similarity(count_matrix)[0][1]

def jaccard_similarity(str1, str2):
    # Tokenize the strings into sets of words
    set1 = set(str1.split())
    set2 = set(str2.split())
    
    # Find the intersection and union of the two sets
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    
    # Calculate the Jaccard score
    if not union:  # Handle the edge case where both strings might be empty
        return 0.0
    return len(intersection) / len(union)

# def calculate_tfidf_cosine_similarity(str1, str2): # TODO adjust for new workflow
#     vectorizer = TfidfVectorizer()
#     tfidf = vectorizer.fit_transform([str1, str2])
#     # Calculate the cosine similarity between the two vectors
#     # tfidf_matrix[0:1] gets the tf-idf vector for the first document
#     # tfidf_matrix[1:2] gets the tf-idf vector for the second document
#     sim_score = cosine_similarity(tfidf[0:1], tfidf[1:2])

#     # sim_score is an array of shape (1,1); we return the element at [0][0]
#     return sim_score[0][0]

In [94]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def cosine_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.
    
    Args:
    labels (list): List of all labels from both ontologies.
    
    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform(labels)
    return vectorizer, count_matrix

def cosine_compare_labels(count_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their count vector indices.
    
    Args:
    count_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the count vectors.
    index1, index2 (int): Indices of the labels to compare.
    
    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(count_matrix[index1:index1+1], count_matrix[index2:index2+1])[0][0]


def execute_cosine_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, count_matrix = cosine_vectorize_labels(all_labels)
    
    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = cosine_compare_labels(count_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [88]:
execute_cosine_string_matching(onto1_list, onto2_list)

Similarity score between 'infraorbital artery' and 'Pars_Interna': 0.0


In [87]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def tfidf_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.
    
    Args:
    labels (list): List of all labels from both ontologies.
    
    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(labels)
    return vectorizer, tfidf_matrix

def tfidf_compare_labels(tfidf_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their TF-IDF vector indices.
    
    Args:
    tfidf_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the TF-IDF vectors.
    index1, index2 (int): Indices of the labels to compare.
    
    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(tfidf_matrix[index1:index1+1], tfidf_matrix[index2:index2+1])[0][0]


def execute_tfidf_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, tfidf_matrix = tfidf_vectorize_labels(all_labels)
    
    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = tfidf_compare_labels(tfidf_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [89]:
execute_tfidf_string_matching(onto1_list, onto2_list)

Similarity score between 'infraorbital artery' and 'Pars_Interna': 0.0


In [14]:
# Unfortunately, this does not work at all because the models or the are too big. The kernel crashes

"""
def linktransformer_comparison(onto1, onto2):
    # Make pandas dataframes (can only compare dataframes)
    # Specify the record_path to expand the labels and superclasses if needed
    df_onto1 = pd.json_normalize(onto1, 'labels', ['uri', 'type', 'superclasses'], 
                    record_prefix='label_')
    df_onto2 = pd.json_normalize(onto2, 'labels', ['uri', 'type', 'superclasses'], 
                    record_prefix='label_')
    
    # Comparison using the most downloaded LLM: 
    # models tested: sentence-transformers/all-MiniLM-L6-v2 -> crashes
    # dell-research-harvard/lt-wikidata-comp-multi -> crashes
    df_matched = lt.merge(df_onto1, df_onto2, on="label_value", merge_type="1:1", suffixes=('_onto1', '_onto2'), model='dell-research-harvard/lt-wikidata-comp-multi')

    return df_matched

onto_matched = linktransformer_comparison(onto1_data, onto2_data)
"""

2024-05-13 17:48:35 - Load pretrained SentenceTransformer: dell-research-harvard/lt-wikidata-comp-multi




: 

In [95]:
def execute_string_matching(metric, data1, data2):
    """
    Executes the selected matching metric on the provided data.

    Args:
    metric (str): A single letter representing the metric to use.
                  'Levenshtein' for Levenshtein Distance,
                  'Jaccard' for Jaccard Similarity,
                  'LinkTransformer' for Link Transformer.
    data1, data2 (str): The data strings to compare.

    Returns:
    result: The result of the chosen metric computation.
    """
    if metric == 'Levenshtein':
        return levenshtein_distance(data1, data2)
    elif metric == 'Jaccard':
        return jaccard_similarity(data1, data2)
    elif metric == 'LinkTransformer':
        pass # TODO implement or remove
    else:
        raise ValueError("Invalid metric selection")

In [96]:
len(onto1_dict)

2739

In [49]:
import time

def measure_time(method=print()):
    start_time = time.time()
    method
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Elapsed time:", elapsed_time, "seconds")




In [90]:
counter = 0
second_key = 0
metric_choice = 'Jaccard'

for key1 in onto1_list:
    print(counter)
    for key2 in onto2_list:
        result = execute_string_matching(metric_choice, key1, key2)
        #if result > 0:
        #    print(f"'{key1}' got matched with '{key2}' and got score: {result}")
    counter += 1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [82]:
counter = 0
second_key = 0
metric_choice = 'Jaccard'

for key1, value1 in onto1_dict.items():
    print(counter)
    for key2, value2 in onto2_dict.items():
        result = execute_string_matching(metric_choice, key1, key2)
        #if result > 0:
        #    print(f"'{key1}' got matched with '{key2}' and got score: {result}")
    counter += 1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [197]:
def match_ontologies(onto1_path_in, onto2_path_in, metric, bidirectional=False):
    # load ontologies
    onto1_graph = load_ontology(onto1_path_in)
    onto2_graph = load_ontology(onto2_path_in)
    
    onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
    onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)
    
    print(onto1_list)
    print(onto2_list)
    
    onto1_safety_dict = {}
    for element in onto1_list:
        onto1_safety_dict[element] = element
        
    onto2_safety_dict = {}
    for element in onto2_list:
        onto2_safety_dict[element] = element
    
    print(onto1_safety_dict)
    
    onto2_used_classes = {}
    
    if metric == "Cosine":
        execute_cosine_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    elif metric == "TF-IDF":
        execute_tfidf_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    else:
        class_results = {}

        operation_total = 0
        for x in onto1_list:
            for y in onto2_list:
                operation_total += 1
        
        print(operation_total)
        operations_done = 0
        
        while onto1_list:
            label1 = onto1_list.pop()
            # Match from Ontology 1 to Ontology 2
            label_result = ["", 0]
            best_score = 0
            for label2 in onto2_list:
                operations_done += 1
                if operations_done % 10000 == 0:
                    print(operations_done)
                matching_score = execute_string_matching(metric, label1, label2)
                # If a perfect match is found, stop iterating over labels for this entry
                if (metric == 'Jaccard' and matching_score == 1) or (metric == 'Levenshtein' and matching_score == 0):
                    best_score = matching_score
                    label_result = [label2, best_score]
                    break
                # Check if a match for this label has been found before
                if (metric == 'Jaccard' and matching_score > best_score) or (metric == 'Levenshtein' and matching_score < best_score):
                    label_result[0] = label2
                    best_score = matching_score
                    
            # print("Label done: ", label1)
            label_result[1] = best_score
            class_uri = onto1_dict[label1]
            class2_uri = onto2_dict[label2]
            label_result[0] = class2_uri
            # print(label1)
            # print(label_result)
            if class2_uri not in onto2_used_classes: # check if class found of ontology 2 already used by other class in ontology 1
                if class_uri not in class_results:
                    class_results[class_uri] = label_result
                    onto2_used_classes[class2_uri] = class_uri
                elif label_result[1] > class_results[class_uri][1]:
                        class_results[class_uri] = label_result
                        onto2_used_classes[class2_uri] = class_uri
            else:
                # print("Already in use")
                score_of_current_class_in_use = onto2_used_classes[class2_uri] # if that is the case get class uri of class that uses that class of ontology 2
                if class_results[score_of_current_class_in_use][1] < label_result[1]: # if score if the new found match is higher than the current assigned one
                    # print("Better score as already in use")
                    class_results[class_uri] = label_result # set the class of ontology 2 to that current class
                    onto2_used_classes[class2_uri] = class_uri # overwrite the use of that class to new class of ontology 1
                    class_results[score_of_current_class_in_use] = ["", 0] # set result of earlier class to None (could also be remmoved but that way later we can handle if no match found)
                    onto1_list.append(label1) # add the label again to list again that gets iterated as it now doesnt have a match anymore
                    
        # TODO implement bidirectional matching
                    
                    
        # ---------------------- OUTDATED but nice for report            
        # Old code without handling if better match found for already in use class (good for report)
        # # Match from Ontology 1 to Ontology 2
        # for label1 in onto1_list:
        #     label_result = ["", 0]
        #     best_score = 0
        #     for label2 in onto2_list:
        #         operations_done += 1
        #         if operations_done % 10000 == 0:
        #             print(operations_done)
        #         matching_score = execute_string_matching(metric, label1, label2)
        #         # If a perfect match is found, stop iterating over labels for this entry
        #         if (metric == 'Jaccard' and matching_score == 1) or (metric == 'Levenshtein' and matching_score == 0):
        #             best_score = matching_score
        #             label_result = [label2, best_score]
        #             break
        #         # Check if a match for this label has been found before
        #         if (metric == 'Jaccard' and matching_score > best_score) or (metric == 'Levenshtein' and matching_score < best_score):
        #             label_result[0] = label2
        #             best_score = matching_score
                    
        #     # print("Label done: ", label1)
        #     label_result[1] = best_score
        #     class_uri = onto1_dict[label1]
        #     class2_uri = onto2_dict[label2]
        #     label_result[0] = class2_uri
        #     # OLD_TO-DO check that if class of ontology 2 already used isn't allowed to use
        #     if class_uri not in class_results:
        #         class_results[class_uri] = label_result
        #     elif class_results[class_uri][1] < label_result[1]:
        #             class_results[class_uri] = label_result
                    
        # OLD_TO-DO important: currently it takes the best match found for the current class of the ontology 1.
        # But it doesnt take into account if a later class has a higher score with that class and therefore would be better suited
        # Solution: make dict and always take one element that gets removed. If later another element matches with the class matched with
        # the previous element the earlier removed element get added again and the value of that elements gets assigned to the higher value element
        
        return class_results


In [216]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"
# onto1_path_in = "test_ontologies/test1.owl"
# onto2_path_in = "test_ontologies/test2.owl"

matching_results = match_ontologies(onto1_path_in, onto2_path_in, 'Jaccard')

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
['epigastric artery', 'parabigeminal nucleus', 'hepatic duct smooth muscle', 'orbital septum', 'dorsal mesogastrium', 'optic disc', 'lateral cuneiform', 'pericardial fluid', 'neck connective tissue', 'dorsal penis artery', 'trachea blood vessel', 'cornea', 'head skin', 'superior cerebellar artery', 'base of arytenoid', 'renal cortex collecting duct', 'adrenal gland cortex', 'sphincter colli superficialis', 'ciliary epithelium', 'dentate gyrus hilus', 'urethra lamina propria', 'adrenal gland cortex zone', 'perilymph', 'sciatic nerve', 'fungiform papilla', 'awl hair', 'constrictor vulvae', 'nerve trunk', 'supraspinatus', 'yellow elastic cartilage', 'lower arm skin', 'liver left medial lobe', 'medial femoral circumflex artery', 'adipose tissue', 'corneal endothelium', 'trunk nerve', 'coat hair', 'heart ventricle membranous part', 'skin gland', 'parasympathetic nerve', 'p

In [177]:
jaccard_similarity("mouse anatomy", "mouse anatomian")

0.3333333333333333

In [178]:
jaccard_similarity("mouse anatomian", "mouse anatomian")

1.0

In [199]:
matching_results

{'http://mouse.owl#MA_0000851': ['', 0],
 'http://mouse.owl#MA_0002633': ['', 0],
 'http://mouse.owl#MA_0002365': ['http://human.owl#NCI_C53175', 1.0],
 'http://mouse.owl#MA_0000630': ['http://human.owl#NCI_C52753', 1.0],
 'http://mouse.owl#MA_0000818': ['http://human.owl#NCI_C12356', 1.0],
 'http://mouse.owl#MA_0002167': ['http://human.owl#NCI_C53051', 1.0],
 'http://mouse.owl#MA_0001654': ['http://human.owl#NCI_C12886', 1.0],
 'http://mouse.owl#MA_0000219': ['http://human.owl#NCI_C12673', 1.0],
 'http://mouse.owl#MA_0001316': ['', 0],
 'http://mouse.owl#MA_0000007': ['http://human.owl#NCI_C12429', 1.0],
 'http://mouse.owl#MA_0002297': ['http://human.owl#NCI_C53155', 1.0],
 'http://mouse.owl#MA_0001772': ['http://human.owl#NCI_C48942', 1.0],
 'http://mouse.owl#MA_0002177': ['http://human.owl#NCI_C53055', 1.0],
 'http://mouse.owl#MA_0002570': ['http://human.owl#NCI_C32150', 1.0],
 'http://mouse.owl#MA_0002477': ['http://human.owl#NCI_C12230', 1.0],
 'http://mouse.owl#MA_0002720': ['htt

In [215]:
counter = 0
for key, element in matching_results.items():
    if element[1] == 1:
        counter += 1
print(counter)

940


In [147]:
# TODO implement LLM (probably Word2Vec)
# TODO implement combining and filtering
# TODO implement user inputs (also weighted average with user defined formular would be nice)

In [116]:
import time

def measure_matching_time(onto1_path_out, onto2_path_out, metric, bidirectional=False):
    start_time = time.time()
    match_ontologies(onto1_path_out, onto2_path_out, metric, bidirectional)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Elapsed time:", elapsed_time, "seconds")

In [146]:
# Usage
matches = match_ontologies(onto1_path_in, onto2_path_in, 'Jaccard', bidirectional=False)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
['eye gland', 'hypothalamus medial zone', 'thoracic mammary gland', 'peritubular capillary', 'heart valve', 'glossopharyngeal ix ganglion', 'tendon', 'anal region', 'prostate gland anterior lobe', 'glomerular capillary endothelium', 'lumbar vertebra', 'pudendal vein', 'inferior suprarenal artery', 'semicircular canal', 'substantia gelatinosa', 'intrinsic tongue muscle transverse component', 'tarsus', 'prostatic urethra', 'spiral ligament', 'right lung alveolar system', 'thymus lobule', 'peritoneal cavity', 'common palmar digital arteries', 'eye posterior chamber', 'ulnar artery', 'middle phalanx of hand', 'gastrointestinal system serosa', 'limb skin', 'tegmentum', 'extensor carpi ulnaris', 'sphincter pupillae', 'temporal vein', 'urinary bladder serosa', 'jejunal vein', 'interlobular bile duct', 'ventral intercostal artery', 'hand digit 4 phalanx', 'cartilaginous joint

### Create artificial ontologies for testing

In [137]:
import json
import random

def generate_class_uri(index, base_uri="http://mouse.owl#MA_"):
    return f"{base_uri}{1000 + index:04}"

def generate_label(index):
    base_labels = ["Nerve", "muscle", "vein", "artery", "bone", "tissue", "cell", "organ", "gland", "membrane"]
    part = random.choice(base_labels)
    return preprocess_label(f"{part} {index}")

def generate_superclasses(index):
    base_superclasses = [
        "http://www.w3.org/2002/07/owl#Thing",
        "http://mouse.owl#AnatomicalStructure",
        "http://mouse.owl#BiologicalProcess"
    ]
    return [random.choice(base_superclasses) for _ in range(random.randint(1, 3))]

def create_ontology_entries(num_entries=100):
    entries = []
    for i in range(num_entries):
        entry = {
            "class_uri": generate_class_uri(i),
            "labels": [{"value": generate_label(i), "datatype": "http://www.w3.org/2001/XMLSchema#string", "language": None}],
            "superclasses": generate_superclasses(i)
        }
        entries.append(entry)
    return entries

# Generate the data for two ontology files
ontology1 = create_ontology_entries()
ontology2 = create_ontology_entries()

with open('ontology_jsons/test-ontology1.json', 'w') as file1:
    json.dump(ontology1, file1, indent=4)

with open('ontology_jsons/test-ontology2.json', 'w') as file2:
    json.dump(ontology2, file2, indent=4)


In [144]:
# Matches on artificial test data
test_matches = match_ontologies('ontology_jsons/test-ontology1.json', 'ontology_jsons/test-ontology2.json', 'C', bidirectional=False)

10000
10000


In [139]:
measure_matching_time('ontology_jsons/test-ontology1.json', 'ontology_jsons/test-ontology2.json', 'C', bidirectional=False)

10000
10000
Elapsed time: 7.353135108947754 seconds


In [145]:
counter = 0
for label_key, data in test_matches.items():
    if data[2] > 0:
        print(data)
        counter += 1

print(counter)

('artery 0', 'artery 8', 1.0)
('organ 1', 'organ 30', 0.5797386715376658)
('nerve 2', 'nerve 29', 0.5797386715376658)
('tissue 3', 'tissue 23', 0.5797386715376658)
('nerve 4', 'nerve 29', 0.5797386715376658)
('nerve 5', 'nerve 29', 0.5797386715376658)
('organ 6', 'organ 30', 0.5797386715376658)
('membrane 7', 'membrane 2', 1.0)
('muscle 8', 'muscle 4', 1.0)
('cell 9', 'cell 9', 1.0)
('organ 10', 'artery 10', 0.33609692727625756)
('tissue 11', 'artery 11', 0.33609692727625756)
('tissue 12', 'membrane 12', 0.33609692727625756)
('tissue 13', 'muscle 13', 0.33609692727625756)
('tissue 14', 'membrane 14', 0.33609692727625756)
('muscle 15', 'muscle 4', 0.5797386715376658)
('cell 16', 'cell 7', 0.5797386715376658)
('cell 17', 'cell 17', 1.0000000000000002)
('cell 18', 'cell 7', 0.5797386715376658)
('muscle 19', 'muscle 4', 0.5797386715376658)
('cell 20', 'cell 7', 0.5797386715376658)
('gland 21', 'gland 0', 0.5797386715376658)
('muscle 22', 'muscle 22', 1.0000000000000002)
('gland 23', 'gland

Output format example:
```
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns="http://knowledgeweb.semanticweb.org/heterogeneity/alignment" 
	 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
	 xmlns:xsd="http://www.w3.org/2001/XMLSchema#">

<Alignment>
<xml>yes</xml>
<level>0</level>
<type>??</type>

<map>
	<Cell>
		<entity1 rdf:resource="http://mouse.owl#MA_0002401"/>
		<entity2 rdf:resource="http://human.owl#NCI_C52561"/>
		<measure rdf:datatype="xsd:float">1.0</measure>
		<relation>=</relation>
	</Cell>
</map>
```