In [1]:
# TODO from insights by prof
# only match by labels (no other matching needed, no properties, superclasses, etc.)
# just save labels in list with key is class uri and value is label: if memory is not a problem save in to structures: key value of label and value is class uri
# 5 - 10 min for human, mouse ontologies is reasonable

# Output: in rdf format (see reference_anatomy) just put final matches in there above the threshold (defined by the user). As relation use owl:equivalentClass

In [2]:
#@TODO
# implement additional user inputs: thresholds, weighted average between methods, choose LLM etc.
# implement TF-IDF and cosine similarity

In [3]:
# Required libraries
!pip install sentence_transformers
!pip install pandas
!pip install rdflib
!pip install python-Levenshtein
!pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip

# Project Description
The goal of the project is to develop a simple yet effective ontology alignment framework in Python that focuses on lexical similarity matching. The framework will utilize both string matching techniques and the semantic capabilities of large language models to identify potential alignments between entities (such as classes) in two different ontologies.

### Objectives
1. **Develop an ontology alignment framework** that can process and compare ontologies based on textual content.
2. **Implement lexical similarity matching** using both basic string matching techniques and advanced semantic analysis with embeddings from LLMs.
3. **Output alignments with confidence scores**, enabling users to understand and evaluate the quality and reliability of the suggested alignments.

### Steps to Perform

#### Step 1: Ontology Parsing
- **Goal**: Load and parse the ontologies to be aligned.
- **Tasks**:
  - Utilize libraries like `rdflib` or `owlready2` to read ontology files.
  - Extract relevant textual information (e.g., class names, labels, descriptions).

#### Step 2: Lexical Similarity Matching
This step is divided into two sub-steps: string matching and embeddings matching.

##### a. String Matching
- **Goal**: Implement direct and fuzzy string comparison techniques to find matches based on textual similarity.
- **Tasks**:
  - Perform normalization (e.g., lowercasing, removing special characters).
  - Use string comparison methods (exact match, substring search, edit distance).

##### b. Embeddings Matching Using LLMs
- **Goal**: Use the semantic context provided by LLMs to match terms based on their meanings.
- **Tasks**:
  - Generate embeddings for the textual content of each ontology using models from the Hugging Face Transformers library.
  - Calculate similarity scores between embeddings (e.g., using cosine similarity).

#### Step 3: Combining and Filtering
- **Goal**: Aggregate results from both matching techniques and refine the output.
- **Tasks**:
  - Combine scores from string and embeddings matching.
  - Apply thresholds to filter out matches with low confidence.
  - Optionally, use simple structural checks to add confidence to matches (e.g., matched entities have similar parent classes).

#### Step 4: Output and Evaluation
- **Goal**: Output the alignment results and provide means for evaluation.
- **Tasks**:
  - Format the output in a structured way (e.g., JSON, CSV) that lists entity pairs and their matching scores.
  - If possible, evaluate the effectiveness using known benchmarks or test cases to calculate precision, recall, and F1-score.

### Summary
The project is centered on creating a practical tool for ontology matching, focusing on textual content using both conventional and advanced NLP techniques. By combining string-based and semantic-based approaches, the framework aims to provide robust alignments that are supported by both literal and contextual text similarities. This dual approach enhances the capability of the alignment process, making it more flexible and potentially more accurate than using only one method.

In [1]:
# imports
import json
import rdflib
import pandas as pd
from collections import OrderedDict, defaultdict

**rdflib vs owlready2:**

Interchangeability: Given that OWL is an application of RDF, tools that can parse RDF/XML can generally handle .owl files, and vice versa, provided that the ontology-specific constructs are understood by the tool. This is why libraries like rdflib, which are capable of parsing RDF, are suitable for handling OWL files serialized in RDF/XML format.

Flexibility: Choosing to work with rdflib for general RDF handling and owlready2 for specific ontology manipulations where needed is a flexible approach. It allows you to leverage the strengths of both libraries—rdflib for its robust RDF manipulation and SPARQL querying capabilities, and owlready2 for its ontology-specific features like reasoning and direct manipulation of classes and properties.

# Load/ Parse ontologies

In [2]:
# input paths
onto1_path_in = "test_ontologies/mouse.owl"
onto2_path_in = "test_ontologies/human.owl"


In [3]:
def load_ontology(file_path):
    """
    Loads an ontology from a given file path, which can be in RDF (.rdf) or OWL (.owl) format.

    Args:
    file_path (str): The file path to the ontology file.

    Returns:
    rdflib.Graph: A graph containing the ontology data.
    """
    # Create a new RDF graph
    graph = rdflib.Graph()

    # Bind some common namespaces to the graph
    namespaces = {
        "rdf": rdflib.namespace.RDF,
        "rdfs": rdflib.namespace.RDFS,
        "owl": rdflib.namespace.OWL,
        "xsd": rdflib.namespace.XSD
    }
    for prefix, namespace in namespaces.items():
        graph.namespace_manager.bind(prefix, namespace)

    # Attempt to parse the file
    try:
        graph.parse(file_path, format=rdflib.util.guess_format(file_path))
        print(f"Successfully loaded ontology from {file_path}")
    except Exception as e:
        print(f"Failed to load ontology from {file_path}: {e}")
        return None

    return graph

In [4]:
# load ontologies
onto1_graph = load_ontology(onto1_path_in)
onto2_graph = load_ontology(onto2_path_in)
print(onto1_graph, onto2_graph)

Successfully loaded ontology from test_ontologies/mouse.owl
Successfully loaded ontology from test_ontologies/human.owl
[a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']]. [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'IOMemory']].


In [5]:
def preprocess_label(label):
    return str(label).replace("_", " ").strip(" ,.").lower()

### New approach without json and instead dicts

In [6]:
def extract_ontology_details_to_dict(graph):
    # Query for classes
    class_query = """
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?class ?label ?label_dt ?label_lang
    WHERE {
        ?class rdf:type owl:Class.
        OPTIONAL { ?class rdfs:label ?label. BIND(datatype(?label) AS ?label_dt) BIND(lang(?label) AS ?label_lang) }
    }
    """
    classes = graph.query(class_query)
    ontology_labels_dict = OrderedDict()
    labels_list = []

    # Process class results
    for row in classes:
        class_uri, label, label_dt, label_lang = row
        class_key = str(class_uri)
        label_str = preprocess_label(label)

        if label_str is None or label_str == "none":
            continue

        if label_str not in ontology_labels_dict:
            ontology_labels_dict[label_str] = class_key
            labels_list.append(label_str)

    return ontology_labels_dict, labels_list

In [7]:
onto1_dict, onto1_list = extract_ontology_details_to_dict(onto1_graph)
onto2_dict, onto2_list = extract_ontology_details_to_dict(onto2_graph)

In [8]:
onto1_dict

OrderedDict([('cervical vertebra 5', 'http://mouse.owl#MA_0001425'),
             ('vertebra', 'http://mouse.owl#MA_0000309'),
             ('forelimb bone', 'http://mouse.owl#MA_0000612'),
             ('thorax muscle', 'http://mouse.owl#MA_0000561'),
             ('spinal cord central cervical nucleus',
              'http://mouse.owl#MA_0001127'),
             ('respiratory system basement membrane',
              'http://mouse.owl#MA_0001815'),
             ('bronchial artery', 'http://mouse.owl#MA_0001923'),
             ('seminal vesicle secretion', 'http://mouse.owl#MA_0002526'),
             ('glossopharyngeal ix ganglion', 'http://mouse.owl#MA_0001077'),
             ('left hepatico-cardiac vein', 'http://mouse.owl#MA_0002163'),
             ('endolymph', 'http://mouse.owl#MA_0002528'),
             ('small saphenous vein', 'http://mouse.owl#MA_0002217'),
             ('outer ear', 'http://mouse.owl#MA_0000258'),
             ('omental bursa superior recess', 'http://mouse.owl

In [9]:
onto1_list

['cervical vertebra 5',
 'vertebra',
 'forelimb bone',
 'thorax muscle',
 'spinal cord central cervical nucleus',
 'respiratory system basement membrane',
 'bronchial artery',
 'seminal vesicle secretion',
 'glossopharyngeal ix ganglion',
 'left hepatico-cardiac vein',
 'endolymph',
 'small saphenous vein',
 'outer ear',
 'omental bursa superior recess',
 'magnocellular nucleus',
 'vestibulocochlear viii nerve vestibular component',
 'cochlear viii nucleus',
 'gingiva',
 'transverse facial artery',
 'mammary gland sebaceous gland',
 'posterior semicircular duct',
 'gall bladder serosa',
 'cleidooccipital',
 'maxillary artery',
 'female urethral meatus',
 'lateral dorsal digital vein 05',
 'parasympathetic ganglion',
 'intestine mucosa',
 'abdominal mammary gland',
 'spinal cord pia mater',
 'left hepatic duct',
 'hair cuticle',
 'renal vein',
 'autopod joint',
 'thymus subcapsular epithelium',
 'trigeminal v ganglion',
 'hand digit 2',
 'trapezium',
 'lower back skin',
 'patella',
 'ca

## 3. Matching

### 3.1. String Matching

In [10]:
def exact_string_match(onto1_dict, onto1_list, onto2_dict, onto2_list):
    exact_matches = {}
    matched_labels1 = set()
    matched_labels2 = set()

    for label1 in onto1_list:
        for label2 in onto2_list:
            if label1 == label2:
                # Creating the formatted match entry
                class1 = onto1_dict[label1]
                class2 = onto2_dict[label2]
                exact_matches[class1] = [label1, class2, label2, 1]
                
                # Tracking matched labels for later removal
                matched_labels1.add(label1)
                matched_labels2.add(label2)
                
    # Remove matched labels from lists and dictionaries
    for label in matched_labels1:
        onto1_list.remove(label)
        del onto1_dict[label]

    for label in matched_labels2:
        onto2_list.remove(label)
        del onto2_dict[label]

    return exact_matches, onto1_dict, onto1_list, onto2_dict, onto2_list


In [11]:
exact_matches, onto1_dict_after_exact, onto1_list_after_exact, onto2_dict_after_exact, onto2_list_after_exact = exact_string_match(onto1_dict, onto1_list, onto2_dict, onto2_list)
print(len(exact_matches))
exact_matches

933


{'http://mouse.owl#MA_0000309': ['vertebra',
  'http://human.owl#NCI_C12933',
  'vertebra',
  1],
 'http://mouse.owl#MA_0001923': ['bronchial artery',
  'http://human.owl#NCI_C32230',
  'bronchial artery',
  1],
 'http://mouse.owl#MA_0002526': ['seminal vesicle secretion',
  'http://human.owl#NCI_C52552',
  'seminal vesicle secretion',
  1],
 'http://mouse.owl#MA_0002528': ['endolymph',
  'http://human.owl#NCI_C32513',
  'endolymph',
  1],
 'http://mouse.owl#MA_0000342': ['gingiva',
  'http://human.owl#NCI_C32677',
  'gingiva',
  1],
 'http://mouse.owl#MA_0002068': ['transverse facial artery',
  'http://human.owl#NCI_C53025',
  'transverse facial artery',
  1],
 'http://mouse.owl#MA_0002276': ['cleidooccipital',
  'http://human.owl#NCI_C52895',
  'cleidooccipital',
  1],
 'http://mouse.owl#MA_0002469': ['parasympathetic ganglion',
  'http://human.owl#NCI_C52557',
  'parasympathetic ganglion',
  1],
 'http://mouse.owl#MA_0001133': ['spinal cord pia mater',
  'http://human.owl#NCI_C49800

For String Matching we will implement 4 different methods that the user then can chose via a parameter when calling the method.

The metrics we will use are:
- Levenshtein distance
- Jaccard Similarity
- Cosine Similarity
- TF-IDF

In [12]:
import Levenshtein
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score

def levenshtein_distance(str1, str2):
    return Levenshtein.distance(str1, str2)

def calc_cosine_similarity(str1, str2):
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform([str1, str2])
    return cosine_similarity(count_matrix)[0][1]

def jaccard_similarity(str1, str2):
    # Tokenize the strings into sets of words
    set1 = set(str1.split())
    set2 = set(str2.split())

    # Find the intersection and union of the two sets
    intersection = set1.intersection(set2)
    union = set1.union(set2)

    # Calculate the Jaccard score
    if not union:  # Handle the edge case where both strings might be empty
        return 0.0
    return len(intersection) / len(union)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def cosine_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.

    Args:
    labels (list): List of all labels from both ontologies.

    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = CountVectorizer()
    count_matrix = vectorizer.fit_transform(labels)
    return vectorizer, count_matrix

def cosine_compare_labels(count_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their count vector indices.

    Args:
    count_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the count vectors.
    index1, index2 (int): Indices of the labels to compare.

    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(count_matrix[index1:index1+1], count_matrix[index2:index2+1])[0][0]


def execute_cosine_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, count_matrix = cosine_vectorize_labels(all_labels)

    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = cosine_compare_labels(count_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [14]:
execute_cosine_string_matching(onto1_list_after_exact, onto2_list_after_exact)

Similarity score between 'cervical vertebra 5' and 'cortical column': 0.0


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def tfidf_vectorize_labels(labels):
    """
    Converts a list of labels into TF-IDF vectors using TfidfVectorizer.

    Args:
    labels (list): List of all labels from both ontologies.

    Returns:
    TfidfVectorizer, scipy.sparse.csr.csr_matrix: The vectorizer and the TF-IDF matrix.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(labels)
    return vectorizer, tfidf_matrix

def tfidf_compare_labels(tfidf_matrix, index1, index2):
    """
    Computes the cosine similarity between two labels based on their TF-IDF vector indices.

    Args:
    tfidf_matrix (scipy.sparse.csr.csr_matrix): The matrix containing the TF-IDF vectors.
    index1, index2 (int): Indices of the labels to compare.

    Returns:
    float: Cosine similarity score.
    """
    return cosine_similarity(tfidf_matrix[index1:index1+1], tfidf_matrix[index2:index2+1])[0][0]


def execute_tfidf_string_matching(label_list1, label_list2):
    # Combine labels and vectorize them
    all_labels = label_list1 + label_list2
    vectorizer, tfidf_matrix = tfidf_vectorize_labels(all_labels)

    # Example comparison between the first label of ontology 1 and the first label of ontology 2
    similarity_score = tfidf_compare_labels(tfidf_matrix, 0, len(label_list1))
    print(f"Similarity score between '{label_list1[0]}' and '{label_list2[0]}': {similarity_score}")

In [16]:
execute_tfidf_string_matching(onto1_list_after_exact, onto2_list_after_exact)

Similarity score between 'cervical vertebra 5' and 'cortical column': 0.0


In [17]:
def execute_string_matching(metric, data1, data2):
    """
    Executes the selected matching metric on the provided data.

    Args:
    metric (str): A single letter representing the metric to use.
                  'Levenshtein' for Levenshtein Distance,
                  'Jaccard' for Jaccard Similarity,
                  'LinkTransformer' for Link Transformer.
    data1, data2 (str): The data strings to compare.

    Returns:
    result: The result of the chosen metric computation.
    """
    if metric == 'Levenshtein':
        return levenshtein_distance(data1, data2)
    elif metric == 'Jaccard':
        return jaccard_similarity(data1, data2)
    elif metric == 'LinkTransformer':
        pass # TODO implement or remove
    else:
        raise ValueError("Invalid metric selection")

In [18]:
def match_ontologies(onto1_dict, onto1_list, onto2_dict, onto2_list, metric, bidirectional=False):
    labels_already_tested_labels = {} # dict to store when labels (of ontology 2) were already tested for label (of ontology 1) => necessary to avoid infite loop

    for label in onto1_list:
        labels_already_tested_labels[label] = []

    onto2_used_classes = {}

    if metric == "Cosine":
        execute_cosine_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    elif metric == "TF-IDF":
        execute_tfidf_string_matching(onto1_list, onto2_list)
        # TODO extend so it return scores etc. and also can be mapped to nodes for next step
    else:
        class_results = {}

        while onto1_list: # loop over labels of ontology 1 until empty
            #print(len(onto1_list))
            label1 = onto1_list.pop() # remove the last element in the list => removing the last (instead of first) makes things easier and less error prone
            # labels that got added again cause a better match was found (see later step) will be appended to the end and therefore handled immediately

            # Match from Ontology 1 to Ontology 2
            label_result = [label1, "", "", 0]
            best_score = 0
            already_tested_labels = labels_already_tested_labels[label1]
            for label2 in onto2_list:
                if label2 not in already_tested_labels: # check that label wasn't already checked in previous run
                    matching_score = execute_string_matching(metric, label1, label2) # calculate string matching score
                    # If a perfect match is found, stop iterating over labels for this entry
                    if (metric == 'Jaccard' and matching_score == 1) or (metric == 'Levenshtein' and matching_score == 0): # handle perfect match
                        best_score = matching_score
                        label_result = [label1, "", label2, best_score]
                        break # stop searching for matches cause perfect match found
                    # Check if a match for this label has been found before
                    if (metric == 'Jaccard' and matching_score > best_score) or (metric == 'Levenshtein' and matching_score < best_score): # handle higher score then before
                        label_result[2] = label2
                        best_score = matching_score

            label_result[3] = best_score # save best score in label_result
            label_with_best_score = label_result[2] # get label that achieved the best score

            class_uri = onto1_dict[label1] # get the class_uri of the currently checked label in ontology 1
            if label_result[3] == 0 and label_with_best_score == '': # handle if no match was found
                class_results[class_uri] = label_result
            else:
                class2_uri = onto2_dict[label_with_best_score] # get class_uri of the label with best match
                label_result[1] = class2_uri # save class_uri instead of label => TODO maybe change to not manipulate label_result as it is confusing for later steps

                if class2_uri not in onto2_used_classes: # check if class found of ontology 2 is NOT already used by other class in ontology 1
                    if class_uri not in class_results: # handle no entry exists yet for that class
                        class_results[class_uri] = label_result
                        onto2_used_classes[class2_uri] = class_uri
                        labels_already_tested_labels[label1].append(label_with_best_score)
                    elif label_result[3] > class_results[class_uri][3]: # handle entry exist but now higher score was found with another label of the class (handles multiple labels)
                            class_results[class_uri] = label_result
                            onto2_used_classes[class2_uri] = class_uri
                            labels_already_tested_labels[label1].append(label_with_best_score)
                else: # class of ontology 2 already in use
                    result_current_class_in_use = onto2_used_classes[class2_uri] # get class uri of class that uses that class of ontology 2
                    if label_result[3] > class_results[result_current_class_in_use][3]: # if score of the new found match is higher than the current assigned one
                        class_results[class_uri] = label_result # set the class of ontology 2 to that current class
                        onto2_used_classes[class2_uri] = class_uri # overwrite the use of that class to new class of ontology 1
                        old_used_label = class_results[result_current_class_in_use][0]
                        labels_already_tested_labels[old_used_label].append(label_with_best_score)
                        onto1_list.append(old_used_label) # add the old used label again to list again that gets iterated as it now doesnt have a match anymore
                        class_results[result_current_class_in_use] = ["", "", "", 0] # set result of earlier class to None (could also be remmoved but that way later we can handle if no match found)
                    else: # handle not a higher score
                        labels_already_tested_labels[label1].append(label_with_best_score) # add the label to the already_tested_labels
                        onto1_list.append(label1) # append currently check label again as it needs to handled again with the new information of already_tested_labels

        # TODO implement bidirectional matching

        # OLD_TO-DO important: currently it takes the best match found for the current class of the ontology 1.
        # But it doesnt take into account if a later class has a higher score with that class and therefore would be better suited
        # Solution: make dict and always take one element that gets removed. If later another element matches with the class matched with
        # the previous element the earlier removed element get added again and the value of that elements gets assigned to the higher value element

        return class_results


In [19]:
string_matching_results = match_ontologies(onto1_dict_after_exact, onto1_list_after_exact, onto2_dict_after_exact, onto2_list_after_exact, 'Jaccard')

In [20]:
print(len(onto1_dict))
print(len(string_matching_results))
# TODO one element is always missing: length of classes in onto_1 is 2739 but resulting matches is 2738. Same by LLM

1805
1804


In [21]:
import itertools
import time
from sentence_transformers import SentenceTransformer, util

def calculate_label_similarity_llm(model_name, onto1_dict, onto2_dict):
  """
  Calculate cosine similarity between pairs of labels from two sets and return the results in a dictionary.
  Each key in the dictionary is the class URI from ontology 1, and each value is a list of tuples,
  each containing the label from ontology 2, the class URI from ontology 2, and the similarity score.

  Parameters:
  model_name (str): Name of the Sentence Transformer model to be used.
  onto1_dict (OrderedDict): Dictionary where keys are labels and values are class URIs for the first ontology.
  onto2_dict (OrderedDict): Dictionary where keys are labels and values are class URIs for the second ontology.

  Returns:
  dict: A dictionary with class URIs from the first ontology as keys and lists of tuples (label, class URI, score) from the second ontology as values.
  """
  model = SentenceTransformer(model_name, device='cpu')
  onto1_labels, onto1_classes = zip(*onto1_dict.items())
  onto2_labels, onto2_classes = zip(*onto2_dict.items())

  onto1_label_embeddings = model.encode(list(onto1_labels), convert_to_tensor=True)
  onto2_label_embeddings = model.encode(list(onto2_labels), convert_to_tensor=True)

  similarity_scores = util.pytorch_cos_sim(onto1_label_embeddings, onto2_label_embeddings)

  # Initialize the dictionary to hold results
  results_dict = {}

  # Fill the dictionary with similarity scores
  for i, onto1_class in enumerate(onto1_classes):
    results_dict[onto1_class] = {}
    for j, onto2_class in enumerate(onto2_classes):
      results_dict[onto1_class][onto2_class] = similarity_scores[i][j].item()

  # Sort the dictionary entries by similarity score within each onto1_class
  sorted_results_dict = {}
  for onto1_class in results_dict:
    sorted_onto2_classes = sorted(results_dict[onto1_class].items(), key=lambda x: x[1], reverse=True)
    sorted_results_dict[onto1_class] = dict(sorted_onto2_classes)

  return sorted_results_dict

# Record start time
start_time = time.time()

dict_similarity_scores_llm = calculate_label_similarity_llm("sentence-transformers/all-MiniLM-L12-v2", onto1_dict_after_exact, onto2_dict_after_exact)

# Record end time
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

print(f"Time taken: {elapsed_time} seconds")

  from .autonotebook import tqdm as notebook_tqdm


Time taken: 19.52402901649475 seconds


In [22]:
# Print the first entries from the dictionary
for onto1_class in list(dict_similarity_scores_llm.keys())[:20]:
    print(f"{onto1_class}: {dict_similarity_scores_llm[onto1_class]}")

http://mouse.owl#MA_0001425: {'http://human.owl#NCI_C32243': 0.8343715071678162, 'http://human.owl#NCI_C33502': 0.826393187046051, 'http://human.owl#NCI_C32903': 0.8174307942390442, 'http://human.owl#NCI_C33727': 0.8165282607078552, 'http://human.owl#NCI_C32244': 0.7676405906677246, 'http://human.owl#NCI_C32242': 0.7669017314910889, 'http://human.owl#NCI_C32902': 0.7607812285423279, 'http://human.owl#NCI_C32239': 0.7522039413452148, 'http://human.owl#NCI_C33501': 0.7520723342895508, 'http://human.owl#NCI_C33726': 0.7488871812820435, 'http://human.owl#NCI_C33728': 0.7460527420043945, 'http://human.owl#NCI_C32240': 0.7334729433059692, 'http://human.owl#NCI_C32245': 0.7288417816162109, 'http://human.owl#NCI_C33730': 0.7245526909828186, 'http://human.owl#NCI_C33868': 0.7221161127090454, 'http://human.owl#NCI_C12892': 0.7200425267219543, 'http://human.owl#NCI_C32241': 0.7161644697189331, 'http://human.owl#NCI_C32901': 0.7126443386077881, 'http://human.owl#NCI_C32174': 0.7107817530632019, 'h

In [23]:
def set_new_match(class1_uri, class2_uri, score, onto2_used_classes, class_results):
    """Set a new match for class1_uri and class2_uri."""
    onto2_used_classes[class2_uri] = class1_uri
    class_results[class1_uri] = ["", class2_uri, "", score]  # label is empty for now

def update_matching(new_class1_uri, class2_uri, new_score, old_class1_uri, onto2_used_classes, class_results, onto1_class_list):
    """Update matches when a better score is found."""
    onto2_used_classes[class2_uri] = new_class1_uri
    class_results[old_class1_uri] = ["", "", "", 0]  # Clear old match
    class_results[new_class1_uri] = ["", class2_uri, "", new_score]
    onto1_class_list.append(old_class1_uri)

def reevaluate(class1_uri, class2_uri, already_tested_classes, onto1_class_list):
    """Re-add class1_uri for re-evaluation."""
    onto1_class_list.append(class1_uri)
    already_tested_classes[class1_uri].add(class2_uri)

def perform_matching_llm(dict_similarity_scores_llm):
    # Initialize dictionaries and lists
    already_tested_classes = {}
    class_results = {}
    onto2_used_classes = {}
    onto1_class_list = list(dict_similarity_scores_llm.keys())

    while onto1_class_list:
        class1_uri = onto1_class_list.pop()
        already_tested_classes[class1_uri] = already_tested_classes.get(class1_uri, set())

        # Iterate over each class2_uri and score from the pre-sorted dictionary
        for class2_uri, score in dict_similarity_scores_llm[class1_uri].items():
            if class2_uri not in already_tested_classes[class1_uri]:
                already_tested_classes[class1_uri].add(class2_uri)  # Mark this class2_uri as tested

                if score >= 0.99:  # Check for a perfect match
                    set_new_match(class1_uri, class2_uri, score, onto2_used_classes, class_results)
                    break  # Found a perfect match, skip further checks for this class1_uri

                # If no perfect match, check if it's not already linked
                if class2_uri not in onto2_used_classes:
                    set_new_match(class1_uri, class2_uri, score, onto2_used_classes, class_results)
                    break  # Successfully linked, no need to continue

                # If already linked, check if the new score is better
                elif score > class_results[onto2_used_classes[class2_uri]][3]:
                    old_class1_uri = onto2_used_classes[class2_uri]
                    update_matching(class1_uri, class2_uri, score, old_class1_uri, onto2_used_classes, class_results, onto1_class_list)
                    break  # Updated the link, no need to continue
            else:
                # This class2_uri was already checked, continue to the next
                continue

    return class_results

# Call function
matched_results_llm = perform_matching_llm(dict_similarity_scores_llm)

In [24]:
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm.keys())[:20]:
    print(f"{onto1_class}: {matched_results_llm[onto1_class]}")

http://mouse.owl#MA_0001876: ['', 'http://human.owl#NCI_C32637', '', 0.5903604626655579]
http://mouse.owl#MA_0002736: ['', 'http://human.owl#NCI_C32189', '', 0.6978012919425964]
http://mouse.owl#MA_0001317: ['', 'http://human.owl#NCI_C33319', '', 0.5318225622177124]
http://mouse.owl#MA_0002026: ['', 'http://human.owl#NCI_C32789', '', 0.7148200869560242]
http://mouse.owl#MA_0000893: ['', 'http://human.owl#NCI_C12689', '', 0.5596039891242981]
http://mouse.owl#MA_0000814: ['', 'http://human.owl#NCI_C49331', '', 0.7378122210502625]
http://mouse.owl#MA_0000688: ['', 'http://human.owl#NCI_C32350', '', 0.7734771966934204]
http://mouse.owl#MA_0000552: ['', 'http://human.owl#NCI_C13041', '', 0.6005765795707703]
http://mouse.owl#MA_0000569: ['', 'http://human.owl#NCI_C33164', '', 0.7628939151763916]
http://mouse.owl#MA_0001941: ['', 'http://human.owl#NCI_C32087', '', 0.5738374590873718]
http://mouse.owl#MA_0001323: ['', 'http://human.owl#NCI_C32856', '', 0.7865316867828369]
http://mouse.owl#MA_0

In [25]:
def transform_dict(original_dict):
    """
    Transforms a dictionary where URLs are the values into a dictionary where
    URLs are the keys and the values are concatenated labels associated with each URL.

    Args:
    original_dict (OrderedDict): The original dictionary with labels as keys and URLs as values.

    Returns:
    dict: A dictionary with URLs as keys and concatenated labels as values.
    """
    new_dict = {}
    for label, url in original_dict.items():
        if url in new_dict:
            new_dict[url] += " " + label  # Concatenating labels with space; change as needed
        else:
            new_dict[url] = label
    return new_dict

onto1_transformed_dict = transform_dict(onto1_dict)
onto2_transformed_dict = transform_dict(onto2_dict)

In [26]:
# Add labels
def add_labels(data, onto1_label, onto2_label):
    # Iterate through each key and update the list with labels
    for key, values in data.items():
        # First empty string in the list gets replaced by the label from onto1_label if available
        label1 = onto1_label.get(key)
        if label1 is not None:
            values[0] = label1

        # Second empty string in the list gets replaced by the label from onto2_label using the second entry's key if available
        label2_key = values[1]  # The second entry in the list is assumed to be a key for the onto2_label
        label2 = onto2_label.get(label2_key)
        if label2 is not None:
            values[2] = label2

    return data

matched_results_llm_with_labels = add_labels(matched_results_llm, onto1_transformed_dict, onto2_transformed_dict)

In [27]:
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm.keys())[:20]:
    print(f"{onto1_class}: {matched_results_llm[onto1_class]}")

http://mouse.owl#MA_0001876: ['right ventricle membranous part', 'http://human.owl#NCI_C32637', 'frontal horn of the lateral ventricle', 0.5903604626655579]
http://mouse.owl#MA_0002736: ['myometrium outer longitudinal layer', 'http://human.owl#NCI_C32189', 'basal layer of the endometrium', 0.6978012919425964]
http://mouse.owl#MA_0001317: ['retina blood vessel', 'http://human.owl#NCI_C33319', 'photosensitive region of the retina', 0.5318225622177124]
http://mouse.owl#MA_0002026: ['profunda brachii artery', 'http://human.owl#NCI_C32789', 'inferior profunda artery', 0.7148200869560242]
http://mouse.owl#MA_0000893: ['caudate-putamen', 'http://human.owl#NCI_C12689', 'cauda equina', 0.5596039891242981]
http://mouse.owl#MA_0000814: ['brain arachnoid matter', 'http://human.owl#NCI_C49331', 'cerebral arachnoid membrane', 0.7378122210502625]
http://mouse.owl#MA_0000688: ['limb bone', 'http://human.owl#NCI_C32350', 'common bony limb', 0.7734771966934204]
http://mouse.owl#MA_0000552: ['chest organ

## Combining & filtering

In [28]:
counter = 0
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm_with_labels.keys()):
    if matched_results_llm_with_labels[onto1_class][3] < 0.6:
        counter += 1
        print(f"{onto1_class}: {matched_results_llm_with_labels[onto1_class]}")
        
counter

http://mouse.owl#MA_0001876: ['right ventricle membranous part', 'http://human.owl#NCI_C32637', 'frontal horn of the lateral ventricle', 0.5903604626655579]
http://mouse.owl#MA_0001317: ['retina blood vessel', 'http://human.owl#NCI_C33319', 'photosensitive region of the retina', 0.5318225622177124]
http://mouse.owl#MA_0000893: ['caudate-putamen', 'http://human.owl#NCI_C12689', 'cauda equina', 0.5596039891242981]
http://mouse.owl#MA_0001941: ['digital artery', 'http://human.owl#NCI_C32087', 'anterior communicating artery', 0.5738374590873718]
http://mouse.owl#MA_0000021: ['abdomen/pelvis/perineum', 'http://human.owl#NCI_C12321', 'appendage of the uterus', 0.5538153648376465]
http://mouse.owl#MA_0002643: ['penile urethra', 'http://human.owl#NCI_C25177', 'genitalia', 0.5946835279464722]
http://mouse.owl#MA_0000558: ['thorax blood vessel', 'http://human.owl#NCI_C33677', 'superior hemorrhoidal artery', 0.4682205021381378]
http://mouse.owl#MA_0001474: ['supraoccipital bone', 'http://human.ow

693

In [29]:
# already included the perfect matches in the final_matching_results
final_matching_results = exact_matches.copy() # in this we will store the final matches

In [30]:
backup_string_matching_results = {} # this is just for easier testing and can be removed in the end
backup_matched_results_llm_with_labels = {}

# Check for overlapping classes
counter = 0
overlapping_results_keys = []
for class_name, values in string_matching_results.items():
    class_2 = values[1]
    if class_2:
        if matched_results_llm_with_labels[class_name][1] == class_2:
            backup_string_matching_results[class_name] = values
            backup_matched_results_llm_with_labels[class_name] = matched_results_llm_with_labels[class_name]
            higher_score = values[3]
            label_higher_score = values[2]
            if matched_results_llm_with_labels[class_name][3] > higher_score:
                higher_score = matched_results_llm_with_labels[class_name][3]
                label_higher_score = matched_results_llm_with_labels[class_name][2]
            final_matching_results[class_name] = [values[0], values[1], label_higher_score, higher_score]
            overlapping_results_keys.append(class_name)
            counter += 1
        
counter
print(len(final_matching_results))

1329


In [31]:
# Remove overlapping keys
for key in overlapping_results_keys:
    string_matching_results.pop(key, None)
    matched_results_llm_with_labels.pop(key, None)

In [32]:
# Check how many of the same matches for String and LLM have a score higher 0.7 => almost all of them
counter = 0
for class_name, values in final_matching_results.items():
    if values[3] > 0.7:
        counter += 1
        # print(final_matching_results[class_name])
        
print(len(final_matching_results))
print(counter)

1329
1261


In [33]:
# method to calculate the score for a given dict of matched classes
# this methods enables us to calculate the String matching score for the results of the LLM and vice versa
def calc_score_for_matched_classes(matched_classes, metric, dict_sim_scores_llm = {}):
    matches_with_score = {}
    for class_name, values in matched_classes.items():
        label1 = values[0]
        class_2 = values[1]
        label2 = values[2]
        if label2:
            matching_score = 0
            if metric == "llm":
                if class_name in dict_sim_scores_llm:
                    matching_score = dict_sim_scores_llm[class_name][class_2]
            else:
                matching_score = execute_string_matching(metric, label1, label2) # calculate string matching score
            matches_with_score[class_name] = [label1, class_2, label2, matching_score]
        else:
            matches_with_score[class_name] = [label1, class_2, "", 0]
    
    return matches_with_score

In [34]:
string_matches_for_llm = calc_score_for_matched_classes(matched_results_llm_with_labels, "Jaccard")
llm_matches_for_string = calc_score_for_matched_classes(string_matching_results, "llm", dict_similarity_scores_llm)

**In total we now have 4 lists**:
- one list calculated with the best 1-to-1 matches for String matching
- one list calculated with the best 1-to-1 matches for LLM matching
- one list calculated with the 1-to-1 matches of the LLM matching but with the String matching score
- one list calculated with the 1-to-1 matches of the String matching but with the LLM matching score

We **need 4 lists as otherwise there is no possibility to merge the results** of the two methods. The matched classes for the different methods (String and LLM) contain many different results: <br><br>For example:
- String matching: http://mouse.owl#MA_0001230 matched to http://human.owl#NCI_C32861
- LLM matching: http://mouse.owl#MA_0001230 matched to http://human.owl#NCI_C33192

So in order to merge them we need to calculate the scores of the resulting lists using the other method, thus resulting in 4 lists.

#### Final String matching lists

In [35]:
# Print the first entries from the dictionary
empty_counter = 0
for onto1_class in list(string_matching_results.keys()):
    if (string_matching_results[onto1_class][1] == ""):
        empty_counter += 1
    if onto1_class == "http://mouse.owl#MA_0001230":
        print(f"{onto1_class}: {string_matching_results[onto1_class]}")
    if string_matching_results[onto1_class][1] == "http://human.owl#NCI_C33192":
        print(f"{onto1_class}: {string_matching_results[onto1_class]}")

http://mouse.owl#MA_0002326: ['intercostales internus', 'http://human.owl#NCI_C33192', 'obturator internus muscle', 0.25]
http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C33290', 'pelvic floor muscle', 0.2]


In [36]:
empty_counter

307

In [37]:
# Print the first entries from the dictionary
for onto1_class in list(string_matches_for_llm.keys()):
    if onto1_class == "http://mouse.owl#MA_0001230":
        print(f"{onto1_class}: {string_matches_for_llm[onto1_class]}")
    if onto1_class == "http://mouse.owl#MA_0002326":
        print(f"{onto1_class}: {string_matches_for_llm[onto1_class]}")

http://mouse.owl#MA_0002326: ['intercostales internus', 'http://human.owl#NCI_C12625', 'interneuron', 0.0]
http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C33192', 'obturator internus muscle', 0.2]


In [38]:
# http://mouse.owl#MA_0001230 matched to http://human.owl#NCI_C33192 as it has highest score for LLM
# http://mouse.owl#MA_0002326 matched to 

#### Final LLM matching lists

In [39]:
# Print the first entries from the dictionary
for onto1_class in list(matched_results_llm_with_labels.keys()):
    if onto1_class == "http://mouse.owl#MA_0001230":
        print(f"{onto1_class}: {matched_results_llm_with_labels[onto1_class]}")
    if onto1_class == "http://mouse.owl#MA_0002326":
        print(f"{onto1_class}: {matched_results_llm_with_labels[onto1_class]}")

http://mouse.owl#MA_0002326: ['intercostales internus', 'http://human.owl#NCI_C12625', 'interneuron', 0.6117238402366638]
http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C33192', 'obturator internus muscle', 0.6077026724815369]


In [40]:
# Print the first entries from the dictionary
for onto1_class in list(llm_matches_for_string.keys()):
    if onto1_class == "http://mouse.owl#MA_0001230":
        print(f"{onto1_class}: {llm_matches_for_string[onto1_class]}")
    if llm_matches_for_string[onto1_class][1] == "http://human.owl#NCI_C33192":
        print(f"{onto1_class}: {llm_matches_for_string[onto1_class]}")

http://mouse.owl#MA_0002326: ['intercostales internus', 'http://human.owl#NCI_C33192', 'obturator internus muscle', 0.5459694862365723]
http://mouse.owl#MA_0001230: ['extrinsic auricular muscle', 'http://human.owl#NCI_C33290', 'pelvic floor muscle', 0.4580720067024231]


In [93]:
# This code sorts the class names of ontology 1 by score in descending order
# This step is important before the combining step as it ensures that first the classes with the highest score
# matches get combined and therefore for these classes less conflict occur as they are handled in the beginning
onto1_class_names_by_score = {}

for onto1_class in list(string_matching_results.keys()):
    string_match_result = string_matching_results[onto1_class]
    string_match_result_for_llm = string_matches_for_llm[onto1_class]
    llm_match_result = matched_results_llm_with_labels[onto1_class]
    llm_match_result_for_string = llm_matches_for_string[onto1_class]
    
    entries = [string_match_result, string_match_result_for_llm, llm_match_result, llm_match_result_for_string]
    highest_score_entry = max(entries, key=lambda x: x[-1])
    onto1_class_names_by_score[onto1_class] = highest_score_entry

# Sort the classes based on the maximum score in descending order
onto1_class_names_by_score = sorted(list(onto1_class_names_by_score.items()), key=lambda x: x[1][3], reverse=True)

sorted_onto1_class_names_by_score = OrderedDict(onto1_class_names_by_score)

In [94]:
# In this code block the four lists are taken and the best matching of the four is picked.
# The code handles conflicts (class2 already used in an earlier match for a different class1)
# It is not perfect as if in the end there are just conflicts that can't be solve these will just be added as empty results
# => but as this are only a few entries and the complexity of handling those will be quite large we will stick with these results for now


# is called a greedy approach (for report)
remaining_results = {} # here the remaining matches will be stored and in the end merged with the final_results
already_used_class2 = []
counter = 0
conflicts = {}

for onto1_class in list(sorted_onto1_class_names_by_score.keys()):
    string_match_result = string_matching_results[onto1_class]
    string_match_result_for_llm = string_matches_for_llm[onto1_class]
    llm_match_result = matched_results_llm_with_labels[onto1_class]
    llm_match_result_for_string = llm_matches_for_string[onto1_class]
    
    results = [string_match_result, string_match_result_for_llm, llm_match_result, llm_match_result_for_string]
    empty_class2_counter = 0
    
    while True:
        if len(results) == 0:
            break
        highest_score_entry = max(results, key=lambda x: x[-1])
        class2_highest_entry = highest_score_entry[1]
        if class2_highest_entry in already_used_class2: # class2 is already matched to another class1
            results.remove(highest_score_entry)
        else:
            break
    # if class2_highest_entry == "":
    #     print("No match found")
    if len(results) == 0 or class2_highest_entry == "": # conflict occur if either all matches the class2 is already in use or no match was found (no match theoretically never happens as LLM always get some match)
        counter += 1
        conflicts[onto1_class] = results
        remaining_results[onto1_class] = ["", "" ,"" ,0]
    else:
        already_used_class2.append(class2_highest_entry)
        remaining_results[onto1_class] = highest_score_entry

In [95]:
df_remaining_results = pd.DataFrame.from_dict(remaining_results, orient="index")
df_remaining_results.reset_index(inplace=True)
df_remaining_results.columns = ['Class1', 'Class1_label', 'Class2', 'Class2_label', 'Score']

In [96]:
print(len(string_matching_results))
print(len(df_remaining_results))

1408
1408


In [98]:
print(len(df_remaining_results[df_remaining_results["Class2"] == ""])) # => only a few class1 have matches that does conflict with other entries
conflicts

0


{}

In [99]:
# Group by 'Class2' column
grouped = df_remaining_results.groupby('Class2')

# Filter groups that have more than one entry
duplicate_class2 = grouped.filter(lambda x: len(x) > 1)

In [100]:
# check how often the LLM score is higher than the LLM score on the classes of the string matching
# Does not check if LLM has higher matches than string matching!
score_count_llm_org = 0
score_count_llm_for_string = 0
equal_count = 0

for onto1_class in list(matched_results_llm_with_labels.keys()):
    llm_score_org = matched_results_llm_with_labels[onto1_class][3]
    llm_score_for_string = llm_matches_for_string[onto1_class][3]
    if llm_score_org > llm_score_for_string:
        score_count_llm_org += 1
    elif llm_score_org < llm_score_for_string:
        score_count_llm_for_string += 1
    else:
        equal_count += 1 # will be never triggered as LLM has very fine matching (lots of decimals)
        
print(score_count_llm_org)
print(score_count_llm_for_string)

1180
228


In [101]:
final_matching_results.update(remaining_results)

In [102]:
print(len(final_matching_results))
print(len(onto1_dict)) # already a TODO more earlier in the code to check why difference of 1

2737
1805


In [103]:
# filtering with threshold
final_results_over_threshold = {}
threshold = 0.8

for class_name, values in final_matching_results.items():
    label1, class2, label2, score = values
    if score > threshold:
        final_results_over_threshold[class_name] = values
print(len(final_results_over_threshold))

1363


### Check how many conflicts if threshold is already set earlier in code (good for the report for validation of our method)

In [53]:
# Feedback Prof
# TODO sort four list by score so you dont take a lower score even though there exist a higher score match => SOLVED
# Write about related work for ontology matching frameworks, string matching, llm matching and use papers as reference there (max 1-2 pages)
# Also write about linktransformer even though we didn't use it in the final version (so Til gets the props he derserves)

In [105]:
# In this code block the four lists are taken and the best matching of the four is picked.
# The code handles conflicts (class2 already used in an earlier match for a different class1)
# It is not perfect as if in the end there are just conflicts that can't be solve these will just be added as empty results
# => but as this are only a few entries and the complexity of handling those will be quite large we will stick with these results for now

test_remaining_results = {} # here the remaining matches will be stored and in the end merged with the final_results
already_used_class2 = []
counter = 0
conflicts = {}

for onto1_class in list(sorted_onto1_class_names_by_score.keys()):
    string_match_result = string_matching_results[onto1_class]
    string_match_result_for_llm = string_matches_for_llm[onto1_class]
    llm_match_result = matched_results_llm_with_labels[onto1_class]
    llm_match_result_for_string = llm_matches_for_string[onto1_class]
    
    results = [string_match_result, string_match_result_for_llm, llm_match_result, llm_match_result_for_string]
    empty_class2_counter = 0
    
    while True:
        if len(results) == 0:
            break
        highest_score_entry = max(results, key=lambda x: x[-1])
        class2_highest_entry = highest_score_entry[1]
        if class2_highest_entry in already_used_class2: # class2 is already matched to another class1
            results.remove(highest_score_entry)
        else:
            break
    # if class2_highest_entry == "":
    #     print("No match found")
    if len(results) == 0 or class2_highest_entry == "": # conflict occur if either all matches the class2 is already in use or no match was found (no match theoretically never happens as LLM always get some match)
        counter += 1
        conflicts[onto1_class] = results
        # test_remaining_results[onto1_class] = ["", "" ,"" ,0]
    else:
        if highest_score_entry[3] > 0.8:
            already_used_class2.append(class2_highest_entry)
            test_remaining_results[onto1_class] = highest_score_entry
            
print(len(string_matching_results))
print(len(test_remaining_results)) # => only that many matches fulfill the requirements of the defined threshold
print(len(conflicts))
# => this shows us that no classes of ontology1 have a conflict if we set a threshold of 0.8
# => most of the conflicts that occur either way don't have a higher score than the threshold

1408
194
0


In [55]:
# check for http://mouse.owl#MA_0001424
# http://mouse.owl#MA_0001424 has label cervical vertebra 4
# currently matched with http://human.owl#NCI_C32245 => has label C7_Vertebra
# but in the reference.rdf matched with http://human.owl#NCI_C32242 => has label C4_Vertebra with score 1
# REPORT when running the matching again the http://mouse.owl#MA_0001424 gets matched to http://human.owl#NCI_C32431
# => which has a score of 0.16 => reason: because now the elements where iterated differently the before matched label got now matched with
# another class but with same score 0.25 => http://mouse.owl#MA_0001436: ['sacral vertebra 3', 'http://human.owl#NCI_C32245', 0.25]
print(jaccard_similarity("cervical vertebra 4", "C4_Vertebra"))
print(jaccard_similarity("thoracic vertebra 10", "C4_Vertebra"))

0.0
0.0


## 4. Write to output rdf

In [58]:
# turn results into triplets
final_matches_out = []

for match in final_results_over_threshold:
    class1 = match
    class2 = final_results_over_threshold[match][1]
    final_matches_out.append((class1, class2, 1.0, "="))

final_matches_out

[('http://mouse.owl#MA_0000003', 'http://human.owl#NCI_C12919', 1.0, '='),
 ('http://mouse.owl#MA_0000004', 'http://human.owl#NCI_C33816', 1.0, '='),
 ('http://mouse.owl#MA_0000007', 'http://human.owl#NCI_C12429', 1.0, '='),
 ('http://mouse.owl#MA_0000009', 'http://human.owl#NCI_C12472', 1.0, '='),
 ('http://mouse.owl#MA_0000010', 'http://human.owl#NCI_C12686', 1.0, '='),
 ('http://mouse.owl#MA_0000011', 'http://human.owl#NCI_C12374', 1.0, '='),
 ('http://mouse.owl#MA_0000012', 'http://human.owl#NCI_C12705', 1.0, '='),
 ('http://mouse.owl#MA_0000015', 'http://human.owl#NCI_C13056', 1.0, '='),
 ('http://mouse.owl#MA_0000016', 'http://human.owl#NCI_C12755', 1.0, '='),
 ('http://mouse.owl#MA_0000018', 'http://human.owl#NCI_C12788', 1.0, '='),
 ('http://mouse.owl#MA_0000020', 'http://human.owl#NCI_C13062', 1.0, '='),
 ('http://mouse.owl#MA_0000022', 'http://human.owl#NCI_C12799', 1.0, '='),
 ('http://mouse.owl#MA_0000023', 'http://human.owl#NCI_C12419', 1.0, '='),
 ('http://mouse.owl#MA_00

In [62]:
# Initialize graph
g = rdflib.Graph()

# Define namespaces
KNOWLEDGEWEB = rdflib.Namespace("http://knowledgeweb.semanticweb.org/heterogeneity/alignment#")
g.bind("kw", KNOWLEDGEWEB)

# Create the root element for the alignment
alignment = rdflib.URIRef("http://example.org/alignment")

# Add basic alignment properties
g.add((alignment, rdflib.namespace.RDF.type, KNOWLEDGEWEB.Alignment))
g.add((alignment, KNOWLEDGEWEB.xml, rdflib.Literal("yes")))
g.add((alignment, KNOWLEDGEWEB.level, rdflib.Literal("0")))
g.add((alignment, KNOWLEDGEWEB.type, rdflib.Literal("??")))

# Add each match to the graph
for entity1, entity2, measure, relation in final_matches_out:
    cell = rdflib.URIRef(f"http://example.org/cell/{entity1.split('#')[-1]}_{entity2.split('#')[-1]}")
    g.add((cell, rdflib.namespace.RDF.type, KNOWLEDGEWEB.Cell))
    g.add((cell, KNOWLEDGEWEB.entity1, rdflib.URIRef(entity1)))
    g.add((cell, KNOWLEDGEWEB.entity2, rdflib.URIRef(entity2)))
    g.add((cell, KNOWLEDGEWEB.measure, rdflib.Literal(measure, datatype=rdflib.namespace.XSD.float)))
    g.add((cell, KNOWLEDGEWEB.relation, rdflib.Literal(relation)))
    g.add((alignment, KNOWLEDGEWEB.map, cell))

# Serialize the graph to an RDF file (e.g., in RDF/XML format)
with open("ontology_matching_results.rdf", "wb") as f:
    f.write(g.serialize(format='pretty-xml').encode("utf-8"))

# Optionally, print out the graph in RDF/XML format for inspection
print(g.serialize(format='pretty-xml'))


<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:kw="http://knowledgeweb.semanticweb.org/heterogeneity/alignment#"
>
  <kw:Alignment rdf:about="http://example.org/alignment">
    <kw:xml>yes</kw:xml>
    <kw:level>0</kw:level>
    <kw:type>??</kw:type>
    <kw:map>
      <kw:Cell rdf:about="http://example.org/cell/MA_0000003_NCI_C12919">
        <kw:entity1 rdf:resource="http://mouse.owl#MA_0000003"/>
        <kw:entity2 rdf:resource="http://human.owl#NCI_C12919"/>
        <kw:measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</kw:measure>
        <kw:relation>=</kw:relation>
      </kw:Cell>
    </kw:map>
    <kw:map>
      <kw:Cell rdf:about="http://example.org/cell/MA_0000004_NCI_C33816">
        <kw:entity1 rdf:resource="http://mouse.owl#MA_0000004"/>
        <kw:entity2 rdf:resource="http://human.owl#NCI_C33816"/>
        <kw:measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</kw:measure

Output format example:
```
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns="http://knowledgeweb.semanticweb.org/heterogeneity/alignment"
	 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	 xmlns:xsd="http://www.w3.org/2001/XMLSchema#">

<Alignment>
<xml>yes</xml>
<level>0</level>
<type>??</type>

<map>
	<Cell>
		<entity1 rdf:resource="http://mouse.owl#MA_0002401"/>
		<entity2 rdf:resource="http://human.owl#NCI_C52561"/>
		<measure rdf:datatype="xsd:float">1.0</measure>
		<relation>=</relation>
	</Cell>
</map>
```